Siddharth/fix ep sync#4607
Merged
wdykas merged 11 commits intoNVIDIA:mainfrom May 5, 2026
Merged
Conversation
The every-20-step consensus optimization caches `_last_ep_consensus`, but the cache is never invalidated on state transitions. After PAUSING -> PAUSED the cache holds (global_work>0, all_pausing=True); when UNPAUSING -> RUNNING the next loop iteration skips a fresh consensus (counter % 20 != 0 and global_work != 0), reads the stale all_pausing=True, and immediately re-pauses the engine. This wedges test_control_logic_lifecycle[tp2-pp2-ep2] at the post-unpause asyncio.gather (10s timeout). Reset the cache to (0, False) on UNPAUSING -> RUNNING so the first RUNNING iteration always re-runs consensus. Also add `--inference-disable-ep-consensus` (DynamicInferenceContextConfig .disable_ep_consensus) as an escape hatch: when set, the control loop skips _ep_establish_consensus and dummy_forward entirely and steps on local state. Safe only when EP coordination is not required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
Contributor
Author
|
/ok to test 9a741d8 |
The disable_ep_consensus branch previously slept for 20ms when local_pending == 0. With EP > 1 a peer running a real forward blocks at its NCCL all-to-all kernel waiting for this rank's collective participation, so sleeping here deadlocks the EP group. Mirror the consensus path: when there is no local work, run controller. dummy_forward() instead. Trades the consensus all-reduce CPU cost for unconditional dummy forwards on idle ranks. Add `await asyncio.sleep(0)` after dummy_forward so the engine task yields to the asyncio loop — the consensus path naturally yields via _ep_establish_consensus and async_step, but the disable path otherwise has no awaits between steps and starves signal delivery. Add test_disable_ep_consensus that wraps controller.dummy_forward in a MagicMock spy and asserts the call count grows during a 200 ms idle window. Also exercises request submission and a pause/unpause cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
/ok to test 9637e6b |
Contributor
Author
|
/ok to test a8128e8 |
…ix-ep-sync # Conflicts: # megatron/inference/utils.py
Contributor
Author
|
/ok to test ffd9ca5 |
sidsingh-nvidia
approved these changes
May 5, 2026
tdene
approved these changes
May 5, 2026
kvareddy
approved these changes
May 5, 2026
kvareddy
approved these changes
May 5, 2026
Contributor
|
For future work, I still believe we should move the EP sync to a side-task in the asyncio event loop - so that it runs in parallel with the forward step. |
shanmugamr1992
approved these changes
May 5, 2026
Contributor
Author
|
/ok to test 478ed44 |
shanmugamr1992
approved these changes
May 5, 2026
Contributor
Author
|
/ok to test bca15ba |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25401534202 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.