[Dev][fix] FSDP EP-overlap CUDA-graph guard uses post-refactor API#4834
Merged
Conversation
PR NVIDIA#3796 ("Support A2A Overlap for Megatron-FSDP") landed on dev with guard logic that iterates the legacy config.cuda_graph_scope list: if config.cuda_graph_impl not in ["none", "full_iteration"]: partial_cuda_graph_scopes = [ scope for scope in config.cuda_graph_scope ... ] After PR NVIDIA#4293 normalized cuda_graph_scope to None in __post_init__, the inner iteration crashes with TypeError: 'NoneType' object is not iterable whenever a user combines overlap_moe_expert_parallel_comm with cuda_graph_impl in {"local", "transformer_engine"}. Drop the legacy iteration; the outer check on cuda_graph_impl is the only signal needed under the new API. Also drop the now-unused CudaGraphScope import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
/ok to test 089a5ed |
hxbai
approved these changes
May 18, 2026
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26014336816 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #3796 (A2A overlap for Megatron-FSDP) introduced a guard in
mcore_fsdp_adapter.pythat iterates the legacyconfig.cuda_graph_scopelist. After PR #4293 normalized
cuda_graph_scopetoNoneinTransformerConfig.__post_init__, this iteration raisesTypeError: 'NoneType' object is not iterablewhenever a user combinesoverlap_moe_expert_parallel_comm=Truewithcuda_graph_impl in {"local", "transformer_engine"}.The bug is currently dormant on dev because:
deepseek_proxy_fsdp_ep2_fsdp2_ep_overlapdoes not set--cuda-graph-impl(defaults to"none").test_fsdp_1f1b_overlap.pylikewise leavescuda_graph_impl="none".if config.cuda_graph_impl not in ["none", "full_iteration"]short-circuits the buggy iteration in both cases.But any legitimate config that combines EP overlap with per-layer CUDA graphs will crash immediately. The equivalent silent failure on main (PR #3797, where no outer guard exists) was triggering the analogous merge-queue failures the corresponding refactor PR (#4292) — that fix has already been pushed there.
This PR applies the same migration to dev: drop the legacy
for scope in config.cuda_graph_scopeiteration and the now-unusedCudaGraphScopeimport, and rely on a singlecuda_graph_implassertion that matches the original intent (per-layer CUDA graphs are not supported with FSDP EP overlap).Test plan
tests/unit_tests/a2a_overlap/test_fsdp_1f1b_overlap.pycontinues to pass (cuda_graph_impl defaults to "none", same path).tests/functional_tests/test_cases/moe/deepseek_proxy_fsdp_ep2_fsdp2_ep_overlapcontinues to pass.--overlap-moe-expert-parallel-comm --cuda-graph-impl=localnow produces a clear assertion error instead of aTypeErrortraceback.cc @Wohox — friendly ping, this is a follow-up to your #3796 to make it compatible with the post-refactor CUDA-graph API. Please review when you have a moment.
🤖 Generated with Claude Code