Skip to content

[Dev][fix] FSDP EP-overlap CUDA-graph guard uses post-refactor API#4834

Merged
hxbai merged 1 commit into
NVIDIA:devfrom
buptzyb:fix/dev-fsdp-adapter-cuda-graph-api
May 18, 2026
Merged

[Dev][fix] FSDP EP-overlap CUDA-graph guard uses post-refactor API#4834
hxbai merged 1 commit into
NVIDIA:devfrom
buptzyb:fix/dev-fsdp-adapter-cuda-graph-api

Conversation

@buptzyb
Copy link
Copy Markdown
Contributor

@buptzyb buptzyb commented May 17, 2026

Summary

PR #3796 (A2A overlap for Megatron-FSDP) introduced a guard in
mcore_fsdp_adapter.py that iterates the legacy config.cuda_graph_scope
list. After PR #4293 normalized cuda_graph_scope to None in
TransformerConfig.__post_init__, this iteration raises
TypeError: 'NoneType' object is not iterable whenever a user combines
overlap_moe_expert_parallel_comm=True with cuda_graph_impl in {"local", "transformer_engine"}.

The bug is currently dormant on dev because:

  • The functional test deepseek_proxy_fsdp_ep2_fsdp2_ep_overlap does not set --cuda-graph-impl (defaults to "none").
  • The unit test test_fsdp_1f1b_overlap.py likewise leaves cuda_graph_impl="none".
  • The outer guard if config.cuda_graph_impl not in ["none", "full_iteration"] short-circuits the buggy iteration in both cases.

But any legitimate config that combines EP overlap with per-layer CUDA graphs will crash immediately. The equivalent silent failure on main (PR #3797, where no outer guard exists) was triggering the analogous merge-queue failures the corresponding refactor PR (#4292) — that fix has already been pushed there.

This PR applies the same migration to dev: drop the legacy for scope in config.cuda_graph_scope iteration and the now-unused CudaGraphScope import, and rely on a single cuda_graph_impl assertion that matches the original intent (per-layer CUDA graphs are not supported with FSDP EP overlap).

Test plan

  • Verify existing tests/unit_tests/a2a_overlap/test_fsdp_1f1b_overlap.py continues to pass (cuda_graph_impl defaults to "none", same path).
  • Verify tests/functional_tests/test_cases/moe/deepseek_proxy_fsdp_ep2_fsdp2_ep_overlap continues to pass.
  • Manually verify a config with --overlap-moe-expert-parallel-comm --cuda-graph-impl=local now produces a clear assertion error instead of a TypeError traceback.

cc @Wohox — friendly ping, this is a follow-up to your #3796 to make it compatible with the post-refactor CUDA-graph API. Please review when you have a moment.

🤖 Generated with Claude Code

PR NVIDIA#3796 ("Support A2A Overlap for Megatron-FSDP") landed on dev with
guard logic that iterates the legacy config.cuda_graph_scope list:

    if config.cuda_graph_impl not in ["none", "full_iteration"]:
        partial_cuda_graph_scopes = [
            scope
            for scope in config.cuda_graph_scope
            ...
        ]

After PR NVIDIA#4293 normalized cuda_graph_scope to None in __post_init__,
the inner iteration crashes with TypeError: 'NoneType' object is not
iterable whenever a user combines overlap_moe_expert_parallel_comm
with cuda_graph_impl in {"local", "transformer_engine"}.

Drop the legacy iteration; the outer check on cuda_graph_impl is the
only signal needed under the new API. Also drop the now-unused
CudaGraphScope import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@buptzyb buptzyb requested a review from Wohox May 17, 2026 02:04
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@buptzyb
Copy link
Copy Markdown
Contributor Author

buptzyb commented May 17, 2026

/ok to test 089a5ed

Copy link
Copy Markdown
Contributor

@Wohox Wohox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@buptzyb buptzyb marked this pull request as ready for review May 18, 2026 00:47
@buptzyb buptzyb requested review from a team as code owners May 18, 2026 00:47
@hxbai hxbai added this pull request to the merge queue May 18, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26014336816

Merged via the queue into NVIDIA:dev with commit bac68ec May 18, 2026
69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants