Skip to content

[Dev] add support for deepep/hybridep dispatcher under thd format training#4816

Merged
Victarry merged 9 commits into
NVIDIA:devfrom
HaochenYuan:thd_e2e_deepep
May 21, 2026
Merged

[Dev] add support for deepep/hybridep dispatcher under thd format training#4816
Victarry merged 9 commits into
NVIDIA:devfrom
HaochenYuan:thd_e2e_deepep

Conversation

@HaochenYuan
Copy link
Copy Markdown
Contributor

@HaochenYuan HaochenYuan commented May 15, 2026

What does this PR do ?

Previous PR added support for thd format in training, but for MoE dispatcher, only all2all type is supported. This PR adds the deepep & hybridep backend.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@HaochenYuan HaochenYuan requested review from a team as code owners May 15, 2026 08:52
@HaochenYuan HaochenYuan requested a review from kunlunl May 15, 2026 08:52
@Victarry Victarry requested a review from Autumn1998 May 15, 2026 08:53
@HaochenYuan HaochenYuan added module: moe Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. dev branch Dev branch related issues and development labels May 15, 2026
Comment thread megatron/core/transformer/transformer_config.py Outdated
Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated
@hxbai hxbai mentioned this pull request May 15, 2026
3 tasks
@Victarry
Copy link
Copy Markdown
Contributor

Victarry commented May 18, 2026

@HaochenYuan Could please you add an UT to guard this and submit a PR to main.

cc @Autumn1998 for review

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HaochenYuan
Copy link
Copy Markdown
Contributor Author

@HaochenYuan Could please you add an UT to guard this and submit a PR to main.

cc @Autumn1998 for review

UT added.

@yaox12
Copy link
Copy Markdown
Member

yaox12 commented May 19, 2026

/claude strict-review

Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated
Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated
Comment thread tests/unit_tests/transformer/moe/test_token_dispatcher.py
Comment thread megatron/core/transformer/moe/fused_a2a.py
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Code Review Summary

PR: Add support for deepep/hybridep dispatcher under THD format training

Overview

This PR enables HybridEP and DeepEP MoE dispatcher backends to work with THD-format sequence packing (variable-length token counts per rank). The core approach — all-reduce MAX to find the group-wide padded token count, pad routing metadata and hidden states before dispatch, trim after combine — is sound. The gradient flow through pad/trim is correct since torch.cat and slicing are differentiable with the right zero-gradient semantics.

The assertion update in transformer_config.py correctly allows "flex" (which wraps deepep/hybridep) alongside "alltoall" for sequence packing. The fused_a2a.py change to skip RDMA buffer allocation for node-local EP groups is a reasonable fix for DeepEP builds without internode support.

Findings

Severity Count
CRITICAL 0
IMPORTANT 2
SUGGESTION 2

IMPORTANT:

  1. CPU-GPU sync in _HybridEPManager.setup_metadataint(nt.item()) forces a host-device sync on every forward pass when THD packing is active. This is functionally necessary but worth documenting, and worth considering whether a config-derived upper bound could avoid it.
  2. Test skip condition too weaktest_sequence_packing_thd_e2e_proxy_model needs 16 ranks (tp=2, pp=2, cp=2, ep=2) but only guards with world_size % 8 != 0, which passes on 8-GPU nodes where initialize_model_parallel would then fail. Should be world_size < 16.

SUGGESTION:

  1. The hardcoded 64 alignment for padded token count should be a named constant with a reference to the HybridEP kernel requirement.
  2. The torch.cuda.device_count() heuristic in fused_a2a.py assumes all local GPUs are visible; a comment noting this assumption would help.

Risk Assessment: Low-Medium

The core dispatcher changes are well-contained in _HybridEPManager and correctly gated behind sequence_packing_scheduler is not None, so existing non-packing paths are unaffected. The fused_a2a.py RDMA change is additive (only skips unnecessary allocation). Main risk is the test skip condition allowing false passes on standard 8-GPU CI nodes.

HaochenYuan and others added 3 commits May 19, 2026 16:41
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Member

@yaox12 yaox12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please fix the CI failure.

Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated
Copy link
Copy Markdown
Contributor

@Autumn1998 Autumn1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Victarry Victarry added this pull request to the merge queue May 21, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26200974444

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 21, 2026
@Victarry Victarry added this pull request to the merge queue May 21, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26206185854

Merged via the queue into NVIDIA:dev with commit 0afbb98 May 21, 2026
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev branch Dev branch related issues and development Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. module: moe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants