[Dev] add support for deepep/hybridep dispatcher under thd format training by HaochenYuan · Pull Request #4816 · NVIDIA/Megatron-LM

HaochenYuan · 2026-05-15T08:52:04Z

What does this PR do ?

Previous PR added support for thd format in training, but for MoE dispatcher, only all2all type is supported. This PR adds the deepep & hybridep backend.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Victarry · 2026-05-18T02:54:10Z

@HaochenYuan Could please you add an UT to guard this and submit a PR to main.

cc @Autumn1998 for review

copy-pr-bot · 2026-05-18T07:32:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HaochenYuan · 2026-05-18T09:07:04Z

@HaochenYuan Could please you add an UT to guard this and submit a PR to main.

cc @Autumn1998 for review

UT added.

yaox12 · 2026-05-19T02:28:13Z

/claude strict-review

claude · 2026-05-19T02:33:38Z

Code Review Summary

PR: Add support for deepep/hybridep dispatcher under THD format training

Overview

This PR enables HybridEP and DeepEP MoE dispatcher backends to work with THD-format sequence packing (variable-length token counts per rank). The core approach — all-reduce MAX to find the group-wide padded token count, pad routing metadata and hidden states before dispatch, trim after combine — is sound. The gradient flow through pad/trim is correct since torch.cat and slicing are differentiable with the right zero-gradient semantics.

The assertion update in transformer_config.py correctly allows "flex" (which wraps deepep/hybridep) alongside "alltoall" for sequence packing. The fused_a2a.py change to skip RDMA buffer allocation for node-local EP groups is a reasonable fix for DeepEP builds without internode support.

Findings

Severity	Count
CRITICAL	0
IMPORTANT	2
SUGGESTION	2

IMPORTANT:

CPU-GPU sync in _HybridEPManager.setup_metadata — int(nt.item()) forces a host-device sync on every forward pass when THD packing is active. This is functionally necessary but worth documenting, and worth considering whether a config-derived upper bound could avoid it.
Test skip condition too weak — test_sequence_packing_thd_e2e_proxy_model needs 16 ranks (tp=2, pp=2, cp=2, ep=2) but only guards with world_size % 8 != 0, which passes on 8-GPU nodes where initialize_model_parallel would then fail. Should be world_size < 16.

SUGGESTION:

The hardcoded 64 alignment for padded token count should be a named constant with a reference to the HybridEP kernel requirement.
The torch.cuda.device_count() heuristic in fused_a2a.py assumes all local GPUs are visible; a comment noting this assumption would help.

Risk Assessment: Low-Medium

The core dispatcher changes are well-contained in _HybridEPManager and correctly gated behind sequence_packing_scheduler is not None, so existing non-packing paths are unaffected. The fused_a2a.py RDMA change is additive (only skips unnecessary allocation). Main risk is the test skip condition allowing false passes on standard 8-GPU CI nodes.

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

…nto thd_e2e_deepep

yaox12

LGTM. Please fix the CI failure.

Autumn1998

LGTM

svcnvidia-nemo-ci · 2026-05-21T02:00:27Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26200974444

svcnvidia-nemo-ci · 2026-05-21T04:52:12Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26206185854

add support for deepep/hybridep dispatcher under thd format training

e7b8eee

HaochenYuan requested review from a team as code owners May 15, 2026 08:52

HaochenYuan requested a review from kunlunl May 15, 2026 08:52

Victarry requested a review from Autumn1998 May 15, 2026 08:53

HaochenYuan added module: moe Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. dev branch Dev branch related issues and development labels May 15, 2026

kunlunl requested changes May 15, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_config.py Outdated

Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated

hxbai mentioned this pull request May 15, 2026

DeepSeek-V4 training support #4468

Open

3 tasks

fix & linting

b1f06d2

copy-pr-bot Bot temporarily deployed to test May 15, 2026 10:25 Inactive

HaochenYuan force-pushed the thd_e2e_deepep branch from 0450ef9 to f61785f Compare May 18, 2026 07:54

refactor UT

38db167

HaochenYuan force-pushed the thd_e2e_deepep branch from 5e73b40 to 38db167 Compare May 18, 2026 09:06

kunlunl approved these changes May 18, 2026

View reviewed changes

Merge branch 'dev' into thd_e2e_deepep

d9f01d5

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread tests/unit_tests/transformer/moe/test_token_dispatcher.py

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/fused_a2a.py

HaochenYuan and others added 3 commits May 19, 2026 16:41

Update megatron/core/transformer/moe/token_dispatcher.py

7cc04a8

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

fix linting

a5a11ef

Merge branch 'thd_e2e_deepep' of github.com:HaochenYuan/Megatron-LM i…

c42fb2d

…nto thd_e2e_deepep

copy-pr-bot Bot temporarily deployed to test May 19, 2026 09:57 Inactive

yaox12 approved these changes May 20, 2026

View reviewed changes

Victarry reviewed May 20, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated

Autumn1998 approved these changes May 20, 2026

View reviewed changes

rename

1d2df5b

copy-pr-bot Bot temporarily deployed to test May 20, 2026 12:10 Inactive

fix test

483cf00

copy-pr-bot Bot temporarily deployed to test May 20, 2026 14:03 Inactive

Victarry added this pull request to the merge queue May 21, 2026

Victarry approved these changes May 21, 2026

View reviewed changes

Victarry mentioned this pull request May 21, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 21, 2026

Victarry added this pull request to the merge queue May 21, 2026

Merged via the queue into NVIDIA:dev with commit 0afbb98 May 21, 2026
67 checks passed

zhongbozhu mentioned this pull request May 28, 2026

[Dev][DO NOT MERGE] Add separate toggle for varlen input padding for HybridEP in THD training #5048

Draft

6 tasks

Conversation

HaochenYuan commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

Uh oh!

Uh oh!

Victarry commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

HaochenYuan commented May 18, 2026

Uh oh!

yaox12 commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 19, 2026

Code Review Summary

Overview

Findings

Risk Assessment: Low-Medium

Uh oh!

yaox12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Autumn1998 left a comment

Choose a reason for hiding this comment

Uh oh!

svcnvidia-nemo-ci commented May 21, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HaochenYuan commented May 15, 2026 •

edited

Loading

Victarry commented May 18, 2026 •

edited

Loading