Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len by Shreyas-S-809 · Pull Request #4094 · NVIDIA/Megatron-LM

Shreyas-S-809 · 2026-04-01T17:41:26Z

Description

Currently, when running multi-node MoE training with the HybridEP backend, passing a total token count (seq_length * micro_batch_size) that results in a DeepEP Queue Pair depth (tx_depth = 3 * num_tokens + 1) greater than 65535 causes an immediate and ungraceful C++ SIGABRT across all ranks due to InfiniBand hardware limits.

Following the architectural discussion in the issue thread, this PR improves the UX around this hardware limitation by catching the overflow in Python and raising a clean, actionable error message.

Changes

Reverted the previous attempt to alter the tensor shape logic in token_dispatcher.py, as DeepEP intentionally expects the fully flattened batch.
Renamed seq_len to num_tokens in fused_a2a.py (init_hybrid_ep_buffer and HybridEPDispatch.forward) to accurately reflect that the variable holds the folded seq_length * batch_size.
Added a ValueError guardrail in HybridEPDispatch.forward. If the calculated tx_depth breaches the InfiniBand limit, it now raises a clear error advising the user to reduce sequence length/batch size or increase their TP/CP degrees, rather than dumping core.

Testing

Tested locally to ensure Python syntax and logic are sound. Relying on CI to verify no regressions in the standard dispatch workflow.

copy-pr-bot · 2026-04-01T17:41:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-04-01T17:41:35Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

janEbert · 2026-04-02T13:25:44Z

See the comment in the issue; I'm not sure whether we need to fix anything here, actually.

- Renames seq_len to um_tokens in init_hybrid_ep_buffer and HybridEPDispatch for clarity, as the variable actually represents the flattened micro-batch (seq_len * batch_size). - Adds a Python-side ValueError guardrail before DeepEP buffer initialization to catch RDMA Queue Pair depths that exceed the InfiniBand hardware limit (65535). This prevents ungraceful C++ SIGABRT crashes and instructs users to increase their Tensor/Context Parallelism degrees.

- Adds a Python-side ValueError guardrail before DeepEP buffer initialization to catch RDMA Queue Pair depths that exceed the InfiniBand hardware limit (65535). This prevents ungraceful C++ SIGABRT crashes and instructs users to increase their Tensor/Context Parallelism degrees.

janEbert

I think we should check for the lower bound, i.e., tx_depth = 2 * num_tokens + 1!
Thanks a lot!

Shreyas-S-809 · 2026-04-28T12:47:16Z

I think we should check for the lower bound, i.e., tx_depth = 2 * num_tokens + 1! Thanks a lot!

Hey @janEbert , great catch finding the allocate_combine_buffers logic!

Just a quick mathematical double-check before I push the update: Since both the dispatch (3x + 1) and combine (2x + 1) queues must stay under 65536, doesn't the 3 * num_tokens + 1 calculation actually hit the hardware ceiling first?

3 * num_tokens + 1 >= 65536 triggers when tokens exceed 21,845
2 * num_tokens + 1 >= 65536 triggers when tokens exceed 32,767

If we update the guardrail to only check 2x + 1, a config with e.g., 25,000 tokens will pass our Python check, but still crash the C++ backend with a SIGABRT during the dispatch allocation.

Should we keep the stricter 3 * num_tokens + 1 check to safely cover both allocations, or would you prefer I check the max() of both explicitly? Happy to push whichever you think is best!

janEbert · 2026-04-28T13:46:59Z

Oops, you're totally right, forget about that part!

janEbert · 2026-04-28T13:48:01Z

/ok to test 8bf0cfe

janEbert · 2026-05-04T11:09:06Z

/ok to test 1bbd149

janEbert · 2026-05-04T11:12:02Z

Hey, please run tools/autoformat.sh (or pre-commit hooks) to fix the linting errors.

janEbert · 2026-05-04T11:14:01Z

Afterwards, we still need reviews from @NVIDIA/core-adlr, @NVIDIA/core-nemo, @NVIDIA/mixture-of-experts-adlr, and @NVIDIA/mixture-of-experts-devtech.

Shreyas-S-809 · 2026-05-04T17:39:42Z

Hey, please run tools/autoformat.sh (or pre-commit hooks) to fix the linting errors.

Fixed lint issues, sorry for that.

janEbert · 2026-05-04T18:36:33Z

/ok to test 7548fb8

svcnvidia-nemo-ci · 2026-05-07T21:11:32Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25522321476

svcnvidia-nemo-ci · 2026-05-08T16:31:24Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25567101301

…ename seq_len (#4094)" (#4718) Signed-off-by: oliver könig <okoenig@nvidia.com>

Shreyas-S-809 requested review from a team as code owners April 1, 2026 17:41

svcnvidia-nemo-ci marked this pull request as draft April 1, 2026 17:41

github-actions Bot added the community-request label Apr 1, 2026

Shreyas-S-809 mentioned this pull request Apr 1, 2026

[Bug] HybridEP dispatcher passes incorrect max_num_of_tokens_per_rank to DeepEP, causing RDMA QP assertion failure #3999

Closed

Victarry requested a review from Autumn1998 April 1, 2026 22:56

janEbert requested changes Apr 2, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/fused_a2a.py Outdated

Comment thread megatron/core/transformer/moe/fused_a2a.py Outdated

Comment thread megatron/core/transformer/moe/token_dispatcher.py Outdated

chtruong814 added waiting-for-customer and removed waiting-for-customer labels Apr 14, 2026

Shreyas-S-809 force-pushed the fix3999 branch from c72dbe9 to f7fbe84 Compare April 27, 2026 18:50

Shreyas-S-809 changed the title ~~Fix DeepEP RDMA QP assertion failure by passing correct token limits in HybridEP~~ Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len Apr 27, 2026

Shreyas-S-809 requested a review from janEbert April 27, 2026 19:07

janEbert reviewed Apr 27, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/fused_a2a.py Outdated

janEbert reviewed Apr 27, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/fused_a2a.py

janEbert reviewed Apr 27, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/fused_a2a.py Outdated

janEbert requested changes Apr 27, 2026

View reviewed changes

janEbert approved these changes Apr 28, 2026

View reviewed changes

Shreyas-S-809 force-pushed the fix3999 branch from 8bf0cfe to 1bbd149 Compare April 28, 2026 15:29

Shreyas-S-809 requested a review from janEbert April 28, 2026 15:39

janEbert approved these changes May 4, 2026

View reviewed changes

fix lexical issues

7548fb8

Shreyas-S-809 force-pushed the fix3999 branch from 1bbd149 to 7548fb8 Compare May 4, 2026 17:38

janEbert marked this pull request as ready for review May 4, 2026 18:36

svcnvidia-nemo-ci requested a review from a team May 4, 2026 18:37

copy-pr-bot Bot temporarily deployed to test May 4, 2026 18:37 Inactive

svcnvidia-nemo-ci added the complexity: low label May 4, 2026

yashaswikarnati approved these changes May 5, 2026

View reviewed changes

deepakn94 approved these changes May 5, 2026

View reviewed changes

yaox12 approved these changes May 6, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 6, 2026

fanshiqing approved these changes May 7, 2026

View reviewed changes

shanmugamr1992 approved these changes May 7, 2026

View reviewed changes

janEbert enabled auto-merge May 7, 2026 21:09

janEbert added this pull request to the merge queue May 7, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 7, 2026

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 8, 2026

janEbert added this pull request to the merge queue May 8, 2026

Merged via the queue into NVIDIA:main with commit a08e259 May 8, 2026
66 of 70 checks passed

This was referenced May 10, 2026

fix(perf): halve MBS on b300/gb300 MoE perf configs to fit HybridEP IB QP cap NVIDIA-NeMo/Megatron-Bridge#3768

Closed

Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)" #4718

Merged

ko3n1g added a commit that referenced this pull request May 11, 2026

Revert "Add Python-side guardrail for HybridEP InfiniBand limit and r…

a2ec5c1

…ename seq_len (#4094)" (#4718) Signed-off-by: oliver könig <okoenig@nvidia.com>

janEbert mentioned this pull request May 11, 2026

Add Python-side guardrail for DeepEP IB limits #4719

Merged

Conversation

Shreyas-S-809 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janEbert commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janEbert left a comment

Choose a reason for hiding this comment

Uh oh!

Shreyas-S-809 commented Apr 28, 2026

Uh oh!

janEbert commented Apr 28, 2026

Uh oh!

janEbert commented Apr 28, 2026

Uh oh!

janEbert commented May 4, 2026

Uh oh!

janEbert commented May 4, 2026

Uh oh!

janEbert commented May 4, 2026

Uh oh!

Shreyas-S-809 commented May 4, 2026

Uh oh!

janEbert commented May 4, 2026

Uh oh!

svcnvidia-nemo-ci commented May 7, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Shreyas-S-809 commented Apr 1, 2026 •

edited

Loading