Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)" by ko3n1g · Pull Request #4718 · NVIDIA/Megatron-LM

ko3n1g · 2026-05-11T07:44:49Z

What

Revert PR #4094 ("Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len").

Why

The guardrail at megatron/core/transformer/moe/fused_a2a.py:398 is overly conservative — it raises ValueError for tx_depth = 3*num_tokens + 1 >= 65536, citing InfiniBand's tx_depth < 65536 limit. Empirically, DeepEP / the underlying RDMA stack already handles dispatch at tx_depth = 98305 without crashing on the affected GB300/B300 clusters; the Python pre-check rejects workloads that would have run fine.

Evidence

End-to-end re-test of every test that the guardrail was preventing from running, with the revert applied:

Revert SHA: 9924c3f205 (this PR's HEAD)
Test image: dgxctestingtemp-nemofw-nightly.50777079 (the same nightly that contains the guardrail baked-in, with the revert overlaid at runtime)
Pinning: MBridge main (original MBS=4/8 values, not halved)
Verification pipeline (nemo-ci internal): 50889489
Test cases: all 9 tests the guardrail was rejecting:

Test	Cluster	MBS	Result with revert
`nemotron_3_nano_8gpu_b300_bf16_perf`	bia (b300)	4	(still running at write time)
`nemotron_3_nano_8gpu_b300_fp8_mx_perf`	bia (b300)	4	(still running at write time)
`nemotron_3_nano_8gpu_gb300_bf16_perf`	lyris (gb300)	4	✅ success
`nemotron_3_nano_8gpu_gb300_fp8_mx_perf`	lyris (gb300)	4	✅ success
`qwen3_30b_a3b_64gpu_b300_bf16_perf`	bia (b300)	8	✅ success
`qwen3_30b_a3b_64gpu_b300_fp8_mx_perf`	bia (b300)	8	✅ success
`qwen3_30b_a3b_64gpu_gb300_bf16_perf`	lyris (gb300)	8	✅ success
`qwen3_30b_a3b_64gpu_gb300_fp8_mx_perf`	lyris (gb300)	8	✅ success
`qwen3_vl_30b_a3b_8gpu_gb300_bf16_perf`	lyris (gb300)	8	✅ success

(I'll update the table with the 2 remaining results when they land.)

Path forward

If maintainers want to keep some form of guardrail, please re-introduce it with a more accurate threshold (e.g. the actual max_qp_wr reported by the IB device at runtime), and behind a feature flag so platforms where it's wrong can opt out.
If the goal is just to surface a clearer error than DeepEP's lower-level assert, a warnings.warn or a try/except wrapper around the dispatch call would achieve that without rejecting valid configurations.

Test plan

Run tests label triggers MCore unit tests
End-to-end CI: pipeline 50889489 (above)

…ename seq_len (NVIDIA#4094)" This reverts commit a08e259. Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-05-11T07:44:57Z

/ok to test

github-actions · 2026-05-11T07:45:00Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

copy-pr-bot · 2026-05-11T07:45:01Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ko3n1g · 2026-05-11T07:52:16Z

#4094 forced us to reduce perf for nemotron3, qwen3 moe 30B and qwen3 -vl by reducing MBS. Reverting until we find a more fine-grained solution

Revert "Add Python-side guardrail for HybridEP InfiniBand limit and r…

9924c3f

…ename seq_len (NVIDIA#4094)" This reverts commit a08e259. Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g requested review from a team as code owners May 11, 2026 07:44

svcnvidia-nemo-ci marked this pull request as draft May 11, 2026 07:44

copy-pr-bot Bot temporarily deployed to test May 11, 2026 07:45 Inactive

ko3n1g marked this pull request as ready for review May 11, 2026 07:50

svcnvidia-nemo-ci requested a review from a team May 11, 2026 07:50

svcnvidia-nemo-ci added the complexity: low label May 11, 2026

ko3n1g merged commit a2ec5c1 into NVIDIA:main May 11, 2026
26 of 28 checks passed

janEbert mentioned this pull request May 11, 2026

Add Python-side guardrail for DeepEP IB limits #4719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)"#4718

Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)"#4718
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/revert/hybridep-ib-guardrail

ko3n1g commented May 11, 2026

Uh oh!

ko3n1g commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

ko3n1g commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ko3n1g commented May 11, 2026

What

Why

Evidence

Path forward

Test plan

Uh oh!

ko3n1g commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

ko3n1g commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants