Skip to content

Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)"#4718

Merged
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/revert/hybridep-ib-guardrail
May 11, 2026
Merged

Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)"#4718
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/revert/hybridep-ib-guardrail

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented May 11, 2026

What

Revert PR #4094 ("Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len").

Why

The guardrail at megatron/core/transformer/moe/fused_a2a.py:398 is overly conservative — it raises ValueError for tx_depth = 3*num_tokens + 1 >= 65536, citing InfiniBand's tx_depth < 65536 limit. Empirically, DeepEP / the underlying RDMA stack already handles dispatch at tx_depth = 98305 without crashing on the affected GB300/B300 clusters; the Python pre-check rejects workloads that would have run fine.

Evidence

End-to-end re-test of every test that the guardrail was preventing from running, with the revert applied:

  • Revert SHA: 9924c3f205 (this PR's HEAD)
  • Test image: dgxctestingtemp-nemofw-nightly.50777079 (the same nightly that contains the guardrail baked-in, with the revert overlaid at runtime)
  • Pinning: MBridge main (original MBS=4/8 values, not halved)
  • Verification pipeline (nemo-ci internal): 50889489
  • Test cases: all 9 tests the guardrail was rejecting:
Test Cluster MBS Result with revert
nemotron_3_nano_8gpu_b300_bf16_perf bia (b300) 4 (still running at write time)
nemotron_3_nano_8gpu_b300_fp8_mx_perf bia (b300) 4 (still running at write time)
nemotron_3_nano_8gpu_gb300_bf16_perf lyris (gb300) 4 ✅ success
nemotron_3_nano_8gpu_gb300_fp8_mx_perf lyris (gb300) 4 ✅ success
qwen3_30b_a3b_64gpu_b300_bf16_perf bia (b300) 8 ✅ success
qwen3_30b_a3b_64gpu_b300_fp8_mx_perf bia (b300) 8 ✅ success
qwen3_30b_a3b_64gpu_gb300_bf16_perf lyris (gb300) 8 ✅ success
qwen3_30b_a3b_64gpu_gb300_fp8_mx_perf lyris (gb300) 8 ✅ success
qwen3_vl_30b_a3b_8gpu_gb300_bf16_perf lyris (gb300) 8 ✅ success

(I'll update the table with the 2 remaining results when they land.)

Path forward

  • If maintainers want to keep some form of guardrail, please re-introduce it with a more accurate threshold (e.g. the actual max_qp_wr reported by the IB device at runtime), and behind a feature flag so platforms where it's wrong can opt out.
  • If the goal is just to surface a clearer error than DeepEP's lower-level assert, a warnings.warn or a try/except wrapper around the dispatch call would achieve that without rejecting valid configurations.

Test plan

  • Run tests label triggers MCore unit tests
  • End-to-end CI: pipeline 50889489 (above)

…ename seq_len (NVIDIA#4094)"

This reverts commit a08e259.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g requested review from a team as code owners May 11, 2026 07:44
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented May 11, 2026

/ok to test

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft May 11, 2026 07:44
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g ko3n1g marked this pull request as ready for review May 11, 2026 07:50
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 11, 2026 07:50
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented May 11, 2026

#4094 forced us to reduce perf for nemotron3, qwen3 moe 30B and qwen3 -vl by reducing MBS. Reverting until we find a more fine-grained solution

@ko3n1g ko3n1g merged commit a2ec5c1 into NVIDIA:main May 11, 2026
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants