Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len#4094
Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len#4094janEbert merged 3 commits intoNVIDIA:mainfrom
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
See the comment in the issue; I'm not sure whether we need to fix anything here, actually. |
- Renames seq_len to um_tokens in init_hybrid_ep_buffer and HybridEPDispatch for clarity, as the variable actually represents the flattened micro-batch (seq_len * batch_size). - Adds a Python-side ValueError guardrail before DeepEP buffer initialization to catch RDMA Queue Pair depths that exceed the InfiniBand hardware limit (65535). This prevents ungraceful C++ SIGABRT crashes and instructs users to increase their Tensor/Context Parallelism degrees.
- Adds a Python-side ValueError guardrail before DeepEP buffer initialization to catch RDMA Queue Pair depths that exceed the InfiniBand hardware limit (65535). This prevents ungraceful C++ SIGABRT crashes and instructs users to increase their Tensor/Context Parallelism degrees.
janEbert
left a comment
There was a problem hiding this comment.
I think we should check for the lower bound, i.e., tx_depth = 2 * num_tokens + 1!
Thanks a lot!
Hey @janEbert , great catch finding the Just a quick mathematical double-check before I push the update: Since both the dispatch (
If we update the guardrail to only check Should we keep the stricter |
|
Oops, you're totally right, forget about that part! |
|
/ok to test 8bf0cfe |
|
/ok to test 1bbd149 |
|
Hey, please run |
|
Afterwards, we still need reviews from @NVIDIA/core-adlr, @NVIDIA/core-nemo, @NVIDIA/mixture-of-experts-adlr, and @NVIDIA/mixture-of-experts-devtech. |
Fixed lint issues, sorry for that. |
|
/ok to test 7548fb8 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25522321476 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25567101301 |
Description
Currently, when running multi-node MoE training with the HybridEP backend, passing a total token count (
seq_length * micro_batch_size) that results in a DeepEP Queue Pair depth (tx_depth = 3 * num_tokens + 1) greater than 65535 causes an immediate and ungraceful C++SIGABRTacross all ranks due to InfiniBand hardware limits.Following the architectural discussion in the issue thread, this PR improves the UX around this hardware limitation by catching the overflow in Python and raising a clean, actionable error message.
Changes
Reverted the previous attempt to alter the tensor shape logic in token_dispatcher.py, as DeepEP intentionally expects the fully flattened batch.
Renamed
seq_lentonum_tokensinfused_a2a.py(init_hybrid_ep_bufferandHybridEPDispatch.forward) to accurately reflect that the variable holds the foldedseq_length * batch_size.Added a
ValueErrorguardrail inHybridEPDispatch.forward. If the calculatedtx_depthbreaches the InfiniBand limit, it now raises a clear error advising the user to reduce sequence length/batch size or increase their TP/CP degrees, rather than dumping core.Testing
Tested locally to ensure Python syntax and logic are sound. Relying on CI to verify no regressions in the standard dispatch workflow.