fix v4 TBO hang; hash topk single all-gather#1109
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to address a TBO (two-thread batch overlap) hang by moving the DP all-gather of input_ids out of the MoE hash-routing callback and into the model forward path, so the hash-topk routing can reuse a single gathered input_ids buffer rather than gathering inside _hash_topk.
Changes:
- Removed DP
input_idsgathering logic fromMoE._hash_topkand replaced it with a simple slice+clamp tonum_tokens. - Added
MoE._gather_ids_for_dp(...)helper to all-gatherinput_idsacross DP ranks. - Added
_need_ids_gathergating and updatedDeepseekV4ForCausalLM.forwardto populateforward_context.context.input_ids(gathered when needed).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2266
to
+2282
| def _gather_ids_for_dp(ids: torch.Tensor, ctx) -> torch.Tensor: | ||
| """All-gather input_ids across DP ranks to match gathered hidden_states.""" | ||
| from aiter.dist.parallel_state import get_dp_group | ||
|
|
||
| ids_2d = ids.unsqueeze(-1) | ||
| dp_eager_mode = ( | ||
| not ctx.context.dp_uniform_decode | ||
| ) and ctx.dp_metadata is not None | ||
| if dp_eager_mode: | ||
| from atom.model_ops.moe import all_gatherv | ||
| sizes = ctx.dp_metadata.get_sizes_across_dp() | ||
| ids_2d = all_gatherv(ids_2d, sizes, get_dp_group()) | ||
| else: | ||
| from atom.model_ops.moe import pad_for_all_gather | ||
| ids_2d, _ = pad_for_all_gather(ids_2d) | ||
| ids_2d = get_dp_group().all_gather(ids_2d, use_custom=False, dim=0) | ||
| return ids_2d.flatten() |
Comment on lines
+2805
to
+2806
| if ctx.context.input_ids is not None: | ||
| pass # already set (e.g. TBO pre-gathered in UBatchWrapper) |
Comment on lines
+2280
to
+2284
| from atom.model_ops.moe import pad_for_all_gather | ||
|
|
||
| ids_2d, _ = pad_for_all_gather(ids_2d) | ||
| ids_2d = get_dp_group().all_gather(ids_2d, use_custom=False, dim=0) | ||
| return ids_2d.flatten() |
valarLip
approved these changes
Jun 6, 2026
ZhangLirong-amd
added a commit
that referenced
this pull request
Jun 7, 2026
The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward by #1109) and the MoE combine_outputs TP all-reduce both run on the same stream as the main TBO compute kernels. With NCCL interleaved on that lane the kernel queue serializes compute and NCCL, blocking the hardware-level overlap with TBO comm_stream during ping-pong. Move both collectives onto a per-device auxiliary stream with wait_stream sync at both ends so the main compute lane stays free of NCCL interleaving and can overlap with TBO comm_stream NCCL. Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo, c=256 ISL=8192): before: 3.8 % of TBO NCCL overlaps with compute after: 91.7 % of TBO NCCL overlaps with compute
3 tasks
ZhangLirong-amd
added a commit
that referenced
this pull request
Jun 7, 2026
The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward by #1109) and the MoE combine_outputs TP all-reduce both run on the same stream as the main TBO compute kernels. With NCCL interleaved on that lane the kernel queue serializes compute and NCCL, blocking the hardware-level overlap with TBO comm_stream during ping-pong. Move both collectives onto a per-device auxiliary stream with wait_stream sync at both ends so the main compute lane stays free of NCCL interleaving and can overlap with TBO comm_stream NCCL. Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo, c=256 ISL=8192): before: 3.8 % of TBO NCCL overlaps with compute after: 91.7 % of TBO NCCL overlaps with compute
ZhangLirong-amd
added a commit
that referenced
this pull request
Jun 7, 2026
The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward by #1109) and the MoE combine_outputs TP all-reduce both run on the same stream as the main TBO compute kernels. With NCCL interleaved on that lane the kernel queue serializes compute and NCCL, blocking the hardware-level overlap with TBO comm_stream during ping-pong. Move both collectives onto a per-device auxiliary stream with wait_stream sync at both ends so the main compute lane stays free of NCCL interleaving and can overlap with TBO comm_stream NCCL. Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo, c=256 ISL=8192): before: 3.8 % of TBO NCCL overlaps with compute after: 91.7 % of TBO NCCL overlaps with compute
valarLip
pushed a commit
that referenced
this pull request
Jun 8, 2026
…gather hang for PD prefill (#1121) * fix(mooncake): synchronize gather kernel before NIC reads staging buffer Without this sync, the RDMA write can race the still-in-flight gather kernel that fills the staging buffer, causing GPU page faults under high-concurrency TBO prefill (DeepSeek-V4-Pro, mesh router, c=256). * fix(deepseek_v4): route TBO-adjacent collectives onto a dedicated stream The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward by #1109) and the MoE combine_outputs TP all-reduce both run on the same stream as the main TBO compute kernels. With NCCL interleaved on that lane the kernel queue serializes compute and NCCL, blocking the hardware-level overlap with TBO comm_stream during ping-pong. Move both collectives onto a per-device auxiliary stream with wait_stream sync at both ends so the main compute lane stays free of NCCL interleaving and can overlap with TBO comm_stream NCCL. Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo, c=256 ISL=8192): before: 3.8 % of TBO NCCL overlaps with compute after: 91.7 % of TBO NCCL overlaps with compute
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix TBO hang and do hash-topk DP all-gather only once.