Skip to content

fix v4 TBO hang; hash topk single all-gather#1109

Merged
valarLip merged 3 commits into
mainfrom
tbo-hang-topk-allgather-v2
Jun 6, 2026
Merged

fix v4 TBO hang; hash topk single all-gather#1109
valarLip merged 3 commits into
mainfrom
tbo-hang-topk-allgather-v2

Conversation

@ZhangLirong-amd
Copy link
Copy Markdown
Contributor

Fix TBO hang and do hash-topk DP all-gather only once.

Copilot AI review requested due to automatic review settings June 6, 2026 02:01
@ZhangLirong-amd ZhangLirong-amd changed the title fix TBO hang; hash topk single all-gather fix v4 TBO hang; hash topk single all-gather Jun 6, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to address a TBO (two-thread batch overlap) hang by moving the DP all-gather of input_ids out of the MoE hash-routing callback and into the model forward path, so the hash-topk routing can reuse a single gathered input_ids buffer rather than gathering inside _hash_topk.

Changes:

  • Removed DP input_ids gathering logic from MoE._hash_topk and replaced it with a simple slice+clamp to num_tokens.
  • Added MoE._gather_ids_for_dp(...) helper to all-gather input_ids across DP ranks.
  • Added _need_ids_gather gating and updated DeepseekV4ForCausalLM.forward to populate forward_context.context.input_ids (gathered when needed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2266 to +2282
def _gather_ids_for_dp(ids: torch.Tensor, ctx) -> torch.Tensor:
"""All-gather input_ids across DP ranks to match gathered hidden_states."""
from aiter.dist.parallel_state import get_dp_group

ids_2d = ids.unsqueeze(-1)
dp_eager_mode = (
not ctx.context.dp_uniform_decode
) and ctx.dp_metadata is not None
if dp_eager_mode:
from atom.model_ops.moe import all_gatherv
sizes = ctx.dp_metadata.get_sizes_across_dp()
ids_2d = all_gatherv(ids_2d, sizes, get_dp_group())
else:
from atom.model_ops.moe import pad_for_all_gather
ids_2d, _ = pad_for_all_gather(ids_2d)
ids_2d = get_dp_group().all_gather(ids_2d, use_custom=False, dim=0)
return ids_2d.flatten()
Comment thread atom/models/deepseek_v4.py Outdated
Comment on lines +2805 to +2806
if ctx.context.input_ids is not None:
pass # already set (e.g. TBO pre-gathered in UBatchWrapper)
Copilot AI review requested due to automatic review settings June 6, 2026 03:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +2280 to +2284
from atom.model_ops.moe import pad_for_all_gather

ids_2d, _ = pad_for_all_gather(ids_2d)
ids_2d = get_dp_group().all_gather(ids_2d, use_custom=False, dim=0)
return ids_2d.flatten()
@valarLip valarLip merged commit 1d9bf20 into main Jun 6, 2026
13 of 32 checks passed
@valarLip valarLip deleted the tbo-hang-topk-allgather-v2 branch June 6, 2026 10:55
ZhangLirong-amd added a commit that referenced this pull request Jun 7, 2026
The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward
by #1109) and the MoE combine_outputs TP all-reduce both run on the same
stream as the main TBO compute kernels. With NCCL interleaved on that
lane the kernel queue serializes compute and NCCL, blocking the
hardware-level overlap with TBO comm_stream during ping-pong.

Move both collectives onto a per-device auxiliary stream with wait_stream
sync at both ends so the main compute lane stays free of NCCL
interleaving and can overlap with TBO comm_stream NCCL.

Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo,
c=256 ISL=8192):
  before: 3.8 % of TBO NCCL overlaps with compute
  after:  91.7 % of TBO NCCL overlaps with compute
ZhangLirong-amd added a commit that referenced this pull request Jun 7, 2026
The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward
by #1109) and the MoE combine_outputs TP all-reduce both run on the same
stream as the main TBO compute kernels. With NCCL interleaved on that
lane the kernel queue serializes compute and NCCL, blocking the
hardware-level overlap with TBO comm_stream during ping-pong.

Move both collectives onto a per-device auxiliary stream with wait_stream
sync at both ends so the main compute lane stays free of NCCL
interleaving and can overlap with TBO comm_stream NCCL.

Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo,
c=256 ISL=8192):
  before: 3.8 % of TBO NCCL overlaps with compute
  after:  91.7 % of TBO NCCL overlaps with compute
ZhangLirong-amd added a commit that referenced this pull request Jun 7, 2026
The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward
by #1109) and the MoE combine_outputs TP all-reduce both run on the same
stream as the main TBO compute kernels. With NCCL interleaved on that
lane the kernel queue serializes compute and NCCL, blocking the
hardware-level overlap with TBO comm_stream during ping-pong.

Move both collectives onto a per-device auxiliary stream with wait_stream
sync at both ends so the main compute lane stays free of NCCL
interleaving and can overlap with TBO comm_stream NCCL.

Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo,
c=256 ISL=8192):
  before: 3.8 % of TBO NCCL overlaps with compute
  after:  91.7 % of TBO NCCL overlaps with compute
valarLip pushed a commit that referenced this pull request Jun 8, 2026
…gather hang for PD prefill (#1121)

* fix(mooncake): synchronize gather kernel before NIC reads staging buffer

Without this sync, the RDMA write can race the still-in-flight gather
kernel that fills the staging buffer, causing GPU page faults under
high-concurrency TBO prefill (DeepSeek-V4-Pro, mesh router, c=256).

* fix(deepseek_v4): route TBO-adjacent collectives onto a dedicated stream

The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward
by #1109) and the MoE combine_outputs TP all-reduce both run on the same
stream as the main TBO compute kernels. With NCCL interleaved on that
lane the kernel queue serializes compute and NCCL, blocking the
hardware-level overlap with TBO comm_stream during ping-pong.

Move both collectives onto a per-device auxiliary stream with wait_stream
sync at both ends so the main compute lane stays free of NCCL
interleaving and can overlap with TBO comm_stream NCCL.

Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo,
c=256 ISL=8192):
  before: 3.8 % of TBO NCCL overlaps with compute
  after:  91.7 % of TBO NCCL overlaps with compute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants