fix v4 TBO hang; hash topk single all-gather by ZhangLirong-amd · Pull Request #1109 · ROCm/ATOM

ZhangLirong-amd · 2026-06-06T02:01:44Z

Fix TBO hang and do hash-topk DP all-gather only once.

Copilot

Pull request overview

This PR aims to address a TBO (two-thread batch overlap) hang by moving the DP all-gather of input_ids out of the MoE hash-routing callback and into the model forward path, so the hash-topk routing can reuse a single gathered input_ids buffer rather than gathering inside _hash_topk.

Changes:

Removed DP input_ids gathering logic from MoE._hash_topk and replaced it with a simple slice+clamp to num_tokens.
Added MoE._gather_ids_for_dp(...) helper to all-gather input_ids across DP ranks.
Added _need_ids_gather gating and updated DeepseekV4ForCausalLM.forward to populate forward_context.context.input_ids (gathered when needed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def _gather_ids_for_dp(ids: torch.Tensor, ctx) -> torch.Tensor:
+        """All-gather input_ids across DP ranks to match gathered hidden_states."""
+        from aiter.dist.parallel_state import get_dp_group
+
+        ids_2d = ids.unsqueeze(-1)
+        dp_eager_mode = (
+            not ctx.context.dp_uniform_decode
+        ) and ctx.dp_metadata is not None
+        if dp_eager_mode:
+            from atom.model_ops.moe import all_gatherv
+            sizes = ctx.dp_metadata.get_sizes_across_dp()
+            ids_2d = all_gatherv(ids_2d, sizes, get_dp_group())
+        else:
+            from atom.model_ops.moe import pad_for_all_gather
+            ids_2d, _ = pad_for_all_gather(ids_2d)
+            ids_2d = get_dp_group().all_gather(ids_2d, use_custom=False, dim=0)
+        return ids_2d.flatten()


+        if ctx.context.input_ids is not None:
+            pass  # already set (e.g. TBO pre-gathered in UBatchWrapper)


Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

+            from atom.model_ops.moe import pad_for_all_gather
+
+            ids_2d, _ = pad_for_all_gather(ids_2d)
+            ids_2d = get_dp_group().all_gather(ids_2d, use_custom=False, dim=0)
+        return ids_2d.flatten()


The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward by #1109) and the MoE combine_outputs TP all-reduce both run on the same stream as the main TBO compute kernels. With NCCL interleaved on that lane the kernel queue serializes compute and NCCL, blocking the hardware-level overlap with TBO comm_stream during ping-pong. Move both collectives onto a per-device auxiliary stream with wait_stream sync at both ends so the main compute lane stays free of NCCL interleaving and can overlap with TBO comm_stream NCCL. Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo, c=256 ISL=8192): before: 3.8 % of TBO NCCL overlaps with compute after: 91.7 % of TBO NCCL overlaps with compute

…gather hang for PD prefill (#1121) * fix(mooncake): synchronize gather kernel before NIC reads staging buffer Without this sync, the RDMA write can race the still-in-flight gather kernel that fills the staging buffer, causing GPU page faults under high-concurrency TBO prefill (DeepSeek-V4-Pro, mesh router, c=256). * fix(deepseek_v4): route TBO-adjacent collectives onto a dedicated stream The DP input_ids all-gather (hoisted into DeepseekV4ForCausalLM.forward by #1109) and the MoE combine_outputs TP all-reduce both run on the same stream as the main TBO compute kernels. With NCCL interleaved on that lane the kernel queue serializes compute and NCCL, blocking the hardware-level overlap with TBO comm_stream during ping-pong. Move both collectives onto a per-device auxiliary stream with wait_stream sync at both ends so the main compute lane stays free of NCCL interleaving and can overlap with TBO comm_stream NCCL. Trace results on DeepSeek-V4-Pro (TP=8 --enable-dp-attention --enable-tbo, c=256 ISL=8192): before: 3.8 % of TBO NCCL overlaps with compute after: 91.7 % of TBO NCCL overlaps with compute

fix TBO hang; hash topk single all-gather

ba6d762

Copilot AI review requested due to automatic review settings June 6, 2026 02:01

Copilot started reviewing on behalf of ZhangLirong-amd June 6, 2026 02:01 View session

ZhangLirong-amd changed the title ~~fix TBO hang; hash topk single all-gather~~ fix v4 TBO hang; hash topk single all-gather Jun 6, 2026

Copilot AI reviewed Jun 6, 2026

View reviewed changes

ZhangLirong-amd added 2 commits June 6, 2026 02:11

style: black format

3fccef8

unify hash-id gather; drop stale TBO branch

12e972e

Copilot AI review requested due to automatic review settings June 6, 2026 03:41

Copilot started reviewing on behalf of ZhangLirong-amd June 6, 2026 03:41 View session

Copilot AI reviewed Jun 6, 2026

View reviewed changes

Comment thread atom/models/deepseek_v4.py

Comment on lines +2280 to +2284

from atom.model_ops.moe import pad_for_all_gather

ids_2d, _ = pad_for_all_gather(ids_2d)

ids_2d = get_dp_group().all_gather(ids_2d, use_custom=False, dim=0)

return ids_2d.flatten()

valarLip approved these changes Jun 6, 2026

View reviewed changes

valarLip merged commit 1d9bf20 into main Jun 6, 2026
13 of 32 checks passed

valarLip deleted the tbo-hang-topk-allgather-v2 branch June 6, 2026 10:55

ZhangLirong-amd mentioned this pull request Jun 7, 2026

fix: enable TBO compute/comm overlap in Deepseek V4 and fix mooncake gather hang for PD prefill #1121

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix v4 TBO hang; hash topk single all-gather#1109

fix v4 TBO hang; hash topk single all-gather#1109
valarLip merged 3 commits into
mainfrom
tbo-hang-topk-allgather-v2

ZhangLirong-amd commented Jun 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if ctx.context.input_ids is not None:
		pass # already set (e.g. TBO pre-gathered in UBatchWrapper)

Conversation

ZhangLirong-amd commented Jun 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants