[None][fix] py_executor: lift conditional tp_allgather sites out of per-rank-divergent gates by lancelly · Pull Request #14560 · NVIDIA/TensorRT-LLM

lancelly · 2026-05-26T08:14:55Z

Summary

Fix the GEN-side ADP-router desync that crashes all 8 ranks of DSv4-Pro DEP8+DEP8 disagg with one of:

TypeError: RankState.__init__() takes from 2 to 4 positional arguments but 27 were given     # TLLM_METRICS_ALL_RANKS=1, doc-variant 1
TypeError: argument of type 'bool' is not iterable                                           # TLLM_METRICS_ALL_RANKS unset,  doc-variant 2
TypeError: 'int' object is not iterable     at tp_cp_allgather                               # observed in this PR's A/B verify (variant 3)

Repro doc: ape-repo/astra-projects/wwfo/artificial-analysis@DSV4-disagg:docs/knowledge/dsv4_disagg_dep8_adp_router_desync.md.

Root cause

Two conditional tp_allgather collectives in tensorrt_llm/_torch/pyexecutor/py_executor.py live inside per-rank-divergent gates:

_handle_responses line 4438 — tp_allgather(bool(timed_out_requests)). Reached from _process_previous_batch (overlap loop), which is gated on should_process_previous_batch = can_queue or not can_queue_this_rank plus previous_batch is not None. When one ADP rank's scheduled_batch.batch_size == 0 and its previous_batch was cleared the iter before, that rank skips _process_previous_batch while peers enter it. Peers issue the tp_allgather, the empty-batch rank does not — the next collective in the queue (the very next gather_all_rank_states) then receives the bool payload positionally as RankState fields and crashes on every rank.
_append_iter_stats line 1436 — tp_allgather(local_dict) (27-key payload under TLLM_METRICS_ALL_RANKS=1). Same outer gate via _process_iter_stats. With the env var set, the per-rank-divergent branch fires both the timeout gather AND the iter-stats gather, so the misalignment moves by two positions and the 27-key dict lands in the RankState.deserialize slot.

The three observed variants (dict, bool, int) are the same bug, differing only in how many extra collectives the divergent branch issued and which downstream collective consumes the misaligned payload.

Fix

Mirror the existing _flush_pending_transfer_responses pattern: buffer the payload on self and drain at a rank-symmetric flush point.

_handle_responses → push to self._pending_timed_out_requests. New helper _handle_kv_transfer_timeouts_synced() does the ADP-safe drain.
_append_iter_stats → push to self._pending_iter_stats_dict. New helper _flush_iter_stats_synced() does the ADP-safe drain (gathers None from idle ranks so the queue stays aligned).
Both helpers are called from a rank-symmetric position in each executor loop:
- Overlap loop: right after the if previous_batch is not None and should_process_previous_batch block (immediately adjacent to the existing _enqueue_responses([]) sync sibling in the else branch).
- Non-overlap loop: right after the rank-symmetric if can_queue block that calls _handle_responses (outside the if to be defensive against future restructuring).
- PP loop / _handle_executed_batch: at the bottom of every call, outside the executed_batch is not None conditional. handle_executed_batches(executed_batch_num) calls this helper an equal number of times on every PP rank (broadcast executed_batch_num).

No new collectives introduced — the existing two are just moved out of the divergent branches to the same flush point the codebase already uses for _enqueue_responses.

reasonsolo · 2026-05-26T10:11:03Z

+        with self.stats_lock:
+            cap = self.max_stats_len * tp_size
+            overflow = max(0, len(self.stats) + len(rank_dicts) - cap)
+            if overflow:


This is awkward, appending all rank_dicts and trimming by cap is much more human readable

This is aligned with origin code, but I think you are right. Fixed.

reasonsolo · 2026-05-26T10:13:39Z

+        _append_iter_stats takes its legacy path and never populates the
+        buffer.
+        """
+        tp_size = getattr(self.dist, "tp_size", 1)


self.dist.tp_size is accessible for sure. Agents use getattr toooo much.

Yes. This function is extracted out of the origin code.
I will fix this.

…er-rank-divergent gates Under ADP + disagg on feat/deepseek_v4, GEN workers desync after a small number of successful requests and all 8 ranks crash inside ADPRouter.gather_all_rank_states with either: TypeError: RankState.__init__() takes from 2 to 4 positional arguments but 27 were given (with TLLM_METRICS_ALL_RANKS=1) TypeError: argument of type 'bool' is not iterable (without TLLM_METRICS_ALL_RANKS=1) Root cause is in py_executor.py, not adp_router.py: two conditional tp_allgather collectives live inside per-rank-divergent gates and misalign the MPI collective queue. 1. _handle_responses:4438-4440 (bool payload) — called from _process_previous_batch which is gated by `should_process_previous_batch = can_queue or not can_queue_this_rank` plus `previous_batch is not None`. When one DP rank has an empty batch but its previous_batch was cleared the iter before, that rank skips _process_previous_batch while peers enter it. Peers issue the tp_allgather, the empty-batch rank does not, and the next collective (the very next gather_all_rank_states) receives the bool payload positionally as RankState fields. 2. _append_iter_stats:1436 (dict payload, ~27 keys) — same outer gate via _process_iter_stats. With TLLM_METRICS_ALL_RANKS=1 the same divergent gate fires this gather AND the timeout gather, so the misalignment lands the 27-key dict in the RankState slot. Fix: mirror the existing _flush_pending_transfer_responses pattern. Buffer the timed-out requests and the per-rank iter-stats dict on self, drain at a rank-symmetric position next to the existing flush. Adds two new helpers: - _handle_kv_transfer_timeouts_synced — replaces the inline ADP branch in _handle_responses - _flush_iter_stats_synced — replaces the inline gather in _append_iter_stats Wired in all three executor loops: - _executor_loop_overlap: after the per-rank-divergent `if previous_batch is not None and should_process_previous_batch` block, alongside the existing _flush_pending_transfer_responses. - _executor_loop (non-overlap): right after the `if can_queue` block that calls _handle_responses (already rank-symmetric, but the buffer-drain symmetry is preserved here too). - _executor_loop_pp / _handle_executed_batch: at the bottom of every call (each rank calls it an equal number of times per outer iter via the rank-0-broadcast executed_batch_num), outside the inner `executed_batch is not None` conditional. Bug report: ape-repo/astra-projects/wwfo/artificial-analysis@DSV4-disagg: docs/knowledge/dsv4_disagg_dep8_adp_router_desync.md Repro: DEP8+DEP8 1p1d DSv4-Pro disagg smoke at concurrency 72; failure fires every iter ~900 once at least one GEN rank's batch_size hits 0. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

lancelly · 2026-05-26T15:02:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-26T15:08:22Z

PR_Github #50354 [ run ] triggered by Bot. Commit: 55065c6 Link to invocation

tensorrt-cicd · 2026-05-26T18:01:06Z

PR_Github #50354 [ run ] completed with state FAILURE. Commit: 55065c6
/LLM/main/L0_MergeRequest_PR pipeline #39881 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zheyuf · 2026-05-26T18:20:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-26T18:26:29Z

PR_Github #50377 [ run ] triggered by Bot. Commit: 55065c6 Link to invocation

zheyuf

LGTM

tensorrt-cicd · 2026-05-26T20:14:51Z

PR_Github #50377 [ run ] completed with state SUCCESS. Commit: 55065c6
/LLM/main/L0_MergeRequest_PR pipeline #39904 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned lancelly May 26, 2026

lancelly force-pushed the pr/dsv4-adp-router-allgather-desync branch 4 times, most recently from a716c67 to f062a13 Compare May 26, 2026 09:52

reasonsolo reviewed May 26, 2026

View reviewed changes

reasonsolo approved these changes May 26, 2026

View reviewed changes

lancelly force-pushed the pr/dsv4-adp-router-allgather-desync branch from f062a13 to 55065c6 Compare May 26, 2026 14:58

lancelly marked this pull request as ready for review May 26, 2026 15:02

lancelly requested a review from a team as a code owner May 26, 2026 15:02

lancelly requested review from joyang-nv and removed request for a team May 26, 2026 15:02

zheyuf self-requested a review May 26, 2026 19:58

zheyuf approved these changes May 26, 2026

View reviewed changes

zheyuf merged commit 857df2f into NVIDIA:feat/deepseek_v4 May 26, 2026
7 checks passed

lfr-0531 added the deepseek-v4 label May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] py_executor: lift conditional tp_allgather sites out of per-rank-divergent gates#14560

[None][fix] py_executor: lift conditional tp_allgather sites out of per-rank-divergent gates#14560
zheyuf merged 1 commit into
NVIDIA:feat/deepseek_v4from
lancelly:pr/dsv4-adp-router-allgather-desync

lancelly commented May 26, 2026 •

edited

Loading

Uh oh!

reasonsolo May 26, 2026

Uh oh!

lancelly May 26, 2026

Uh oh!

reasonsolo May 26, 2026

Uh oh!

lancelly May 26, 2026

Uh oh!

lancelly commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

zheyuf commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

zheyuf left a comment

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

lancelly commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Uh oh!

reasonsolo May 26, 2026

Choose a reason for hiding this comment

Uh oh!

lancelly May 26, 2026

Choose a reason for hiding this comment

Uh oh!

reasonsolo May 26, 2026

Choose a reason for hiding this comment

Uh oh!

lancelly May 26, 2026

Choose a reason for hiding this comment

Uh oh!

lancelly commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

zheyuf commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

zheyuf left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lancelly commented May 26, 2026 •

edited

Loading