Skip to content

[None][fix] py_executor: lift conditional tp_allgather sites out of per-rank-divergent gates#14560

Merged
zheyuf merged 1 commit into
NVIDIA:feat/deepseek_v4from
lancelly:pr/dsv4-adp-router-allgather-desync
May 26, 2026
Merged

[None][fix] py_executor: lift conditional tp_allgather sites out of per-rank-divergent gates#14560
zheyuf merged 1 commit into
NVIDIA:feat/deepseek_v4from
lancelly:pr/dsv4-adp-router-allgather-desync

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented May 26, 2026

Summary

Fix the GEN-side ADP-router desync that crashes all 8 ranks of DSv4-Pro DEP8+DEP8 disagg with one of:

TypeError: RankState.__init__() takes from 2 to 4 positional arguments but 27 were given     # TLLM_METRICS_ALL_RANKS=1, doc-variant 1
TypeError: argument of type 'bool' is not iterable                                           # TLLM_METRICS_ALL_RANKS unset,  doc-variant 2
TypeError: 'int' object is not iterable     at tp_cp_allgather                               # observed in this PR's A/B verify (variant 3)

Repro doc: ape-repo/astra-projects/wwfo/artificial-analysis@DSV4-disagg:docs/knowledge/dsv4_disagg_dep8_adp_router_desync.md.

Root cause

Two conditional tp_allgather collectives in tensorrt_llm/_torch/pyexecutor/py_executor.py live inside per-rank-divergent gates:

  1. _handle_responses line 4438tp_allgather(bool(timed_out_requests)). Reached from _process_previous_batch (overlap loop), which is gated on should_process_previous_batch = can_queue or not can_queue_this_rank plus previous_batch is not None. When one ADP rank's scheduled_batch.batch_size == 0 and its previous_batch was cleared the iter before, that rank skips _process_previous_batch while peers enter it. Peers issue the tp_allgather, the empty-batch rank does not — the next collective in the queue (the very next gather_all_rank_states) then receives the bool payload positionally as RankState fields and crashes on every rank.

  2. _append_iter_stats line 1436tp_allgather(local_dict) (27-key payload under TLLM_METRICS_ALL_RANKS=1). Same outer gate via _process_iter_stats. With the env var set, the per-rank-divergent branch fires both the timeout gather AND the iter-stats gather, so the misalignment moves by two positions and the 27-key dict lands in the RankState.deserialize slot.

The three observed variants (dict, bool, int) are the same bug, differing only in how many extra collectives the divergent branch issued and which downstream collective consumes the misaligned payload.

Fix

Mirror the existing _flush_pending_transfer_responses pattern: buffer the payload on self and drain at a rank-symmetric flush point.

  • _handle_responses → push to self._pending_timed_out_requests. New helper _handle_kv_transfer_timeouts_synced() does the ADP-safe drain.
  • _append_iter_stats → push to self._pending_iter_stats_dict. New helper _flush_iter_stats_synced() does the ADP-safe drain (gathers None from idle ranks so the queue stays aligned).
  • Both helpers are called from a rank-symmetric position in each executor loop:
    • Overlap loop: right after the if previous_batch is not None and should_process_previous_batch block (immediately adjacent to the existing _enqueue_responses([]) sync sibling in the else branch).
    • Non-overlap loop: right after the rank-symmetric if can_queue block that calls _handle_responses (outside the if to be defensive against future restructuring).
    • PP loop / _handle_executed_batch: at the bottom of every call, outside the executed_batch is not None conditional. handle_executed_batches(executed_batch_num) calls this helper an equal number of times on every PP rank (broadcast executed_batch_num).

No new collectives introduced — the existing two are just moved out of the divergent branches to the same flush point the codebase already uses for _enqueue_responses.

@lancelly lancelly force-pushed the pr/dsv4-adp-router-allgather-desync branch 4 times, most recently from a716c67 to f062a13 Compare May 26, 2026 09:52
with self.stats_lock:
cap = self.max_stats_len * tp_size
overflow = max(0, len(self.stats) + len(rank_dicts) - cap)
if overflow:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awkward, appending all rank_dicts and trimming by cap is much more human readable

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is aligned with origin code, but I think you are right. Fixed.

_append_iter_stats takes its legacy path and never populates the
buffer.
"""
tp_size = getattr(self.dist, "tp_size", 1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.dist.tp_size is accessible for sure. Agents use getattr toooo much.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This function is extracted out of the origin code.
I will fix this.

…er-rank-divergent gates

Under ADP + disagg on feat/deepseek_v4, GEN workers desync after a small
number of successful requests and all 8 ranks crash inside
ADPRouter.gather_all_rank_states with either:

  TypeError: RankState.__init__() takes from 2 to 4 positional arguments
             but 27 were given     (with TLLM_METRICS_ALL_RANKS=1)
  TypeError: argument of type 'bool' is not iterable
                                    (without TLLM_METRICS_ALL_RANKS=1)

Root cause is in py_executor.py, not adp_router.py: two conditional
tp_allgather collectives live inside per-rank-divergent gates and
misalign the MPI collective queue.

  1. _handle_responses:4438-4440 (bool payload) — called from
     _process_previous_batch which is gated by
     `should_process_previous_batch = can_queue or not can_queue_this_rank`
     plus `previous_batch is not None`.  When one DP rank has an empty
     batch but its previous_batch was cleared the iter before, that
     rank skips _process_previous_batch while peers enter it.  Peers
     issue the tp_allgather, the empty-batch rank does not, and the
     next collective (the very next gather_all_rank_states) receives
     the bool payload positionally as RankState fields.

  2. _append_iter_stats:1436 (dict payload, ~27 keys) — same outer gate
     via _process_iter_stats.  With TLLM_METRICS_ALL_RANKS=1 the same
     divergent gate fires this gather AND the timeout gather, so the
     misalignment lands the 27-key dict in the RankState slot.

Fix: mirror the existing _flush_pending_transfer_responses pattern.
Buffer the timed-out requests and the per-rank iter-stats dict on
self, drain at a rank-symmetric position next to the existing flush.

Adds two new helpers:
  - _handle_kv_transfer_timeouts_synced — replaces the inline ADP
    branch in _handle_responses
  - _flush_iter_stats_synced — replaces the inline gather in
    _append_iter_stats

Wired in all three executor loops:
  - _executor_loop_overlap: after the per-rank-divergent
    `if previous_batch is not None and should_process_previous_batch`
    block, alongside the existing _flush_pending_transfer_responses.
  - _executor_loop (non-overlap): right after the `if can_queue` block
    that calls _handle_responses (already rank-symmetric, but the
    buffer-drain symmetry is preserved here too).
  - _executor_loop_pp / _handle_executed_batch: at the bottom of
    every call (each rank calls it an equal number of times per outer
    iter via the rank-0-broadcast executed_batch_num), outside the
    inner `executed_batch is not None` conditional.

Bug report:
  ape-repo/astra-projects/wwfo/artificial-analysis@DSV4-disagg:
  docs/knowledge/dsv4_disagg_dep8_adp_router_desync.md

Repro: DEP8+DEP8 1p1d DSv4-Pro disagg smoke at concurrency 72; failure
fires every iter ~900 once at least one GEN rank's batch_size hits 0.

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly lancelly force-pushed the pr/dsv4-adp-router-allgather-desync branch from f062a13 to 55065c6 Compare May 26, 2026 14:58
@lancelly lancelly marked this pull request as ready for review May 26, 2026 15:02
@lancelly lancelly requested a review from a team as a code owner May 26, 2026 15:02
@lancelly lancelly requested review from joyang-nv and removed request for a team May 26, 2026 15:02
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50354 [ run ] triggered by Bot. Commit: 55065c6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50354 [ run ] completed with state FAILURE. Commit: 55065c6
/LLM/main/L0_MergeRequest_PR pipeline #39881 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zheyuf
Copy link
Copy Markdown
Collaborator

zheyuf commented May 26, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50377 [ run ] triggered by Bot. Commit: 55065c6 Link to invocation

@zheyuf zheyuf self-requested a review May 26, 2026 19:58
Copy link
Copy Markdown
Collaborator

@zheyuf zheyuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50377 [ run ] completed with state SUCCESS. Commit: 55065c6
/LLM/main/L0_MergeRequest_PR pipeline #39904 completed with status: 'SUCCESS'

CI Report

Link to invocation

@zheyuf zheyuf merged commit 857df2f into NVIDIA:feat/deepseek_v4 May 26, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants