Skip to content

[TRTLLM-12901][fix] cap per-rank max_num_active_requests by max_num_tokens under attention DP#14481

Merged
xwang233 merged 1 commit into
NVIDIA:mainfrom
xwang233:fix/adp-router-per-rank-token-cap
May 29, 2026
Merged

[TRTLLM-12901][fix] cap per-rank max_num_active_requests by max_num_tokens under attention DP#14481
xwang233 merged 1 commit into
NVIDIA:mainfrom
xwang233:fix/adp-router-per-rank-token-cap

Conversation

@xwang233
Copy link
Copy Markdown
Collaborator

@xwang233 xwang233 commented May 23, 2026

Summary

Under enable_attention_dp=true, the per-rank assert at model_engine.py (total_num_tokens <= max_num_tokens) is the only safeguard against per-rank gen-phase step-token overflow. If max_batch_size * (1 + max_total_draft_tokens) > max_num_tokens, decode steady-state trips the assert and silently deadlocks the rank (HTTP keeps returning 200 OK, no decode tokens; nvbug-6133201).

This PR tightens PyExecutor.max_num_active_requests at init to min(model_engine.get_max_num_sequences(), max_num_tokens // (1 + max_total_draft_tokens)), mirroring the existing CUDA-graph batch-size derivation at model_engine._filter_cuda_graph_batch_sizes. Per-rank step-token load can no longer exceed max_num_tokens by construction.

  • Correctly-sized configs: no-op (derived cap == base cap).
  • Misconfigured configs: cap tightens + a logger.warning at init names the field values involved and the minimum max_num_tokens needed to restore the declared max_batch_size. Deployers self-diagnose at startup instead of silently running at reduced throughput.

Context-phase prompt tokens remain bounded by the existing cluster-wide scheduler at scheduler.py; this PR addresses only the gen-phase per-rank overflow path.

Test plan

  • 5 unit tests for the cap derivation helper (tests/unittest/_torch/executor/test_request_utils.py::TestDeriveAttentionDpPerRankRequestCap): no-tightening, failing-config, correctly-sized, no-spec-decoding, defensive-clamp.

Summary by CodeRabbit

  • New Features

    • Added automatic request capacity safeguard for Attention Distributed Parallel processing to prevent token buffer overflow and maintain stability under concurrent request loads.
  • Tests

    • Added comprehensive unit tests validating request capacity derivation across multiple configurations and edge cases.

Review Change Stack

@xwang233 xwang233 requested a review from a team as a code owner May 23, 2026 05:41
@xwang233 xwang233 requested a review from lancelly May 23, 2026 05:41
@xwang233
Copy link
Copy Markdown
Collaborator Author

/bot run

@xwang233 xwang233 requested a review from chienchunhung May 23, 2026 05:44
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d47d2912-23c4-4b8e-a1be-5a414b9c90ca

📥 Commits

Reviewing files that changed from the base of the PR and between 9ec3c84 and 2f15a86.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/request_utils.py
  • tests/unittest/_torch/executor/test_request_utils.py

📝 Walkthrough

Walkthrough

This PR introduces a safeguard for Attention DP by adding a helper function that derives a per-rank request capacity cap based on token budget constraints, integrating it into PyExecutor initialization, and validating the logic with comprehensive unit tests.

Changes

Attention DP Per-Rank Request Cap

Layer / File(s) Summary
Helper function definition and export
tensorrt_llm/_torch/pyexecutor/request_utils.py
derive_attention_dp_per_rank_request_cap computes a tightened per-rank cap using per-step token arithmetic (max_num_tokens // (1 + max_total_draft_tokens)), clamping negative draft tokens to zero and returning the base cap unchanged when max_num_tokens is None.
PyExecutor initialization safeguard
tensorrt_llm/_torch/pyexecutor/py_executor.py
PyExecutor imports the helper and invokes it during initialization when attention DP is enabled; logs a warning when the derived cap reduces concurrency and tightens self.max_num_active_requests accordingly.
Unit tests for request capacity derivation
tests/unittest/_torch/executor/test_request_utils.py
TestDeriveAttentionDpPerRankRequestCap validates helper behavior across no-op cases (max_num_tokens is None), correct cap tightening including nvbug-6133201 configuration, unchanged results when already fitting, zero draft-token handling, and negative draft-token clamping.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: capping per-rank max_num_active_requests by max_num_tokens under attention DP, matching the core fix in PyExecutor initialization.
Description check ✅ Passed The PR description clearly explains the issue (silent deadlock under enable_attention_dp=true when max_batch_size * (1 + max_total_draft_tokens) > max_num_tokens), the solution (tightening max_num_active_requests at init), and test coverage (5 unit tests); all required sections per template are adequately addressed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@xwang233 xwang233 force-pushed the fix/adp-router-per-rank-token-cap branch from 2f15a86 to 06c176f Compare May 23, 2026 05:46
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50028 [ run ] triggered by Bot. Commit: 06c176f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50028 [ run ] completed with state SUCCESS. Commit: 06c176f
/LLM/main/L0_MergeRequest_PR pipeline #39591 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xwang233
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50366 [ run ] triggered by Bot. Commit: 06c176f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50366 [ run ] completed with state SUCCESS. Commit: 06c176f
/LLM/main/L0_MergeRequest_PR pipeline #39893 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

Copy link
Copy Markdown
Collaborator

@achartier achartier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nitpicking: I think the comments are a bit too verbose

…okens under attention DP

Under ``enable_attention_dp=true`` the per-rank assert at
``model_engine.py`` (``total_num_tokens <= max_num_tokens``) is the
only safeguard against per-rank gen-phase step-token overflow: the
global Python scheduler enforces ``max_num_tokens`` cluster-wide and
the ADP router enforces a per-rank request-count cap, but no
component enforces a per-rank token cap.  Under an arithmetically-
broken config where
``max_batch_size * (1 + max_total_draft_tokens) > max_num_tokens``,
the executor admits enough active gen requests per rank to trip the
assert at decode steady state, which silently deadlocks the rank
(forward step throws but the executor does not propagate, so HTTP
keeps returning 200 OK with no decode tokens).  Originally surfaced
as nvbug-6133201 on a disagg perf YAML with mbs=128, mdl=3,
max_num_tokens=256 (the YAML was fixed in a prior PR; this is the
architectural safety net).

Fix: at ``PyExecutor.__init__``, tighten the per-rank request cap to
the smaller of ``model_engine.get_max_num_sequences()`` and
``max_num_tokens // (1 + max_total_draft_tokens)``.  This mirrors the
existing CUDA graph batch-size derivation in
``model_engine._filter_cuda_graph_batch_sizes`` and uses the same
``self.max_total_draft_tokens`` field, so chain (MTP) and tree
(Eagle/Medusa) spec decoding are both handled correctly.  Per-rank
step-token load ``num_active_requests * (1 + max_total_draft_tokens)``
cannot exceed ``max_num_tokens`` by construction; gen-phase token
overflow at the per-rank assert is arithmetically impossible.
Context-phase prompt tokens remain bounded by the existing cluster-
wide scheduler.

For correctly-sized configs (``max_batch_size * (1 + max_total_draft_tokens)
<= max_num_tokens``) the derived cap equals ``base_cap`` and behavior
is unchanged.  For broken configs the cap tightens and a
``logger.warning`` is emitted at init, pointing at the field values
involved and the minimum ``max_num_tokens`` needed to restore the
declared ``max_batch_size``; the warning lets misconfigured
deployments self-diagnose at startup instead of silently running at
reduced throughput.

Adds 5 unit tests for the cap derivation helper covering the
no-tightening / failing-config / correctly-sized / no-spec-decoding /
defensive-clamp cases.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
@xwang233 xwang233 force-pushed the fix/adp-router-per-rank-token-cap branch from 06c176f to 3fb091e Compare May 28, 2026 21:38
@xwang233
Copy link
Copy Markdown
Collaborator Author

/bot run

@xwang233 xwang233 enabled auto-merge (squash) May 28, 2026 21:40
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50894 [ run ] triggered by Bot. Commit: 3fb091e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50894 [ run ] completed with state SUCCESS. Commit: 3fb091e
/LLM/main/L0_MergeRequest_PR pipeline #40360 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xwang233
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "full pipeline passed in https://nv/trt-llm-cicd/job/helpers/job/PR_Github/50366/"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51071 [ skip ] triggered by Bot. Commit: 3fb091e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51071 [ skip ] completed with state SUCCESS. Commit: 3fb091e
Skipping testing for commit 3fb091e

Link to invocation

@xwang233 xwang233 merged commit 29971b4 into NVIDIA:main May 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants