[TRTLLM-12901][fix] cap per-rank max_num_active_requests by max_num_tokens under attention DP by xwang233 · Pull Request #14481 · NVIDIA/TensorRT-LLM

xwang233 · 2026-05-23T05:41:31Z

Summary

Under enable_attention_dp=true, the per-rank assert at model_engine.py (total_num_tokens <= max_num_tokens) is the only safeguard against per-rank gen-phase step-token overflow. If max_batch_size * (1 + max_total_draft_tokens) > max_num_tokens, decode steady-state trips the assert and silently deadlocks the rank (HTTP keeps returning 200 OK, no decode tokens; nvbug-6133201).

This PR tightens PyExecutor.max_num_active_requests at init to min(model_engine.get_max_num_sequences(), max_num_tokens // (1 + max_total_draft_tokens)), mirroring the existing CUDA-graph batch-size derivation at model_engine._filter_cuda_graph_batch_sizes. Per-rank step-token load can no longer exceed max_num_tokens by construction.

Correctly-sized configs: no-op (derived cap == base cap).
Misconfigured configs: cap tightens + a logger.warning at init names the field values involved and the minimum max_num_tokens needed to restore the declared max_batch_size. Deployers self-diagnose at startup instead of silently running at reduced throughput.

Context-phase prompt tokens remain bounded by the existing cluster-wide scheduler at scheduler.py; this PR addresses only the gen-phase per-rank overflow path.

Test plan

5 unit tests for the cap derivation helper (tests/unittest/_torch/executor/test_request_utils.py::TestDeriveAttentionDpPerRankRequestCap): no-tightening, failing-config, correctly-sized, no-spec-decoding, defensive-clamp.

Summary by CodeRabbit

New Features
- Added automatic request capacity safeguard for Attention Distributed Parallel processing to prevent token buffer overflow and maintain stability under concurrent request loads.
Tests
- Added comprehensive unit tests validating request capacity derivation across multiple configurations and edge cases.

xwang233 · 2026-05-23T05:43:08Z

/bot run

coderabbitai · 2026-05-23T05:44:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d47d2912-23c4-4b8e-a1be-5a414b9c90ca

📥 Commits

Reviewing files that changed from the base of the PR and between 9ec3c84 and 2f15a86.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/request_utils.py
tests/unittest/_torch/executor/test_request_utils.py

📝 Walkthrough

Walkthrough

This PR introduces a safeguard for Attention DP by adding a helper function that derives a per-rank request capacity cap based on token budget constraints, integrating it into PyExecutor initialization, and validating the logic with comprehensive unit tests.

Changes

Attention DP Per-Rank Request Cap

Layer / File(s)	Summary
Helper function definition and export `tensorrt_llm/_torch/pyexecutor/request_utils.py`	`derive_attention_dp_per_rank_request_cap` computes a tightened per-rank cap using per-step token arithmetic (`max_num_tokens // (1 + max_total_draft_tokens)`), clamping negative draft tokens to zero and returning the base cap unchanged when `max_num_tokens` is `None`.
PyExecutor initialization safeguard `tensorrt_llm/_torch/pyexecutor/py_executor.py`	PyExecutor imports the helper and invokes it during initialization when attention DP is enabled; logs a warning when the derived cap reduces concurrency and tightens `self.max_num_active_requests` accordingly.
Unit tests for request capacity derivation `tests/unittest/_torch/executor/test_request_utils.py`	`TestDeriveAttentionDpPerRankRequestCap` validates helper behavior across no-op cases (`max_num_tokens` is `None`), correct cap tightening including nvbug-6133201 configuration, unchanged results when already fitting, zero draft-token handling, and negative draft-token clamping.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: capping per-rank max_num_active_requests by max_num_tokens under attention DP, matching the core fix in PyExecutor initialization.
Description check	✅ Passed	The PR description clearly explains the issue (silent deadlock under enable_attention_dp=true when max_batch_size * (1 + max_total_draft_tokens) > max_num_tokens), the solution (tightening max_num_active_requests at init), and test coverage (5 unit tests); all required sections per template are adequately addressed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-05-23T05:49:20Z

PR_Github #50028 [ run ] triggered by Bot. Commit: 06c176f Link to invocation

tensorrt-cicd · 2026-05-23T09:00:39Z

PR_Github #50028 [ run ] completed with state SUCCESS. Commit: 06c176f
/LLM/main/L0_MergeRequest_PR pipeline #39591 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xwang233 · 2026-05-26T16:23:14Z

/bot run

tensorrt-cicd · 2026-05-26T16:29:55Z

PR_Github #50366 [ run ] triggered by Bot. Commit: 06c176f Link to invocation

tensorrt-cicd · 2026-05-26T23:19:44Z

PR_Github #50366 [ run ] completed with state SUCCESS. Commit: 06c176f
/LLM/main/L0_MergeRequest_PR pipeline #39893 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

achartier

LGTM

Nitpicking: I think the comments are a bit too verbose

…okens under attention DP Under ``enable_attention_dp=true`` the per-rank assert at ``model_engine.py`` (``total_num_tokens <= max_num_tokens``) is the only safeguard against per-rank gen-phase step-token overflow: the global Python scheduler enforces ``max_num_tokens`` cluster-wide and the ADP router enforces a per-rank request-count cap, but no component enforces a per-rank token cap. Under an arithmetically- broken config where ``max_batch_size * (1 + max_total_draft_tokens) > max_num_tokens``, the executor admits enough active gen requests per rank to trip the assert at decode steady state, which silently deadlocks the rank (forward step throws but the executor does not propagate, so HTTP keeps returning 200 OK with no decode tokens). Originally surfaced as nvbug-6133201 on a disagg perf YAML with mbs=128, mdl=3, max_num_tokens=256 (the YAML was fixed in a prior PR; this is the architectural safety net). Fix: at ``PyExecutor.__init__``, tighten the per-rank request cap to the smaller of ``model_engine.get_max_num_sequences()`` and ``max_num_tokens // (1 + max_total_draft_tokens)``. This mirrors the existing CUDA graph batch-size derivation in ``model_engine._filter_cuda_graph_batch_sizes`` and uses the same ``self.max_total_draft_tokens`` field, so chain (MTP) and tree (Eagle/Medusa) spec decoding are both handled correctly. Per-rank step-token load ``num_active_requests * (1 + max_total_draft_tokens)`` cannot exceed ``max_num_tokens`` by construction; gen-phase token overflow at the per-rank assert is arithmetically impossible. Context-phase prompt tokens remain bounded by the existing cluster- wide scheduler. For correctly-sized configs (``max_batch_size * (1 + max_total_draft_tokens) <= max_num_tokens``) the derived cap equals ``base_cap`` and behavior is unchanged. For broken configs the cap tightens and a ``logger.warning`` is emitted at init, pointing at the field values involved and the minimum ``max_num_tokens`` needed to restore the declared ``max_batch_size``; the warning lets misconfigured deployments self-diagnose at startup instead of silently running at reduced throughput. Adds 5 unit tests for the cap derivation helper covering the no-tightening / failing-config / correctly-sized / no-spec-decoding / defensive-clamp cases. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

xwang233 · 2026-05-28T21:40:45Z

/bot run

tensorrt-cicd · 2026-05-28T21:47:24Z

PR_Github #50894 [ run ] triggered by Bot. Commit: 3fb091e Link to invocation

tensorrt-cicd · 2026-05-29T02:44:21Z

PR_Github #50894 [ run ] completed with state SUCCESS. Commit: 3fb091e
/LLM/main/L0_MergeRequest_PR pipeline #40360 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xwang233 · 2026-05-29T16:50:12Z

/bot skip --comment "full pipeline passed in https://nv/trt-llm-cicd/job/helpers/job/PR_Github/50366/"

tensorrt-cicd · 2026-05-29T17:00:17Z

PR_Github #51071 [ skip ] triggered by Bot. Commit: 3fb091e Link to invocation

tensorrt-cicd · 2026-05-29T17:08:03Z

PR_Github #51071 [ skip ] completed with state SUCCESS. Commit: 3fb091e
Skipping testing for commit 3fb091e

Link to invocation

xwang233 requested a review from a team as a code owner May 23, 2026 05:41

xwang233 requested a review from lancelly May 23, 2026 05:41

github-actions Bot assigned xwang233 May 23, 2026

xwang233 requested a review from chienchunhung May 23, 2026 05:44

xwang233 force-pushed the fix/adp-router-per-rank-token-cap branch from 2f15a86 to 06c176f Compare May 23, 2026 05:46

achartier approved these changes May 28, 2026

View reviewed changes

xwang233 force-pushed the fix/adp-router-per-rank-token-cap branch from 06c176f to 3fb091e Compare May 28, 2026 21:38

xwang233 enabled auto-merge (squash) May 28, 2026 21:40

xwang233 merged commit 29971b4 into NVIDIA:main May 29, 2026
8 checks passed

Conversation

xwang233 commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

xwang233 commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

tensorrt-cicd commented May 23, 2026

Uh oh!

tensorrt-cicd commented May 23, 2026

Uh oh!

xwang233 commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

achartier left a comment

Choose a reason for hiding this comment

Uh oh!

xwang233 commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

xwang233 commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xwang233 commented May 23, 2026 •

edited by coderabbitai Bot

Loading