[TRTLLM-12901][fix] cap per-rank max_num_active_requests by max_num_tokens under attention DP#14481
Conversation
|
/bot run |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces a safeguard for Attention DP by adding a helper function that derives a per-rank request capacity cap based on token budget constraints, integrating it into PyExecutor initialization, and validating the logic with comprehensive unit tests. ChangesAttention DP Per-Rank Request Cap
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
2f15a86 to
06c176f
Compare
|
PR_Github #50028 [ run ] triggered by Bot. Commit: |
|
PR_Github #50028 [ run ] completed with state
|
|
/bot run |
|
PR_Github #50366 [ run ] triggered by Bot. Commit: |
|
PR_Github #50366 [ run ] completed with state |
achartier
left a comment
There was a problem hiding this comment.
LGTM
Nitpicking: I think the comments are a bit too verbose
…okens under attention DP Under ``enable_attention_dp=true`` the per-rank assert at ``model_engine.py`` (``total_num_tokens <= max_num_tokens``) is the only safeguard against per-rank gen-phase step-token overflow: the global Python scheduler enforces ``max_num_tokens`` cluster-wide and the ADP router enforces a per-rank request-count cap, but no component enforces a per-rank token cap. Under an arithmetically- broken config where ``max_batch_size * (1 + max_total_draft_tokens) > max_num_tokens``, the executor admits enough active gen requests per rank to trip the assert at decode steady state, which silently deadlocks the rank (forward step throws but the executor does not propagate, so HTTP keeps returning 200 OK with no decode tokens). Originally surfaced as nvbug-6133201 on a disagg perf YAML with mbs=128, mdl=3, max_num_tokens=256 (the YAML was fixed in a prior PR; this is the architectural safety net). Fix: at ``PyExecutor.__init__``, tighten the per-rank request cap to the smaller of ``model_engine.get_max_num_sequences()`` and ``max_num_tokens // (1 + max_total_draft_tokens)``. This mirrors the existing CUDA graph batch-size derivation in ``model_engine._filter_cuda_graph_batch_sizes`` and uses the same ``self.max_total_draft_tokens`` field, so chain (MTP) and tree (Eagle/Medusa) spec decoding are both handled correctly. Per-rank step-token load ``num_active_requests * (1 + max_total_draft_tokens)`` cannot exceed ``max_num_tokens`` by construction; gen-phase token overflow at the per-rank assert is arithmetically impossible. Context-phase prompt tokens remain bounded by the existing cluster- wide scheduler. For correctly-sized configs (``max_batch_size * (1 + max_total_draft_tokens) <= max_num_tokens``) the derived cap equals ``base_cap`` and behavior is unchanged. For broken configs the cap tightens and a ``logger.warning`` is emitted at init, pointing at the field values involved and the minimum ``max_num_tokens`` needed to restore the declared ``max_batch_size``; the warning lets misconfigured deployments self-diagnose at startup instead of silently running at reduced throughput. Adds 5 unit tests for the cap derivation helper covering the no-tightening / failing-config / correctly-sized / no-spec-decoding / defensive-clamp cases. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
06c176f to
3fb091e
Compare
|
/bot run |
|
PR_Github #50894 [ run ] triggered by Bot. Commit: |
|
PR_Github #50894 [ run ] completed with state
|
|
/bot skip --comment "full pipeline passed in https://nv/trt-llm-cicd/job/helpers/job/PR_Github/50366/" |
|
PR_Github #51071 [ skip ] triggered by Bot. Commit: |
|
PR_Github #51071 [ skip ] completed with state |
Summary
Under
enable_attention_dp=true, the per-rank assert atmodel_engine.py(total_num_tokens <= max_num_tokens) is the only safeguard against per-rank gen-phase step-token overflow. Ifmax_batch_size * (1 + max_total_draft_tokens) > max_num_tokens, decode steady-state trips the assert and silently deadlocks the rank (HTTP keeps returning 200 OK, no decode tokens; nvbug-6133201).This PR tightens
PyExecutor.max_num_active_requestsat init tomin(model_engine.get_max_num_sequences(), max_num_tokens // (1 + max_total_draft_tokens)), mirroring the existing CUDA-graph batch-size derivation atmodel_engine._filter_cuda_graph_batch_sizes. Per-rank step-token load can no longer exceedmax_num_tokensby construction.logger.warningat init names the field values involved and the minimummax_num_tokensneeded to restore the declaredmax_batch_size. Deployers self-diagnose at startup instead of silently running at reduced throughput.Context-phase prompt tokens remain bounded by the existing cluster-wide scheduler at
scheduler.py; this PR addresses only the gen-phase per-rank overflow path.Test plan
tests/unittest/_torch/executor/test_request_utils.py::TestDeriveAttentionDpPerRankRequestCap): no-tightening, failing-config, correctly-sized, no-spec-decoding, defensive-clamp.Summary by CodeRabbit
New Features
Tests