[TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats#13199
Conversation
Extend IterationStats with 9 flat top-level fields (scheduled_* and
queued_* for prefill / decode counts and token sums) so consumers such
as Dynamo's planner can build ForwardPassMetrics directly from
LLM.get_stats_async() dicts, closing the TODO at
dynamo/components/src/dynamo/common/forward_pass_metrics.py
("add metrics for TrtLLM/SGLang").
Changes:
* cpp/include/tensorrt_llm/executor/types.h:
Add 9 SizeType32 fields to IterationStats. Variance fields
(var_prefill_length, var_decode_kv_tokens, and their queued
counterparts) are deferred: the live Dynamo planner consumes only
sums on origin/main.
* cpp/tensorrt_llm/executor/{serialization,jsonSerialization}.cpp:
Plumb the new fields through binary serialize / deserialize /
serializedSize and the NLOHMANN_DEFINE_TYPE_NON_INTRUSIVE macro.
* cpp/tensorrt_llm/nanobind/executor/bindings.cpp:
Expose the 9 fields as snake-cased Python attributes.
* tensorrt_llm/_torch/pyexecutor/py_executor.py:
Populate the 9 fields in _update_iter_stats by walking
scheduled_batch.context_requests (context_chunk_size +
context_current_position), generation_requests, paused_requests,
and the executor_request_queue. Attention-DP dummy requests are
excluded. Gated by the existing enable_iter_perf_stats flag.
* _process_iter_stats: when enable_attention_dp and tp_size > 1,
tp_allgather each rank's serialized + tagged (attentionDpRank) dict
across TP siblings and store N-per-iter tagged dicts on rank 0 so
the stats RPC can surface per-attention-DP-rank FPM streams to the
Dynamo planner (which keys its perf model by (worker_id, dp_rank)).
* tensorrt_llm/executor/base_worker.py _stats_serializer:
Accept pre-serialized dict tuples from the attention-DP path;
otherwise inject attentionDpRank=0 so downstream consumers can
always read the field without a missing-key branch.
* tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py (new):
Fixture-driven coverage of the 9-field mapping (prefill, decode,
chunked prefill, prefix-cache hit, preempted decode, queued,
attention-DP dummy exclusion) plus a to_json_str roundtrip.
Semantics mirror vLLM's InstrumentedScheduler in Dynamo
(_extract_scheduled + _compute_queued): semantically equivalent
definitions, accepting numeric divergences from backend-specific
block managers / scheduler budgets.
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Original FPM populate code read req.context_chunk_size / req.context_current_position on each scheduled context request, but _update_iter_stats runs after _update_request_states — context_requests_last_chunk members have transitioned to GENERATION_IN_PROGRESS, and the C++ accessors assert 'context phase or generation init phase' on that state. E2E smoke with Qwen3-0.6B (8 requests, ISL=1024, OSL=128) showed scheduled_sum_prefill_tokens was stuck at 0 because every per-request access threw RuntimeError. Fix: * For scheduled_sum_prefill_kv_tokens: read py_last_context_chunk, a Python-cached (start, end) tuple set in _update_request_states BEFORE the state transition. The start offset equals what the pre-mutation context_current_position was. * For scheduled_sum_prefill_tokens: read from self.model_engine.iter_states['num_ctx_tokens'], a model-engine-level aggregate already computed before state mutation (existing code path used by inflight_batching_stats.num_ctx_tokens). This preserves the MVP's single-rank and per-rank (attention-DP) behavior. E2E verification with Qwen3-0.6B now emits scheduledSumPrefillTokens=1024 during prefill iters, matching the per-request prompt length. Test artifact: ~/code/tmp_context/forwardmetric-redesign/20260416/trtllm-only-e2e/fpm_results.json Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…t tests with real populate
Remove the tp_allgather-based attention-DP per-rank delivery from the
FPM MVP branch. Ships the 9 flat IterationStats fields for the
non-attention-DP case only; attention-DP per-rank emission becomes a
follow-up with a proper RC2 redesign (piggyback on
adp_router.gather_all_rank_states or dedicated MPI communicator).
Changes:
- py_executor.py: delete the `if self.enable_attention_dp and
self.dist.tp_size > 1 and self.dist.is_first_pp_rank:` branch in
_process_iter_stats that built a dict, called tp_allgather, and
stored (dict, None, None) sentinels. Reduces _process_iter_stats to
a straightforward populate + append. Remove now-unused `import json`.
- base_worker.py: delete the dict-sentinel `isinstance(iteration_stats,
dict)` short-circuit in _stats_serializer. Keep
`stats_dict.setdefault("attentionDpRank", 0)` so the serialized dict
stays self-describing for the future attention-DP follow-up.
- tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py: remove
_FakeDist + test_attention_dp_path_tags_rank_and_short_circuits
(they exercised the deleted branch). Update the _populate helper to
mirror the REAL populate algorithm: read py_last_context_chunk[0]
(primary) + context_current_position (fallback) for prefill-KV, read
model_engine.iter_states["num_ctx_tokens"] (engine-level aggregate)
for prefill-tokens. Add a stub iter_states dict on the mock executor.
Tests now validate the real code path instead of the pre-populate-fix
approximation.
Rationale for the delete:
- The tp_allgather in _process_iter_stats interleaved with
adp_router.gather_all_rank_states on the same TP communicator,
causing MPI collective-ordering desync (RankState(*24 args)
type-confusion). Documented in
~/code/tmp_context/forwardmetric-redesign/20260416/STEPS_14_15_POSTMORTEM.md.
- Under enable_attention_dp=True, the Dynamo adapter side is gated off
separately (see companion commit on dynamo-fpm branch) so no
misleading fake-idle FPM streams reach the Planner.
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
The prior round-trip test set and asserted only 5 of the 9 flat IterationStats fields added by this PR. A camelCase typo or a dropped serializer line in jsonSerialization.cpp on the other 4 (scheduledNumDecodeRequests, scheduledSumDecodeKvTokens, queuedNumPrefillRequests, queuedSumPrefillTokens) would silently ship. Expand the test to exercise all 9 keys with distinct values per field, so a cross-wiring bug in the C++ serializer (e.g. swapping prefill and decode count slots) fails the assertion rather than round-tripping cleanly with the wrong semantics. Test locally: pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py -v -> 13 passed Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…for FPM fields
Replace the `_populate` shim in `test_iter_stats_fpm_fields.py` with direct
unbound calls to the production `PyExecutor._update_iter_stats`. The prior
shim was a verbatim copy of the populate block, so it could not detect
drift between tests and production. A `MagicMock`-based fake `self` holds
only the attributes the method actually reads (two for the FPM fields:
`executor_request_queue.get_request_queue().queue` and
`model_engine.iter_states`), and `torch.cuda.mem_get_info` is patched so
the test runs without a live CUDA context.
Also extend `validate_stats` in `test_llm.py` to assert the 9 FPM
camelCase keys on every stats iteration under the pytorch backend. This
exercises the full pipeline (`_update_iter_stats` -> `_stats_serializer` ->
RPC -> `get_stats` / `get_stats_async`) via the existing
`llm_get_stats_test_harness`, so any drop or rename along the chain fails
the test. Prefill-iteration `scheduledSumPrefillTokens > 0` is gated off
under `use_overlap=True` because the populate block sources that field
from the `model_engine.iter_states["num_ctx_tokens"]` side channel, which
can be stale at the moment `_update_iter_stats` runs under overlap
scheduling; all other FPM keys remain strictly asserted.
Local runs (inside the TRT-LLM build container, RTX 4090):
pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py
-> 13 passed
pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats
-> 4 passed (all parametrize combinations)
pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats_async
-> 4 passed (all parametrize combinations)
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…verlap-safe)
Under overlap scheduling `_executor_loop_overlap` invokes
`_process_iter_stats` on `self.previous_batch` (py_executor.py:2654-2656),
but `self.model_engine.iter_states["num_ctx_tokens"]` has been
overwritten by the CURRENT iteration's `_forward_step` before the
previous batch's stats are emitted. Reading the side channel therefore
reports the wrong batch's prefill-token count for every emission, and
at iter 0 collapses to 0 (initial dict empty). The Planner reads that as
"worker idle" and can scale down a loaded worker — silent planner poison.
Reproduction before the fix:
pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats
-> 2 failed (use_overlap=True configurations), 2 passed
Fix: drop the side-channel read and compute `scheduled_sum_prefill_tokens`
locally inside the existing context_requests loop by summing
`py_last_context_chunk[1] - py_last_context_chunk[0]` (= chunk_size =
end_compute - begin_compute) across non-dummy context requests. This
mirrors what `_prepare_tp_inputs` computes as `len(input_ids)` but
without depending on when that write happens relative to
`_update_iter_stats`. `py_last_context_chunk` is a Python-side cache
set by `_update_request_states` BEFORE state mutation, so it remains
valid for the previous batch's requests even after they transition to
GENERATION_IN_PROGRESS.
Dummy-exclusion semantics are now consistent across all three
scheduled-prefill fields (count, sum_tokens, sum_kv_tokens) — the
loop's single `if is_attention_dp_dummy: continue` guard drives all of
them. Previously `scheduled_sum_prefill_tokens` came from the engine
aggregate which included dummies while siblings excluded them.
The `test_llm.py::validate_stats` helper drops the `if not use_overlap`
relaxation; the strict `scheduledSumPrefillTokens > 0` assertion now
holds under every scheduler configuration.
`test_iter_states_missing_falls_back_to_zero` is repurposed as a
regression guard: passes a deliberately wrong `iter_states["num_ctx_tokens"]`
value and asserts the result comes from `py_last_context_chunk`, not
the side channel.
Known separate issue, NOT fixed here: `queued_num_prefill_requests` and
`queued_sum_prefill_tokens` still read `self.executor_request_queue`
at stats emission time. Under overlap this produces values from iter
N's state paired with batch N-1's scheduled_* counts — a semantic
drift, not a wrong value. Fixing it requires snapshotting queue state
at batch-prep time and belongs in a dedicated design discussion.
Local runs (inside dynamo-fpm-mvp-build:test with the modified
py_executor.py mounted over the installed copy):
pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py
-> 13 passed
pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats
-> 4 passed (all parametrize combinations, including use_overlap=True)
pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats_async
-> 4 passed (all parametrize combinations, including use_overlap=True)
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…onventions
Two related cleanups on the per-iteration request-aggregate counters
added earlier in this PR:
1. Comments made self-contained. References to downstream consumers
and unrelated reference implementations ("ForwardPassMetrics",
"Dynamo", "Dynamo's planner", vLLM's InstrumentedScheduler,
components/src/dynamo/...) are removed from the TRT-LLM source
tree. Comments now describe the behavior and invariants of the
code in front of the reader without requiring familiarity with an
external repo.
2. Default-initialization moved from the struct to the populate path.
The 9 new primitive members on IterationStats were declared with
`{0}` defaults, but every sibling field on the struct is instead
populated unconditionally by the TensorRT-backend populate chain
(Executor::Impl::getCurrentIterationStats → mModel->
getCurrentIterationStats). Breaking that invariant silently exempts
the new members and makes it harder to verify that every field is
reached by the populate chain. Drop the `{0}` defaults and instead
assign the 9 fields to 0 in
Executor::Impl::getCurrentIterationStats before the virtual
mModel callback, so a future TRT-side populate can overwrite and
the struct-level convention is restored. Net behavior unchanged:
TRT backend emits 0 for these fields (the backend does not track
them); the PyTorch backend populates them via its own path.
Files:
- cpp/include/tensorrt_llm/executor/types.h — drop {0} defaults,
rewrite section comment.
- cpp/tensorrt_llm/executor/executorImpl.cpp — add 9 zero
assignments with a comment explaining the invariant.
- tensorrt_llm/_torch/pyexecutor/py_executor.py — comment rewrite.
- tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py —
docstring/comment rewrite.
- tests/unittest/llmapi/test_llm.py — comment rewrite.
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
The 9 top-level flat fields added earlier in this PR had two systemic issues: 1. Five of the nine were semantic duplicates of existing fields already tracked elsewhere in IterationStats or InflightBatchingStats. Keeping them meant two fields answered the same question with slightly different dummy-filtering semantics, inviting ambiguity for downstream consumers. 2. The naming vocabulary and word order didn't match the rest of the file. The struct uses Context/Ctx/Gen/Paused and num-prefix word order (numContextRequests, numCtxTokens); the new fields introduced Prefill/Decode and inverted the word order (scheduledNumPrefillRequests). Cleanup: - Delete the 9 top-level IterationStats members introduced in commit 086f692. - Add 5 members to InflightBatchingStats using the file's native vocabulary and word order: numCtxPrecomputedTokens — tokens read from prior state (prefix cache hits + previously chunked tokens) across scheduled context requests; pairs with numCtxTokens (tokens computed this iteration). numGenKvTokens — total KV context summed across scheduled generation requests. numQueuedContextRequests — normal-only requests in the executor request queue (filters shutdown/cancel control items). numQueuedCtxTokens — prompt tokens for the above. numPausedKvTokens — total KV context across paused (preempted-decode) requests. All serialization paths (binary, NLOHMANN JSON, nanobind) are updated in lockstep so the wire format is self-consistent. Also removes the 9 explicit zero-assignments in Executor::Impl::getCurrentIterationStats that were added in commit 9bd9133 as a compensation for the old IterationStats field set. The new fields live inside InflightBatchingStats which is value-initialized at its model-populate construction site (InflightBatchingStats modelStats{} in trtGptModelInflightBatching.cpp), so TRT-backend stats emissions get zero sentinels automatically. Populate block in PyExecutor._update_iter_stats collapses — 5 loops feed 5 new InflightBatchingStats fields; 4 duplicate top-level assignment targets go away entirely (replaced by pre-existing InflightBatchingStats assignments at py_executor.py:1128-1134 which handle num_context_requests / num_gen_requests / num_paused_requests / num_scheduled_requests from the scheduled_batch directly). Tests updated: test_iter_stats_fpm_fields.py now asserts on the nested InflightBatchingStats fields and on existing sibling fields where the old top-level assertions mapped. test_llm.py::validate_stats now reads new keys from ifbStats instead of the removed top-level keys, and uses numCtxTokens > 0 on prefill iterations (overlap-safe, set after _forward_step). Local validation: Python syntax check clean; clang-format clean on all 5 cpp files. Tests against the current container .so fail with AttributeError on the new nested field names (expected — the stale binding has the old schema). A local TRT-LLM wheel rebuild will land the updated nanobind binding and unblock e2e validation. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
- py_executor.py: drop the overlap-safety paragraph (no longer user-facing interesting), inline per-field rationale directly above each computation loop, keep the short high-level intro. - py_executor.py: restore the single-expression form of self._append_iter_stats(self._update_iter_stats(...)) in _process_iter_stats; the intermediate iter_stats local variable was an unrelated drift from this PR's refactors. - base_worker.py: drop the cross-repo pointer "(see Dynamo publisher.py gate)" from the attentionDpRank comment so the comment remains self-contained in TRT-LLM. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…_populate.py The old name referenced Dynamo's ForwardPassMetrics vocabulary, but these tests now cover the populate block in ``PyExecutor._update_iter_stats`` generally — specifically the extra members on ``InflightBatchingStats`` written by that block. Renaming to match the scope and drop the out-of-tree concept. Pure rename; no content change. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Pure rename across all 7 touched files — C++ struct, both serializers, nanobind binding, Python populate, and both test files. snake_case counterpart renamed to ``num_ctx_kv_tokens``. Matches the `KvTokens` pattern already used for the sibling `numGenKvTokens` / `numPausedKvTokens` fields on the same struct — symmetry makes the three KV-token fields visually group in the struct. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThe PR extends Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate SPDX copyright year for this modified file.
This file was modified, but the header still ends at
2025. Please bump it to the current modification year.Suggested fix
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines: "Add NVIDIA copyright header on ALL new files and update year on modified files" and "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/nanobind/executor/bindings.cpp` at line 2, The SPDX header in the top of cpp/tensorrt_llm/nanobind/executor/bindings.cpp still lists copyright years ending at 2025; update the SPDX comment to include the current modification year (bump 2025 to the current year) so the NVIDIA copyright header reflects the latest meaningful modification.
🧹 Nitpick comments (1)
tests/unittest/pyexecutor/test_iter_stats_populate.py (1)
91-95: Let the queue stub model payload-less entries too.
PyExecutor._update_iter_stats()has an explicititem.request is Noneskip path, but_StubQueueItemalways fabricates a request object. That leaves the "payload-less requests are excluded" contract untested.Suggested test helper tweak
class _StubQueueItem: - def __init__(self, input_token_ids, is_normal_request=True): - self.request = types.SimpleNamespace(input_token_ids=input_token_ids) + def __init__( + self, + input_token_ids=None, + *, + is_normal_request=True, + has_request=True, + ): + self.request = ( + types.SimpleNamespace(input_token_ids=input_token_ids) + if has_request + else None + ) self.is_normal_request = is_normal_requestThen add a small assertion that
has_request=Falseleaves both queued counters unchanged.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/pyexecutor/test_iter_stats_populate.py` around lines 91 - 95, Modify the test stub and assertions so payload-less queue entries can be simulated: update _StubQueueItem to allow request=None (or add a parameter like has_request=False) instead of always creating types.SimpleNamespace, then in the test that covers PyExecutor._update_iter_stats create an item with no request and assert that both queued counters remain unchanged; reference the _StubQueueItem constructor and the PyExecutor._update_iter_stats call when making the change and add the extra assertion that has_request=False leaves queued counters unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1234-1247: The counters currently increment
num_queued_context_requests before verifying the request payload, so move the
payload validation so that you only increment num_queued_context_requests when
the item has a usable payload: check getattr(item, "is_normal_request", False)
and that item.request exists and that item.request.input_token_ids is present
and consumable (e.g., wrapped in the existing try/except that computes
len(item.request.input_token_ids)); only on successful token-length computation
increment both num_queued_ctx_tokens (by the length) and
num_queued_context_requests. Update the loop around
executor_request_queue.get_request_queue().queue to perform the payload
existence/len check first and skip incrementing the request counter when
input_token_ids is missing or raises in the try block.
---
Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/executor/bindings.cpp`:
- Line 2: The SPDX header in the top of
cpp/tensorrt_llm/nanobind/executor/bindings.cpp still lists copyright years
ending at 2025; update the SPDX comment to include the current modification year
(bump 2025 to the current year) so the NVIDIA copyright header reflects the
latest meaningful modification.
---
Nitpick comments:
In `@tests/unittest/pyexecutor/test_iter_stats_populate.py`:
- Around line 91-95: Modify the test stub and assertions so payload-less queue
entries can be simulated: update _StubQueueItem to allow request=None (or add a
parameter like has_request=False) instead of always creating
types.SimpleNamespace, then in the test that covers
PyExecutor._update_iter_stats create an item with no request and assert that
both queued counters remain unchanged; reference the _StubQueueItem constructor
and the PyExecutor._update_iter_stats call when making the change and add the
extra assertion that has_request=False leaves queued counters unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 3bb2f0bc-5ed0-4c25-b0a3-926ad864b261
📒 Files selected for processing (8)
cpp/include/tensorrt_llm/executor/types.hcpp/tensorrt_llm/executor/jsonSerialization.cppcpp/tensorrt_llm/executor/serialization.cppcpp/tensorrt_llm/nanobind/executor/bindings.cpptensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/executor/base_worker.pytests/unittest/llmapi/test_llm.pytests/unittest/pyexecutor/test_iter_stats_populate.py
|
PR_Github #44807 [ run ] triggered by Bot. Commit: |
brb-nv
left a comment
There was a problem hiding this comment.
Left a few questions to consider.
…onStats
Follow-up work is documented on TRTLLM-12123. The stub
stats_dict.setdefault("attentionDpRank", 0) keeps downstream consumers
able to read the field unconditionally today.
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Two regression guards in PyTorchModelEngineTestCase, placed next to the existing test_pad_generation_requests. They pin the invariant that CUDAGraphRunner.pad_batch strips every is_cuda_graph_dummy=True entry from scheduled_requests.generation_requests before its `with` block exits — on both clean exit and exception paths. Downstream consumers of scheduled_batch.generation_requests rely on this, including the new FPM populate block added earlier in this PR (tensorrt_llm/_torch/pyexecutor/py_executor.py:1220); those callers filter only is_attention_dp_dummy and trust pad_batch to handle the cudagraph kind upstream. Negative-control verified locally: replacing the `finally` slice with `pass` makes both tests fail with 'AssertionError: 8 != 5' (5 real requests + 3 padding dummies retained). Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…t comments Line-number references (cuda_graph_runner.py:524, py_executor.py:1220) drift easily under refactors; the comment still conveys intent without them. 'FPM' is a Dynamo-side term; describe the dependent code as 'the per-iteration stats populate block' so the comment stands on its own within TRT-LLM. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…, narrow except, warn on drift Addresses CodeRabbit's review comment on PyExecutor._update_iter_stats: - Move num_queued_context_requests increment AFTER the input-token length check so both counters move in lockstep. Downstream consumers that compute avg = num_queued_ctx_tokens / num_queued_context_requests can no longer see a deflated ratio if a future API change makes input_token_ids unreadable. - Narrow bare 'except Exception' to (AttributeError, TypeError); silences Ruff S110 / BLE001 and stops masking unexpected errors. - Drop defensive getattr() on item.is_normal_request; item is always a RequestQueueItem (queue type is queue.Queue[RequestQueueItem]) so the fallback was dead code. - Emit logger.warning on the skip branch so future API drift surfaces instead of being silently absorbed. No behavior change on the current API: the C++ ExecutorRequest constructor requires non-empty input_token_ids, so the except branch is unreachable today. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…ed InflightBatchingStats fields
Addresses the disagg-decode routing concern raised on PR 13199: the
queued-context loop in PyExecutor._update_iter_stats currently lumps
REQUEST_TYPE_GENERATION_ONLY queued requests (disagg-decode engine,
awaiting KV transfer from a prefill engine) into
numQueuedContextRequests / numQueuedCtxTokens. Autoscaling consumers
read that as 'prefill backlog deep' on a decode engine, when the real
signal is 'decode work pending KV transfer' — wrong scaling action.
Extend InflightBatchingStats with two new members:
numQueuedGenRequests — count of gen-only queued items
numQueuedGenKvTokens — prompt-token sum across those items
(the KV budget they will occupy once
their KV transfer completes)
Populate block now routes by item.request.request_type:
- CONTEXT_AND_GENERATION (non-disagg default) and CONTEXT_ONLY
(disagg-prefill side) -> queued-context counters.
- GENERATION_ONLY (disagg-decode side) -> queued-gen counters.
Downstream consumers that build Dynamo's ForwardPassMetrics can now
compute a correct queued_requests.num_decode_requests as
numPausedRequests + numQueuedGenRequests (same for the KV-token sum).
Wired through the full binding surface: types.h (struct members),
serialization.cpp (binary ser/deser/size), jsonSerialization.cpp
(NLOHMANN macro), nanobind/executor/bindings.cpp (Python properties).
The __init__.pyi stub is generated from the bindings and is
regenerated by the build.
Tests:
* test_iter_stats_populate.py::test_queued_routes_by_request_type
covers all three enum values with distinct token counts, asserts
correct routing across both counter pairs.
* test_empty_iteration and the JSON round-trip test extended to
include the two new fields.
* test_llm.py::validate_stats's 'new_aggregate_keys' tuple extended
to assert presence of the two new keys under the full
get_stats -> get_stats_async RPC pipeline.
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #45368 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #45443 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #45459 [ run ] triggered by Bot. Commit: |
|
PR_Github #45459 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45532 [ run ] triggered by Bot. Commit: |
|
PR_Github #45532 [ run ] completed with state |
Refreshed all 16 files under docs/overview/ to reflect TRT-LLM v1.3.0rc14 (upstream/main 3b7af1c, ~3 weeks of changes since the 2026-04-06 baseline 2b80f8d), competing frameworks (vLLM v0.20.0, SGLang v0.5.10.post1, LMCache v0.4.4, NVIDIA Dynamo v1.0.0), and the current hardware (Vera Rubin in production, AMD MI355X MLPerf 6.0, Google TPU v7 GA) and academic landscape (GOOSE, StreamServe, PrfaaS, FlowKV). Highlights: - Spec-dec algorithm count corrected 7 to 8 (DFlash added; EAGLE3 dynamic tree re-enabled; LoRA + spec-dec generic and EAGLE3-specific now work). - Block reuse + overlap scheduler combined (NVIDIA#12816), removing a long-standing internal gap. - First-class lmcache and kvbm KV connectors (NVIDIA#12626). - Production observability stack (Prometheus NVIDIA#12545, modular logger NVIDIA#13202, NvTelemetry NVIDIA#12384, per-iteration aggregate counters NVIDIA#13199). - Disagg reliability fail-fast wave (NVIDIA#13119, NVIDIA#13408, NVIDIA#12718, NVIDIA#12888). - AutoDeploy onboarded DeepSeek-R1, Gemma-4 + 4-31B NVFP4, MiniMax-M2.7; standalone-package-ready; legacy EdgeLLM ONNX export removed. - KV cache V2 still default OFF (multiple V2 fixes; tracked via new V2-default-on milestone in §06). - New 4 framework comparison column for Dynamo v1.0; updated feature matrix and gap analysis to reflect vLLM v0.20 (FA4 MLA prefill default, TurboQuant 2-bit KV, vLLM IR foundation, Model Runner V2). - Strategic prioritization quadrant chart fully re-ranked; new Tier 1 items: TTFT re-benchmark vs vLLM v0.20, low-bit KV, MLA prefill kernel default, disagg chaos-test harness; new Tier 2: Dynamo Snapshot integration, TRT-LLM IR strategy, adaptive spec-dec depth. Snapshot of pre-refresh content saved at docs/overview/.snapshots/2026-04-06/. Per-file diff highlights, priority-shift table, sources, and blocked/skipped notes are in docs/overview/CHANGELOG.md. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
NVIDIA/TensorRT-LLM#13199 merged with a different shape than the draft this PR was originally written against: the new IterationStats fields live nested inside `inflightBatchingStats` and use names that don't match the draft's flat top-level keys. Field-extraction changes (publisher.py): - _check_fpm_schema reads the nested `inflightBatchingStats` dict and fails closed if it is missing or any of the 11 required IBS fields is absent. Strict probe protects against silent planner poison when running against a TRT-LLM that predates #13199. - handle_stat reads the 11 IBS fields and composes Dynamo-side fields: scheduled_num_prefill_requests = numContextRequests scheduled_sum_prefill_tokens = numCtxTokens (compute this iter; excludes prefix-cache carryover counted in numCtxKvTokens, to match vLLM's TTFT-prediction semantics) scheduled_sum_prefill_kv_tokens = numCtxKvTokens scheduled_num_decode_requests = numGenRequests scheduled_sum_decode_kv_tokens = numGenKvTokens queued_num_prefill_requests = numQueuedContextRequests queued_sum_prefill_tokens = numQueuedCtxTokens queued_num_decode_requests = numPausedRequests + numQueuedGenRequests queued_sum_decode_kv_tokens = numPausedKvTokens + numQueuedGenKvTokens Paused-decode requests are intentionally double-counted with numScheduledRequests upstream because the planner reads queued_decode as a KV-pressure preemption signal, where paused-decodes carry the same semantic weight as queued-gen-only requests. - FpmDirectPublisher.publish() callsite switched to keyword-only args so adjacent ints with similar units (`scheduled_*` vs `queued_*`, `*_prefill_*` vs `*_decode_*`) cannot be silently transposed. - Init-time and cleanup-time `except Exception` narrowed to `except RuntimeError` (PyO3 surfaces FpmDirectPublisher new/shutdown failures only as PyRuntimeError); hot-path catch in handle_stat stays broad on purpose, with a comment explaining why. Rust binding changes (lib/bindings/python/rust/llm/fpm.rs): - FpmDirectPublisher struct field `_inner` renamed to `inner`. The underscore prefix conventionally signals "intentionally unused" to Rust readers, but the field is load-bearing for Drop-driven shutdown of all per-rank serialization tasks. A future cleanup pass would delete it as dead and leak every spawned task. Documented in a comment. - publish() signature switched to keyword-only via #[pyo3(signature = (*, ...))]; matching update in _core.pyi stub. Test changes (test_trtllm_fpm_publisher.py): - Module-level pytestmark = [unit, trtllm, pre_merge, gpu_0] picked up by CI marker filters. Without these, the tests were silently skipped by the strict-marker config. - _build_fake_stat constructs the nested IBS structure matching the real JSON; _invoke_handler mirrors handle_stat's nested extraction and kwargs publish() exactly. A guardrail test (test_invoke_handler_matches_publisher_keyword_set) catches drift. - Schema-probe parametrize iterates all 11 IBS fields plus the ibs-not-a-dict and missing-ibs-dict edge cases. - New tests for the queued-decode composite under disagg-decode (only-queued-gen) and aggregated (only-paused) engine shapes. - _build_publisher_stub uses the monkeypatch fixture for hermetic state — no module-level rebinds that leak across tests. The blanket try/except in _run_initialize is removed; WorkerMetricsPublisher and KvEventPublisher are stubbed via monkeypatch so initialize() reaches the FPM gate cleanly. Verification (single-GPU local, non-attention-DP): - Unit: 28/28 pass. - Schema probe against real Qwen3-0.6B inside L0_PostMerge build 2697 (commit e903428, first build containing #13199): all 11 required IBS fields present at the nested path, attentionDpRank tagged at the top level. - Sustained load (concurrency 16 vs max_batch_size 4, 45 s): 520 stats sampled, max scheduled_num_decode_requests=4 (saturated batch), max queued_num_prefill_requests=3 (queueing observed), 512/520 iters had non-zero decode signal. Known limitation: TensorRT (legacy) backend leaves IBS at zero-default, so FPM emission against a TRT-engine deployment will be all-zero. The schema probe accepts keys-exist; a value-aware probe is left as follow-up. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
… fields Closes the "old TRT-LLM" planner-poison gap left by ai-dynamo#8356. Problem: handle_stat's .publish() call uses stat.get(field, 0) for every required IterationStats field. On a TRT-LLM version that predates NVIDIA/TensorRT-LLM#13199, every field defaults to 0 and the emitted ForwardPassSnapshot is byte-identical to the Rust-side idle heartbeat. The Planner reads that as "worker is idle" and can scale down a worker under real load. Same class of bug as the attention-DP rank 1..N-1 case that ai-dynamo#8356 explicitly gates, but ungated for the old-TRT-LLM path. Fix: one-shot schema probe on the first IterationStats delivered to handle_stat. Strict — all 9 fields added by #13199 must be present, else the probe shuts down the publisher and sets self.fpm_publisher = None. The existing `if self.fpm_publisher is not None:` guard then short- circuits all further FPM emission, matching the attention-DP gate's "zero messages" contract. Scope: - components/src/dynamo/trtllm/publisher.py: * Add _FPM_REQUIRED_STAT_FIELDS (9 fields). * Add self._fpm_schema_checked: bool = False in Publisher.__init__. * Add Publisher._check_fpm_schema(stat) method. * Wire the probe into handle_stat before the existing FPM publish branch. - components/src/dynamo/trtllm/tests/test_trtllm_fpm_publisher.py: * 7 new tests (15 cases via parametrize). Cover: all-fields-present keeps publisher; each of the 9 fields missing individually disables publisher; missing-all-fields (legacy TRT-LLM) disables; noop when fpm_publisher is already None; shutdown exception still disables; probe runs only once; required-fields-tuple-matches-publish guardrail. No Rust changes. No API surface changes. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
NVIDIA/TensorRT-LLM#13199 merged with a different shape than the draft this PR was originally written against: the new IterationStats fields live nested inside `inflightBatchingStats` and use names that don't match the draft's flat top-level keys. Field-extraction changes (publisher.py): - _check_fpm_schema reads the nested `inflightBatchingStats` dict and fails closed if it is missing or any of the 11 required IBS fields is absent. Strict probe protects against silent planner poison when running against a TRT-LLM that predates #13199. - handle_stat reads the 11 IBS fields and composes Dynamo-side fields: scheduled_num_prefill_requests = numContextRequests scheduled_sum_prefill_tokens = numCtxTokens (compute this iter; excludes prefix-cache carryover counted in numCtxKvTokens, to match vLLM's TTFT-prediction semantics) scheduled_sum_prefill_kv_tokens = numCtxKvTokens scheduled_num_decode_requests = numGenRequests scheduled_sum_decode_kv_tokens = numGenKvTokens queued_num_prefill_requests = numQueuedContextRequests queued_sum_prefill_tokens = numQueuedCtxTokens queued_num_decode_requests = numPausedRequests + numQueuedGenRequests queued_sum_decode_kv_tokens = numPausedKvTokens + numQueuedGenKvTokens Paused-decode requests are intentionally double-counted with numScheduledRequests upstream because the planner reads queued_decode as a KV-pressure preemption signal, where paused-decodes carry the same semantic weight as queued-gen-only requests. - FpmDirectPublisher.publish() callsite switched to keyword-only args so adjacent ints with similar units (`scheduled_*` vs `queued_*`, `*_prefill_*` vs `*_decode_*`) cannot be silently transposed. - Init-time and cleanup-time `except Exception` narrowed to `except RuntimeError` (PyO3 surfaces FpmDirectPublisher new/shutdown failures only as PyRuntimeError); hot-path catch in handle_stat stays broad on purpose, with a comment explaining why. Rust binding changes (lib/bindings/python/rust/llm/fpm.rs): - FpmDirectPublisher struct field `_inner` renamed to `inner`. The underscore prefix conventionally signals "intentionally unused" to Rust readers, but the field is load-bearing for Drop-driven shutdown of all per-rank serialization tasks. A future cleanup pass would delete it as dead and leak every spawned task. Documented in a comment. - publish() signature switched to keyword-only via #[pyo3(signature = (*, ...))]; matching update in _core.pyi stub. Test changes (test_trtllm_fpm_publisher.py): - Module-level pytestmark = [unit, trtllm, pre_merge, gpu_0] picked up by CI marker filters. Without these, the tests were silently skipped by the strict-marker config. - _build_fake_stat constructs the nested IBS structure matching the real JSON; _invoke_handler mirrors handle_stat's nested extraction and kwargs publish() exactly. A guardrail test (test_invoke_handler_matches_publisher_keyword_set) catches drift. - Schema-probe parametrize iterates all 11 IBS fields plus the ibs-not-a-dict and missing-ibs-dict edge cases. - New tests for the queued-decode composite under disagg-decode (only-queued-gen) and aggregated (only-paused) engine shapes. - _build_publisher_stub uses the monkeypatch fixture for hermetic state — no module-level rebinds that leak across tests. The blanket try/except in _run_initialize is removed; WorkerMetricsPublisher and KvEventPublisher are stubbed via monkeypatch so initialize() reaches the FPM gate cleanly. Verification (single-GPU local, non-attention-DP): - Unit: 28/28 pass. - Schema probe against real Qwen3-0.6B inside L0_PostMerge build 2697 (commit e903428, first build containing #13199): all 11 required IBS fields present at the nested path, attentionDpRank tagged at the top level. - Sustained load (concurrency 16 vs max_batch_size 4, 45 s): 520 stats sampled, max scheduled_num_decode_requests=4 (saturated batch), max queued_num_prefill_requests=3 (queueing observed), 512/520 iters had non-zero decode signal. Known limitation: TensorRT (legacy) backend leaves IBS at zero-default, so FPM emission against a TRT-engine deployment will be all-zero. The schema probe accepts keys-exist; a value-aware probe is left as follow-up. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Summary
Add 7 new members to
InflightBatchingStatsthat expose per-iteration request-aggregate counts and token-weighted totals not covered by the existingnumContextRequests/numGenRequests/numCtxTokens/numPausedRequests/numScheduledRequests. Populated by the PyTorch backend'sPyExecutor._update_iter_stats; the TensorRT (legacy) backend leaves them at 0 via value-initialization of theInflightBatchingStatsstruct at its populate site — 0 is the correct "not tracked by this backend" sentinel.These fields support downstream autoscaling / load-prediction consumers (e.g., Dynamo's Planner) that need richer per-iteration composition data than the existing counters provide, including disaggregated-decode engines whose queue is dominated by generation-only requests awaiting KV transfer.
Scope: non-attention-DP (single-rank + plain TP). True per-rank emission under attention-DP requires an MPI collective-ordering redesign and is deferred.
New
InflightBatchingStatsmembersnumCtxKvTokensnumCtxTokens(tokens computed this iteration).numGenKvTokensnumQueuedContextRequestsREQUEST_TYPE_CONTEXT_AND_GENERATION/REQUEST_TYPE_CONTEXT_ONLY) waiting inexecutor_request_queuefor first scheduling; excludes non-normal control items (shutdown/cancel) and requests without a payload.numQueuedCtxTokensnumQueuedContextRequests.numQueuedGenRequestsREQUEST_TYPE_GENERATION_ONLY) waiting inexecutor_request_queue. On a disaggregated-decode engine these are requests that have completed prefill elsewhere and are awaiting KV-cache transfer before they can start decoding. Always 0 on a non-disaggregated or disaggregated-prefill engine.numQueuedGenKvTokensnumQueuedGenRequests; acts as the KV budget these requests will occupy once their KV transfer completes.numPausedKvTokensnumPausedRequests.Naming follows the file's existing conventions:
Ctxfor token-count compounds,Contextfor request counts,Genfor decode/generation,Pausedfor preempted-decode,numprefix first.Dynamo
ForwardPassMetrics→ TRT-LLM mappingFor downstream readers (e.g., Dynamo's TRT-LLM adapter), here is the full mapping from Dynamo's
ForwardPassMetricsfields to the TRT-LLMIterationStatsfields that should be consumed. Four fields reuse existingInflightBatchingStatsmembers; seven are new (added in this PR). The two disagg-decode-aware rows compose the new gen-queue fields with the existingnumPaused*fields so the Dynamo adapter reports a correct "queued decode" signal on both preempted-decode and disagg-decode-awaiting-KV paths.ForwardPassMetricsscheduled_requests.num_prefill_requestsinflightBatchingStats.numContextRequestsscheduled_requests.sum_prefill_tokensinflightBatchingStats.numCtxTokensscheduled_requests.sum_prefill_kv_tokensinflightBatchingStats.numCtxKvTokensscheduled_requests.num_decode_requestsinflightBatchingStats.numGenRequestsscheduled_requests.sum_decode_kv_tokensinflightBatchingStats.numGenKvTokensqueued_requests.num_prefill_requestsinflightBatchingStats.numQueuedContextRequestsqueued_requests.sum_prefill_tokensinflightBatchingStats.numQueuedCtxTokensqueued_requests.num_decode_requestsinflightBatchingStats.numPausedRequests + numQueuedGenRequestsqueued_requests.sum_decode_kv_tokensinflightBatchingStats.numPausedKvTokens + numQueuedGenKvTokensVariance fields on Dynamo's
ForwardPassMetrics(var_prefill_length,var_decode_kv_tokens) are not populated by TRT-LLM yet — tracked as a follow-up once a downstream consumer begins reading them.Attention-DP is NOT covered
Under
enable_attention_dp=True, tp_size > 1, TRT-LLM's rank-0-only RPC gate (rpc_worker_mixin.py) ensures only rank-0'sIterationStatsreach the user — same behavior as today. True per-rank emission under attention-DP requires a separate MPI collective-ordering redesign (piggyback onadp_router.gather_all_rank_statesis one proposal) and is tracked as a follow-up.Test plan
pytest tests/unittest/pyexecutor/test_iter_stats_populate.py -v— 14 passed (local, in dev container with mount-overpy_executor.py)pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase -k cudagraph_dummies— 2 passed. Negative-control verified: disabling the strip inCUDAGraphRunner.pad_batch'sfinallymakes both tests fail withAssertionError: 8 != 5, confirming they would catch a regression in the strip invariant.pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats ::test_llm_get_stats_async— 8 passed across all parametrize combinations includinguse_overlap=True(against an earlier schema; will re-run against the final nanobind binding once the local wheel rebuild completes)pre-commit runclean on all modified filesLLM.get_stats()against a real prompt workload (TinyLlama, pytorch backend) once the local TRT-LLM wheel rebuild with the new.sofinishes/bot runFollow-up (tracked separately)
var_*once a downstream consumer starts reading them.inflightBatchingStats.*names and compose the queued-decode signal fromnumPausedRequests + numQueuedGenRequests/numPausedKvTokens + numQueuedGenKvTokens(separate Dynamo-side PR).Summary by CodeRabbit
New Features
Tests