[TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats by nv-yna · Pull Request #13199 · NVIDIA/TensorRT-LLM

nv-yna · 2026-04-20T05:35:01Z

Summary

Add 7 new members to InflightBatchingStats that expose per-iteration request-aggregate counts and token-weighted totals not covered by the existing numContextRequests / numGenRequests / numCtxTokens / numPausedRequests / numScheduledRequests. Populated by the PyTorch backend's PyExecutor._update_iter_stats; the TensorRT (legacy) backend leaves them at 0 via value-initialization of the InflightBatchingStats struct at its populate site — 0 is the correct "not tracked by this backend" sentinel.

These fields support downstream autoscaling / load-prediction consumers (e.g., Dynamo's Planner) that need richer per-iteration composition data than the existing counters provide, including disaggregated-decode engines whose queue is dominated by generation-only requests awaiting KV transfer.

Scope: non-attention-DP (single-rank + plain TP). True per-rank emission under attention-DP requires an MPI collective-ordering redesign and is deferred.

New `InflightBatchingStats` members

Field	Meaning
`numCtxKvTokens`	Tokens read from prior state (prefix-cache hits + previously-chunked tokens) summed across scheduled context requests. Complements `numCtxTokens` (tokens computed this iteration).
`numGenKvTokens`	Total KV context length (prompt + tokens generated so far) summed across scheduled generation requests.
`numQueuedContextRequests`	Context-type requests (`REQUEST_TYPE_CONTEXT_AND_GENERATION` / `REQUEST_TYPE_CONTEXT_ONLY`) waiting in `executor_request_queue` for first scheduling; excludes non-normal control items (shutdown/cancel) and requests without a payload.
`numQueuedCtxTokens`	Prompt-token sum across the requests counted in `numQueuedContextRequests`.
`numQueuedGenRequests`	Generation-only requests (`REQUEST_TYPE_GENERATION_ONLY`) waiting in `executor_request_queue`. On a disaggregated-decode engine these are requests that have completed prefill elsewhere and are awaiting KV-cache transfer before they can start decoding. Always 0 on a non-disaggregated or disaggregated-prefill engine.
`numQueuedGenKvTokens`	Prompt-token sum across the requests counted in `numQueuedGenRequests`; acts as the KV budget these requests will occupy once their KV transfer completes.
`numPausedKvTokens`	Total KV context length summed across paused (preempted-decode) requests. Complements `numPausedRequests`.

Naming follows the file's existing conventions: Ctx for token-count compounds, Context for request counts, Gen for decode/generation, Paused for preempted-decode, num prefix first.

Dynamo `ForwardPassMetrics` → TRT-LLM mapping

For downstream readers (e.g., Dynamo's TRT-LLM adapter), here is the full mapping from Dynamo's ForwardPassMetrics fields to the TRT-LLM IterationStats fields that should be consumed. Four fields reuse existing InflightBatchingStats members; seven are new (added in this PR). The two disagg-decode-aware rows compose the new gen-queue fields with the existing numPaused* fields so the Dynamo adapter reports a correct "queued decode" signal on both preempted-decode and disagg-decode-awaiting-KV paths.

Dynamo `ForwardPassMetrics`	TRT-LLM field	Status
`scheduled_requests.num_prefill_requests`	`inflightBatchingStats.numContextRequests`	existing
`scheduled_requests.sum_prefill_tokens`	`inflightBatchingStats.numCtxTokens`	existing
`scheduled_requests.sum_prefill_kv_tokens`	`inflightBatchingStats.numCtxKvTokens`	new
`scheduled_requests.num_decode_requests`	`inflightBatchingStats.numGenRequests`	existing
`scheduled_requests.sum_decode_kv_tokens`	`inflightBatchingStats.numGenKvTokens`	new
`queued_requests.num_prefill_requests`	`inflightBatchingStats.numQueuedContextRequests`	new
`queued_requests.sum_prefill_tokens`	`inflightBatchingStats.numQueuedCtxTokens`	new
`queued_requests.num_decode_requests`	`inflightBatchingStats.numPausedRequests + numQueuedGenRequests`	existing + new
`queued_requests.sum_decode_kv_tokens`	`inflightBatchingStats.numPausedKvTokens + numQueuedGenKvTokens`	new + new

Variance fields on Dynamo's ForwardPassMetrics (var_prefill_length, var_decode_kv_tokens) are not populated by TRT-LLM yet — tracked as a follow-up once a downstream consumer begins reading them.

Attention-DP is NOT covered

Under enable_attention_dp=True, tp_size > 1, TRT-LLM's rank-0-only RPC gate (rpc_worker_mixin.py) ensures only rank-0's IterationStats reach the user — same behavior as today. True per-rank emission under attention-DP requires a separate MPI collective-ordering redesign (piggyback on adp_router.gather_all_rank_states is one proposal) and is tracked as a follow-up.

Test plan

pytest tests/unittest/pyexecutor/test_iter_stats_populate.py -v — 14 passed (local, in dev container with mount-over py_executor.py)
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase -k cudagraph_dummies — 2 passed. Negative-control verified: disabling the strip in CUDAGraphRunner.pad_batch's finally makes both tests fail with AssertionError: 8 != 5, confirming they would catch a regression in the strip invariant.
pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats ::test_llm_get_stats_async — 8 passed across all parametrize combinations including use_overlap=True (against an earlier schema; will re-run against the final nanobind binding once the local wheel rebuild completes)
pre-commit run clean on all modified files
Full end-to-end sanity via LLM.get_stats() against a real prompt workload (TinyLlama, pytorch backend) once the local TRT-LLM wheel rebuild with the new .so finishes
/bot run

Follow-up (tracked separately)

Attention-DP per-rank emission — MPI collective-ordering redesign. Tracked on TRTLLM-12123.
Variance fields — Welford-based var_* once a downstream consumer starts reading them.
Dynamo adapter update — rename the adapter's field reads to the new nested inflightBatchingStats.* names and compose the queued-decode signal from numPausedRequests + numQueuedGenRequests / numPausedKvTokens + numQueuedGenKvTokens (separate Dynamo-side PR).

Summary by CodeRabbit

New Features
- Extended inflight batching statistics with five new granular metrics to track token distribution and queue status during model execution: context KV tokens, generation KV tokens, queued context requests, queued context tokens, and paused KV tokens.
Tests
- Added comprehensive test suite validating the new iteration-level statistics fields across various batch execution scenarios.

Extend IterationStats with 9 flat top-level fields (scheduled_* and queued_* for prefill / decode counts and token sums) so consumers such as Dynamo's planner can build ForwardPassMetrics directly from LLM.get_stats_async() dicts, closing the TODO at dynamo/components/src/dynamo/common/forward_pass_metrics.py ("add metrics for TrtLLM/SGLang"). Changes: * cpp/include/tensorrt_llm/executor/types.h: Add 9 SizeType32 fields to IterationStats. Variance fields (var_prefill_length, var_decode_kv_tokens, and their queued counterparts) are deferred: the live Dynamo planner consumes only sums on origin/main. * cpp/tensorrt_llm/executor/{serialization,jsonSerialization}.cpp: Plumb the new fields through binary serialize / deserialize / serializedSize and the NLOHMANN_DEFINE_TYPE_NON_INTRUSIVE macro. * cpp/tensorrt_llm/nanobind/executor/bindings.cpp: Expose the 9 fields as snake-cased Python attributes. * tensorrt_llm/_torch/pyexecutor/py_executor.py: Populate the 9 fields in _update_iter_stats by walking scheduled_batch.context_requests (context_chunk_size + context_current_position), generation_requests, paused_requests, and the executor_request_queue. Attention-DP dummy requests are excluded. Gated by the existing enable_iter_perf_stats flag. * _process_iter_stats: when enable_attention_dp and tp_size > 1, tp_allgather each rank's serialized + tagged (attentionDpRank) dict across TP siblings and store N-per-iter tagged dicts on rank 0 so the stats RPC can surface per-attention-DP-rank FPM streams to the Dynamo planner (which keys its perf model by (worker_id, dp_rank)). * tensorrt_llm/executor/base_worker.py _stats_serializer: Accept pre-serialized dict tuples from the attention-DP path; otherwise inject attentionDpRank=0 so downstream consumers can always read the field without a missing-key branch. * tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py (new): Fixture-driven coverage of the 9-field mapping (prefill, decode, chunked prefill, prefix-cache hit, preempted decode, queued, attention-DP dummy exclusion) plus a to_json_str roundtrip. Semantics mirror vLLM's InstrumentedScheduler in Dynamo (_extract_scheduled + _compute_queued): semantically equivalent definitions, accepting numeric divergences from backend-specific block managers / scheduler budgets. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

Original FPM populate code read req.context_chunk_size / req.context_current_position on each scheduled context request, but _update_iter_stats runs after _update_request_states — context_requests_last_chunk members have transitioned to GENERATION_IN_PROGRESS, and the C++ accessors assert 'context phase or generation init phase' on that state. E2E smoke with Qwen3-0.6B (8 requests, ISL=1024, OSL=128) showed scheduled_sum_prefill_tokens was stuck at 0 because every per-request access threw RuntimeError. Fix: * For scheduled_sum_prefill_kv_tokens: read py_last_context_chunk, a Python-cached (start, end) tuple set in _update_request_states BEFORE the state transition. The start offset equals what the pre-mutation context_current_position was. * For scheduled_sum_prefill_tokens: read from self.model_engine.iter_states['num_ctx_tokens'], a model-engine-level aggregate already computed before state mutation (existing code path used by inflight_batching_stats.num_ctx_tokens). This preserves the MVP's single-rank and per-rank (attention-DP) behavior. E2E verification with Qwen3-0.6B now emits scheduledSumPrefillTokens=1024 during prefill iters, matching the per-request prompt length. Test artifact: ~/code/tmp_context/forwardmetric-redesign/20260416/trtllm-only-e2e/fpm_results.json Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…t tests with real populate Remove the tp_allgather-based attention-DP per-rank delivery from the FPM MVP branch. Ships the 9 flat IterationStats fields for the non-attention-DP case only; attention-DP per-rank emission becomes a follow-up with a proper RC2 redesign (piggyback on adp_router.gather_all_rank_states or dedicated MPI communicator). Changes: - py_executor.py: delete the `if self.enable_attention_dp and self.dist.tp_size > 1 and self.dist.is_first_pp_rank:` branch in _process_iter_stats that built a dict, called tp_allgather, and stored (dict, None, None) sentinels. Reduces _process_iter_stats to a straightforward populate + append. Remove now-unused `import json`. - base_worker.py: delete the dict-sentinel `isinstance(iteration_stats, dict)` short-circuit in _stats_serializer. Keep `stats_dict.setdefault("attentionDpRank", 0)` so the serialized dict stays self-describing for the future attention-DP follow-up. - tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py: remove _FakeDist + test_attention_dp_path_tags_rank_and_short_circuits (they exercised the deleted branch). Update the _populate helper to mirror the REAL populate algorithm: read py_last_context_chunk[0] (primary) + context_current_position (fallback) for prefill-KV, read model_engine.iter_states["num_ctx_tokens"] (engine-level aggregate) for prefill-tokens. Add a stub iter_states dict on the mock executor. Tests now validate the real code path instead of the pre-populate-fix approximation. Rationale for the delete: - The tp_allgather in _process_iter_stats interleaved with adp_router.gather_all_rank_states on the same TP communicator, causing MPI collective-ordering desync (RankState(*24 args) type-confusion). Documented in ~/code/tmp_context/forwardmetric-redesign/20260416/STEPS_14_15_POSTMORTEM.md. - Under enable_attention_dp=True, the Dynamo adapter side is gated off separately (see companion commit on dynamo-fpm branch) so no misleading fake-idle FPM streams reach the Planner. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

The prior round-trip test set and asserted only 5 of the 9 flat IterationStats fields added by this PR. A camelCase typo or a dropped serializer line in jsonSerialization.cpp on the other 4 (scheduledNumDecodeRequests, scheduledSumDecodeKvTokens, queuedNumPrefillRequests, queuedSumPrefillTokens) would silently ship. Expand the test to exercise all 9 keys with distinct values per field, so a cross-wiring bug in the C++ serializer (e.g. swapping prefill and decode count slots) fails the assertion rather than round-tripping cleanly with the wrong semantics. Test locally: pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py -v -> 13 passed Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…for FPM fields Replace the `_populate` shim in `test_iter_stats_fpm_fields.py` with direct unbound calls to the production `PyExecutor._update_iter_stats`. The prior shim was a verbatim copy of the populate block, so it could not detect drift between tests and production. A `MagicMock`-based fake `self` holds only the attributes the method actually reads (two for the FPM fields: `executor_request_queue.get_request_queue().queue` and `model_engine.iter_states`), and `torch.cuda.mem_get_info` is patched so the test runs without a live CUDA context. Also extend `validate_stats` in `test_llm.py` to assert the 9 FPM camelCase keys on every stats iteration under the pytorch backend. This exercises the full pipeline (`_update_iter_stats` -> `_stats_serializer` -> RPC -> `get_stats` / `get_stats_async`) via the existing `llm_get_stats_test_harness`, so any drop or rename along the chain fails the test. Prefill-iteration `scheduledSumPrefillTokens > 0` is gated off under `use_overlap=True` because the populate block sources that field from the `model_engine.iter_states["num_ctx_tokens"]` side channel, which can be stale at the moment `_update_iter_stats` runs under overlap scheduling; all other FPM keys remain strictly asserted. Local runs (inside the TRT-LLM build container, RTX 4090): pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py -> 13 passed pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats -> 4 passed (all parametrize combinations) pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats_async -> 4 passed (all parametrize combinations) Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…verlap-safe) Under overlap scheduling `_executor_loop_overlap` invokes `_process_iter_stats` on `self.previous_batch` (py_executor.py:2654-2656), but `self.model_engine.iter_states["num_ctx_tokens"]` has been overwritten by the CURRENT iteration's `_forward_step` before the previous batch's stats are emitted. Reading the side channel therefore reports the wrong batch's prefill-token count for every emission, and at iter 0 collapses to 0 (initial dict empty). The Planner reads that as "worker idle" and can scale down a loaded worker — silent planner poison. Reproduction before the fix: pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats -> 2 failed (use_overlap=True configurations), 2 passed Fix: drop the side-channel read and compute `scheduled_sum_prefill_tokens` locally inside the existing context_requests loop by summing `py_last_context_chunk[1] - py_last_context_chunk[0]` (= chunk_size = end_compute - begin_compute) across non-dummy context requests. This mirrors what `_prepare_tp_inputs` computes as `len(input_ids)` but without depending on when that write happens relative to `_update_iter_stats`. `py_last_context_chunk` is a Python-side cache set by `_update_request_states` BEFORE state mutation, so it remains valid for the previous batch's requests even after they transition to GENERATION_IN_PROGRESS. Dummy-exclusion semantics are now consistent across all three scheduled-prefill fields (count, sum_tokens, sum_kv_tokens) — the loop's single `if is_attention_dp_dummy: continue` guard drives all of them. Previously `scheduled_sum_prefill_tokens` came from the engine aggregate which included dummies while siblings excluded them. The `test_llm.py::validate_stats` helper drops the `if not use_overlap` relaxation; the strict `scheduledSumPrefillTokens > 0` assertion now holds under every scheduler configuration. `test_iter_states_missing_falls_back_to_zero` is repurposed as a regression guard: passes a deliberately wrong `iter_states["num_ctx_tokens"]` value and asserts the result comes from `py_last_context_chunk`, not the side channel. Known separate issue, NOT fixed here: `queued_num_prefill_requests` and `queued_sum_prefill_tokens` still read `self.executor_request_queue` at stats emission time. Under overlap this produces values from iter N's state paired with batch N-1's scheduled_* counts — a semantic drift, not a wrong value. Fixing it requires snapshotting queue state at batch-prep time and belongs in a dedicated design discussion. Local runs (inside dynamo-fpm-mvp-build:test with the modified py_executor.py mounted over the installed copy): pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py -> 13 passed pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats -> 4 passed (all parametrize combinations, including use_overlap=True) pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats_async -> 4 passed (all parametrize combinations, including use_overlap=True) Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…onventions Two related cleanups on the per-iteration request-aggregate counters added earlier in this PR: 1. Comments made self-contained. References to downstream consumers and unrelated reference implementations ("ForwardPassMetrics", "Dynamo", "Dynamo's planner", vLLM's InstrumentedScheduler, components/src/dynamo/...) are removed from the TRT-LLM source tree. Comments now describe the behavior and invariants of the code in front of the reader without requiring familiarity with an external repo. 2. Default-initialization moved from the struct to the populate path. The 9 new primitive members on IterationStats were declared with `{0}` defaults, but every sibling field on the struct is instead populated unconditionally by the TensorRT-backend populate chain (Executor::Impl::getCurrentIterationStats → mModel-> getCurrentIterationStats). Breaking that invariant silently exempts the new members and makes it harder to verify that every field is reached by the populate chain. Drop the `{0}` defaults and instead assign the 9 fields to 0 in Executor::Impl::getCurrentIterationStats before the virtual mModel callback, so a future TRT-side populate can overwrite and the struct-level convention is restored. Net behavior unchanged: TRT backend emits 0 for these fields (the backend does not track them); the PyTorch backend populates them via its own path. Files: - cpp/include/tensorrt_llm/executor/types.h — drop {0} defaults, rewrite section comment. - cpp/tensorrt_llm/executor/executorImpl.cpp — add 9 zero assignments with a comment explaining the invariant. - tensorrt_llm/_torch/pyexecutor/py_executor.py — comment rewrite. - tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py — docstring/comment rewrite. - tests/unittest/llmapi/test_llm.py — comment rewrite. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

The 9 top-level flat fields added earlier in this PR had two systemic issues: 1. Five of the nine were semantic duplicates of existing fields already tracked elsewhere in IterationStats or InflightBatchingStats. Keeping them meant two fields answered the same question with slightly different dummy-filtering semantics, inviting ambiguity for downstream consumers. 2. The naming vocabulary and word order didn't match the rest of the file. The struct uses Context/Ctx/Gen/Paused and num-prefix word order (numContextRequests, numCtxTokens); the new fields introduced Prefill/Decode and inverted the word order (scheduledNumPrefillRequests). Cleanup: - Delete the 9 top-level IterationStats members introduced in commit 086f692. - Add 5 members to InflightBatchingStats using the file's native vocabulary and word order: numCtxPrecomputedTokens — tokens read from prior state (prefix cache hits + previously chunked tokens) across scheduled context requests; pairs with numCtxTokens (tokens computed this iteration). numGenKvTokens — total KV context summed across scheduled generation requests. numQueuedContextRequests — normal-only requests in the executor request queue (filters shutdown/cancel control items). numQueuedCtxTokens — prompt tokens for the above. numPausedKvTokens — total KV context across paused (preempted-decode) requests. All serialization paths (binary, NLOHMANN JSON, nanobind) are updated in lockstep so the wire format is self-consistent. Also removes the 9 explicit zero-assignments in Executor::Impl::getCurrentIterationStats that were added in commit 9bd9133 as a compensation for the old IterationStats field set. The new fields live inside InflightBatchingStats which is value-initialized at its model-populate construction site (InflightBatchingStats modelStats{} in trtGptModelInflightBatching.cpp), so TRT-backend stats emissions get zero sentinels automatically. Populate block in PyExecutor._update_iter_stats collapses — 5 loops feed 5 new InflightBatchingStats fields; 4 duplicate top-level assignment targets go away entirely (replaced by pre-existing InflightBatchingStats assignments at py_executor.py:1128-1134 which handle num_context_requests / num_gen_requests / num_paused_requests / num_scheduled_requests from the scheduled_batch directly). Tests updated: test_iter_stats_fpm_fields.py now asserts on the nested InflightBatchingStats fields and on existing sibling fields where the old top-level assertions mapped. test_llm.py::validate_stats now reads new keys from ifbStats instead of the removed top-level keys, and uses numCtxTokens > 0 on prefill iterations (overlap-safe, set after _forward_step). Local validation: Python syntax check clean; clang-format clean on all 5 cpp files. Tests against the current container .so fail with AttributeError on the new nested field names (expected — the stale binding has the old schema). A local TRT-LLM wheel rebuild will land the updated nanobind binding and unblock e2e validation. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

- py_executor.py: drop the overlap-safety paragraph (no longer user-facing interesting), inline per-field rationale directly above each computation loop, keep the short high-level intro. - py_executor.py: restore the single-expression form of self._append_iter_stats(self._update_iter_stats(...)) in _process_iter_stats; the intermediate iter_stats local variable was an unrelated drift from this PR's refactors. - base_worker.py: drop the cross-repo pointer "(see Dynamo publisher.py gate)" from the attentionDpRank comment so the comment remains self-contained in TRT-LLM. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…_populate.py The old name referenced Dynamo's ForwardPassMetrics vocabulary, but these tests now cover the populate block in ``PyExecutor._update_iter_stats`` generally — specifically the extra members on ``InflightBatchingStats`` written by that block. Renaming to match the scope and drop the out-of-tree concept. Pure rename; no content change. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

Pure rename across all 7 touched files — C++ struct, both serializers, nanobind binding, Python populate, and both test files. snake_case counterpart renamed to ``num_ctx_kv_tokens``. Matches the `KvTokens` pattern already used for the sibling `numGenKvTokens` / `numPausedKvTokens` fields on the same struct — symmetry makes the three KV-token fields visually group in the struct. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

nv-yna · 2026-04-21T21:42:33Z

/bot run --disable-fail-fast

coderabbitai · 2026-04-21T21:50:40Z

📝 Walkthrough

Walkthrough

The PR extends InflightBatchingStats struct with five new fields tracking KV-cache metrics: KV tokens from context and generation requests, queued context request counts, total queued context tokens, and paused request KV tokens. Changes span type definitions, serialization layers, Python bindings, computation logic in the executor, and comprehensive test coverage.

Changes

Cohort / File(s)	Summary
Type Definition `cpp/include/tensorrt_llm/executor/types.h`	Added five new `SizeType32` fields to `InflightBatchingStats` struct: `numCtxKvTokens`, `numGenKvTokens`, `numQueuedContextRequests`, `numQueuedCtxTokens`, `numPausedKvTokens`.
Serialization Layer `cpp/tensorrt_llm/executor/jsonSerialization.cpp`, `cpp/tensorrt_llm/executor/serialization.cpp`	Extended JSON serialization macro to include new fields, and updated binary serialization/deserialization methods to handle reading, writing, and computing sizes for the five additional KV-token and queue-related counters.
Python Integration `cpp/tensorrt_llm/nanobind/executor/bindings.cpp`, `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Exposed new properties to Python via nanobind bindings and implemented computation logic in `_update_iter_stats` to aggregate KV tokens from scheduled requests, queued requests from the request queue, and paused requests via `py_last_context_chunk` and per-request token counts.
Utility & Stats `tensorrt_llm/executor/base_worker.py`	Added default value initialization for `attentionDpRank` field in JSON-parsed stats dictionary to ensure consistent serialized stats payload shape.
Tests `tests/unittest/llmapi/test_llm.py`, `tests/unittest/pyexecutor/test_iter_stats_populate.py`	Enhanced existing test validation to assert new fields are present and non-negative, added context/generation phase-specific assertions, and introduced comprehensive new test suite covering empty iterations, prefill/decode scenarios, queued/paused request aggregation, and JSON serialization round-trips.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding per-iteration request-aggregate counters to InflightBatchingStats, which is the primary focus of all file modifications.
Description check	✅ Passed	The pull request description clearly explains the purpose, scope, new fields, and implementation details with comprehensive documentation of the changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update SPDX copyright year for this modified file.

This file was modified, but the header still ends at 2025. Please bump it to the current modification year.
Suggested fix
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines: "Add NVIDIA copyright header on ALL new files and update year on modified files" and "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/nanobind/executor/bindings.cpp` at line 2, The SPDX header
in the top of cpp/tensorrt_llm/nanobind/executor/bindings.cpp still lists
copyright years ending at 2025; update the SPDX comment to include the current
modification year (bump 2025 to the current year) so the NVIDIA copyright header
reflects the latest meaningful modification.

🧹 Nitpick comments (1)

tests/unittest/pyexecutor/test_iter_stats_populate.py (1)

91-95: Let the queue stub model payload-less entries too.

PyExecutor._update_iter_stats() has an explicit item.request is None skip path, but _StubQueueItem always fabricates a request object. That leaves the "payload-less requests are excluded" contract untested.

Suggested test helper tweak

 class _StubQueueItem:
-    def __init__(self, input_token_ids, is_normal_request=True):
-        self.request = types.SimpleNamespace(input_token_ids=input_token_ids)
+    def __init__(
+        self,
+        input_token_ids=None,
+        *,
+        is_normal_request=True,
+        has_request=True,
+    ):
+        self.request = (
+            types.SimpleNamespace(input_token_ids=input_token_ids)
+            if has_request
+            else None
+        )
         self.is_normal_request = is_normal_request

Then add a small assertion that has_request=False leaves both queued counters unchanged.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/pyexecutor/test_iter_stats_populate.py` around lines 91 - 95,
Modify the test stub and assertions so payload-less queue entries can be
simulated: update _StubQueueItem to allow request=None (or add a parameter like
has_request=False) instead of always creating types.SimpleNamespace, then in the
test that covers PyExecutor._update_iter_stats create an item with no request
and assert that both queued counters remain unchanged; reference the
_StubQueueItem constructor and the PyExecutor._update_iter_stats call when
making the change and add the extra assertion that has_request=False leaves
queued counters unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1234-1247: The counters currently increment
num_queued_context_requests before verifying the request payload, so move the
payload validation so that you only increment num_queued_context_requests when
the item has a usable payload: check getattr(item, "is_normal_request", False)
and that item.request exists and that item.request.input_token_ids is present
and consumable (e.g., wrapped in the existing try/except that computes
len(item.request.input_token_ids)); only on successful token-length computation
increment both num_queued_ctx_tokens (by the length) and
num_queued_context_requests. Update the loop around
executor_request_queue.get_request_queue().queue to perform the payload
existence/len check first and skip incrementing the request counter when
input_token_ids is missing or raises in the try block.

---

Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/executor/bindings.cpp`:
- Line 2: The SPDX header in the top of
cpp/tensorrt_llm/nanobind/executor/bindings.cpp still lists copyright years
ending at 2025; update the SPDX comment to include the current modification year
(bump 2025 to the current year) so the NVIDIA copyright header reflects the
latest meaningful modification.

---

Nitpick comments:
In `@tests/unittest/pyexecutor/test_iter_stats_populate.py`:
- Around line 91-95: Modify the test stub and assertions so payload-less queue
entries can be simulated: update _StubQueueItem to allow request=None (or add a
parameter like has_request=False) instead of always creating
types.SimpleNamespace, then in the test that covers
PyExecutor._update_iter_stats create an item with no request and assert that
both queued counters remain unchanged; reference the _StubQueueItem constructor
and the PyExecutor._update_iter_stats call when making the change and add the
extra assertion that has_request=False leaves queued counters unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 3bb2f0bc-5ed0-4c25-b0a3-926ad864b261

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae13fb and c05da0c.

📒 Files selected for processing (8)

cpp/include/tensorrt_llm/executor/types.h
cpp/tensorrt_llm/executor/jsonSerialization.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/nanobind/executor/bindings.cpp
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/executor/base_worker.py
tests/unittest/llmapi/test_llm.py
tests/unittest/pyexecutor/test_iter_stats_populate.py

tensorrt-cicd · 2026-04-21T21:52:31Z

PR_Github #44807 [ run ] triggered by Bot. Commit: c05da0c Link to invocation

brb-nv

Left a few questions to consider.

…onStats Follow-up work is documented on TRTLLM-12123. The stub stats_dict.setdefault("attentionDpRank", 0) keeps downstream consumers able to read the field unconditionally today. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

Two regression guards in PyTorchModelEngineTestCase, placed next to the existing test_pad_generation_requests. They pin the invariant that CUDAGraphRunner.pad_batch strips every is_cuda_graph_dummy=True entry from scheduled_requests.generation_requests before its `with` block exits — on both clean exit and exception paths. Downstream consumers of scheduled_batch.generation_requests rely on this, including the new FPM populate block added earlier in this PR (tensorrt_llm/_torch/pyexecutor/py_executor.py:1220); those callers filter only is_attention_dp_dummy and trust pad_batch to handle the cudagraph kind upstream. Negative-control verified locally: replacing the `finally` slice with `pass` makes both tests fail with 'AssertionError: 8 != 5' (5 real requests + 3 padding dummies retained). Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…t comments Line-number references (cuda_graph_runner.py:524, py_executor.py:1220) drift easily under refactors; the comment still conveys intent without them. 'FPM' is a Dynamo-side term; describe the dependent code as 'the per-iteration stats populate block' so the comment stands on its own within TRT-LLM. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…, narrow except, warn on drift Addresses CodeRabbit's review comment on PyExecutor._update_iter_stats: - Move num_queued_context_requests increment AFTER the input-token length check so both counters move in lockstep. Downstream consumers that compute avg = num_queued_ctx_tokens / num_queued_context_requests can no longer see a deflated ratio if a future API change makes input_token_ids unreadable. - Narrow bare 'except Exception' to (AttributeError, TypeError); silences Ruff S110 / BLE001 and stops masking unexpected errors. - Drop defensive getattr() on item.is_normal_request; item is always a RequestQueueItem (queue type is queue.Queue[RequestQueueItem]) so the fallback was dead code. - Emit logger.warning on the skip branch so future API drift surfaces instead of being silently absorbed. No behavior change on the current API: the C++ ExecutorRequest constructor requires non-empty input_token_ids, so the except branch is unreachable today. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…ed InflightBatchingStats fields Addresses the disagg-decode routing concern raised on PR 13199: the queued-context loop in PyExecutor._update_iter_stats currently lumps REQUEST_TYPE_GENERATION_ONLY queued requests (disagg-decode engine, awaiting KV transfer from a prefill engine) into numQueuedContextRequests / numQueuedCtxTokens. Autoscaling consumers read that as 'prefill backlog deep' on a decode engine, when the real signal is 'decode work pending KV transfer' — wrong scaling action. Extend InflightBatchingStats with two new members: numQueuedGenRequests — count of gen-only queued items numQueuedGenKvTokens — prompt-token sum across those items (the KV budget they will occupy once their KV transfer completes) Populate block now routes by item.request.request_type: - CONTEXT_AND_GENERATION (non-disagg default) and CONTEXT_ONLY (disagg-prefill side) -> queued-context counters. - GENERATION_ONLY (disagg-decode side) -> queued-gen counters. Downstream consumers that build Dynamo's ForwardPassMetrics can now compute a correct queued_requests.num_decode_requests as numPausedRequests + numQueuedGenRequests (same for the KV-token sum). Wired through the full binding surface: types.h (struct members), serialization.cpp (binary ser/deser/size), jsonSerialization.cpp (NLOHMANN macro), nanobind/executor/bindings.cpp (Python properties). The __init__.pyi stub is generated from the bindings and is regenerated by the build. Tests: * test_iter_stats_populate.py::test_queued_routes_by_request_type covers all three enum values with distinct token counts, asserts correct routing across both counter pairs. * test_empty_iteration and the JSON round-trip test extended to include the two new fields. * test_llm.py::validate_stats's 'new_aggregate_keys' tuple extended to assert presence of the two new keys under the full get_stats -> get_stats_async RPC pipeline. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

nv-yna · 2026-04-24T07:53:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-24T07:59:26Z

PR_Github #45368 [ run ] triggered by Bot. Commit: 4995f8f Link to invocation

brb-nv

LGTM.

nv-yna · 2026-04-24T20:30:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-24T20:37:10Z

PR_Github #45443 [ run ] triggered by Bot. Commit: 4995f8f Link to invocation

nv-yna · 2026-04-24T23:22:15Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-24T23:30:20Z

PR_Github #45459 [ run ] triggered by Bot. Commit: 4995f8f Link to invocation

tensorrt-cicd · 2026-04-25T15:02:13Z

PR_Github #45459 [ run ] completed with state SUCCESS. Commit: 4995f8f
/LLM/main/L0_MergeRequest_PR pipeline #35690 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-yna · 2026-04-26T01:57:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-26T02:06:05Z

PR_Github #45532 [ run ] triggered by Bot. Commit: 4995f8f Link to invocation

tensorrt-cicd · 2026-04-26T02:35:19Z

PR_Github #45532 [ run ] completed with state SUCCESS. Commit: 4995f8f
/LLM/main/L0_MergeRequest_PR pipeline #35753 completed with status: 'SUCCESS'

CI Report

Link to invocation

Refreshed all 16 files under docs/overview/ to reflect TRT-LLM v1.3.0rc14 (upstream/main 3b7af1c, ~3 weeks of changes since the 2026-04-06 baseline 2b80f8d), competing frameworks (vLLM v0.20.0, SGLang v0.5.10.post1, LMCache v0.4.4, NVIDIA Dynamo v1.0.0), and the current hardware (Vera Rubin in production, AMD MI355X MLPerf 6.0, Google TPU v7 GA) and academic landscape (GOOSE, StreamServe, PrfaaS, FlowKV). Highlights: - Spec-dec algorithm count corrected 7 to 8 (DFlash added; EAGLE3 dynamic tree re-enabled; LoRA + spec-dec generic and EAGLE3-specific now work). - Block reuse + overlap scheduler combined (NVIDIA#12816), removing a long-standing internal gap. - First-class lmcache and kvbm KV connectors (NVIDIA#12626). - Production observability stack (Prometheus NVIDIA#12545, modular logger NVIDIA#13202, NvTelemetry NVIDIA#12384, per-iteration aggregate counters NVIDIA#13199). - Disagg reliability fail-fast wave (NVIDIA#13119, NVIDIA#13408, NVIDIA#12718, NVIDIA#12888). - AutoDeploy onboarded DeepSeek-R1, Gemma-4 + 4-31B NVFP4, MiniMax-M2.7; standalone-package-ready; legacy EdgeLLM ONNX export removed. - KV cache V2 still default OFF (multiple V2 fixes; tracked via new V2-default-on milestone in §06). - New 4 framework comparison column for Dynamo v1.0; updated feature matrix and gap analysis to reflect vLLM v0.20 (FA4 MLA prefill default, TurboQuant 2-bit KV, vLLM IR foundation, Model Runner V2). - Strategic prioritization quadrant chart fully re-ranked; new Tier 1 items: TTFT re-benchmark vs vLLM v0.20, low-bit KV, MLA prefill kernel default, disagg chaos-test harness; new Tier 2: Dynamo Snapshot integration, TRT-LLM IR strategy, adaptive spec-dec depth. Snapshot of pre-refresh content saved at docs/overview/.snapshots/2026-04-06/. Per-file diff highlights, priority-shift table, sources, and blocked/skipped notes are in docs/overview/CHANGELOG.md. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

NVIDIA/TensorRT-LLM#13199 merged with a different shape than the draft this PR was originally written against: the new IterationStats fields live nested inside `inflightBatchingStats` and use names that don't match the draft's flat top-level keys. Field-extraction changes (publisher.py): - _check_fpm_schema reads the nested `inflightBatchingStats` dict and fails closed if it is missing or any of the 11 required IBS fields is absent. Strict probe protects against silent planner poison when running against a TRT-LLM that predates #13199. - handle_stat reads the 11 IBS fields and composes Dynamo-side fields: scheduled_num_prefill_requests = numContextRequests scheduled_sum_prefill_tokens = numCtxTokens (compute this iter; excludes prefix-cache carryover counted in numCtxKvTokens, to match vLLM's TTFT-prediction semantics) scheduled_sum_prefill_kv_tokens = numCtxKvTokens scheduled_num_decode_requests = numGenRequests scheduled_sum_decode_kv_tokens = numGenKvTokens queued_num_prefill_requests = numQueuedContextRequests queued_sum_prefill_tokens = numQueuedCtxTokens queued_num_decode_requests = numPausedRequests + numQueuedGenRequests queued_sum_decode_kv_tokens = numPausedKvTokens + numQueuedGenKvTokens Paused-decode requests are intentionally double-counted with numScheduledRequests upstream because the planner reads queued_decode as a KV-pressure preemption signal, where paused-decodes carry the same semantic weight as queued-gen-only requests. - FpmDirectPublisher.publish() callsite switched to keyword-only args so adjacent ints with similar units (`scheduled_*` vs `queued_*`, `*_prefill_*` vs `*_decode_*`) cannot be silently transposed. - Init-time and cleanup-time `except Exception` narrowed to `except RuntimeError` (PyO3 surfaces FpmDirectPublisher new/shutdown failures only as PyRuntimeError); hot-path catch in handle_stat stays broad on purpose, with a comment explaining why. Rust binding changes (lib/bindings/python/rust/llm/fpm.rs): - FpmDirectPublisher struct field `_inner` renamed to `inner`. The underscore prefix conventionally signals "intentionally unused" to Rust readers, but the field is load-bearing for Drop-driven shutdown of all per-rank serialization tasks. A future cleanup pass would delete it as dead and leak every spawned task. Documented in a comment. - publish() signature switched to keyword-only via #[pyo3(signature = (*, ...))]; matching update in _core.pyi stub. Test changes (test_trtllm_fpm_publisher.py): - Module-level pytestmark = [unit, trtllm, pre_merge, gpu_0] picked up by CI marker filters. Without these, the tests were silently skipped by the strict-marker config. - _build_fake_stat constructs the nested IBS structure matching the real JSON; _invoke_handler mirrors handle_stat's nested extraction and kwargs publish() exactly. A guardrail test (test_invoke_handler_matches_publisher_keyword_set) catches drift. - Schema-probe parametrize iterates all 11 IBS fields plus the ibs-not-a-dict and missing-ibs-dict edge cases. - New tests for the queued-decode composite under disagg-decode (only-queued-gen) and aggregated (only-paused) engine shapes. - _build_publisher_stub uses the monkeypatch fixture for hermetic state — no module-level rebinds that leak across tests. The blanket try/except in _run_initialize is removed; WorkerMetricsPublisher and KvEventPublisher are stubbed via monkeypatch so initialize() reaches the FPM gate cleanly. Verification (single-GPU local, non-attention-DP): - Unit: 28/28 pass. - Schema probe against real Qwen3-0.6B inside L0_PostMerge build 2697 (commit e903428, first build containing #13199): all 11 required IBS fields present at the nested path, attentionDpRank tagged at the top level. - Sustained load (concurrency 16 vs max_batch_size 4, 45 s): 520 stats sampled, max scheduled_num_decode_requests=4 (saturated batch), max queued_num_prefill_requests=3 (queueing observed), 512/520 iters had non-zero decode signal. Known limitation: TensorRT (legacy) backend leaves IBS at zero-default, so FPM emission against a TRT-engine deployment will be all-zero. The schema probe accepts keys-exist; a value-aware probe is left as follow-up. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

… fields Closes the "old TRT-LLM" planner-poison gap left by ai-dynamo#8356. Problem: handle_stat's .publish() call uses stat.get(field, 0) for every required IterationStats field. On a TRT-LLM version that predates NVIDIA/TensorRT-LLM#13199, every field defaults to 0 and the emitted ForwardPassSnapshot is byte-identical to the Rust-side idle heartbeat. The Planner reads that as "worker is idle" and can scale down a worker under real load. Same class of bug as the attention-DP rank 1..N-1 case that ai-dynamo#8356 explicitly gates, but ungated for the old-TRT-LLM path. Fix: one-shot schema probe on the first IterationStats delivered to handle_stat. Strict — all 9 fields added by #13199 must be present, else the probe shuts down the publisher and sets self.fpm_publisher = None. The existing `if self.fpm_publisher is not None:` guard then short- circuits all further FPM emission, matching the attention-DP gate's "zero messages" contract. Scope: - components/src/dynamo/trtllm/publisher.py: * Add _FPM_REQUIRED_STAT_FIELDS (9 fields). * Add self._fpm_schema_checked: bool = False in Publisher.__init__. * Add Publisher._check_fpm_schema(stat) method. * Wire the probe into handle_stat before the existing FPM publish branch. - components/src/dynamo/trtllm/tests/test_trtllm_fpm_publisher.py: * 7 new tests (15 cases via parametrize). Cover: all-fields-present keeps publisher; each of the 9 fields missing individually disables publisher; missing-all-fields (legacy TRT-LLM) disables; noop when fpm_publisher is already None; shutdown exception still disables; probe runs only once; required-fields-tuple-matches-publish guardrail. No Rust changes. No API surface changes. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

NVIDIA/TensorRT-LLM#13199 merged with a different shape than the draft this PR was originally written against: the new IterationStats fields live nested inside `inflightBatchingStats` and use names that don't match the draft's flat top-level keys. Field-extraction changes (publisher.py): - _check_fpm_schema reads the nested `inflightBatchingStats` dict and fails closed if it is missing or any of the 11 required IBS fields is absent. Strict probe protects against silent planner poison when running against a TRT-LLM that predates #13199. - handle_stat reads the 11 IBS fields and composes Dynamo-side fields: scheduled_num_prefill_requests = numContextRequests scheduled_sum_prefill_tokens = numCtxTokens (compute this iter; excludes prefix-cache carryover counted in numCtxKvTokens, to match vLLM's TTFT-prediction semantics) scheduled_sum_prefill_kv_tokens = numCtxKvTokens scheduled_num_decode_requests = numGenRequests scheduled_sum_decode_kv_tokens = numGenKvTokens queued_num_prefill_requests = numQueuedContextRequests queued_sum_prefill_tokens = numQueuedCtxTokens queued_num_decode_requests = numPausedRequests + numQueuedGenRequests queued_sum_decode_kv_tokens = numPausedKvTokens + numQueuedGenKvTokens Paused-decode requests are intentionally double-counted with numScheduledRequests upstream because the planner reads queued_decode as a KV-pressure preemption signal, where paused-decodes carry the same semantic weight as queued-gen-only requests. - FpmDirectPublisher.publish() callsite switched to keyword-only args so adjacent ints with similar units (`scheduled_*` vs `queued_*`, `*_prefill_*` vs `*_decode_*`) cannot be silently transposed. - Init-time and cleanup-time `except Exception` narrowed to `except RuntimeError` (PyO3 surfaces FpmDirectPublisher new/shutdown failures only as PyRuntimeError); hot-path catch in handle_stat stays broad on purpose, with a comment explaining why. Rust binding changes (lib/bindings/python/rust/llm/fpm.rs): - FpmDirectPublisher struct field `_inner` renamed to `inner`. The underscore prefix conventionally signals "intentionally unused" to Rust readers, but the field is load-bearing for Drop-driven shutdown of all per-rank serialization tasks. A future cleanup pass would delete it as dead and leak every spawned task. Documented in a comment. - publish() signature switched to keyword-only via #[pyo3(signature = (*, ...))]; matching update in _core.pyi stub. Test changes (test_trtllm_fpm_publisher.py): - Module-level pytestmark = [unit, trtllm, pre_merge, gpu_0] picked up by CI marker filters. Without these, the tests were silently skipped by the strict-marker config. - _build_fake_stat constructs the nested IBS structure matching the real JSON; _invoke_handler mirrors handle_stat's nested extraction and kwargs publish() exactly. A guardrail test (test_invoke_handler_matches_publisher_keyword_set) catches drift. - Schema-probe parametrize iterates all 11 IBS fields plus the ibs-not-a-dict and missing-ibs-dict edge cases. - New tests for the queued-decode composite under disagg-decode (only-queued-gen) and aggregated (only-paused) engine shapes. - _build_publisher_stub uses the monkeypatch fixture for hermetic state — no module-level rebinds that leak across tests. The blanket try/except in _run_initialize is removed; WorkerMetricsPublisher and KvEventPublisher are stubbed via monkeypatch so initialize() reaches the FPM gate cleanly. Verification (single-GPU local, non-attention-DP): - Unit: 28/28 pass. - Schema probe against real Qwen3-0.6B inside L0_PostMerge build 2697 (commit e903428, first build containing #13199): all 11 required IBS fields present at the nested path, attentionDpRank tagged at the top level. - Sustained load (concurrency 16 vs max_batch_size 4, 45 s): 520 stats sampled, max scheduled_num_decode_requests=4 (saturated batch), max queued_num_prefill_requests=3 (queueing observed), 512/520 iters had non-zero decode signal. Known limitation: TensorRT (legacy) backend leaves IBS at zero-default, so FPM emission against a TRT-engine deployment will be all-zero. The schema probe accepts keys-exist; a value-aware probe is left as follow-up. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

nv-yna added 5 commits April 16, 2026 20:58

[None][style] apply pre-commit formatting to FPM fields

44d2d80

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

[None][style] apply pre-commit ruff-format + docstring fix

f0ccf02

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

github-actions Bot assigned nv-yna Apr 20, 2026

This was referenced Apr 20, 2026

feat(trtllm): publish ForwardPassMetrics via FpmDirectPublisher (non-attention-DP) ai-dynamo/dynamo#8356

Merged

feat(trtllm): strict first-stat FPM schema probe; gate off on missing fields ai-dynamo/dynamo#8396

Draft

nv-yna added 8 commits April 20, 2026 14:47

[None][style] apply pre-commit formatting to reorganization commit

963a3f7

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

nv-yna changed the title ~~[None][feat] Expose flat ForwardPassMetrics fields on IterationStats (non-attention-DP scope)~~ [None][feat] Add per-iteration request-aggregate counters to InflightBatchingStats Apr 21, 2026

nv-yna marked this pull request as ready for review April 21, 2026 21:40

nv-yna requested review from a team as code owners April 21, 2026 21:40

nv-yna requested review from Superjomn, brb-nv, lfr-0531 and pcastonguay April 21, 2026 21:40

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated

brb-nv reviewed Apr 21, 2026

View reviewed changes

Comment thread tensorrt_llm/executor/base_worker.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated

tedzhouhk mentioned this pull request Apr 22, 2026

feat: emit per-iteration forward pass metrics via ZMQ PUB sgl-project/sglang#22789

Open

nv-yna changed the title ~~[None][feat] Add per-iteration request-aggregate counters to InflightBatchingStats~~ [TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats Apr 23, 2026

nv-yna added 5 commits April 23, 2026 14:38

nv-yna requested review from brb-nv and tedzhouhk April 24, 2026 15:32

brb-nv approved these changes Apr 24, 2026

View reviewed changes

nv-yna requested review from tedzhouhk and removed request for Superjomn, lfr-0531 and tedzhouhk April 24, 2026 17:24

pcastonguay approved these changes Apr 27, 2026

View reviewed changes

pcastonguay merged commit 734a146 into NVIDIA:main Apr 27, 2026
13 of 15 checks passed

Conversation

nv-yna commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New InflightBatchingStats members

Dynamo ForwardPassMetrics → TRT-LLM mapping

Attention-DP is NOT covered

Test plan

Follow-up (tracked separately)

Summary by CodeRabbit

Uh oh!

nv-yna commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

brb-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-yna commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

brb-nv left a comment

Choose a reason for hiding this comment

Uh oh!

nv-yna commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

nv-yna commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 25, 2026

Uh oh!

nv-yna commented Apr 26, 2026

Uh oh!

tensorrt-cicd commented Apr 26, 2026

Uh oh!

tensorrt-cicd commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nv-yna commented Apr 20, 2026 •

edited

Loading

New `InflightBatchingStats` members

Dynamo `ForwardPassMetrics` → TRT-LLM mapping

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading