Skip to content

[TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats#13199

Merged
pcastonguay merged 19 commits intoNVIDIA:mainfrom
nv-yna:yna/feat/iter-stats-fpm-mvp
Apr 27, 2026
Merged

[TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats#13199
pcastonguay merged 19 commits intoNVIDIA:mainfrom
nv-yna:yna/feat/iter-stats-fpm-mvp

Conversation

@nv-yna
Copy link
Copy Markdown
Collaborator

@nv-yna nv-yna commented Apr 20, 2026

Summary

Add 7 new members to InflightBatchingStats that expose per-iteration request-aggregate counts and token-weighted totals not covered by the existing numContextRequests / numGenRequests / numCtxTokens / numPausedRequests / numScheduledRequests. Populated by the PyTorch backend's PyExecutor._update_iter_stats; the TensorRT (legacy) backend leaves them at 0 via value-initialization of the InflightBatchingStats struct at its populate site — 0 is the correct "not tracked by this backend" sentinel.

These fields support downstream autoscaling / load-prediction consumers (e.g., Dynamo's Planner) that need richer per-iteration composition data than the existing counters provide, including disaggregated-decode engines whose queue is dominated by generation-only requests awaiting KV transfer.

Scope: non-attention-DP (single-rank + plain TP). True per-rank emission under attention-DP requires an MPI collective-ordering redesign and is deferred.

New InflightBatchingStats members

Field Meaning
numCtxKvTokens Tokens read from prior state (prefix-cache hits + previously-chunked tokens) summed across scheduled context requests. Complements numCtxTokens (tokens computed this iteration).
numGenKvTokens Total KV context length (prompt + tokens generated so far) summed across scheduled generation requests.
numQueuedContextRequests Context-type requests (REQUEST_TYPE_CONTEXT_AND_GENERATION / REQUEST_TYPE_CONTEXT_ONLY) waiting in executor_request_queue for first scheduling; excludes non-normal control items (shutdown/cancel) and requests without a payload.
numQueuedCtxTokens Prompt-token sum across the requests counted in numQueuedContextRequests.
numQueuedGenRequests Generation-only requests (REQUEST_TYPE_GENERATION_ONLY) waiting in executor_request_queue. On a disaggregated-decode engine these are requests that have completed prefill elsewhere and are awaiting KV-cache transfer before they can start decoding. Always 0 on a non-disaggregated or disaggregated-prefill engine.
numQueuedGenKvTokens Prompt-token sum across the requests counted in numQueuedGenRequests; acts as the KV budget these requests will occupy once their KV transfer completes.
numPausedKvTokens Total KV context length summed across paused (preempted-decode) requests. Complements numPausedRequests.

Naming follows the file's existing conventions: Ctx for token-count compounds, Context for request counts, Gen for decode/generation, Paused for preempted-decode, num prefix first.

Dynamo ForwardPassMetrics → TRT-LLM mapping

For downstream readers (e.g., Dynamo's TRT-LLM adapter), here is the full mapping from Dynamo's ForwardPassMetrics fields to the TRT-LLM IterationStats fields that should be consumed. Four fields reuse existing InflightBatchingStats members; seven are new (added in this PR). The two disagg-decode-aware rows compose the new gen-queue fields with the existing numPaused* fields so the Dynamo adapter reports a correct "queued decode" signal on both preempted-decode and disagg-decode-awaiting-KV paths.

Dynamo ForwardPassMetrics TRT-LLM field Status
scheduled_requests.num_prefill_requests inflightBatchingStats.numContextRequests existing
scheduled_requests.sum_prefill_tokens inflightBatchingStats.numCtxTokens existing
scheduled_requests.sum_prefill_kv_tokens inflightBatchingStats.numCtxKvTokens new
scheduled_requests.num_decode_requests inflightBatchingStats.numGenRequests existing
scheduled_requests.sum_decode_kv_tokens inflightBatchingStats.numGenKvTokens new
queued_requests.num_prefill_requests inflightBatchingStats.numQueuedContextRequests new
queued_requests.sum_prefill_tokens inflightBatchingStats.numQueuedCtxTokens new
queued_requests.num_decode_requests inflightBatchingStats.numPausedRequests + numQueuedGenRequests existing + new
queued_requests.sum_decode_kv_tokens inflightBatchingStats.numPausedKvTokens + numQueuedGenKvTokens new + new

Variance fields on Dynamo's ForwardPassMetrics (var_prefill_length, var_decode_kv_tokens) are not populated by TRT-LLM yet — tracked as a follow-up once a downstream consumer begins reading them.

Attention-DP is NOT covered

Under enable_attention_dp=True, tp_size > 1, TRT-LLM's rank-0-only RPC gate (rpc_worker_mixin.py) ensures only rank-0's IterationStats reach the user — same behavior as today. True per-rank emission under attention-DP requires a separate MPI collective-ordering redesign (piggyback on adp_router.gather_all_rank_states is one proposal) and is tracked as a follow-up.

Test plan

  • pytest tests/unittest/pyexecutor/test_iter_stats_populate.py -v — 14 passed (local, in dev container with mount-over py_executor.py)
  • pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase -k cudagraph_dummies — 2 passed. Negative-control verified: disabling the strip in CUDAGraphRunner.pad_batch's finally makes both tests fail with AssertionError: 8 != 5, confirming they would catch a regression in the strip invariant.
  • pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats ::test_llm_get_stats_async — 8 passed across all parametrize combinations including use_overlap=True (against an earlier schema; will re-run against the final nanobind binding once the local wheel rebuild completes)
  • pre-commit run clean on all modified files
  • Full end-to-end sanity via LLM.get_stats() against a real prompt workload (TinyLlama, pytorch backend) once the local TRT-LLM wheel rebuild with the new .so finishes
  • /bot run

Follow-up (tracked separately)

  1. Attention-DP per-rank emission — MPI collective-ordering redesign. Tracked on TRTLLM-12123.
  2. Variance fields — Welford-based var_* once a downstream consumer starts reading them.
  3. Dynamo adapter update — rename the adapter's field reads to the new nested inflightBatchingStats.* names and compose the queued-decode signal from numPausedRequests + numQueuedGenRequests / numPausedKvTokens + numQueuedGenKvTokens (separate Dynamo-side PR).

Summary by CodeRabbit

  • New Features

    • Extended inflight batching statistics with five new granular metrics to track token distribution and queue status during model execution: context KV tokens, generation KV tokens, queued context requests, queued context tokens, and paused KV tokens.
  • Tests

    • Added comprehensive test suite validating the new iteration-level statistics fields across various batch execution scenarios.

nv-yna added 5 commits April 16, 2026 20:58
Extend IterationStats with 9 flat top-level fields (scheduled_* and
queued_* for prefill / decode counts and token sums) so consumers such
as Dynamo's planner can build ForwardPassMetrics directly from
LLM.get_stats_async() dicts, closing the TODO at
dynamo/components/src/dynamo/common/forward_pass_metrics.py
("add metrics for TrtLLM/SGLang").

Changes:
* cpp/include/tensorrt_llm/executor/types.h:
  Add 9 SizeType32 fields to IterationStats. Variance fields
  (var_prefill_length, var_decode_kv_tokens, and their queued
  counterparts) are deferred: the live Dynamo planner consumes only
  sums on origin/main.
* cpp/tensorrt_llm/executor/{serialization,jsonSerialization}.cpp:
  Plumb the new fields through binary serialize / deserialize /
  serializedSize and the NLOHMANN_DEFINE_TYPE_NON_INTRUSIVE macro.
* cpp/tensorrt_llm/nanobind/executor/bindings.cpp:
  Expose the 9 fields as snake-cased Python attributes.
* tensorrt_llm/_torch/pyexecutor/py_executor.py:
  Populate the 9 fields in _update_iter_stats by walking
  scheduled_batch.context_requests (context_chunk_size +
  context_current_position), generation_requests, paused_requests,
  and the executor_request_queue. Attention-DP dummy requests are
  excluded. Gated by the existing enable_iter_perf_stats flag.
* _process_iter_stats: when enable_attention_dp and tp_size > 1,
  tp_allgather each rank's serialized + tagged (attentionDpRank) dict
  across TP siblings and store N-per-iter tagged dicts on rank 0 so
  the stats RPC can surface per-attention-DP-rank FPM streams to the
  Dynamo planner (which keys its perf model by (worker_id, dp_rank)).
* tensorrt_llm/executor/base_worker.py _stats_serializer:
  Accept pre-serialized dict tuples from the attention-DP path;
  otherwise inject attentionDpRank=0 so downstream consumers can
  always read the field without a missing-key branch.
* tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py (new):
  Fixture-driven coverage of the 9-field mapping (prefill, decode,
  chunked prefill, prefix-cache hit, preempted decode, queued,
  attention-DP dummy exclusion) plus a to_json_str roundtrip.

Semantics mirror vLLM's InstrumentedScheduler in Dynamo
(_extract_scheduled + _compute_queued): semantically equivalent
definitions, accepting numeric divergences from backend-specific
block managers / scheduler budgets.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Original FPM populate code read req.context_chunk_size /
req.context_current_position on each scheduled context request, but
_update_iter_stats runs after _update_request_states — context_requests_last_chunk
members have transitioned to GENERATION_IN_PROGRESS, and the C++ accessors
assert 'context phase or generation init phase' on that state.

E2E smoke with Qwen3-0.6B (8 requests, ISL=1024, OSL=128) showed
scheduled_sum_prefill_tokens was stuck at 0 because every per-request
access threw RuntimeError.

Fix:
* For scheduled_sum_prefill_kv_tokens: read py_last_context_chunk, a
  Python-cached (start, end) tuple set in _update_request_states BEFORE
  the state transition. The start offset equals what the pre-mutation
  context_current_position was.
* For scheduled_sum_prefill_tokens: read from
  self.model_engine.iter_states['num_ctx_tokens'], a model-engine-level
  aggregate already computed before state mutation (existing code path
  used by inflight_batching_stats.num_ctx_tokens).

This preserves the MVP's single-rank and per-rank (attention-DP)
behavior. E2E verification with Qwen3-0.6B now emits
scheduledSumPrefillTokens=1024 during prefill iters, matching the
per-request prompt length. Test artifact:
~/code/tmp_context/forwardmetric-redesign/20260416/trtllm-only-e2e/fpm_results.json

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…t tests with real populate

Remove the tp_allgather-based attention-DP per-rank delivery from the
FPM MVP branch. Ships the 9 flat IterationStats fields for the
non-attention-DP case only; attention-DP per-rank emission becomes a
follow-up with a proper RC2 redesign (piggyback on
adp_router.gather_all_rank_states or dedicated MPI communicator).

Changes:
- py_executor.py: delete the `if self.enable_attention_dp and
  self.dist.tp_size > 1 and self.dist.is_first_pp_rank:` branch in
  _process_iter_stats that built a dict, called tp_allgather, and
  stored (dict, None, None) sentinels. Reduces _process_iter_stats to
  a straightforward populate + append. Remove now-unused `import json`.
- base_worker.py: delete the dict-sentinel `isinstance(iteration_stats,
  dict)` short-circuit in _stats_serializer. Keep
  `stats_dict.setdefault("attentionDpRank", 0)` so the serialized dict
  stays self-describing for the future attention-DP follow-up.
- tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py: remove
  _FakeDist + test_attention_dp_path_tags_rank_and_short_circuits
  (they exercised the deleted branch). Update the _populate helper to
  mirror the REAL populate algorithm: read py_last_context_chunk[0]
  (primary) + context_current_position (fallback) for prefill-KV, read
  model_engine.iter_states["num_ctx_tokens"] (engine-level aggregate)
  for prefill-tokens. Add a stub iter_states dict on the mock executor.
  Tests now validate the real code path instead of the pre-populate-fix
  approximation.

Rationale for the delete:
- The tp_allgather in _process_iter_stats interleaved with
  adp_router.gather_all_rank_states on the same TP communicator,
  causing MPI collective-ordering desync (RankState(*24 args)
  type-confusion). Documented in
  ~/code/tmp_context/forwardmetric-redesign/20260416/STEPS_14_15_POSTMORTEM.md.
- Under enable_attention_dp=True, the Dynamo adapter side is gated off
  separately (see companion commit on dynamo-fpm branch) so no
  misleading fake-idle FPM streams reach the Planner.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
nv-yna added 8 commits April 20, 2026 14:47
The prior round-trip test set and asserted only 5 of the 9 flat
IterationStats fields added by this PR. A camelCase typo or a dropped
serializer line in jsonSerialization.cpp on the other 4 (scheduledNumDecodeRequests,
scheduledSumDecodeKvTokens, queuedNumPrefillRequests, queuedSumPrefillTokens)
would silently ship.

Expand the test to exercise all 9 keys with distinct values per field, so
a cross-wiring bug in the C++ serializer (e.g. swapping prefill and
decode count slots) fails the assertion rather than round-tripping
cleanly with the wrong semantics.

Test locally:
  pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py -v
  -> 13 passed

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…for FPM fields

Replace the `_populate` shim in `test_iter_stats_fpm_fields.py` with direct
unbound calls to the production `PyExecutor._update_iter_stats`. The prior
shim was a verbatim copy of the populate block, so it could not detect
drift between tests and production. A `MagicMock`-based fake `self` holds
only the attributes the method actually reads (two for the FPM fields:
`executor_request_queue.get_request_queue().queue` and
`model_engine.iter_states`), and `torch.cuda.mem_get_info` is patched so
the test runs without a live CUDA context.

Also extend `validate_stats` in `test_llm.py` to assert the 9 FPM
camelCase keys on every stats iteration under the pytorch backend. This
exercises the full pipeline (`_update_iter_stats` -> `_stats_serializer` ->
RPC -> `get_stats` / `get_stats_async`) via the existing
`llm_get_stats_test_harness`, so any drop or rename along the chain fails
the test. Prefill-iteration `scheduledSumPrefillTokens > 0` is gated off
under `use_overlap=True` because the populate block sources that field
from the `model_engine.iter_states["num_ctx_tokens"]` side channel, which
can be stale at the moment `_update_iter_stats` runs under overlap
scheduling; all other FPM keys remain strictly asserted.

Local runs (inside the TRT-LLM build container, RTX 4090):
  pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py
    -> 13 passed
  pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats
    -> 4 passed (all parametrize combinations)
  pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats_async
    -> 4 passed (all parametrize combinations)

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…verlap-safe)

Under overlap scheduling `_executor_loop_overlap` invokes
`_process_iter_stats` on `self.previous_batch` (py_executor.py:2654-2656),
but `self.model_engine.iter_states["num_ctx_tokens"]` has been
overwritten by the CURRENT iteration's `_forward_step` before the
previous batch's stats are emitted. Reading the side channel therefore
reports the wrong batch's prefill-token count for every emission, and
at iter 0 collapses to 0 (initial dict empty). The Planner reads that as
"worker idle" and can scale down a loaded worker — silent planner poison.

Reproduction before the fix:
  pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats
  -> 2 failed (use_overlap=True configurations), 2 passed

Fix: drop the side-channel read and compute `scheduled_sum_prefill_tokens`
locally inside the existing context_requests loop by summing
`py_last_context_chunk[1] - py_last_context_chunk[0]` (= chunk_size =
end_compute - begin_compute) across non-dummy context requests. This
mirrors what `_prepare_tp_inputs` computes as `len(input_ids)` but
without depending on when that write happens relative to
`_update_iter_stats`. `py_last_context_chunk` is a Python-side cache
set by `_update_request_states` BEFORE state mutation, so it remains
valid for the previous batch's requests even after they transition to
GENERATION_IN_PROGRESS.

Dummy-exclusion semantics are now consistent across all three
scheduled-prefill fields (count, sum_tokens, sum_kv_tokens) — the
loop's single `if is_attention_dp_dummy: continue` guard drives all of
them. Previously `scheduled_sum_prefill_tokens` came from the engine
aggregate which included dummies while siblings excluded them.

The `test_llm.py::validate_stats` helper drops the `if not use_overlap`
relaxation; the strict `scheduledSumPrefillTokens > 0` assertion now
holds under every scheduler configuration.

`test_iter_states_missing_falls_back_to_zero` is repurposed as a
regression guard: passes a deliberately wrong `iter_states["num_ctx_tokens"]`
value and asserts the result comes from `py_last_context_chunk`, not
the side channel.

Known separate issue, NOT fixed here: `queued_num_prefill_requests` and
`queued_sum_prefill_tokens` still read `self.executor_request_queue`
at stats emission time. Under overlap this produces values from iter
N's state paired with batch N-1's scheduled_* counts — a semantic
drift, not a wrong value. Fixing it requires snapshotting queue state
at batch-prep time and belongs in a dedicated design discussion.

Local runs (inside dynamo-fpm-mvp-build:test with the modified
py_executor.py mounted over the installed copy):
  pytest tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py
    -> 13 passed
  pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats
    -> 4 passed (all parametrize combinations, including use_overlap=True)
  pytest tests/unittest/llmapi/test_llm_pytorch.py::test_llm_get_stats_async
    -> 4 passed (all parametrize combinations, including use_overlap=True)

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…onventions

Two related cleanups on the per-iteration request-aggregate counters
added earlier in this PR:

1. Comments made self-contained. References to downstream consumers
   and unrelated reference implementations ("ForwardPassMetrics",
   "Dynamo", "Dynamo's planner", vLLM's InstrumentedScheduler,
   components/src/dynamo/...) are removed from the TRT-LLM source
   tree. Comments now describe the behavior and invariants of the
   code in front of the reader without requiring familiarity with an
   external repo.

2. Default-initialization moved from the struct to the populate path.
   The 9 new primitive members on IterationStats were declared with
   `{0}` defaults, but every sibling field on the struct is instead
   populated unconditionally by the TensorRT-backend populate chain
   (Executor::Impl::getCurrentIterationStats → mModel->
   getCurrentIterationStats). Breaking that invariant silently exempts
   the new members and makes it harder to verify that every field is
   reached by the populate chain. Drop the `{0}` defaults and instead
   assign the 9 fields to 0 in
   Executor::Impl::getCurrentIterationStats before the virtual
   mModel callback, so a future TRT-side populate can overwrite and
   the struct-level convention is restored. Net behavior unchanged:
   TRT backend emits 0 for these fields (the backend does not track
   them); the PyTorch backend populates them via its own path.

Files:
- cpp/include/tensorrt_llm/executor/types.h — drop {0} defaults,
  rewrite section comment.
- cpp/tensorrt_llm/executor/executorImpl.cpp — add 9 zero
  assignments with a comment explaining the invariant.
- tensorrt_llm/_torch/pyexecutor/py_executor.py — comment rewrite.
- tests/unittest/pyexecutor/test_iter_stats_fpm_fields.py —
  docstring/comment rewrite.
- tests/unittest/llmapi/test_llm.py — comment rewrite.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
The 9 top-level flat fields added earlier in this PR had two systemic
issues:

1. Five of the nine were semantic duplicates of existing fields already
   tracked elsewhere in IterationStats or InflightBatchingStats. Keeping
   them meant two fields answered the same question with slightly
   different dummy-filtering semantics, inviting ambiguity for
   downstream consumers.

2. The naming vocabulary and word order didn't match the rest of the
   file. The struct uses Context/Ctx/Gen/Paused and num-prefix word
   order (numContextRequests, numCtxTokens); the new fields introduced
   Prefill/Decode and inverted the word order
   (scheduledNumPrefillRequests).

Cleanup:

- Delete the 9 top-level IterationStats members introduced in commit
  086f692.
- Add 5 members to InflightBatchingStats using the file's native
  vocabulary and word order:
    numCtxPrecomputedTokens       — tokens read from prior state
                                    (prefix cache hits + previously
                                    chunked tokens) across scheduled
                                    context requests; pairs with
                                    numCtxTokens (tokens computed
                                    this iteration).
    numGenKvTokens                — total KV context summed across
                                    scheduled generation requests.
    numQueuedContextRequests      — normal-only requests in the
                                    executor request queue (filters
                                    shutdown/cancel control items).
    numQueuedCtxTokens            — prompt tokens for the above.
    numPausedKvTokens             — total KV context across paused
                                    (preempted-decode) requests.

All serialization paths (binary, NLOHMANN JSON, nanobind) are updated
in lockstep so the wire format is self-consistent.

Also removes the 9 explicit zero-assignments in
Executor::Impl::getCurrentIterationStats that were added in commit
9bd9133 as a compensation for the old IterationStats field set. The
new fields live inside InflightBatchingStats which is
value-initialized at its model-populate construction site
(InflightBatchingStats modelStats{} in trtGptModelInflightBatching.cpp),
so TRT-backend stats emissions get zero sentinels automatically.

Populate block in PyExecutor._update_iter_stats collapses — 5 loops
feed 5 new InflightBatchingStats fields; 4 duplicate top-level
assignment targets go away entirely (replaced by pre-existing
InflightBatchingStats assignments at py_executor.py:1128-1134 which
handle num_context_requests / num_gen_requests / num_paused_requests /
num_scheduled_requests from the scheduled_batch directly).

Tests updated: test_iter_stats_fpm_fields.py now asserts on the nested
InflightBatchingStats fields and on existing sibling fields where the
old top-level assertions mapped. test_llm.py::validate_stats now reads
new keys from ifbStats instead of the removed top-level keys, and
uses numCtxTokens > 0 on prefill iterations (overlap-safe, set after
_forward_step).

Local validation: Python syntax check clean; clang-format clean on
all 5 cpp files. Tests against the current container .so fail with
AttributeError on the new nested field names (expected — the stale
binding has the old schema). A local TRT-LLM wheel rebuild will land
the updated nanobind binding and unblock e2e validation.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
- py_executor.py: drop the overlap-safety paragraph (no longer
  user-facing interesting), inline per-field rationale directly
  above each computation loop, keep the short high-level intro.
- py_executor.py: restore the single-expression form of
  self._append_iter_stats(self._update_iter_stats(...)) in
  _process_iter_stats; the intermediate iter_stats local variable
  was an unrelated drift from this PR's refactors.
- base_worker.py: drop the cross-repo pointer "(see Dynamo
  publisher.py gate)" from the attentionDpRank comment so the
  comment remains self-contained in TRT-LLM.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…_populate.py

The old name referenced Dynamo's ForwardPassMetrics vocabulary, but
these tests now cover the populate block in
``PyExecutor._update_iter_stats`` generally — specifically the extra
members on ``InflightBatchingStats`` written by that block. Renaming
to match the scope and drop the out-of-tree concept.

Pure rename; no content change.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@nv-yna nv-yna changed the title [None][feat] Expose flat ForwardPassMetrics fields on IterationStats (non-attention-DP scope) [None][feat] Add per-iteration request-aggregate counters to InflightBatchingStats Apr 21, 2026
Pure rename across all 7 touched files — C++ struct, both serializers,
nanobind binding, Python populate, and both test files. snake_case
counterpart renamed to ``num_ctx_kv_tokens``.

Matches the `KvTokens` pattern already used for the sibling
`numGenKvTokens` / `numPausedKvTokens` fields on the same struct —
symmetry makes the three KV-token fields visually group in the struct.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@nv-yna nv-yna marked this pull request as ready for review April 21, 2026 21:40
@nv-yna nv-yna requested review from a team as code owners April 21, 2026 21:40
@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented Apr 21, 2026

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

The PR extends InflightBatchingStats struct with five new fields tracking KV-cache metrics: KV tokens from context and generation requests, queued context request counts, total queued context tokens, and paused request KV tokens. Changes span type definitions, serialization layers, Python bindings, computation logic in the executor, and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Type Definition
cpp/include/tensorrt_llm/executor/types.h
Added five new SizeType32 fields to InflightBatchingStats struct: numCtxKvTokens, numGenKvTokens, numQueuedContextRequests, numQueuedCtxTokens, numPausedKvTokens.
Serialization Layer
cpp/tensorrt_llm/executor/jsonSerialization.cpp, cpp/tensorrt_llm/executor/serialization.cpp
Extended JSON serialization macro to include new fields, and updated binary serialization/deserialization methods to handle reading, writing, and computing sizes for the five additional KV-token and queue-related counters.
Python Integration
cpp/tensorrt_llm/nanobind/executor/bindings.cpp, tensorrt_llm/_torch/pyexecutor/py_executor.py
Exposed new properties to Python via nanobind bindings and implemented computation logic in _update_iter_stats to aggregate KV tokens from scheduled requests, queued requests from the request queue, and paused requests via py_last_context_chunk and per-request token counts.
Utility & Stats
tensorrt_llm/executor/base_worker.py
Added default value initialization for attentionDpRank field in JSON-parsed stats dictionary to ensure consistent serialized stats payload shape.
Tests
tests/unittest/llmapi/test_llm.py, tests/unittest/pyexecutor/test_iter_stats_populate.py
Enhanced existing test validation to assert new fields are present and non-negative, added context/generation phase-specific assertions, and introduced comprehensive new test suite covering empty iterations, prefill/decode scenarios, queued/paused request aggregation, and JSON serialization round-trips.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and specifically describes the main change: adding per-iteration request-aggregate counters to InflightBatchingStats, which is the primary focus of all file modifications.
Description check ✅ Passed The pull request description clearly explains the purpose, scope, new fields, and implementation details with comprehensive documentation of the changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update SPDX copyright year for this modified file.

This file was modified, but the header still ends at 2025. Please bump it to the current modification year.

Suggested fix
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines: "Add NVIDIA copyright header on ALL new files and update year on modified files" and "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/nanobind/executor/bindings.cpp` at line 2, The SPDX header
in the top of cpp/tensorrt_llm/nanobind/executor/bindings.cpp still lists
copyright years ending at 2025; update the SPDX comment to include the current
modification year (bump 2025 to the current year) so the NVIDIA copyright header
reflects the latest meaningful modification.
🧹 Nitpick comments (1)
tests/unittest/pyexecutor/test_iter_stats_populate.py (1)

91-95: Let the queue stub model payload-less entries too.

PyExecutor._update_iter_stats() has an explicit item.request is None skip path, but _StubQueueItem always fabricates a request object. That leaves the "payload-less requests are excluded" contract untested.

Suggested test helper tweak
 class _StubQueueItem:
-    def __init__(self, input_token_ids, is_normal_request=True):
-        self.request = types.SimpleNamespace(input_token_ids=input_token_ids)
+    def __init__(
+        self,
+        input_token_ids=None,
+        *,
+        is_normal_request=True,
+        has_request=True,
+    ):
+        self.request = (
+            types.SimpleNamespace(input_token_ids=input_token_ids)
+            if has_request
+            else None
+        )
         self.is_normal_request = is_normal_request

Then add a small assertion that has_request=False leaves both queued counters unchanged.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/pyexecutor/test_iter_stats_populate.py` around lines 91 - 95,
Modify the test stub and assertions so payload-less queue entries can be
simulated: update _StubQueueItem to allow request=None (or add a parameter like
has_request=False) instead of always creating types.SimpleNamespace, then in the
test that covers PyExecutor._update_iter_stats create an item with no request
and assert that both queued counters remain unchanged; reference the
_StubQueueItem constructor and the PyExecutor._update_iter_stats call when
making the change and add the extra assertion that has_request=False leaves
queued counters unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1234-1247: The counters currently increment
num_queued_context_requests before verifying the request payload, so move the
payload validation so that you only increment num_queued_context_requests when
the item has a usable payload: check getattr(item, "is_normal_request", False)
and that item.request exists and that item.request.input_token_ids is present
and consumable (e.g., wrapped in the existing try/except that computes
len(item.request.input_token_ids)); only on successful token-length computation
increment both num_queued_ctx_tokens (by the length) and
num_queued_context_requests. Update the loop around
executor_request_queue.get_request_queue().queue to perform the payload
existence/len check first and skip incrementing the request counter when
input_token_ids is missing or raises in the try block.

---

Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/executor/bindings.cpp`:
- Line 2: The SPDX header in the top of
cpp/tensorrt_llm/nanobind/executor/bindings.cpp still lists copyright years
ending at 2025; update the SPDX comment to include the current modification year
(bump 2025 to the current year) so the NVIDIA copyright header reflects the
latest meaningful modification.

---

Nitpick comments:
In `@tests/unittest/pyexecutor/test_iter_stats_populate.py`:
- Around line 91-95: Modify the test stub and assertions so payload-less queue
entries can be simulated: update _StubQueueItem to allow request=None (or add a
parameter like has_request=False) instead of always creating
types.SimpleNamespace, then in the test that covers
PyExecutor._update_iter_stats create an item with no request and assert that
both queued counters remain unchanged; reference the _StubQueueItem constructor
and the PyExecutor._update_iter_stats call when making the change and add the
extra assertion that has_request=False leaves queued counters unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 3bb2f0bc-5ed0-4c25-b0a3-926ad864b261

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae13fb and c05da0c.

📒 Files selected for processing (8)
  • cpp/include/tensorrt_llm/executor/types.h
  • cpp/tensorrt_llm/executor/jsonSerialization.cpp
  • cpp/tensorrt_llm/executor/serialization.cpp
  • cpp/tensorrt_llm/nanobind/executor/bindings.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/executor/base_worker.py
  • tests/unittest/llmapi/test_llm.py
  • tests/unittest/pyexecutor/test_iter_stats_populate.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44807 [ run ] triggered by Bot. Commit: c05da0c Link to invocation

Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions to consider.

Comment thread tensorrt_llm/executor/base_worker.py
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
@nv-yna nv-yna changed the title [None][feat] Add per-iteration request-aggregate counters to InflightBatchingStats [TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats Apr 23, 2026
nv-yna added 5 commits April 23, 2026 14:38
…onStats

Follow-up work is documented on TRTLLM-12123. The stub
stats_dict.setdefault("attentionDpRank", 0) keeps downstream consumers
able to read the field unconditionally today.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Two regression guards in PyTorchModelEngineTestCase, placed next to the
existing test_pad_generation_requests. They pin the invariant that
CUDAGraphRunner.pad_batch strips every is_cuda_graph_dummy=True entry
from scheduled_requests.generation_requests before its `with` block
exits — on both clean exit and exception paths.

Downstream consumers of scheduled_batch.generation_requests rely on
this, including the new FPM populate block added earlier in this PR
(tensorrt_llm/_torch/pyexecutor/py_executor.py:1220); those callers
filter only is_attention_dp_dummy and trust pad_batch to handle the
cudagraph kind upstream.

Negative-control verified locally: replacing the `finally` slice with
`pass` makes both tests fail with 'AssertionError: 8 != 5' (5 real
requests + 3 padding dummies retained).

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…t comments

Line-number references (cuda_graph_runner.py:524, py_executor.py:1220)
drift easily under refactors; the comment still conveys intent without
them. 'FPM' is a Dynamo-side term; describe the dependent code as
'the per-iteration stats populate block' so the comment stands on its
own within TRT-LLM.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…, narrow except, warn on drift

Addresses CodeRabbit's review comment on PyExecutor._update_iter_stats:

- Move num_queued_context_requests increment AFTER the input-token
  length check so both counters move in lockstep. Downstream consumers
  that compute avg = num_queued_ctx_tokens / num_queued_context_requests
  can no longer see a deflated ratio if a future API change makes
  input_token_ids unreadable.
- Narrow bare 'except Exception' to (AttributeError, TypeError); silences
  Ruff S110 / BLE001 and stops masking unexpected errors.
- Drop defensive getattr() on item.is_normal_request; item is always a
  RequestQueueItem (queue type is queue.Queue[RequestQueueItem]) so the
  fallback was dead code.
- Emit logger.warning on the skip branch so future API drift surfaces
  instead of being silently absorbed.

No behavior change on the current API: the C++ ExecutorRequest
constructor requires non-empty input_token_ids, so the except branch
is unreachable today.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…ed InflightBatchingStats fields

Addresses the disagg-decode routing concern raised on PR 13199: the
queued-context loop in PyExecutor._update_iter_stats currently lumps
REQUEST_TYPE_GENERATION_ONLY queued requests (disagg-decode engine,
awaiting KV transfer from a prefill engine) into
numQueuedContextRequests / numQueuedCtxTokens. Autoscaling consumers
read that as 'prefill backlog deep' on a decode engine, when the real
signal is 'decode work pending KV transfer' — wrong scaling action.

Extend InflightBatchingStats with two new members:
  numQueuedGenRequests  — count of gen-only queued items
  numQueuedGenKvTokens  — prompt-token sum across those items
                          (the KV budget they will occupy once
                          their KV transfer completes)

Populate block now routes by item.request.request_type:
  - CONTEXT_AND_GENERATION (non-disagg default) and CONTEXT_ONLY
    (disagg-prefill side) -> queued-context counters.
  - GENERATION_ONLY (disagg-decode side) -> queued-gen counters.

Downstream consumers that build Dynamo's ForwardPassMetrics can now
compute a correct queued_requests.num_decode_requests as
numPausedRequests + numQueuedGenRequests (same for the KV-token sum).

Wired through the full binding surface: types.h (struct members),
serialization.cpp (binary ser/deser/size), jsonSerialization.cpp
(NLOHMANN macro), nanobind/executor/bindings.cpp (Python properties).
The __init__.pyi stub is generated from the bindings and is
regenerated by the build.

Tests:
  * test_iter_stats_populate.py::test_queued_routes_by_request_type
    covers all three enum values with distinct token counts, asserts
    correct routing across both counter pairs.
  * test_empty_iteration and the JSON round-trip test extended to
    include the two new fields.
  * test_llm.py::validate_stats's 'new_aggregate_keys' tuple extended
    to assert presence of the two new keys under the full
    get_stats -> get_stats_async RPC pipeline.
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented Apr 24, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45368 [ run ] triggered by Bot. Commit: 4995f8f Link to invocation

@nv-yna nv-yna requested review from brb-nv and tedzhouhk April 24, 2026 15:32
Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@nv-yna nv-yna requested review from tedzhouhk and removed request for Superjomn, lfr-0531 and tedzhouhk April 24, 2026 17:24
@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented Apr 24, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45443 [ run ] triggered by Bot. Commit: 4995f8f Link to invocation

@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented Apr 24, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45459 [ run ] triggered by Bot. Commit: 4995f8f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45459 [ run ] completed with state SUCCESS. Commit: 4995f8f
/LLM/main/L0_MergeRequest_PR pipeline #35690 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented Apr 26, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45532 [ run ] triggered by Bot. Commit: 4995f8f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45532 [ run ] completed with state SUCCESS. Commit: 4995f8f
/LLM/main/L0_MergeRequest_PR pipeline #35753 completed with status: 'SUCCESS'

CI Report

Link to invocation

@pcastonguay pcastonguay merged commit 734a146 into NVIDIA:main Apr 27, 2026
13 of 15 checks passed
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Apr 29, 2026
Refreshed all 16 files under docs/overview/ to reflect TRT-LLM v1.3.0rc14
(upstream/main 3b7af1c, ~3 weeks of changes since the 2026-04-06
baseline 2b80f8d), competing frameworks (vLLM v0.20.0, SGLang
v0.5.10.post1, LMCache v0.4.4, NVIDIA Dynamo v1.0.0), and the current
hardware (Vera Rubin in production, AMD MI355X MLPerf 6.0, Google TPU v7
GA) and academic landscape (GOOSE, StreamServe, PrfaaS, FlowKV).

Highlights:
- Spec-dec algorithm count corrected 7 to 8 (DFlash added; EAGLE3 dynamic
  tree re-enabled; LoRA + spec-dec generic and EAGLE3-specific now work).
- Block reuse + overlap scheduler combined (NVIDIA#12816), removing a
  long-standing internal gap.
- First-class lmcache and kvbm KV connectors (NVIDIA#12626).
- Production observability stack (Prometheus NVIDIA#12545, modular logger
  NVIDIA#13202, NvTelemetry NVIDIA#12384, per-iteration aggregate counters NVIDIA#13199).
- Disagg reliability fail-fast wave (NVIDIA#13119, NVIDIA#13408, NVIDIA#12718, NVIDIA#12888).
- AutoDeploy onboarded DeepSeek-R1, Gemma-4 + 4-31B NVFP4, MiniMax-M2.7;
  standalone-package-ready; legacy EdgeLLM ONNX export removed.
- KV cache V2 still default OFF (multiple V2 fixes; tracked via new
  V2-default-on milestone in §06).
- New 4 framework comparison column for Dynamo v1.0; updated feature
  matrix and gap analysis to reflect vLLM v0.20 (FA4 MLA prefill default,
  TurboQuant 2-bit KV, vLLM IR foundation, Model Runner V2).
- Strategic prioritization quadrant chart fully re-ranked; new Tier 1
  items: TTFT re-benchmark vs vLLM v0.20, low-bit KV, MLA prefill kernel
  default, disagg chaos-test harness; new Tier 2: Dynamo Snapshot
  integration, TRT-LLM IR strategy, adaptive spec-dec depth.

Snapshot of pre-refresh content saved at docs/overview/.snapshots/2026-04-06/.
Per-file diff highlights, priority-shift table, sources, and
blocked/skipped notes are in docs/overview/CHANGELOG.md.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
nv-yna added a commit to nv-yna/dynamo that referenced this pull request Apr 30, 2026
NVIDIA/TensorRT-LLM#13199 merged with a different shape than the draft
this PR was originally written against: the new IterationStats fields
live nested inside `inflightBatchingStats` and use names that don't
match the draft's flat top-level keys.

Field-extraction changes (publisher.py):
  - _check_fpm_schema reads the nested `inflightBatchingStats` dict and
    fails closed if it is missing or any of the 11 required IBS fields
    is absent. Strict probe protects against silent planner poison when
    running against a TRT-LLM that predates #13199.
  - handle_stat reads the 11 IBS fields and composes Dynamo-side fields:
      scheduled_num_prefill_requests = numContextRequests
      scheduled_sum_prefill_tokens   = numCtxTokens (compute this iter;
                                       excludes prefix-cache carryover
                                       counted in numCtxKvTokens, to
                                       match vLLM's TTFT-prediction
                                       semantics)
      scheduled_sum_prefill_kv_tokens = numCtxKvTokens
      scheduled_num_decode_requests   = numGenRequests
      scheduled_sum_decode_kv_tokens  = numGenKvTokens
      queued_num_prefill_requests     = numQueuedContextRequests
      queued_sum_prefill_tokens       = numQueuedCtxTokens
      queued_num_decode_requests      = numPausedRequests +
                                        numQueuedGenRequests
      queued_sum_decode_kv_tokens     = numPausedKvTokens +
                                        numQueuedGenKvTokens
    Paused-decode requests are intentionally double-counted with
    numScheduledRequests upstream because the planner reads queued_decode
    as a KV-pressure preemption signal, where paused-decodes carry the
    same semantic weight as queued-gen-only requests.
  - FpmDirectPublisher.publish() callsite switched to keyword-only args
    so adjacent ints with similar units (`scheduled_*` vs `queued_*`,
    `*_prefill_*` vs `*_decode_*`) cannot be silently transposed.
  - Init-time and cleanup-time `except Exception` narrowed to
    `except RuntimeError` (PyO3 surfaces FpmDirectPublisher new/shutdown
    failures only as PyRuntimeError); hot-path catch in handle_stat
    stays broad on purpose, with a comment explaining why.

Rust binding changes (lib/bindings/python/rust/llm/fpm.rs):
  - FpmDirectPublisher struct field `_inner` renamed to `inner`. The
    underscore prefix conventionally signals "intentionally unused" to
    Rust readers, but the field is load-bearing for Drop-driven shutdown
    of all per-rank serialization tasks. A future cleanup pass would
    delete it as dead and leak every spawned task. Documented in a
    comment.
  - publish() signature switched to keyword-only via #[pyo3(signature =
    (*, ...))]; matching update in _core.pyi stub.

Test changes (test_trtllm_fpm_publisher.py):
  - Module-level pytestmark = [unit, trtllm, pre_merge, gpu_0] picked up
    by CI marker filters. Without these, the tests were silently skipped
    by the strict-marker config.
  - _build_fake_stat constructs the nested IBS structure matching the
    real JSON; _invoke_handler mirrors handle_stat's nested extraction
    and kwargs publish() exactly. A guardrail test
    (test_invoke_handler_matches_publisher_keyword_set) catches drift.
  - Schema-probe parametrize iterates all 11 IBS fields plus the
    ibs-not-a-dict and missing-ibs-dict edge cases.
  - New tests for the queued-decode composite under disagg-decode
    (only-queued-gen) and aggregated (only-paused) engine shapes.
  - _build_publisher_stub uses the monkeypatch fixture for hermetic
    state — no module-level rebinds that leak across tests. The blanket
    try/except in _run_initialize is removed; WorkerMetricsPublisher and
    KvEventPublisher are stubbed via monkeypatch so initialize() reaches
    the FPM gate cleanly.

Verification (single-GPU local, non-attention-DP):
  - Unit: 28/28 pass.
  - Schema probe against real Qwen3-0.6B inside L0_PostMerge build 2697
    (commit e903428, first build containing #13199): all 11 required IBS
    fields present at the nested path, attentionDpRank tagged at the
    top level.
  - Sustained load (concurrency 16 vs max_batch_size 4, 45 s): 520 stats
    sampled, max scheduled_num_decode_requests=4 (saturated batch), max
    queued_num_prefill_requests=3 (queueing observed), 512/520 iters had
    non-zero decode signal.

Known limitation: TensorRT (legacy) backend leaves IBS at zero-default,
so FPM emission against a TRT-engine deployment will be all-zero. The
schema probe accepts keys-exist; a value-aware probe is left as
follow-up.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
nv-yna added a commit to nv-yna/dynamo that referenced this pull request Apr 30, 2026
… fields

Closes the "old TRT-LLM" planner-poison gap left by ai-dynamo#8356.

Problem: handle_stat's .publish() call uses stat.get(field, 0) for every
required IterationStats field. On a TRT-LLM version that predates
NVIDIA/TensorRT-LLM#13199, every field defaults to 0 and the emitted
ForwardPassSnapshot is byte-identical to the Rust-side idle heartbeat.
The Planner reads that as "worker is idle" and can scale down a worker
under real load. Same class of bug as the attention-DP rank 1..N-1 case
that ai-dynamo#8356 explicitly gates, but ungated for the old-TRT-LLM path.

Fix: one-shot schema probe on the first IterationStats delivered to
handle_stat. Strict — all 9 fields added by #13199 must be present, else
the probe shuts down the publisher and sets self.fpm_publisher = None.
The existing `if self.fpm_publisher is not None:` guard then short-
circuits all further FPM emission, matching the attention-DP gate's
"zero messages" contract.

Scope:
- components/src/dynamo/trtllm/publisher.py:
  * Add _FPM_REQUIRED_STAT_FIELDS (9 fields).
  * Add self._fpm_schema_checked: bool = False in Publisher.__init__.
  * Add Publisher._check_fpm_schema(stat) method.
  * Wire the probe into handle_stat before the existing FPM publish branch.
- components/src/dynamo/trtllm/tests/test_trtllm_fpm_publisher.py:
  * 7 new tests (15 cases via parametrize). Cover: all-fields-present
    keeps publisher; each of the 9 fields missing individually disables
    publisher; missing-all-fields (legacy TRT-LLM) disables; noop when
    fpm_publisher is already None; shutdown exception still disables;
    probe runs only once; required-fields-tuple-matches-publish guardrail.

No Rust changes. No API surface changes.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
nv-yna added a commit to nv-yna/dynamo that referenced this pull request Apr 30, 2026
NVIDIA/TensorRT-LLM#13199 merged with a different shape than the draft
this PR was originally written against: the new IterationStats fields
live nested inside `inflightBatchingStats` and use names that don't
match the draft's flat top-level keys.

Field-extraction changes (publisher.py):
  - _check_fpm_schema reads the nested `inflightBatchingStats` dict and
    fails closed if it is missing or any of the 11 required IBS fields
    is absent. Strict probe protects against silent planner poison when
    running against a TRT-LLM that predates #13199.
  - handle_stat reads the 11 IBS fields and composes Dynamo-side fields:
      scheduled_num_prefill_requests = numContextRequests
      scheduled_sum_prefill_tokens   = numCtxTokens (compute this iter;
                                       excludes prefix-cache carryover
                                       counted in numCtxKvTokens, to
                                       match vLLM's TTFT-prediction
                                       semantics)
      scheduled_sum_prefill_kv_tokens = numCtxKvTokens
      scheduled_num_decode_requests   = numGenRequests
      scheduled_sum_decode_kv_tokens  = numGenKvTokens
      queued_num_prefill_requests     = numQueuedContextRequests
      queued_sum_prefill_tokens       = numQueuedCtxTokens
      queued_num_decode_requests      = numPausedRequests +
                                        numQueuedGenRequests
      queued_sum_decode_kv_tokens     = numPausedKvTokens +
                                        numQueuedGenKvTokens
    Paused-decode requests are intentionally double-counted with
    numScheduledRequests upstream because the planner reads queued_decode
    as a KV-pressure preemption signal, where paused-decodes carry the
    same semantic weight as queued-gen-only requests.
  - FpmDirectPublisher.publish() callsite switched to keyword-only args
    so adjacent ints with similar units (`scheduled_*` vs `queued_*`,
    `*_prefill_*` vs `*_decode_*`) cannot be silently transposed.
  - Init-time and cleanup-time `except Exception` narrowed to
    `except RuntimeError` (PyO3 surfaces FpmDirectPublisher new/shutdown
    failures only as PyRuntimeError); hot-path catch in handle_stat
    stays broad on purpose, with a comment explaining why.

Rust binding changes (lib/bindings/python/rust/llm/fpm.rs):
  - FpmDirectPublisher struct field `_inner` renamed to `inner`. The
    underscore prefix conventionally signals "intentionally unused" to
    Rust readers, but the field is load-bearing for Drop-driven shutdown
    of all per-rank serialization tasks. A future cleanup pass would
    delete it as dead and leak every spawned task. Documented in a
    comment.
  - publish() signature switched to keyword-only via #[pyo3(signature =
    (*, ...))]; matching update in _core.pyi stub.

Test changes (test_trtllm_fpm_publisher.py):
  - Module-level pytestmark = [unit, trtllm, pre_merge, gpu_0] picked up
    by CI marker filters. Without these, the tests were silently skipped
    by the strict-marker config.
  - _build_fake_stat constructs the nested IBS structure matching the
    real JSON; _invoke_handler mirrors handle_stat's nested extraction
    and kwargs publish() exactly. A guardrail test
    (test_invoke_handler_matches_publisher_keyword_set) catches drift.
  - Schema-probe parametrize iterates all 11 IBS fields plus the
    ibs-not-a-dict and missing-ibs-dict edge cases.
  - New tests for the queued-decode composite under disagg-decode
    (only-queued-gen) and aggregated (only-paused) engine shapes.
  - _build_publisher_stub uses the monkeypatch fixture for hermetic
    state — no module-level rebinds that leak across tests. The blanket
    try/except in _run_initialize is removed; WorkerMetricsPublisher and
    KvEventPublisher are stubbed via monkeypatch so initialize() reaches
    the FPM gate cleanly.

Verification (single-GPU local, non-attention-DP):
  - Unit: 28/28 pass.
  - Schema probe against real Qwen3-0.6B inside L0_PostMerge build 2697
    (commit e903428, first build containing #13199): all 11 required IBS
    fields present at the nested path, attentionDpRank tagged at the
    top level.
  - Sustained load (concurrency 16 vs max_batch_size 4, 45 s): 520 stats
    sampled, max scheduled_num_decode_requests=4 (saturated batch), max
    queued_num_prefill_requests=3 (queueing observed), 512/520 iters had
    non-zero decode signal.

Known limitation: TensorRT (legacy) backend leaves IBS at zero-default,
so FPM emission against a TRT-engine deployment will be all-zero. The
schema probe accepts keys-exist; a value-aware probe is left as
follow-up.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants