[None][feat] add batch-full benchmark throughput metric by zhaoyangwang-nvidia · Pull Request #13638 · NVIDIA/TensorRT-LLM

zhaoyangwang-nvidia · 2026-04-30T03:28:31Z

Summary by CodeRabbit

New Features
- Added a new batch-full output throughput metric that measures output tokens per second when the batch is operating at full capacity. This metric estimates performance during peak load conditions and provides visibility into maximum achievable throughput. Now included in performance statistics and reporting overview.

Motivation

In speculative decoding benchmarks, we want generated outputs to remain valid user-facing content. For that reason, these tests should not ignore EOS, and the input prompts should use the chat template when appropriate.

That makes the benchmark closer to real serving behavior, but it also means output lengths are naturally uneven. With 80 test samples and larger batch sizes such as 32 or 64, a single long-tail request can keep running after most other requests have finished. This final drain phase can noticeably reduce end-to-end output throughput, even when the steady-state full-batch throughput is healthy.

For example, in one Llama 8B rejection-sampling run:

bs32/linear_rejection_on
- End-to-end output throughput: 637.2 tok/s
- Batch-full output throughput: 2445.8 tok/s
- One request reached the 2000 token cap without EOS, which heavily dragged down the end-to-end metric.
bs32/dynamic_tree_rejection_on
- End-to-end output throughput: 1473.6 tok/s
- Batch-full output throughput: 2502.5 tok/s
- Two requests reached the 2000 token cap, creating a visible tail/drain effect.

The existing output throughput metric is still useful because it reflects total wall-clock behavior, including tail latency. However, it can obscure the steady-state throughput when the batch is full.

Changes

This PR adds a new benchmark metric:

batch_full_output_throughput_tok_s

The metric estimates output token throughput only over intervals where the number of active requests is at least the configured batch size. Since request records do not include per-token timestamps, output tokens are prorated uniformly over each request's start/end interval.

This gives us a complementary steady-state throughput view while preserving the existing end-to-end throughput metric.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

zhaoyangwang-nvidia · 2026-04-30T03:34:52Z

Hi @NVIDIA/trtllm-bench-reviewers could you help to review this PR, It only adds a small benchmark statistics feature: reporting batch_full_output_throughput_tok_s in addition to the existing end-to-end output throughput metric. This helps distinguish steady-state full-batch throughput from cases where a few long-tail outputs drag down the overall wall-clock throughput.

coderabbitai · 2026-04-30T03:35:41Z

📝 Walkthrough

Walkthrough

A new batch-full output throughput metric is added to the benchmarking statistics. The metric measures output tokens per second during periods when active request counts meet or exceed the specified batch size. It is computed from request records, stored in the BenchmarkStatistics model, and exposed through a ReportUtility property.

Changes

Cohort / File(s)	Summary
Statistics Model `tensorrt_llm/bench/dataclasses/statistics.py`	Added optional `batch_full_output_throughput_tok_ns` field to `BenchmarkStatistics` Pydantic model to store the new throughput metric.
Reporting and Computation Logic `tensorrt_llm/bench/dataclasses/reporting.py`	Implemented static method `_compute_batch_full_output_throughput()` to calculate batch-full output throughput by prorating tokens across request spans and summing overlaps with batch-full intervals. Integrated metric into `generate_statistics_summary()` and exposed via new `ReportUtility.batch_full_output_throughput_tok_s` property for serialization and reporting.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding a new batch-full benchmark throughput metric. It follows the required template format with ticket reference and type annotation.
Description check	✅ Passed	The PR description includes a detailed Motivation section explaining the problem, a clear Changes section describing the new metric, and verification of the PR Checklist items. All major template sections are adequately filled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tensorrt_llm/bench/dataclasses/reporting.py (1)
88-143: ⚡ Quick win

Add focused tests for the sweep-line boundary cases.

This logic depends on subtle timestamp semantics: identical start/end boundaries, zero-duration requests being skipped, and the None path when no full-batch interval exists. A few table-driven tests here would make future refactors much safer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/bench/dataclasses/reporting.py` around lines 88 - 143, Add
focused, table-driven unit tests for _compute_batch_full_output_throughput
(involving RequestRecord instances) that cover the sweep-line boundary cases:
(1) requests with identical start and end timestamps (zero-duration) are
skipped, (2) intervals where active_requests equals batch_size at boundaries
(start/end coincident) are handled correctly, (3) the function returns None when
no batch-full interval exists, and (4) token prorating over partial overlaps
yields the expected fractional token counts; create small test rows exercising
single-request, overlapping requests, exact-boundary transitions, and
multi-request partial-overlap scenarios to assert numeric throughput or None as
appropriate.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/bench/dataclasses/reporting.py`:
- Around line 131-143: Replace the nested loops that compute
batch_full_output_tokens by building a single sorted event sweep: create events
for each request span boundary (at request_start add delta = output_tokens /
request_duration_ns, at request_end subtract same delta) and events for each
batch_full_interval boundary (at interval_start mark +1, at interval_end mark
-1), sort all events by time, then iterate once maintaining current_active_rate
and an inside_batch_full counter; for each consecutive event segment, if
inside_batch_full > 0 add current_active_rate * segment_duration to
batch_full_output_tokens, and at the end divide by batch_full_duration_ns as
before. Update the code around request_spans, batch_full_intervals and
batch_full_output_tokens to use this single-sweep approach.

---

Nitpick comments:
In `@tensorrt_llm/bench/dataclasses/reporting.py`:
- Around line 88-143: Add focused, table-driven unit tests for
_compute_batch_full_output_throughput (involving RequestRecord instances) that
cover the sweep-line boundary cases: (1) requests with identical start and end
timestamps (zero-duration) are skipped, (2) intervals where active_requests
equals batch_size at boundaries (start/end coincident) are handled correctly,
(3) the function returns None when no batch-full interval exists, and (4) token
prorating over partial overlaps yields the expected fractional token counts;
create small test rows exercising single-request, overlapping requests,
exact-boundary transitions, and multi-request partial-overlap scenarios to
assert numeric throughput or None as appropriate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed6b4eaa-c6e9-417b-bdcf-5a3748bcb699

📥 Commits

Reviewing files that changed from the base of the PR and between 37fc0e3 and 0ae5d69.

📒 Files selected for processing (2)

tensorrt_llm/bench/dataclasses/reporting.py
tensorrt_llm/bench/dataclasses/statistics.py

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

zhaoyangwang-nvidia · 2026-05-13T02:01:59Z

/bot run

tensorrt-cicd · 2026-05-13T02:08:23Z

PR_Github #48065 [ run ] triggered by Bot. Commit: 88e810f Link to invocation

tensorrt-cicd · 2026-05-13T04:18:12Z

PR_Github #48065 [ run ] completed with state SUCCESS. Commit: 88e810f
/LLM/main/L0_MergeRequest_PR pipeline #37897 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zhaoyangwang-nvidia · 2026-05-13T05:33:54Z

/bot run

tensorrt-cicd · 2026-05-13T05:39:11Z

PR_Github #48107 [ run ] triggered by Bot. Commit: 88e810f Link to invocation

tensorrt-cicd · 2026-05-13T07:17:59Z

PR_Github #48107 [ run ] completed with state SUCCESS. Commit: 88e810f
/LLM/main/L0_MergeRequest_PR pipeline #37935 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zhaoyangwang-nvidia · 2026-05-13T07:37:59Z

/bot run

tensorrt-cicd · 2026-05-13T07:43:38Z

PR_Github #48136 [ run ] triggered by Bot. Commit: 88e810f Link to invocation

tensorrt-cicd · 2026-05-13T11:14:09Z

PR_Github #48136 [ run ] completed with state SUCCESS. Commit: 88e810f
/LLM/main/L0_MergeRequest_PR pipeline #37961 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned zhaoyangwang-nvidia Apr 30, 2026

zhaoyangwang-nvidia marked this pull request as ready for review April 30, 2026 03:29

zhaoyangwang-nvidia requested a review from a team as a code owner April 30, 2026 03:29

zhaoyangwang-nvidia requested a review from FrankD412 April 30, 2026 03:29

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread tensorrt_llm/bench/dataclasses/reporting.py Outdated

qiaoxj07 approved these changes May 12, 2026

View reviewed changes

zhaoyangwang-nvidia added 2 commits May 13, 2026 10:01

[None][feat] add batch-full benchmark throughput

207e95c

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

[None][fix] optimize batch-full throughput calculation

88e810f

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

zhaoyangwang-nvidia force-pushed the add-full-batch-result branch from 6f8cf1f to 88e810f Compare May 13, 2026 02:01

sunnyqgg merged commit 6ccb07a into NVIDIA:main May 14, 2026
6 checks passed

Conversation

zhaoyangwang-nvidia commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Motivation

Changes

PR Checklist

GitHub Bot Help

Uh oh!

zhaoyangwang-nvidia commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaoyangwang-nvidia commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

zhaoyangwang-nvidia commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

zhaoyangwang-nvidia commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhaoyangwang-nvidia commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading