Skip to content

[None][feat] add batch-full benchmark throughput metric#13638

Merged
sunnyqgg merged 2 commits into
NVIDIA:mainfrom
zhaoyangwang-nvidia:add-full-batch-result
May 14, 2026
Merged

[None][feat] add batch-full benchmark throughput metric#13638
sunnyqgg merged 2 commits into
NVIDIA:mainfrom
zhaoyangwang-nvidia:add-full-batch-result

Conversation

@zhaoyangwang-nvidia
Copy link
Copy Markdown
Collaborator

@zhaoyangwang-nvidia zhaoyangwang-nvidia commented Apr 30, 2026

Summary by CodeRabbit

  • New Features
    • Added a new batch-full output throughput metric that measures output tokens per second when the batch is operating at full capacity. This metric estimates performance during peak load conditions and provides visibility into maximum achievable throughput. Now included in performance statistics and reporting overview.

Motivation

In speculative decoding benchmarks, we want generated outputs to remain valid user-facing content. For that reason, these tests should not ignore EOS, and the input prompts should use the chat template when appropriate.

That makes the benchmark closer to real serving behavior, but it also means output lengths are naturally uneven. With 80 test samples and larger batch sizes such as 32 or 64, a single long-tail request can keep running after most other requests have finished. This final drain phase can noticeably reduce end-to-end output throughput, even when the steady-state full-batch throughput is healthy.

For example, in one Llama 8B rejection-sampling run:

  • bs32/linear_rejection_on

    • End-to-end output throughput: 637.2 tok/s
    • Batch-full output throughput: 2445.8 tok/s
    • One request reached the 2000 token cap without EOS, which heavily dragged down the end-to-end metric.
  • bs32/dynamic_tree_rejection_on

    • End-to-end output throughput: 1473.6 tok/s
    • Batch-full output throughput: 2502.5 tok/s
    • Two requests reached the 2000 token cap, creating a visible tail/drain effect.

The existing output throughput metric is still useful because it reflects total wall-clock behavior, including tail latency. However, it can obscure the steady-state throughput when the batch is full.

Changes

This PR adds a new benchmark metric:

  • batch_full_output_throughput_tok_s

The metric estimates output token throughput only over intervals where the number of active requests is at least the configured batch size. Since request records do not include per-token timestamps, output tokens are prorated uniformly over each request's start/end interval.

This gives us a complementary steady-state throughput view while preserving the existing end-to-end throughput metric.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@zhaoyangwang-nvidia zhaoyangwang-nvidia marked this pull request as ready for review April 30, 2026 03:29
@zhaoyangwang-nvidia zhaoyangwang-nvidia requested a review from a team as a code owner April 30, 2026 03:29
@zhaoyangwang-nvidia
Copy link
Copy Markdown
Collaborator Author

Hi @NVIDIA/trtllm-bench-reviewers could you help to review this PR, It only adds a small benchmark statistics feature: reporting batch_full_output_throughput_tok_s in addition to the existing end-to-end output throughput metric. This helps distinguish steady-state full-batch throughput from cases where a few long-tail outputs drag down the overall wall-clock throughput.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

A new batch-full output throughput metric is added to the benchmarking statistics. The metric measures output tokens per second during periods when active request counts meet or exceed the specified batch size. It is computed from request records, stored in the BenchmarkStatistics model, and exposed through a ReportUtility property.

Changes

Cohort / File(s) Summary
Statistics Model
tensorrt_llm/bench/dataclasses/statistics.py
Added optional batch_full_output_throughput_tok_ns field to BenchmarkStatistics Pydantic model to store the new throughput metric.
Reporting and Computation Logic
tensorrt_llm/bench/dataclasses/reporting.py
Implemented static method _compute_batch_full_output_throughput() to calculate batch-full output throughput by prorating tokens across request spans and summing overlaps with batch-full intervals. Integrated metric into generate_statistics_summary() and exposed via new ReportUtility.batch_full_output_throughput_tok_s property for serialization and reporting.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding a new batch-full benchmark throughput metric. It follows the required template format with ticket reference and type annotation.
Description check ✅ Passed The PR description includes a detailed Motivation section explaining the problem, a clear Changes section describing the new metric, and verification of the PR Checklist items. All major template sections are adequately filled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

88-143: ⚡ Quick win

Add focused tests for the sweep-line boundary cases.

This logic depends on subtle timestamp semantics: identical start/end boundaries, zero-duration requests being skipped, and the None path when no full-batch interval exists. A few table-driven tests here would make future refactors much safer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/bench/dataclasses/reporting.py` around lines 88 - 143, Add
focused, table-driven unit tests for _compute_batch_full_output_throughput
(involving RequestRecord instances) that cover the sweep-line boundary cases:
(1) requests with identical start and end timestamps (zero-duration) are
skipped, (2) intervals where active_requests equals batch_size at boundaries
(start/end coincident) are handled correctly, (3) the function returns None when
no batch-full interval exists, and (4) token prorating over partial overlaps
yields the expected fractional token counts; create small test rows exercising
single-request, overlapping requests, exact-boundary transitions, and
multi-request partial-overlap scenarios to assert numeric throughput or None as
appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/bench/dataclasses/reporting.py`:
- Around line 131-143: Replace the nested loops that compute
batch_full_output_tokens by building a single sorted event sweep: create events
for each request span boundary (at request_start add delta = output_tokens /
request_duration_ns, at request_end subtract same delta) and events for each
batch_full_interval boundary (at interval_start mark +1, at interval_end mark
-1), sort all events by time, then iterate once maintaining current_active_rate
and an inside_batch_full counter; for each consecutive event segment, if
inside_batch_full > 0 add current_active_rate * segment_duration to
batch_full_output_tokens, and at the end divide by batch_full_duration_ns as
before. Update the code around request_spans, batch_full_intervals and
batch_full_output_tokens to use this single-sweep approach.

---

Nitpick comments:
In `@tensorrt_llm/bench/dataclasses/reporting.py`:
- Around line 88-143: Add focused, table-driven unit tests for
_compute_batch_full_output_throughput (involving RequestRecord instances) that
cover the sweep-line boundary cases: (1) requests with identical start and end
timestamps (zero-duration) are skipped, (2) intervals where active_requests
equals batch_size at boundaries (start/end coincident) are handled correctly,
(3) the function returns None when no batch-full interval exists, and (4) token
prorating over partial overlaps yields the expected fractional token counts;
create small test rows exercising single-request, overlapping requests,
exact-boundary transitions, and multi-request partial-overlap scenarios to
assert numeric throughput or None as appropriate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed6b4eaa-c6e9-417b-bdcf-5a3748bcb699

📥 Commits

Reviewing files that changed from the base of the PR and between 37fc0e3 and 0ae5d69.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py
  • tensorrt_llm/bench/dataclasses/statistics.py

Comment thread tensorrt_llm/bench/dataclasses/reporting.py Outdated
Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
@zhaoyangwang-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48065 [ run ] triggered by Bot. Commit: 88e810f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48065 [ run ] completed with state SUCCESS. Commit: 88e810f
/LLM/main/L0_MergeRequest_PR pipeline #37897 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zhaoyangwang-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48107 [ run ] triggered by Bot. Commit: 88e810f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48107 [ run ] completed with state SUCCESS. Commit: 88e810f
/LLM/main/L0_MergeRequest_PR pipeline #37935 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zhaoyangwang-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48136 [ run ] triggered by Bot. Commit: 88e810f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48136 [ run ] completed with state SUCCESS. Commit: 88e810f
/LLM/main/L0_MergeRequest_PR pipeline #37961 completed with status: 'SUCCESS'

CI Report

Link to invocation

@sunnyqgg sunnyqgg merged commit 6ccb07a into NVIDIA:main May 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants