[None][feat] add batch-full benchmark throughput metric#13638
Conversation
|
Hi @NVIDIA/trtllm-bench-reviewers could you help to review this PR, It only adds a small benchmark statistics feature: reporting batch_full_output_throughput_tok_s in addition to the existing end-to-end output throughput metric. This helps distinguish steady-state full-batch throughput from cases where a few long-tail outputs drag down the overall wall-clock throughput. |
📝 WalkthroughWalkthroughA new batch-full output throughput metric is added to the benchmarking statistics. The metric measures output tokens per second during periods when active request counts meet or exceed the specified batch size. It is computed from request records, stored in the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
88-143: ⚡ Quick winAdd focused tests for the sweep-line boundary cases.
This logic depends on subtle timestamp semantics: identical start/end boundaries, zero-duration requests being skipped, and the
Nonepath when no full-batch interval exists. A few table-driven tests here would make future refactors much safer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/bench/dataclasses/reporting.py` around lines 88 - 143, Add focused, table-driven unit tests for _compute_batch_full_output_throughput (involving RequestRecord instances) that cover the sweep-line boundary cases: (1) requests with identical start and end timestamps (zero-duration) are skipped, (2) intervals where active_requests equals batch_size at boundaries (start/end coincident) are handled correctly, (3) the function returns None when no batch-full interval exists, and (4) token prorating over partial overlaps yields the expected fractional token counts; create small test rows exercising single-request, overlapping requests, exact-boundary transitions, and multi-request partial-overlap scenarios to assert numeric throughput or None as appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/bench/dataclasses/reporting.py`:
- Around line 131-143: Replace the nested loops that compute
batch_full_output_tokens by building a single sorted event sweep: create events
for each request span boundary (at request_start add delta = output_tokens /
request_duration_ns, at request_end subtract same delta) and events for each
batch_full_interval boundary (at interval_start mark +1, at interval_end mark
-1), sort all events by time, then iterate once maintaining current_active_rate
and an inside_batch_full counter; for each consecutive event segment, if
inside_batch_full > 0 add current_active_rate * segment_duration to
batch_full_output_tokens, and at the end divide by batch_full_duration_ns as
before. Update the code around request_spans, batch_full_intervals and
batch_full_output_tokens to use this single-sweep approach.
---
Nitpick comments:
In `@tensorrt_llm/bench/dataclasses/reporting.py`:
- Around line 88-143: Add focused, table-driven unit tests for
_compute_batch_full_output_throughput (involving RequestRecord instances) that
cover the sweep-line boundary cases: (1) requests with identical start and end
timestamps (zero-duration) are skipped, (2) intervals where active_requests
equals batch_size at boundaries (start/end coincident) are handled correctly,
(3) the function returns None when no batch-full interval exists, and (4) token
prorating over partial overlaps yields the expected fractional token counts;
create small test rows exercising single-request, overlapping requests,
exact-boundary transitions, and multi-request partial-overlap scenarios to
assert numeric throughput or None as appropriate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ed6b4eaa-c6e9-417b-bdcf-5a3748bcb699
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.pytensorrt_llm/bench/dataclasses/statistics.py
Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
6f8cf1f to
88e810f
Compare
|
/bot run |
|
PR_Github #48065 [ run ] triggered by Bot. Commit: |
|
PR_Github #48065 [ run ] completed with state
|
|
/bot run |
|
PR_Github #48107 [ run ] triggered by Bot. Commit: |
|
PR_Github #48107 [ run ] completed with state
|
|
/bot run |
|
PR_Github #48136 [ run ] triggered by Bot. Commit: |
|
PR_Github #48136 [ run ] completed with state |
Summary by CodeRabbit
Motivation
In speculative decoding benchmarks, we want generated outputs to remain valid user-facing content. For that reason, these tests should not ignore EOS, and the input prompts should use the chat template when appropriate.
That makes the benchmark closer to real serving behavior, but it also means output lengths are naturally uneven. With 80 test samples and larger batch sizes such as 32 or 64, a single long-tail request can keep running after most other requests have finished. This final drain phase can noticeably reduce end-to-end output throughput, even when the steady-state full-batch throughput is healthy.
For example, in one Llama 8B rejection-sampling run:
bs32/linear_rejection_on637.2 tok/s2445.8 tok/s2000token cap without EOS, which heavily dragged down the end-to-end metric.bs32/dynamic_tree_rejection_on1473.6 tok/s2502.5 tok/s2000token cap, creating a visible tail/drain effect.The existing output throughput metric is still useful because it reflects total wall-clock behavior, including tail latency. However, it can obscure the steady-state throughput when the batch is full.
Changes
This PR adds a new benchmark metric:
batch_full_output_throughput_tok_sThe metric estimates output token throughput only over intervals where the number of active requests is at least the configured batch size. Since request records do not include per-token timestamps, output tokens are prorated uniformly over each request's start/end interval.
This gives us a complementary steady-state throughput view while preserving the existing end-to-end throughput metric.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.