[None][doc] Add blog post for tuning batch sizes for CUDA graph padding and increasing the default batch size granularity for it by yijingl-nvidia · Pull Request #13393 · NVIDIA/TensorRT-LLM

yijingl-nvidia · 2026-04-23T19:56:41Z

Description

The blog post discusses our experiment of different batch size sets for CUDA graphs when CUDA graph padding is enabled. The finer grained the batch sizes are, the less compute waste on padding but the more GPU memory is used for CUDA graphs and the longer the server startup time is.

Based on experiments, a batch size set of using +64 increment shows better output throughput than the default x2 exponential increment batch size set. It improves by up to 1.3x for aggregated and 1.5x for disaggregated serving for high concurrencies. It increased GPU memory by about 260MB per GPU tested on DeepSeek R1 and slows down server start up time by 1.17x.

We decided to change the default batch size when enabling CUDA graph padding to be the +64 set as a result. The corresponding code change is #12895.

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Documentation
- Added technical guide on optimizing CUDA graph batch sizes for TensorRT-LLM deployments. Includes performance benchmarks comparing configurations, memory overhead analysis, server initialization metrics, and recommendations for different serving scenarios.

coderabbitai · 2026-04-23T19:59:29Z

📝 Walkthrough

Walkthrough

A new blog post documenting TensorRT-LLM's CUDA graph batch size optimization strategies. It covers captured batch size configurations, throughput impacts across serving topologies, GPU memory overhead quantification, and server initialization costs, recommending the "+64" configuration as the default.

Changes

Cohort / File(s)	Summary
Documentation `docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md`	New blog post explaining CUDA graph padding behavior, comparing captured batch size sets (default "x2", "+64", and "+8"), presenting throughput measurements, GPU memory overhead analysis, and server initialization costs. Includes figures and tables summarizing results and recommendations for configuration changes.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a blog post about tuning CUDA graph batch sizes and changing the default batch size granularity.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description clearly explains the experiment comparing batch size sets for CUDA graphs, presents key findings with specific metrics, and references the corresponding code change.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md (1)

152-152: Hyphenate compound adjective for clarity.

At Line 152, “dry run phase” reads better as “dry-run phase.”

✍️ Suggested wording tweak

-- **Optimized dry-run estimation**: The serve memory dry-run procedure could be streamlined by capturing only two to four graphs during the dry run phase to estimate per-graph memory footprints, rather than capturing the full CUDA graph set twice.
+- **Optimized dry-run estimation**: The serve memory dry-run procedure could be streamlined by capturing only two to four graphs during the dry-run phase to estimate per-graph memory footprints, rather than capturing the full CUDA graph set twice.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md`
at line 152, Change the phrase "dry run phase" to the hyphenated "dry-run phase"
in the sentence beginning with "Optimized dry-run estimation" (specifically in
the clause "serve memory dry-run procedure could be streamlined by capturing
only two to four graphs during the dry run phase...") so the compound adjective
is properly hyphenated as "dry-run phase".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md`:
- Around line 29-31: The fenced code block containing the batch-size list (the
line with "1, 2, 4, 8, ... 2048") is missing a language tag; update that
markdown fence to include a language identifier (e.g., change ``` to ```text) so
the block reads ```text followed by the batch-size list and ends with ```,
ensuring MD040 linter compliance.
- Line 86: Fix the typo "CUDA gragh metadata" to "CUDA graph metadata" in the
sentence that begins "The observable memory difference lies instead in the CUDA
gragh metadata outside the pool." Update that exact phrase so the user-facing
technical text reads "CUDA graph metadata" (ensure only the misspelled word is
changed and surrounding wording remains the same).

---

Nitpick comments:
In
`@docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md`:
- Line 152: Change the phrase "dry run phase" to the hyphenated "dry-run phase"
in the sentence beginning with "Optimized dry-run estimation" (specifically in
the clause "serve memory dry-run procedure could be streamlined by capturing
only two to four graphs during the dry run phase...") so the compound adjective
is properly hyphenated as "dry-run phase".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08aa84aa-3efa-4b52-9638-28177772457e

📥 Commits

Reviewing files that changed from the base of the PR and between 43cd23f and 8e00a28.

⛔ Files ignored due to path filters (3)

docs/source/blogs/media/tech_blog20_aggr_all_models.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog20_disagg_all_models.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog20_p8_all_models.png is excluded by !**/*.png

📒 Files selected for processing (1)

docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md

Signed-off-by: Yijing Li <257409031+yijingl-nvidia@users.noreply.github.com>

yijingl-nvidia · 2026-04-27T17:59:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-27T18:07:13Z

PR_Github #45766 [ run ] triggered by Bot. Commit: f53bdb4 Link to invocation

tensorrt-cicd · 2026-04-27T19:16:34Z

PR_Github #45766 [ run ] completed with state SUCCESS. Commit: f53bdb4
/LLM/main/L0_MergeRequest_PR pipeline #35957 completed with status: 'SUCCESS'

CI Report

Link to invocation

yijingl-nvidia requested a review from a team as a code owner April 23, 2026 19:56

yijingl-nvidia requested review from chang-l and venkywonka April 23, 2026 19:56

github-actions Bot assigned yijingl-nvidia Apr 23, 2026

yijingl-nvidia mentioned this pull request Apr 23, 2026

[None][perf] Use +64 batch sizes for padding-enabled CUDA graphs #12895

Merged

1 task

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md Outdated

Comment thread docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md Outdated

yijingl-nvidia force-pushed the batch_size_blog branch 2 times, most recently from 3b14054 to 2d67904 Compare April 23, 2026 20:19

yijingl-nvidia requested review from dominicshanshan and kaiyux April 23, 2026 20:19

yijingl-nvidia force-pushed the batch_size_blog branch from 2d67904 to c76c2eb Compare April 23, 2026 20:30

venkywonka approved these changes Apr 27, 2026

View reviewed changes

add blog post

f53bdb4

Signed-off-by: Yijing Li <257409031+yijingl-nvidia@users.noreply.github.com>

yijingl-nvidia force-pushed the batch_size_blog branch from c76c2eb to f53bdb4 Compare April 27, 2026 17:58

taylor-yb-lee merged commit ed16121 into NVIDIA:main Apr 27, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][doc] Add blog post for tuning batch sizes for CUDA graph padding and increasing the default batch size granularity for it#13393

[None][doc] Add blog post for tuning batch sizes for CUDA graph padding and increasing the default batch size granularity for it#13393
taylor-yb-lee merged 1 commit intoNVIDIA:mainfrom
yijingl-nvidia:batch_size_blog

yijingl-nvidia commented Apr 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

yijingl-nvidia commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yijingl-nvidia commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yijingl-nvidia commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yijingl-nvidia commented Apr 23, 2026 •

edited

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading