[None][doc] Add blog post for tuning batch sizes for CUDA graph padding and increasing the default batch size granularity for it#13393
Conversation
📝 WalkthroughWalkthroughA new blog post documenting TensorRT-LLM's CUDA graph batch size optimization strategies. It covers captured batch size configurations, throughput impacts across serving topologies, GPU memory overhead quantification, and server initialization costs, recommending the "+64" configuration as the default. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md (1)
152-152: Hyphenate compound adjective for clarity.At Line 152, “dry run phase” reads better as “dry-run phase.”
✍️ Suggested wording tweak
-- **Optimized dry-run estimation**: The serve memory dry-run procedure could be streamlined by capturing only two to four graphs during the dry run phase to estimate per-graph memory footprints, rather than capturing the full CUDA graph set twice. +- **Optimized dry-run estimation**: The serve memory dry-run procedure could be streamlined by capturing only two to four graphs during the dry-run phase to estimate per-graph memory footprints, rather than capturing the full CUDA graph set twice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md` at line 152, Change the phrase "dry run phase" to the hyphenated "dry-run phase" in the sentence beginning with "Optimized dry-run estimation" (specifically in the clause "serve memory dry-run procedure could be streamlined by capturing only two to four graphs during the dry run phase...") so the compound adjective is properly hyphenated as "dry-run phase".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md`:
- Around line 29-31: The fenced code block containing the batch-size list (the
line with "1, 2, 4, 8, ... 2048") is missing a language tag; update that
markdown fence to include a language identifier (e.g., change ``` to ```text) so
the block reads ```text followed by the batch-size list and ends with ```,
ensuring MD040 linter compliance.
- Line 86: Fix the typo "CUDA gragh metadata" to "CUDA graph metadata" in the
sentence that begins "The observable memory difference lies instead in the CUDA
gragh metadata outside the pool." Update that exact phrase so the user-facing
technical text reads "CUDA graph metadata" (ensure only the misspelled word is
changed and surrounding wording remains the same).
---
Nitpick comments:
In
`@docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md`:
- Line 152: Change the phrase "dry run phase" to the hyphenated "dry-run phase"
in the sentence beginning with "Optimized dry-run estimation" (specifically in
the clause "serve memory dry-run procedure could be streamlined by capturing
only two to four graphs during the dry run phase...") so the compound adjective
is properly hyphenated as "dry-run phase".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 08aa84aa-3efa-4b52-9638-28177772457e
⛔ Files ignored due to path filters (3)
docs/source/blogs/media/tech_blog20_aggr_all_models.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog20_disagg_all_models.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog20_p8_all_models.pngis excluded by!**/*.png
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog20_Tuning_CUDA_Graph_Batch_Sizes_for_Higher_Throughput.md
3b14054 to
2d67904
Compare
2d67904 to
c76c2eb
Compare
Signed-off-by: Yijing Li <257409031+yijingl-nvidia@users.noreply.github.com>
c76c2eb to
f53bdb4
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #45766 [ run ] triggered by Bot. Commit: |
|
PR_Github #45766 [ run ] completed with state |
Description
The blog post discusses our experiment of different batch size sets for CUDA graphs when CUDA graph padding is enabled. The finer grained the batch sizes are, the less compute waste on padding but the more GPU memory is used for CUDA graphs and the longer the server startup time is.
Based on experiments, a batch size set of using +64 increment shows better output throughput than the default x2 exponential increment batch size set. It improves by up to 1.3x for aggregated and 1.5x for disaggregated serving for high concurrencies. It increased GPU memory by about 260MB per GPU tested on DeepSeek R1 and slows down server start up time by 1.17x.
We decided to change the default batch size when enabling CUDA graph padding to be the +64 set as a result. The corresponding code change is #12895.
Test Coverage
N/A
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit