[TRTLLM-12374][fix] Tighten DSv4 cache constraint floor to match warmup shapes by lancelly · Pull Request #13657 · NVIDIA/TensorRT-LLM

lancelly · 2026-04-30T08:58:50Z

Summary

DSv4-Pro hits cuMemCreate-FAIL during V2 KV cache initialization at small batch sizes (e.g., max_batch_size=8), with accumulated allocation reaching ~124.7 GiB even though the user-derived quota_from_max_tokens is only 117.84 GiB. This is
not a batch-size problem — the constraint floor is over-reserving.

Root cause

_storage_manager.py::_compute_slot_count_for_level takes:

quota = max(min_quota_from_constraints, user_quota)
                                       
so once min_quota_from_constraints exceeds user_quota, the user's free_gpu_memory_fraction is silently overridden.                                                                                                                                      
                                                    
The previous Constraint 1 was:                                                                                                                                                                                                                          
                                        
KVCacheDesc(capacity=max_seq_len, history_length=0)                                                                                                                                                                                                     
                                                                                                                                                                                                                                                        
history_length=0 represents the freshest state of a request that has the full max_seq_len capacity — a worst case that never actually occurs at runtime. With 64-layer DSv4-Pro and max_seq_len=300K, every pool group's min_slots was forced to        
max_seq_len / tokens_per_block ≈ 2344, including SWA and SSM pools that should collapse to their windowed working set. The resulting floor (~125 GiB) directly exceeded the user quota and triggered the OOM.                                           
                                                    
Fix                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                        
Reshape the constraints to match the two real warmup shapes used by _capture_generation_cuda_graphs and _general_warmup_impl:                                                                                                                           
                                                                                                                                                                                                                                                        
┌────────────┬────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────┐                                                                                    
│ Constraint │                                          Warmup form                                           │                    KVCacheDesc                     │
├────────────┼────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
│ C1         │ CUDA graph generation warmup — one decode request at the tail of max_seq_len                   │ capacity=max_seq_len, history_length=max_seq_len-1 │                                                                                    
├────────────┼────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
│ C2         │ General / chunked-prefill warmup — one fresh context request of the per-iteration token budget │ capacity=max_num_tokens, history_length=0          │                                                                                    
└────────────┴────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

jiaganc

LGTM

Constraint 1 used history_length=0 with capacity=max_seq_len, which is the worst-case freshest state that never actually occurs at runtime. On DSv4-Pro with 64 layers and max_seq_len=300K this pushed min_quota above the user quota and caused cuMemCreate failures during V2 KV cache init even at batch_size=8. Reshape the constraints to match the two warmup forms: C1 (cuda graph generation warmup): one decode request at the tail of max_seq_len -- capacity=max_seq_len, history_length=max_seq_len-1. SWA / SSM pools collapse to their windowed working set; full-cache pools still reserve max_seq_len/tokens_per_block blocks. C2 (general / chunked-prefill warmup): one fresh context request of the per-iteration token budget -- capacity=max_num_tokens, history_length=0. Also pass max_num_tokens (rather than max_seq_len) as the capacity of the single context request in typical_step, matching the same per-iteration budget. Signed-off-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com> Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

…up shapes (#13657) Signed-off-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com> Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com>

…up shapes (#13657) Signed-off-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com> Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

…up shapes (NVIDIA#13657) Signed-off-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com> Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com> (cherry picked from commit 8e8f37d) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lancelly requested a review from a team as a code owner April 30, 2026 08:58

lancelly requested review from brb-nv and removed request for a team April 30, 2026 08:58

github-actions Bot assigned lancelly Apr 30, 2026

lancelly changed the title ~~[None][fix] Tighten DSv4 cache constraint floor to match warmup shapes~~ [TRTLLM-12374][fix] Tighten DSv4 cache constraint floor to match warmup shapes Apr 30, 2026

lfr-0531 requested a review from jiaganc April 30, 2026 09:10

lfr-0531 added the deepseek-v4 label Apr 30, 2026

jiaganc approved these changes Apr 30, 2026

View reviewed changes

lancelly force-pushed the fix/dsv4-cache-constraint-warmup branch from 4cb79f5 to 939926d Compare April 30, 2026 09:38

lfr-0531 merged commit 8b7386b into NVIDIA:feat/deepseek_v4 Apr 30, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRTLLM-12374][fix] Tighten DSv4 cache constraint floor to match warmup shapes#13657

[TRTLLM-12374][fix] Tighten DSv4 cache constraint floor to match warmup shapes#13657
lfr-0531 merged 1 commit into
NVIDIA:feat/deepseek_v4from
lancelly:fix/dsv4-cache-constraint-warmup

lancelly commented Apr 30, 2026 •

edited

Loading

Uh oh!

jiaganc left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lancelly commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Uh oh!

jiaganc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lancelly commented Apr 30, 2026 •

edited

Loading