Skip to content

[TRTLLM-12374][fix] Tighten DSv4 cache constraint floor to match warmup shapes#13657

Merged
lfr-0531 merged 1 commit into
NVIDIA:feat/deepseek_v4from
lancelly:fix/dsv4-cache-constraint-warmup
Apr 30, 2026
Merged

[TRTLLM-12374][fix] Tighten DSv4 cache constraint floor to match warmup shapes#13657
lfr-0531 merged 1 commit into
NVIDIA:feat/deepseek_v4from
lancelly:fix/dsv4-cache-constraint-warmup

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented Apr 30, 2026

Summary

DSv4-Pro hits cuMemCreate-FAIL during V2 KV cache initialization at small batch sizes (e.g., max_batch_size=8), with accumulated allocation reaching ~124.7 GiB even though the user-derived quota_from_max_tokens is only 117.84 GiB. This is
not a batch-size problem — the constraint floor is over-reserving.

Root cause

_storage_manager.py::_compute_slot_count_for_level takes:

quota = max(min_quota_from_constraints, user_quota)
                                       
so once min_quota_from_constraints exceeds user_quota, the user's free_gpu_memory_fraction is silently overridden.                                                                                                                                      
                                                    
The previous Constraint 1 was:                                                                                                                                                                                                                          
                                        
KVCacheDesc(capacity=max_seq_len, history_length=0)                                                                                                                                                                                                     
                                                                                                                                                                                                                                                        
history_length=0 represents the freshest state of a request that has the full max_seq_len capacitya worst case that never actually occurs at runtime. With 64-layer DSv4-Pro and max_seq_len=300K, every pool group's min_slots was forced to        
max_seq_len / tokens_per_block2344, including SWA and SSM pools that should collapse to their windowed working set. The resulting floor (~125 GiB) directly exceeded the user quota and triggered the OOM.                                           
                                                    
Fix                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                        
Reshape the constraints to match the two real warmup shapes used by _capture_generation_cuda_graphs and _general_warmup_impl:                                                                                                                           
                                                                                                                                                                                                                                                        
┌────────────┬────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────┐                                                                                    
│ ConstraintWarmup formKVCacheDesc                     │
├────────────┼────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
│ C1CUDA graph generation warmupone decode request at the tail of max_seq_lencapacity=max_seq_len, history_length=max_seq_len-1 │                                                                                    
├────────────┼────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
│ C2General / chunked-prefill warmupone fresh context request of the per-iteration token budgetcapacity=max_num_tokens, history_length=0          │                                                                                    
└────────────┴────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

@lancelly lancelly requested a review from a team as a code owner April 30, 2026 08:58
@lancelly lancelly requested review from brb-nv and removed request for a team April 30, 2026 08:58
@lancelly lancelly changed the title [None][fix] Tighten DSv4 cache constraint floor to match warmup shapes [TRTLLM-12374][fix] Tighten DSv4 cache constraint floor to match warmup shapes Apr 30, 2026
@lfr-0531 lfr-0531 requested a review from jiaganc April 30, 2026 09:10
Copy link
Copy Markdown
Collaborator

@jiaganc jiaganc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Constraint 1 used history_length=0 with capacity=max_seq_len, which is the
worst-case freshest state that never actually occurs at runtime. On DSv4-Pro
with 64 layers and max_seq_len=300K this pushed min_quota above the user
quota and caused cuMemCreate failures during V2 KV cache init even at
batch_size=8.

Reshape the constraints to match the two warmup forms:

  C1 (cuda graph generation warmup): one decode request at the tail of
      max_seq_len -- capacity=max_seq_len, history_length=max_seq_len-1.
      SWA / SSM pools collapse to their windowed working set; full-cache
      pools still reserve max_seq_len/tokens_per_block blocks.

  C2 (general / chunked-prefill warmup): one fresh context request of the
      per-iteration token budget -- capacity=max_num_tokens,
      history_length=0.

Also pass max_num_tokens (rather than max_seq_len) as the capacity of the
single context request in typical_step, matching the same per-iteration
budget.

Signed-off-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly lancelly force-pushed the fix/dsv4-cache-constraint-warmup branch from 4cb79f5 to 939926d Compare April 30, 2026 09:38
@lfr-0531 lfr-0531 merged commit 8b7386b into NVIDIA:feat/deepseek_v4 Apr 30, 2026
4 checks passed
lfr-0531 pushed a commit that referenced this pull request May 7, 2026
…up shapes (#13657)

Signed-off-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com>
lfr-0531 pushed a commit that referenced this pull request May 14, 2026
…up shapes (#13657)

Signed-off-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
lfr-0531 pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 29, 2026
…up shapes (NVIDIA#13657)

Signed-off-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lance Liao <laliao@login-bia01.bia.clusters.nvidia.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
(cherry picked from commit 8e8f37d)
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants