[https://nvbugs/6114821][fix] Fix extra_tokens in V2 KV cache#13619
Conversation
8dcbfdd to
cbec4e0
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughModified the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 8/10 reviews remaining, refill in 10 minutes and 34 seconds. Comment |
|
/bot run |
|
PR_Github #46265 [ run ] triggered by Bot. Commit: |
|
PR_Github #46265 [ run ] completed with state
|
cbec4e0 to
ccc559e
Compare
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #46401 [ run ] triggered by Bot. Commit: |
ccc559e to
6e33b81
Compare
|
/bot run |
|
PR_Github #46402 [ run ] triggered by Bot. Commit: |
|
PR_Github #46403 [ run ] triggered by Bot. Commit: |
|
PR_Github #46402 [ run ] completed with state |
|
PR_Github #46403 [ run ] completed with state
|
6e33b81 to
cffc312
Compare
|
/bot run |
|
PR_Github #46441 [ run ] triggered by Bot. Commit: |
|
PR_Github #46441 [ run ] completed with state
|
…mp arg clamp_max_seq_len_for_mem must be called with (token_num_upper_bound + extra_tokens) so the function answers "given each seq actually uses N+extra actual tokens, what seq_len fits?" PR NVIDIA#12306 dropped the + extra_tokens from the arg, making it answer the wrong question and under-report user-visible capacity by extra_tokens. In the memory-plentiful case this clamps self.max_seq_len down by extra_tokens during V2 init (resource_manager.py:1874), which triggers the SWA-detection branch in _util.py:591 to rebuild _dummy_reqs mid-init, leaving estimation results and warmup state internally inconsistent. Under sustained spec-dec load (GPT-OSS-120B + Eagle3 one-model + V2 KV cache + non-greedy sampling) this manifests as intermittent OutOfPagesError on draft KV cache resize, IMA in spec sampler, or hangs at cuda_event.synchronize() during GPQA evaluation. The _gpu_max_tokens - extra_tokens cap from PR NVIDIA#12306 is preserved as it correctly converts the GPU-only cap to user-visible token units. Tested test_eagle3_4gpus[v2_kv_cache-trtllm-one_model-no_overlap_scheduler]: baseline (PR NVIDIA#12306 applied): ~18% fail rate with this fix: 41/41 passing across 2 nodes Tracking: nvbugs/6113016 (overlap_scheduler), nvbugs/6114821 (no_overlap_scheduler) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
cffc312 to
f8bd7e8
Compare
|
/bot run |
|
PR_Github #46529 [ run ] triggered by Bot. Commit: |
|
PR_Github #46529 [ run ] completed with state
|
|
/bot run |
|
|
|
/bot run |
|
PR_Github #46563 [ run ] triggered by Bot. Commit: |
|
PR_Github #46563 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46570 [ run ] triggered by Bot. Commit: |
|
PR_Github #46570 [ run ] completed with state |
clamp_max_seq_len_for_mem must be called with (token_num_upper_bound
In the memory-plentiful case this clamps self.max_seq_len down by extra_tokens during V2 init (resource_manager.py:1874), which triggers the SWA-detection branch in _util.py:591 to rebuild _dummy_reqs mid-init, leaving estimation results and warmup state internally inconsistent. Under sustained spec-dec load (GPT-OSS-120B + Eagle3 one-model + V2 KV cache + non-greedy sampling) this manifests as intermittent OutOfPagesError on draft KV cache resize, IMA in spec sampler, or hangs at cuda_event.synchronize() during GPQA evaluation.
The _gpu_max_tokens - extra_tokens cap from PR #12306 is preserved as it correctly converts the GPU-only cap to user-visible token units.
Tested test_eagle3_4gpus[v2_kv_cache-trtllm-one_model-no_overlap_scheduler]:
baseline (PR #12306 applied): ~18% fail rate
with this fix: 41/41 passing across 2 nodes
Tracking: nvbugs/6113016 (overlap_scheduler), nvbugs/6114821 (no_overlap_scheduler)
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.