[None][fix] KVCacheManagerV2 bug fixes (V2 remains default OFF)#12306
[None][fix] KVCacheManagerV2 bug fixes (V2 remains default OFF)#12306yizhang-nv merged 4 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughTwo files were modified to add multimodal token augmentation support for block reuse in KV cache management and to refine the fallback conditions triggering downgrade from KVCacheManagerV2 to KVCacheManager when cache_transceiver_config with a non-None backend is present. Changes
Sequence Diagram(s)sequenceDiagram
participant Request as LlmRequest
participant Manager as KVCacheManager/<br/>KVCacheManagerV2
participant Augment as _augment_tokens_<br/>for_block_reuse
participant GenTokens as gen_multi_modal_<br/>tokens
participant Cache as KV Cache/<br/>Commit
Request->>Manager: prepare_resources()
alt Block reuse enabled & context chunk is first
Manager->>Augment: _augment_tokens_for_block_reuse(tokens)
Augment->>GenTokens: gen_multi_modal_tokens(multimodal regions)
GenTokens-->>Augment: embedding content digests
Augment-->>Manager: augmented tokens
Manager->>Cache: _create_kv_cache(augmented_tokens)
else Block reuse disabled
Manager->>Cache: _create_kv_cache(original_tokens)
end
Request->>Manager: update_resources()
alt Block reuse enabled & uncommitted tokens exist
Manager->>Augment: _augment_tokens_for_block_reuse(committed_range)
Augment->>GenTokens: gen_multi_modal_tokens()
GenTokens-->>Augment: embedding content digests
Augment-->>Manager: augmented tokens
Manager->>Cache: commit(augmented_tokens)
else
Manager->>Cache: commit(original_tokens)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
39-40: Use a module import here to keep namespace-qualified usage.This new direct function import diverges from the repo’s Python import rule and makes internal API provenance less clear.
💡 Suggested diff
-from tensorrt_llm.runtime.kv_cache_manager_v2._block_radix_tree import \ - gen_multi_modal_tokens +from tensorrt_llm.runtime.kv_cache_manager_v2 import \ + _block_radix_tree- mm_tokens = gen_multi_modal_tokens(self.vocab_size, digest, length) + mm_tokens = _block_radix_tree.gen_multi_modal_tokens( + self.vocab_size, digest, length)As per coding guidelines "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions (e.g., use
from package.subpackage import foothenfoo.SomeClass())."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 39 - 40, Replace the direct function import of gen_multi_modal_tokens with a module import so the callsite remains namespace-qualified: import the module tensorrt_llm.runtime.kv_cache_manager_v2._block_radix_tree (instead of "from ... import gen_multi_modal_tokens") and update all uses of gen_multi_modal_tokens in resource_manager.py to call _block_radix_tree.gen_multi_modal_tokens so the internal API provenance is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 127-129: The fallback warning for KVCacheManagerV2 is missing the
new cache transceiver condition; update the logger.warning message (the call
that currently mentions kv_connector_manager, beam width, and event buffer max
size) to also reference the cache transceiver condition (e.g., "cache
transceiver enabled" or the actual flag name used in code such as
cache_transceiver or cache_transceiver_enabled) so the predicate matches the
real checks that trigger the fallback from KVCacheManagerV2 to KVCacheManager.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 2452-2457: free_resources() currently calls kv_cache.commit(...)
with raw tokens when committing an uncommitted tail, which can mix non-canonical
radix entries; update free_resources() to mirror the other commit path by
passing tokens through _augment_tokens_for_block_reuse before calling
kv_cache.commit. Specifically, locate the commit call in free_resources() and
replace the raw tokens argument (from req.get_tokens(...) or similar) with a
call to self._augment_tokens_for_block_reuse(req.get_tokens(DEFAULT_BEAM_INDEX),
req, start=kv_cache.num_committed_tokens, end=req.context_current_position) so
the same canonicalization logic used at the Line 2452–2457 path is applied
consistently.
- Around line 1967-1969: The for-loop that iterates over req.multimodal_hashes,
req.multimodal_positions, and req.multimodal_lengths uses zip() which silently
truncates mismatched lists; change the zip call to zip(req.multimodal_hashes,
req.multimodal_positions, req.multimodal_lengths, strict=True) so Python raises
on length mismatches and prevents silent token-augmentation errors; update the
loop where those symbols (req.multimodal_hashes, req.multimodal_positions,
req.multimodal_lengths) are zipped to include strict=True and run tests to
confirm a ValueError is raised for mismatched lengths.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 39-40: Replace the direct function import of
gen_multi_modal_tokens with a module import so the callsite remains
namespace-qualified: import the module
tensorrt_llm.runtime.kv_cache_manager_v2._block_radix_tree (instead of "from ...
import gen_multi_modal_tokens") and update all uses of gen_multi_modal_tokens in
resource_manager.py to call _block_radix_tree.gen_multi_modal_tokens so the
internal API provenance is preserved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c87a564d-166c-41ce-b64b-6deba6c4f38b
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
|
@lowsfer could you review this? thanks |
5bb15e3 to
235c426
Compare
8149f4c to
2f60b5a
Compare
6547686 to
8c6cc5e
Compare
8c6cc5e to
fdf7bb0
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #40046 [ run ] triggered by Bot. Commit: |
fdf7bb0 to
ceec81d
Compare
|
PR_Github #40046 [ run ] completed with state
|
450e42d to
0b503ab
Compare
|
/bot run |
|
PR_Github #42321 [ run ] triggered by Bot. Commit: |
|
PR_Github #42321 [ run ] completed with state
|
|
/bot run |
|
/bot kill |
|
PR_Github #44353 [ run ] completed with state
|
82238df to
ea80f05
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #44383 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #44459 [ run ] triggered by Bot. Commit: |
|
PR_Github #44383 [ run ] completed with state
|
|
PR_Github #44459 [ run ] completed with state
|
ea80f05 to
9a12dcf
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #44658 [ run ] triggered by Bot. Commit: |
9a12dcf to
11302e6
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #44701 [ run ] triggered by Bot. Commit: |
|
PR_Github #44701 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
Port comprehensive bug fixes for KVCacheManagerV2 while keeping V2 default OFF. The fallback mechanism now warns instead of switching. Fixes: partial block rebase corruption, max_blocks_per_seq undercount, multimodal block reuse, auto-provision host cache tier, CUDA graph capture/replay safety, draft KV cache stream and resource release, two-phase scheduler with PEFT pre-claim and deadlock detection, speculative decoding padding KV cache rewind mismatch, float type propagation in token estimation. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Remove _extend_kv_cache_for_padding from ModelDrafter and NGramDrafter (padding is now handled in the drafter base class), simplify scheduler_v2 and resource_manager interfaces, and remove leftover debug prints from KV cache manager V2. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
…d correctness fixes - Consolidate CUDA graph skip logic into CUDAGraphRunner.maybe_get_cuda_graph - Fix get_num_available_tokens over-subtraction of extra_tokens - Unify get_batch_cache_indices parameter name (layer_id -> layer_idx) - Add defensive assert for multimodal hash length - Update warning message to include cache transceiver condition - Restore upstream _is_prop_supported implementation - Fix copyright year, TYPE_CHECKING import, return type annotation Signed-off-by: Yi Zhang <yizhang@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Port test_encoder_counts_toward_batch fix from enable-v2-by-default-v2: increase max_batch_size to 2 so gen request wins phase 1 and encoder fills the remaining slot in phase 2, verifying encoder correctly counts toward batch size. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
11302e6 to
53ce2fd
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #44935 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #45081 [ run ] triggered by Bot. Commit: |
|
PR_Github #45081 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45297 [ run ] triggered by Bot. Commit: |
|
PR_Github #45297 [ run ] completed with state |
@coderabbitai summary
Description
Port comprehensive bug fixes for KVCacheManagerV2 while keeping V2 default OFF (
use_kv_cache_manager_v2=False). The fallback mechanism now warns instead of silently switching to V1.Fixes
KV Cache Core
_kv_cache.pywhere a full tree block could corrupt shared pages during partial rebase (addedis_fullguard).get_batch_cache_indicesdefaultlayer_id: Madelayer_idoptional (default pool 0) to avoid mandatory argument when layer is irrelevant.max_blocks_per_sequndercount: Computed aftermax_seq_lenclamping and now accounts fornum_extra_kv_tokens + max_total_draft_tokensso the host page-index buffer is large enough._augment_tokens_for_block_reuseusing content digest (Blake3 hash) for distinct tree entries.KV Cache Manager
MAX_UTILIZATIONscheduler relies on suspend/resume; without a host tier, suspended pages have nowhere to go, causing deadlock. Now auto-provisions host tier matching GPU quota (capped at 50% available host memory).extend_capacity_for_tokens: Added to bothKVCacheManagerandKVCacheManagerV2for CUDA graph padding token capacity extension.max_num_tokens_in_memorycomputation withint()to prevent float propagation.resume()to usedraft_kv_cache_manager._streaminstead oftorch.cuda.current_stream().release_resourcesto also free draft KV cache on context scheduling failure.Nonecapacity for context phase, removing fragile draft OOM workaround.Scheduler V2
GENERATION_TO_COMPLETE: Pre-claims PEFT pages for requests whose adapters haven't been released yet (overlap executor timing).RuntimeErrorwhen generation requests exist but none can be scheduled or evicted (KV cache exhausted with no host tier).CUDA Graph
_capture_allowedflag andallow_capture()context manager to prevent workspace tensor reallocation from invalidating existing graphs.Speculative Decoding
pad_draft_tokens_for_cuda_graphnow extends KV cache capacity to match padded length, preventing rewind underflow. Implemented indrafter.py,model_drafter.py(two-model), andngram.py.Fallback
Test Coverage
tests/unittest/PR Checklist
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.