[TRTLLM-12015][feat] Introduce KV reuse in transceiver v2#13115
Conversation
6258387 to
35bce1d
Compare
35bce1d to
cbdb40e
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
📝 WalkthroughWalkthroughThis PR introduces a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
PR_Github #43985 [ run ] triggered by Bot. Commit: |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/disaggregation/transceiver.py (1)
166-197:⚠️ Potential issue | 🟠 MajorKeep
token_range.startaligned with the trimmed block list.On Line 166,
token_rangeis derived fromcached_tokensbefore any per-layer-group trimming. After Lines 187-197, a sliding-window group can start much later than that original offset, and an exact full-cache hit still leaves all blocks in place because theblock_ids.size > num_cached_blocksguard never trims when the counts are equal. The receiver later forwardstoken_range.startasstart_token_idx, so the sender can align against the wrong token offset and either skip live window blocks or retransmit a fully cached prompt. Please recompute the effective start from the surviving block range and allow the full-cache case to trim to an empty block list.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/disaggregation/transceiver.py` around lines 166 - 197, The token_range.start needs to be recomputed after per-layer-group trimming so it matches the surviving block_ids for each layer group: after obtaining and trimming block_ids (the window_size logic and the cached-blocks trimming around adapter.get_block_ids and MambaLayerGroup handling) compute the effective first surviving token index from the trimmed block_ids (or set token_range to an empty/zero-length range when no blocks remain) and update token_range.start accordingly before it is forwarded; also change the cache-trim guard to allow trimming when block_ids.size >= num_cached_blocks (so a full-cache match can produce an empty block_ids) and use that trimmed block_ids when recomputing the start.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py (1)
4-4: Use|instead oftyping.Unionin new 3.10+ code.This module is new Python 3.10+ code, so the factory signature should use PEP 604 syntax and drop the
Unionimport.Proposed fix
-from typing import Union - import numpy as np @@ def create_cache_reuse_adapter( - mgr: Union[KVCacheManager, KVCacheManagerV2], + mgr: KVCacheManager | KVCacheManagerV2, ) -> CacheReuseAdapter:As per coding guidelines, "Use the
|syntax instead oftyping.Unionin Python type annotations."Also applies to: 126-128
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py` at line 4, Replace the use of typing.Union by using PEP 604 |-syntax: remove the line importing "Union" and update the factory signature and any annotations (notably the factory function at the reported 126-128 region and any other annotations referencing Union) to use "TypeA | TypeB" form instead of "Union[TypeA, TypeB]"; ensure imports no longer include Union and that all updated annotations are valid Python 3.10+ syntax.tests/unittest/disaggregated/test_cache_reuse_adapter.py (1)
113-135: Add a transceiver-path regression test for full-cache/prefix cases.These tests validate
TokenRangedirectly, but they do not exerciseKvCacheTransceiverV2._create_kv_slice()/Receiver._build_recv_req_info(), which is wherestart_token_idxis actually derived from the trimmed block list. A case that goes through that path would catch exact full-cache hits and sliding-window prefix offsets, which the helper-level tests here can miss.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/disaggregated/test_cache_reuse_adapter.py` around lines 113 - 135, Add a regression unit test that exercises the transceiver/receiver path (not just TokenRange) to catch full-cache and prefix-offset behavior: create a scenario that calls KvCacheTransceiverV2._create_kv_slice (or triggers it via the public transceiver send/receive flow) and Receiver._build_recv_req_info with a trimmed block list and cached_tokens equal to prompt_len and with a sliding-window prefix, then assert the produced start_token_idx and resulting TokenRange behavior (expect a ValueError for exact full-cache hit or correct start/end for prefix offsets) to ensure the same edge cases covered by TokenRange unit tests are validated end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/disaggregation/base/transfer.py`:
- Around line 48-49: The docstring for KVSlice's token_range incorrectly claims
a single half-open interval "[start, end) of the tokens covered by the blocks in
block_ids_per_layer_groups" while block_ids_per_layer_groups can contain layer
groups with different retained suffix lengths (mixed windowed groups); update
the KVSlice docstring and any related comments to remove the claim that
token_range fully describes tokens covered by every group and instead state the
guaranteed semantics: token_range represents the common start-offset and the
half-open interval for the base retention span (or the minimal shared prefix
span) but per-layer-group retention may differ, or move/duplicate the range
information into the per-group structure within block_ids_per_layer_groups;
reference KVSlice.token_range and block_ids_per_layer_groups and adjust tests or
docs accordingly so callers understand token_range is not a precise descriptor
for mixed windowed layer groups.
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 490-523: The file is missing or has an outdated NVIDIA
copyright/header; update the top-of-file header to include the required NVIDIA
copyright block and current year (or update the year if this is a modified
file). Edit the source file containing the function _align_kv_blocks (and its
module-level header) to add the canonical NVIDIA header used across the repo,
ensuring formatting and year match project guidelines.
In `@tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py`:
- Around line 1-12: Add the standard NVIDIA copyright/header block to the top of
the new module (tensorrt_llm._torch.disaggregation.resource.cache_reuse.py)
before any imports; ensure it matches the project's canonical NVIDIA header text
and includes the appropriate copyright year(s) and licensing lines, then leave
the existing imports and symbols (LlmRequest, KVCacheManager, KVCacheManagerV2,
AttentionLayerGroup, get_global_layer_ids) unchanged.
- Around line 18-24: The abstract property stubs enable_block_reuse and
tokens_per_block are written with the ellipsis on the same line, triggering
E704; expand each stub to a normal multi-line function body by moving the
ellipsis onto its own indented line (e.g. keep the decorators and signature
lines for def enable_block_reuse(self) -> bool: followed by a new indented line
with ...; do the same for def tokens_per_block(self) -> int:).
In `@tensorrt_llm/_torch/disaggregation/transceiver.py`:
- Around line 13-21: This file is missing or has an out-of-date NVIDIA copyright
header; update the top-of-file header to include the proper NVIDIA copyright
notice and current year per project guidelines. Edit the header at the very top
of the module that defines TransferWorker, TransferWorkerConfig and the
CacheReuseAdapter/create_cache_reuse_adapter imports (the transceiver module)
and add or replace the existing header block with the required NVIDIA copyright
text and updated year.
In `@tests/unittest/disaggregated/test_cache_reuse_adapter.py`:
- Around line 1-14: Update the NVIDIA copyright header year range in the file to
include 2026 by changing the trailing year value from "2025" to "2026" in the
existing header block (the SPDX copyright line and any matching year mentions at
the top of tests/unittest/disaggregated/test_cache_reuse_adapter.py).
---
Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/transceiver.py`:
- Around line 166-197: The token_range.start needs to be recomputed after
per-layer-group trimming so it matches the surviving block_ids for each layer
group: after obtaining and trimming block_ids (the window_size logic and the
cached-blocks trimming around adapter.get_block_ids and MambaLayerGroup
handling) compute the effective first surviving token index from the trimmed
block_ids (or set token_range to an empty/zero-length range when no blocks
remain) and update token_range.start accordingly before it is forwarded; also
change the cache-trim guard to allow trimming when block_ids.size >=
num_cached_blocks (so a full-cache match can produce an empty block_ids) and use
that trimmed block_ids when recomputing the start.
---
Nitpick comments:
In `@tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py`:
- Line 4: Replace the use of typing.Union by using PEP 604 |-syntax: remove the
line importing "Union" and update the factory signature and any annotations
(notably the factory function at the reported 126-128 region and any other
annotations referencing Union) to use "TypeA | TypeB" form instead of
"Union[TypeA, TypeB]"; ensure imports no longer include Union and that all
updated annotations are valid Python 3.10+ syntax.
In `@tests/unittest/disaggregated/test_cache_reuse_adapter.py`:
- Around line 113-135: Add a regression unit test that exercises the
transceiver/receiver path (not just TokenRange) to catch full-cache and
prefix-offset behavior: create a scenario that calls
KvCacheTransceiverV2._create_kv_slice (or triggers it via the public transceiver
send/receive flow) and Receiver._build_recv_req_info with a trimmed block list
and cached_tokens equal to prompt_len and with a sliding-window prefix, then
assert the produced start_token_idx and resulting TokenRange behavior (expect a
ValueError for exact full-cache hit or correct start/end for prefix offsets) to
ensure the same edge cases covered by TokenRange unit tests are validated
end-to-end.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a142dc9e-631e-41f3-90a5-8b91d9676b79
📒 Files selected for processing (8)
tensorrt_llm/_torch/disaggregation/base/transfer.pytensorrt_llm/_torch/disaggregation/native/transfer.pytensorrt_llm/_torch/disaggregation/resource/cache_reuse.pytensorrt_llm/_torch/disaggregation/transceiver.pytensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/disaggregated/test_cache_reuse_adapter.pytests/unittest/disaggregated/test_kv_transfer.pytests/unittest/disaggregated/test_kv_transfer_mp.py
|
PR_Github #43985 [ run ] completed with state |
4cb7527 to
6db3765
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #44427 [ run ] triggered by Bot. Commit: |
|
PR_Github #44427 [ run ] completed with state
|
065d012 to
49aef23
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #45391 [ run ] triggered by Bot. Commit: |
|
PR_Github #45391 [ run ] completed with state
|
|
PR_Github #45520 [ run ] triggered by Bot. Commit: |
|
PR_Github #45520 [ run ] completed with state
|
4bf153f to
97de349
Compare
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
97de349 to
cc8f3f7
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #46893 [ run ] triggered by Bot. Commit: |
|
PR_Github #46893 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-2" --disable-fail-fast |
|
PR_Github #47118 [ run ] triggered by Bot. Commit: |
|
PR_Github #47118 [ run ] completed with state |
|
/bot skip --comment "CI passed in the assembled form." |
|
PR_Github #47176 [ skip ] triggered by Bot. Commit: |
|
PR_Github #47176 [ skip ] completed with state |
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Limitation
In V2,
_KVCache.num_committed_tokensis a single per-sequence scalar, so_setup_for_reuseintersects all layer-group constraints into one global cutoff. For hybrid models with a short SWA window, the SWA check often truncates the match to 0, dragging full-attention layers down with it, even though their pages are still in the prefix tree.The transceiver inherits this:
cached_tokensandKVSlice.token_rangeare also single values, so every layer group shares one reuse boundary. Lifting this requires per-group commit lengths in the manager and per-group token ranges in the transfer protocol.Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.