Skip to content

[TRTLLM-12015][feat] Introduce KV reuse in transceiver v2#13115

Merged
Shixiaowei02 merged 2 commits into
NVIDIA:mainfrom
Shixiaowei02:user/xiaoweis/ctx-cache-reuse
May 8, 2026
Merged

[TRTLLM-12015][feat] Introduce KV reuse in transceiver v2#13115
Shixiaowei02 merged 2 commits into
NVIDIA:mainfrom
Shixiaowei02:user/xiaoweis/ctx-cache-reuse

Conversation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator

@Shixiaowei02 Shixiaowei02 commented Apr 16, 2026

Summary by CodeRabbit

  • New Features

    • Added cache reuse adapter abstraction for improved KV block management across cache manager versions.
    • Enhanced KV slice token range tracking to support cached token accounting in disaggregated execution.
  • Bug Fixes

    • Improved block alignment logic for speculative decoding in distributed KV cache scenarios.
  • Tests

    • Added comprehensive tests for KV cache block alignment and token range handling in disaggregated transfer workflows.

Limitation

In V2, _KVCache.num_committed_tokens is a single per-sequence scalar, so _setup_for_reuse intersects all layer-group constraints into one global cutoff. For hybrid models with a short SWA window, the SWA check often truncates the match to 0, dragging full-attention layers down with it, even though their pages are still in the prefix tree.

The transceiver inherits this: cached_tokens and KVSlice.token_range are also single values, so every layer group shares one reuse boundary. Lifting this requires per-group commit lengths in the manager and per-group token ranges in the transfer protocol.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@Shixiaowei02 Shixiaowei02 changed the title [TRTLLM-11635][feat] Introduce KV reuse in transceiver v2 [TRTLLM-12015][feat] Introduce KV reuse in transceiver v2 Apr 16, 2026
@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/ctx-cache-reuse branch from 6258387 to 35bce1d Compare April 16, 2026 09:00
@Shixiaowei02 Shixiaowei02 marked this pull request as ready for review April 17, 2026 06:24
@Shixiaowei02 Shixiaowei02 requested review from a team as code owners April 17, 2026 06:24
@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/ctx-cache-reuse branch from 35bce1d to cbdb40e Compare April 17, 2026 06:24
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

This PR introduces a CacheReuseAdapter abstraction to normalize KV block reuse behavior across cache manager versions, refactors block alignment logic to support speculative decoding with explicit token ranges, and updates the transceiver and executor to integrate cache reuse tracking. It includes corresponding test coverage for token range and block alignment scenarios.

Changes

Cohort / File(s) Summary
KV Transfer Core
tensorrt_llm/_torch/disaggregation/base/transfer.py, tensorrt_llm/_torch/disaggregation/native/transfer.py
Updated KVSlice docstring to clarify token_range as half-open interval. Replaced _filter_kv_blocks with _align_kv_blocks to slice src/dst block-id arrays based on token offsets. Extended RecvReqInfo with start_token_idx propagation. Updated RxSession.process_kv_agent_result to accept sender_slice_id and introduced _terminal_status to prevent overwriting non-ERROR terminal states.
Cache Reuse Abstraction
tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py
Introduced CacheReuseAdapter ABC with uniform API for block-reuse operations across KVCacheManager V1 and V2. Concrete adapters delegate to C++-backed methods (V1) or Python kv_cache_map (V2). Factory function create_cache_reuse_adapter selects adapter based on manager type.
Transceiver & Executor Integration
tensorrt_llm/_torch/disaggregation/transceiver.py, tensorrt_llm/_torch/pyexecutor/py_executor.py
Integrated CacheReuseAdapter into transceiver to compute cached token/block counts and derive per-layer-group block_ids. Refactored context send flow with _finalize_send for auxiliary packing. Added executor call to commit_blocks_for_reuse during generation transmission completion.
Test Coverage
tests/unittest/disaggregated/test_cache_reuse_adapter.py, tests/unittest/disaggregated/test_kv_transfer.py, tests/unittest/disaggregated/test_kv_transfer_mp.py
Added new test module for cache reuse adapter covering block alignment scenarios with/without prefix cache. Updated existing KV transfer tests to include explicit token_range in KVSlice construction. Added end-to-end test for generation with prefix offset across V1/V2 cache managers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete and lacks essential sections required by the template. Complete the PR description by filling in: the title with proper format [ticket][type], the Description section explaining what and why, and the Test Coverage section listing relevant tests.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies the main feature: introducing KV reuse in transceiver v2. It follows the required format with JIRA ticket [TRTLLM-12015], type [feat], and a concise summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43985 [ run ] triggered by Bot. Commit: cbdb40e Link to invocation

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/disaggregation/transceiver.py (1)

166-197: ⚠️ Potential issue | 🟠 Major

Keep token_range.start aligned with the trimmed block list.

On Line 166, token_range is derived from cached_tokens before any per-layer-group trimming. After Lines 187-197, a sliding-window group can start much later than that original offset, and an exact full-cache hit still leaves all blocks in place because the block_ids.size > num_cached_blocks guard never trims when the counts are equal. The receiver later forwards token_range.start as start_token_idx, so the sender can align against the wrong token offset and either skip live window blocks or retransmit a fully cached prompt. Please recompute the effective start from the surviving block range and allow the full-cache case to trim to an empty block list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/transceiver.py` around lines 166 - 197,
The token_range.start needs to be recomputed after per-layer-group trimming so
it matches the surviving block_ids for each layer group: after obtaining and
trimming block_ids (the window_size logic and the cached-blocks trimming around
adapter.get_block_ids and MambaLayerGroup handling) compute the effective first
surviving token index from the trimmed block_ids (or set token_range to an
empty/zero-length range when no blocks remain) and update token_range.start
accordingly before it is forwarded; also change the cache-trim guard to allow
trimming when block_ids.size >= num_cached_blocks (so a full-cache match can
produce an empty block_ids) and use that trimmed block_ids when recomputing the
start.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py (1)

4-4: Use | instead of typing.Union in new 3.10+ code.

This module is new Python 3.10+ code, so the factory signature should use PEP 604 syntax and drop the Union import.

Proposed fix
-from typing import Union
-
 import numpy as np
@@
 def create_cache_reuse_adapter(
-    mgr: Union[KVCacheManager, KVCacheManagerV2],
+    mgr: KVCacheManager | KVCacheManagerV2,
 ) -> CacheReuseAdapter:

As per coding guidelines, "Use the | syntax instead of typing.Union in Python type annotations."

Also applies to: 126-128

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py` at line 4,
Replace the use of typing.Union by using PEP 604 |-syntax: remove the line
importing "Union" and update the factory signature and any annotations (notably
the factory function at the reported 126-128 region and any other annotations
referencing Union) to use "TypeA | TypeB" form instead of "Union[TypeA, TypeB]";
ensure imports no longer include Union and that all updated annotations are
valid Python 3.10+ syntax.
tests/unittest/disaggregated/test_cache_reuse_adapter.py (1)

113-135: Add a transceiver-path regression test for full-cache/prefix cases.

These tests validate TokenRange directly, but they do not exercise KvCacheTransceiverV2._create_kv_slice() / Receiver._build_recv_req_info(), which is where start_token_idx is actually derived from the trimmed block list. A case that goes through that path would catch exact full-cache hits and sliding-window prefix offsets, which the helper-level tests here can miss.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/disaggregated/test_cache_reuse_adapter.py` around lines 113 -
135, Add a regression unit test that exercises the transceiver/receiver path
(not just TokenRange) to catch full-cache and prefix-offset behavior: create a
scenario that calls KvCacheTransceiverV2._create_kv_slice (or triggers it via
the public transceiver send/receive flow) and Receiver._build_recv_req_info with
a trimmed block list and cached_tokens equal to prompt_len and with a
sliding-window prefix, then assert the produced start_token_idx and resulting
TokenRange behavior (expect a ValueError for exact full-cache hit or correct
start/end for prefix offsets) to ensure the same edge cases covered by
TokenRange unit tests are validated end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/disaggregation/base/transfer.py`:
- Around line 48-49: The docstring for KVSlice's token_range incorrectly claims
a single half-open interval "[start, end) of the tokens covered by the blocks in
block_ids_per_layer_groups" while block_ids_per_layer_groups can contain layer
groups with different retained suffix lengths (mixed windowed groups); update
the KVSlice docstring and any related comments to remove the claim that
token_range fully describes tokens covered by every group and instead state the
guaranteed semantics: token_range represents the common start-offset and the
half-open interval for the base retention span (or the minimal shared prefix
span) but per-layer-group retention may differ, or move/duplicate the range
information into the per-group structure within block_ids_per_layer_groups;
reference KVSlice.token_range and block_ids_per_layer_groups and adjust tests or
docs accordingly so callers understand token_range is not a precise descriptor
for mixed windowed layer groups.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 490-523: The file is missing or has an outdated NVIDIA
copyright/header; update the top-of-file header to include the required NVIDIA
copyright block and current year (or update the year if this is a modified
file). Edit the source file containing the function _align_kv_blocks (and its
module-level header) to add the canonical NVIDIA header used across the repo,
ensuring formatting and year match project guidelines.

In `@tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py`:
- Around line 1-12: Add the standard NVIDIA copyright/header block to the top of
the new module (tensorrt_llm._torch.disaggregation.resource.cache_reuse.py)
before any imports; ensure it matches the project's canonical NVIDIA header text
and includes the appropriate copyright year(s) and licensing lines, then leave
the existing imports and symbols (LlmRequest, KVCacheManager, KVCacheManagerV2,
AttentionLayerGroup, get_global_layer_ids) unchanged.
- Around line 18-24: The abstract property stubs enable_block_reuse and
tokens_per_block are written with the ellipsis on the same line, triggering
E704; expand each stub to a normal multi-line function body by moving the
ellipsis onto its own indented line (e.g. keep the decorators and signature
lines for def enable_block_reuse(self) -> bool: followed by a new indented line
with ...; do the same for def tokens_per_block(self) -> int:).

In `@tensorrt_llm/_torch/disaggregation/transceiver.py`:
- Around line 13-21: This file is missing or has an out-of-date NVIDIA copyright
header; update the top-of-file header to include the proper NVIDIA copyright
notice and current year per project guidelines. Edit the header at the very top
of the module that defines TransferWorker, TransferWorkerConfig and the
CacheReuseAdapter/create_cache_reuse_adapter imports (the transceiver module)
and add or replace the existing header block with the required NVIDIA copyright
text and updated year.

In `@tests/unittest/disaggregated/test_cache_reuse_adapter.py`:
- Around line 1-14: Update the NVIDIA copyright header year range in the file to
include 2026 by changing the trailing year value from "2025" to "2026" in the
existing header block (the SPDX copyright line and any matching year mentions at
the top of tests/unittest/disaggregated/test_cache_reuse_adapter.py).

---

Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/transceiver.py`:
- Around line 166-197: The token_range.start needs to be recomputed after
per-layer-group trimming so it matches the surviving block_ids for each layer
group: after obtaining and trimming block_ids (the window_size logic and the
cached-blocks trimming around adapter.get_block_ids and MambaLayerGroup
handling) compute the effective first surviving token index from the trimmed
block_ids (or set token_range to an empty/zero-length range when no blocks
remain) and update token_range.start accordingly before it is forwarded; also
change the cache-trim guard to allow trimming when block_ids.size >=
num_cached_blocks (so a full-cache match can produce an empty block_ids) and use
that trimmed block_ids when recomputing the start.

---

Nitpick comments:
In `@tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py`:
- Line 4: Replace the use of typing.Union by using PEP 604 |-syntax: remove the
line importing "Union" and update the factory signature and any annotations
(notably the factory function at the reported 126-128 region and any other
annotations referencing Union) to use "TypeA | TypeB" form instead of
"Union[TypeA, TypeB]"; ensure imports no longer include Union and that all
updated annotations are valid Python 3.10+ syntax.

In `@tests/unittest/disaggregated/test_cache_reuse_adapter.py`:
- Around line 113-135: Add a regression unit test that exercises the
transceiver/receiver path (not just TokenRange) to catch full-cache and
prefix-offset behavior: create a scenario that calls
KvCacheTransceiverV2._create_kv_slice (or triggers it via the public transceiver
send/receive flow) and Receiver._build_recv_req_info with a trimmed block list
and cached_tokens equal to prompt_len and with a sliding-window prefix, then
assert the produced start_token_idx and resulting TokenRange behavior (expect a
ValueError for exact full-cache hit or correct start/end for prefix offsets) to
ensure the same edge cases covered by TokenRange unit tests are validated
end-to-end.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a142dc9e-631e-41f3-90a5-8b91d9676b79

📥 Commits

Reviewing files that changed from the base of the PR and between f67eba2 and cbdb40e.

📒 Files selected for processing (8)
  • tensorrt_llm/_torch/disaggregation/base/transfer.py
  • tensorrt_llm/_torch/disaggregation/native/transfer.py
  • tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py
  • tensorrt_llm/_torch/disaggregation/transceiver.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/unittest/disaggregated/test_cache_reuse_adapter.py
  • tests/unittest/disaggregated/test_kv_transfer.py
  • tests/unittest/disaggregated/test_kv_transfer_mp.py

Comment thread tensorrt_llm/_torch/disaggregation/base/transfer.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/native/transfer.py
Comment thread tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py
Comment thread tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py
Comment thread tensorrt_llm/_torch/disaggregation/transceiver.py
Comment thread tests/unittest/disaggregated/test_cache_reuse_adapter.py Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43985 [ run ] completed with state ABORTED. Commit: cbdb40e

Link to invocation

@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/ctx-cache-reuse branch 2 times, most recently from 4cb7527 to 6db3765 Compare April 20, 2026 09:24
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44427 [ run ] triggered by Bot. Commit: 6db3765 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44427 [ run ] completed with state SUCCESS. Commit: 6db3765
/LLM/main/L0_MergeRequest_PR pipeline #34838 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Comment thread tensorrt_llm/_torch/disaggregation/native/transfer.py Outdated
@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/ctx-cache-reuse branch 2 times, most recently from 065d012 to 49aef23 Compare April 24, 2026 11:46
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45391 [ run ] triggered by Bot. Commit: 49aef23 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45391 [ run ] completed with state SUCCESS. Commit: 49aef23
/LLM/main/L0_MergeRequest_PR pipeline #35631 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45520 [ run ] triggered by Bot. Commit: 6942b31 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45520 [ run ] completed with state SUCCESS. Commit: 6942b31
/LLM/main/L0_MergeRequest_PR pipeline #35744 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Comment thread tensorrt_llm/_torch/disaggregation/native/transfer.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/transceiver.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py
@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/ctx-cache-reuse branch 2 times, most recently from 4bf153f to 97de349 Compare May 6, 2026 01:49
@Shixiaowei02 Shixiaowei02 requested review from a team as code owners May 6, 2026 01:49
@Shixiaowei02 Shixiaowei02 requested a review from chuangz0 May 6, 2026 01:49
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
@Shixiaowei02 Shixiaowei02 force-pushed the user/xiaoweis/ctx-cache-reuse branch from 97de349 to cc8f3f7 Compare May 6, 2026 02:01
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46893 [ run ] triggered by Bot. Commit: cc8f3f7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46893 [ run ] completed with state SUCCESS. Commit: cc8f3f7
/LLM/main/L0_MergeRequest_PR pipeline #36902 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-2" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47118 [ run ] triggered by Bot. Commit: cc8f3f7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47118 [ run ] completed with state SUCCESS. Commit: cc8f3f7
/LLM/main/L0_MergeRequest_PR pipeline #37084 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "CI passed in the assembled form."

@Shixiaowei02 Shixiaowei02 requested a review from pcastonguay May 7, 2026 09:14
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47176 [ skip ] triggered by Bot. Commit: cc8f3f7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47176 [ skip ] completed with state SUCCESS. Commit: cc8f3f7
Skipping testing for commit cc8f3f7

Link to invocation

Comment thread tensorrt_llm/_torch/pyexecutor/resource_manager.py
@Shixiaowei02 Shixiaowei02 enabled auto-merge (squash) May 8, 2026 07:41
@Shixiaowei02 Shixiaowei02 requested a review from lfr-0531 May 8, 2026 08:38
@Shixiaowei02 Shixiaowei02 merged commit 27b91f2 into NVIDIA:main May 8, 2026
13 checks passed
@Shixiaowei02 Shixiaowei02 deleted the user/xiaoweis/ctx-cache-reuse branch May 8, 2026 08:40
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants