[None][refactor] Decouple cached prefix from KVSlice token_range by Shixiaowei02 · Pull Request #13937 · NVIDIA/TensorRT-LLM

Shixiaowei02 · 2026-05-09T09:42:04Z

Summary by CodeRabbit

Bug Fixes
- Improved token offset calculations in distributed KV cache transfers to enhance alignment accuracy
- Enhanced sliding window attention cache reuse logic with better stale range handling
Documentation
- Clarified KV slice token range derivation, offset calculations, and sliding window attention behavior
Tests
- Expanded test coverage for cache reuse adapter functionality and token/block alignment validation

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Shixiaowei02 · 2026-05-09T14:11:31Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-05-09T14:16:51Z

PR_Github #47525 [ run ] triggered by Bot. Commit: f18a0d7 Link to invocation

tensorrt-cicd · 2026-05-09T23:23:47Z

PR_Github #47525 [ run ] completed with state SUCCESS. Commit: f18a0d7
/LLM/main/L0_MergeRequest_PR pipeline #37442 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Shixiaowei02 · 2026-05-10T03:18:01Z

/bot run --add-multi-gpu-test --disable-fail-fast

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 · 2026-05-10T03:18:42Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-05-10T03:24:03Z

PR_Github #47551 [ run ] triggered by Bot. Commit: 8926048 Link to invocation

tensorrt-cicd · 2026-05-10T03:24:29Z

PR_Github #47552 [ run ] triggered by Bot. Commit: 8926048 Link to invocation

tensorrt-cicd · 2026-05-10T03:24:32Z

PR_Github #47551 [ run ] completed with state ABORTED. Commit: 8926048

Link to invocation

tensorrt-cicd · 2026-05-10T12:14:39Z

PR_Github #47552 [ run ] completed with state SUCCESS. Commit: 8926048
/LLM/main/L0_MergeRequest_PR pipeline #37466 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Shixiaowei02 · 2026-05-11T02:48:56Z

/bot run --stage-list "DGX_B200-PyTorch-2" --disable-fail-fast

coderabbitai · 2026-05-11T02:50:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0ff149a-e1b8-4f43-a0e9-0e0093025b34

📥 Commits

Reviewing files that changed from the base of the PR and between a31d650 and 8926048.

📒 Files selected for processing (6)

tensorrt_llm/_torch/disaggregation/base/transfer.py
tensorrt_llm/_torch/disaggregation/native/transfer.py
tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py
tensorrt_llm/_torch/disaggregation/transceiver.py
tests/unittest/disaggregated/test_cache_reuse_adapter.py
tests/unittest/disaggregated/test_kv_transfer.py

📝 Walkthrough

Walkthrough

This PR refactors KV cache token offset derivation in a disaggregation (distributed inference) system. The receiver no longer provides explicit token offsets; instead, the sender computes block alignment offsets implicitly from block-list sizes. The cache adapter interface is refactored to report per-layer-group cached token counts with sliding-window clamping, replacing global cached-token counting. Session metadata threads prompt_len through to enable stale-end SWA computation.

Changes

KV Token Offset & Per-Layer-Group Cache Alignment

Layer / File(s)	Summary
Session Base Types with Prompt Length `tensorrt_llm/_torch/disaggregation/base/transfer.py`	`SessionArgsBase` gains optional `prompt_len` field. `KVSlice` docstring clarified for single/multi-slice token-range and SWA stale-end semantics.
Protocol & Data Contract Changes `tensorrt_llm/_torch/disaggregation/native/transfer.py`	`RecvReqInfo` replaces `start_token_idx` with optional `dst_start_token` field (msgpacked); signals sender to derive destination start implicitly.
Session & Task Constructors with Prompt Length `tensorrt_llm/_torch/disaggregation/native/transfer.py`	`TxSession`, `RxSession`, and `KVSendTask` constructors accept optional `prompt_len` and propagate into `SessionArgsBase`.
Receiver Request Info Construction `tensorrt_llm/_torch/disaggregation/native/transfer.py`	`Receiver._build_recv_req_info()` stops populating token offsets; sets `dst_start_token=None` to signal sender-side implicit derivation.
Cache Reuse Adapter API Refactor `tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py`	`CacheReuseAdapter` interface replaced `get_cached_token_count()` with `_global_cached_token_count()` and added `get_cached_token_count_per_layer_group()` with SWA clamping per layer group.
Adapter Implementations `tensorrt_llm/_torch/disaggregation/resource/cache_reuse.py`	V1 and V2 adapters implement `_global_cached_token_count()` using their cache sources (prepopulated_prompt_len, kv_cache.num_committed_tokens) with block alignment.
Sender Block Alignment & Token Starts `tensorrt_llm/_torch/disaggregation/native/transfer.py`	`Sender._build_kv_write_meta()` rewritten: derives `src_start`/`dst_start` from implicit cached-prefix size (block-list sizes, slice_end, tokens_per_block); SWA stale-end uses `task._prompt_len` with validation.
Transceiver KV Slice Selection `tensorrt_llm/_torch/disaggregation/transceiver.py`	`KvCacheTransceiverV2._create_kv_slice()` refactored: uses per-layer-group cached counts, defaults `token_range` to full prompt (0..prompt_len), applies per-layer-group cache-skip trimming with consistency assertions.
Test Module Imports & Setup `tests/unittest/disaggregated/test_cache_reuse_adapter.py`	Test module updated with imports for `TokenRange`, `Sender`, `CacheReuseAdapter`, and `AttentionLayerGroup`.
Token Alignment & Token Range Tests `tests/unittest/disaggregated/test_cache_reuse_adapter.py`	Refactored `TestAlignKvBlocks` with granular no-offset/dst-later/src-later/both-offset/no-overlap cases. Renamed `TestTokenRangeWithPrefix` → `TestTokenRange` with start/end invariant checks.
Test Helper Functions & Fixtures `tests/unittest/disaggregated/test_cache_reuse_adapter.py`	Added `_swa_trim()` and `_derive_starts()` helpers; added `_StubAdapter`, `_FakeReq`, `_lg()` fixtures for deterministic SWA/caching test scenarios.
Per-Layer-Group Adapter Tests `tests/unittest/disaggregated/test_cache_reuse_adapter.py`	New `TestAdapterPerLayerGroup` validates `get_cached_token_count_per_layer_group()`: reuse-disabled zeros, full-attention passthrough, SWA clamping relative to stale-end, mixed layer-group outputs.
SWA Trim & Cache-Skip Tests `tests/unittest/disaggregated/test_cache_reuse_adapter.py`	New `TestSwaTrim` exercises SWA window trimming and adapter cache-skip across no-cache, stale, partial/full window coverage, offset interactions, and V1 pre-eviction scenarios.
KV Transfer Integration Test Update `tests/unittest/disaggregated/test_kv_transfer.py`	Updated `test_transfer_with_gen_prefix_offset`: receiver now derives destination start implicitly from suffix block count; gen-side `token_range.start` defaults to 0.

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely a template with no concrete explanation of changes, test coverage, or rationale provided by the author despite checking the PR checklist box.	Fill in the Description and Test Coverage sections with clear explanations of what was changed and why, and list the relevant tests that validate the refactoring.
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.96% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is a clear, concise summary of the main refactoring change: decoupling cached prefix from KVSlice token_range.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-05-11T02:54:35Z

PR_Github #47647 [ run ] triggered by Bot. Commit: 8926048 Link to invocation

tensorrt-cicd · 2026-05-11T04:01:06Z

PR_Github #47647 [ run ] completed with state SUCCESS. Commit: 8926048
/LLM/main/L0_MergeRequest_PR pipeline #37551 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Shixiaowei02 · 2026-05-11T04:44:47Z

/bot skip --comment "CI passed in the assembled form."

tensorrt-cicd · 2026-05-11T04:52:06Z

PR_Github #47662 [ skip ] triggered by Bot. Commit: 8926048 Link to invocation

tensorrt-cicd · 2026-05-11T04:58:13Z

PR_Github #47662 [ skip ] completed with state SUCCESS. Commit: 8926048
Skipping testing for commit 8926048

Link to invocation

…DIA#13937) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

github-actions Bot assigned Shixiaowei02 May 9, 2026

Shixiaowei02 force-pushed the user/xiaoweis/reuse-swa branch from f84b368 to f18a0d7 Compare May 9, 2026 14:11

Shixiaowei02 force-pushed the user/xiaoweis/reuse-swa branch from f18a0d7 to 9b2bed1 Compare May 10, 2026 03:17

decouple cached prefix from KVSlice token_range

8926048

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 force-pushed the user/xiaoweis/reuse-swa branch from 9b2bed1 to 8926048 Compare May 10, 2026 03:18

Shixiaowei02 marked this pull request as ready for review May 11, 2026 02:45

Shixiaowei02 requested a review from a team as a code owner May 11, 2026 02:45

Shixiaowei02 requested review from byshiue, chuangz0 and dongxuy04 May 11, 2026 02:45

Shixiaowei02 requested a review from pcastonguay May 12, 2026 05:07

Shixiaowei02 enabled auto-merge (squash) May 12, 2026 05:07

chuangz0 approved these changes May 12, 2026

View reviewed changes

lfr-0531 approved these changes May 12, 2026

View reviewed changes

Shixiaowei02 merged commit 7e07c8b into NVIDIA:main May 12, 2026
10 checks passed

Shixiaowei02 deleted the user/xiaoweis/reuse-swa branch May 12, 2026 05:34

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][refactor] Decouple cached prefix from KVSlice token_range (NVI…

6880a23

…DIA#13937) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Conversation

Shixiaowei02 commented May 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Shixiaowei02 commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

Shixiaowei02 commented May 10, 2026

Uh oh!

Shixiaowei02 commented May 10, 2026

Uh oh!

tensorrt-cicd commented May 10, 2026

Uh oh!

tensorrt-cicd commented May 10, 2026

Uh oh!

tensorrt-cicd commented May 10, 2026

Uh oh!

tensorrt-cicd commented May 10, 2026

Uh oh!

Shixiaowei02 commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

Shixiaowei02 commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shixiaowei02 commented May 9, 2026 •

edited by coderabbitai Bot

Loading