[TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2 by lowsfer · Pull Request #12644 · NVIDIA/TensorRT-LLM

lowsfer · 2026-04-01T02:46:24Z

Description

SSM was supported but reuse is disabled when SSM layers are present. This PR allows cache reuse for SSM layers as well, by snapshotting SSM states periodically.

Test Coverage

Added test cases in test_kv_cache_manager_v2.py to cover new feature.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

lowsfer · 2026-04-01T02:49:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T02:55:27Z

PR_Github #41088 [ run ] triggered by Bot. Commit: fee5e84 Link to invocation

coderabbitai · 2026-04-01T03:00:46Z

📝 Walkthrough

Walkthrough

The PR introduces SSM (State Space Model) reuse with interval-based snapshots to the KV cache manager. Changes include new configuration types and fields for SSM support, a deferred GPU copy mechanism during first resume, reworked prefix reuse logic accounting for SSM lifecycle stages, and snapshot-driven commit behavior. Validation ensures compatibility constraints are met.

Changes

Cohort / File(s)	Summary
Type Stubs & Public API `tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi`	Added new dataclasses (`SsmLayerConfig`, `KVCacheDesc`, `BatchDesc`), expanded `LayerConfig` union, extended `KVCacheManagerConfig` with `enable_partial_reuse`, `constraints`, `typical_step`, `ssm_reuse_interval` fields, removed `set_page_index_buf()` and `get_page_indices()` from `_KVCache`, added `adjust()` and properties `need_adjustment` and `ssm_reuse_interval` to `KVCacheManager`.
Configuration & Validation `tensorrt_llm/runtime/kv_cache_manager_v2/_config.py`	Added `ssm_reuse_interval: int = 512` field to `KVCacheManagerConfig` with post-init validation ensuring positive value, exact divisibility by `tokens_per_block`, and exclusivity with `enable_partial_reuse` when SSM layers present.
Core Cache Manager `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py`	Added `ssm_reuse_interval` property exposing config value.
Cache Logic & SSM Handling `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`	Restructured SSM ownership model (`_ssm_blocks` no longer nullable, added `_never_resumed` tracking), implemented deferred GPU-to-GPU batched copies on first resume, reworked prefix reuse with SSM snapshot truncation, added interval-based snapshot commits via `_snapshot_ssm_to_tree_block()`, and updated eviction/cleanup for SSM pages.
Radix Tree & Block Management `tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py`	Conditioned subtree-removal logic in `unset_page()` to only execute for `AttnLifeCycle` with appropriate window/sink constraints, allowing graceful handling of non-attention lifecycle types.
Page & Memory Management `tensorrt_llm/runtime/kv_cache_manager_v2/_page.py`	Updated `UncommittedPage.convert_to_committed()` to accept and assign `ready_event`, refactored completion-event collection via new `notify_finish()` method to prevent unbounded growth on shared pages, and relaxed assertion in `__del__` gated by SSM lifecycle check.
Tests `tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py`	Enhanced SSM test configuration helper with `ssm_reuse_interval` parameter, added `_make_ssm_reuse_config()` builder, and introduced three new tests covering interval boundary behavior, data integrity after reuse, and configuration validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description consists entirely of the template with unfilled placeholders and no actual implementation details, rationale, test coverage, or checklist verification.	Complete the PR description by filling in the Description, Test Coverage sections, and verifying the PR Checklist items are addressed and marked as appropriate.
Docstring Coverage	⚠️ Warning	Docstring coverage is 24.49% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main feature: adding SSM cache reuse support to KVCacheManagerV2, matching the substantial code changes across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi`:
- Around line 109-115: The stub for KVCacheManagerConfig has an incorrect type
for the layers field (currently declared as list[AttentionLayerConfig]); update
the annotation to use the union type LayerConfig so SsmLayerConfig instances are
accepted at type-check time—i.e., change the KVCacheManagerConfig.layers
annotation from list[AttentionLayerConfig] to list[LayerConfig] (LayerConfig is
already defined as AttentionLayerConfig | SsmLayerConfig).

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_config.py`:
- Around line 209-213: The constructor/validation currently enforces
ssm_reuse_interval divisibility against tokens_per_block for all configs; change
the logic so the checks that ssm_reuse_interval is positive and a multiple of
tokens_per_block only run when has_ssm_layer is True (refer to the
ssm_reuse_interval, tokens_per_block, and has_ssm_layer symbols and the
validation block in the class/constructor that currently raises on
non-divisors), and add an attention-only regression test that constructs a
config with has_ssm_layer=False and tokens_per_block=96 to ensure the default
ssm_reuse_interval=512 does not raise.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`:
- Around line 816-817: UncommittedPage is being constructed with
BlockOrdinal(0), which forces block-0 priority for every SSM snapshot; replace
BlockOrdinal(0) with the snapshot block's actual ordinal pulled from the
snapshot/tree_block (e.g., use tree_block.ordinal or tree_block.block_ordinal as
appropriate) so that UncommittedPage(self, <snapshot_ordinal>, ssm_lc_id, lvl,
new_slot, beam_idx) computes correct priority; update the constructor call
before calling convert_to_committed(tree_block, ready_event) to pass that real
ordinal.
- Around line 633-646: The new deferred allocation path in _kv_cache.py can
raise OutOfPagesError before the existing recovery path runs, changing
resume()'s boolean-only failure contract; wrap the storage.new_gpu_slots(...)
call and the subsequent loop that assigns deferred_slots (the block inside the
if self._never_resumed branch that constructs num_slots and calls
storage.new_gpu_slots and iterates tmp_slots) in a try/except that catches
OutOfPagesError and returns False from resume() (preserving other exception
propagation), so that when SsmLifeCycle or has_partial allocation fails under
memory pressure resume() still returns False rather than raising.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d82a63b7-3093-45e4-8c4e-1699e7621999

📥 Commits

Reviewing files that changed from the base of the PR and between 7a450b4 and fee5e84.

📒 Files selected for processing (7)

tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi
tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py
tensorrt_llm/runtime/kv_cache_manager_v2/_config.py
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py
tensorrt_llm/runtime/kv_cache_manager_v2/_page.py
tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py

tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi

tensorrt_llm/runtime/kv_cache_manager_v2/_config.py

tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

lowsfer · 2026-04-01T06:35:13Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T06:41:55Z

PR_Github #41133 [ run ] triggered by Bot. Commit: 4e941f6 Link to invocation

tensorrt-cicd · 2026-04-01T11:15:51Z

PR_Github #41133 [ run ] completed with state SUCCESS. Commit: 4e941f6
/LLM/main/L0_MergeRequest_PR pipeline #32103 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lowsfer · 2026-04-02T07:55:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T08:02:27Z

PR_Github #41383 [ run ] triggered by Bot. Commit: 4e941f6 Link to invocation

tensorrt-cicd · 2026-04-02T11:01:28Z

PR_Github #41383 [ run ] completed with state SUCCESS. Commit: 4e941f6
/LLM/main/L0_MergeRequest_PR pipeline #32323 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lowsfer · 2026-04-03T11:39:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T11:44:50Z

PR_Github #41658 [ run ] triggered by Bot. Commit: 3c7df70 Link to invocation

tensorrt-cicd · 2026-04-03T16:07:12Z

PR_Github #41658 [ run ] completed with state SUCCESS. Commit: 3c7df70
/LLM/main/L0_MergeRequest_PR pipeline #32563 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lowsfer · 2026-04-04T04:12:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-04T04:18:46Z

PR_Github #41797 [ run ] triggered by Bot. Commit: bdbc762 Link to invocation

tensorrt-cicd · 2026-04-04T08:33:56Z

PR_Github #41797 [ run ] completed with state SUCCESS. Commit: bdbc762
/LLM/main/L0_MergeRequest_PR pipeline #32691 completed with status: 'SUCCESS'

CI Report

Link to invocation

…VIDIA#12644) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

lowsfer requested a review from yizhang-nv April 1, 2026 02:46

github-actions bot assigned lowsfer Apr 1, 2026

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

lowsfer force-pushed the ssm-reuse branch from fee5e84 to 295f396 Compare April 1, 2026 05:58

[TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2

4e941f6

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

lowsfer force-pushed the ssm-reuse branch from 295f396 to 4e941f6 Compare April 1, 2026 06:34

yizhang-nv approved these changes Apr 1, 2026

View reviewed changes

lowsfer enabled auto-merge (squash) April 2, 2026 07:55

Merge branch 'main' into ssm-reuse

3c7df70

Merge branch 'main' into ssm-reuse

bdbc762

lowsfer merged commit fd7cc85 into NVIDIA:main Apr 4, 2026
5 checks passed

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026

[TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2 (N…

cb375ac

…VIDIA#12644) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2 (N…

970debf

…VIDIA#12644) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

Conversation

lowsfer commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

lowsfer commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

coderabbitai bot commented Apr 1, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lowsfer commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

lowsfer commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

lowsfer commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

lowsfer commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lowsfer commented Apr 1, 2026 •

edited

Loading