Skip to content

[TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2#12644

Merged
lowsfer merged 3 commits intoNVIDIA:mainfrom
lowsfer:ssm-reuse
Apr 4, 2026
Merged

[TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2#12644
lowsfer merged 3 commits intoNVIDIA:mainfrom
lowsfer:ssm-reuse

Conversation

@lowsfer
Copy link
Copy Markdown
Member

@lowsfer lowsfer commented Apr 1, 2026

@coderabbitai summary

Description

SSM was supported but reuse is disabled when SSM layers are present. This PR allows cache reuse for SSM layers as well, by snapshotting SSM states periodically.

Test Coverage

Added test cases in test_kv_cache_manager_v2.py to cover new feature.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 1, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41088 [ run ] triggered by Bot. Commit: fee5e84 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

The PR introduces SSM (State Space Model) reuse with interval-based snapshots to the KV cache manager. Changes include new configuration types and fields for SSM support, a deferred GPU copy mechanism during first resume, reworked prefix reuse logic accounting for SSM lifecycle stages, and snapshot-driven commit behavior. Validation ensures compatibility constraints are met.

Changes

Cohort / File(s) Summary
Type Stubs & Public API
tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi
Added new dataclasses (SsmLayerConfig, KVCacheDesc, BatchDesc), expanded LayerConfig union, extended KVCacheManagerConfig with enable_partial_reuse, constraints, typical_step, ssm_reuse_interval fields, removed set_page_index_buf() and get_page_indices() from _KVCache, added adjust() and properties need_adjustment and ssm_reuse_interval to KVCacheManager.
Configuration & Validation
tensorrt_llm/runtime/kv_cache_manager_v2/_config.py
Added ssm_reuse_interval: int = 512 field to KVCacheManagerConfig with post-init validation ensuring positive value, exact divisibility by tokens_per_block, and exclusivity with enable_partial_reuse when SSM layers present.
Core Cache Manager
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py
Added ssm_reuse_interval property exposing config value.
Cache Logic & SSM Handling
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
Restructured SSM ownership model (_ssm_blocks no longer nullable, added _never_resumed tracking), implemented deferred GPU-to-GPU batched copies on first resume, reworked prefix reuse with SSM snapshot truncation, added interval-based snapshot commits via _snapshot_ssm_to_tree_block(), and updated eviction/cleanup for SSM pages.
Radix Tree & Block Management
tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py
Conditioned subtree-removal logic in unset_page() to only execute for AttnLifeCycle with appropriate window/sink constraints, allowing graceful handling of non-attention lifecycle types.
Page & Memory Management
tensorrt_llm/runtime/kv_cache_manager_v2/_page.py
Updated UncommittedPage.convert_to_committed() to accept and assign ready_event, refactored completion-event collection via new notify_finish() method to prevent unbounded growth on shared pages, and relaxed assertion in __del__ gated by SSM lifecycle check.
Tests
tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py
Enhanced SSM test configuration helper with ssm_reuse_interval parameter, added _make_ssm_reuse_config() builder, and introduced three new tests covering interval boundary behavior, data integrity after reuse, and configuration validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description consists entirely of the template with unfilled placeholders and no actual implementation details, rationale, test coverage, or checklist verification. Complete the PR description by filling in the Description, Test Coverage sections, and verifying the PR Checklist items are addressed and marked as appropriate.
Docstring Coverage ⚠️ Warning Docstring coverage is 24.49% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main feature: adding SSM cache reuse support to KVCacheManagerV2, matching the substantial code changes across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi`:
- Around line 109-115: The stub for KVCacheManagerConfig has an incorrect type
for the layers field (currently declared as list[AttentionLayerConfig]); update
the annotation to use the union type LayerConfig so SsmLayerConfig instances are
accepted at type-check time—i.e., change the KVCacheManagerConfig.layers
annotation from list[AttentionLayerConfig] to list[LayerConfig] (LayerConfig is
already defined as AttentionLayerConfig | SsmLayerConfig).

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_config.py`:
- Around line 209-213: The constructor/validation currently enforces
ssm_reuse_interval divisibility against tokens_per_block for all configs; change
the logic so the checks that ssm_reuse_interval is positive and a multiple of
tokens_per_block only run when has_ssm_layer is True (refer to the
ssm_reuse_interval, tokens_per_block, and has_ssm_layer symbols and the
validation block in the class/constructor that currently raises on
non-divisors), and add an attention-only regression test that constructs a
config with has_ssm_layer=False and tokens_per_block=96 to ensure the default
ssm_reuse_interval=512 does not raise.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`:
- Around line 816-817: UncommittedPage is being constructed with
BlockOrdinal(0), which forces block-0 priority for every SSM snapshot; replace
BlockOrdinal(0) with the snapshot block's actual ordinal pulled from the
snapshot/tree_block (e.g., use tree_block.ordinal or tree_block.block_ordinal as
appropriate) so that UncommittedPage(self, <snapshot_ordinal>, ssm_lc_id, lvl,
new_slot, beam_idx) computes correct priority; update the constructor call
before calling convert_to_committed(tree_block, ready_event) to pass that real
ordinal.
- Around line 633-646: The new deferred allocation path in _kv_cache.py can
raise OutOfPagesError before the existing recovery path runs, changing
resume()'s boolean-only failure contract; wrap the storage.new_gpu_slots(...)
call and the subsequent loop that assigns deferred_slots (the block inside the
if self._never_resumed branch that constructs num_slots and calls
storage.new_gpu_slots and iterates tmp_slots) in a try/except that catches
OutOfPagesError and returns False from resume() (preserving other exception
propagation), so that when SsmLifeCycle or has_partial allocation fails under
memory pressure resume() still returns False rather than raising.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d82a63b7-3093-45e4-8c4e-1699e7621999

📥 Commits

Reviewing files that changed from the base of the PR and between 7a450b4 and fee5e84.

📒 Files selected for processing (7)
  • tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi
  • tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_config.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_page.py
  • tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 1, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41133 [ run ] triggered by Bot. Commit: 4e941f6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41133 [ run ] completed with state SUCCESS. Commit: 4e941f6
/LLM/main/L0_MergeRequest_PR pipeline #32103 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 2, 2026

/bot run --disable-fail-fast

@lowsfer lowsfer enabled auto-merge (squash) April 2, 2026 07:55
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41383 [ run ] triggered by Bot. Commit: 4e941f6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41383 [ run ] completed with state SUCCESS. Commit: 4e941f6
/LLM/main/L0_MergeRequest_PR pipeline #32323 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 3, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41658 [ run ] triggered by Bot. Commit: 3c7df70 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41658 [ run ] completed with state SUCCESS. Commit: 3c7df70
/LLM/main/L0_MergeRequest_PR pipeline #32563 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 4, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41797 [ run ] triggered by Bot. Commit: bdbc762 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41797 [ run ] completed with state SUCCESS. Commit: bdbc762
/LLM/main/L0_MergeRequest_PR pipeline #32691 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lowsfer lowsfer merged commit fd7cc85 into NVIDIA:main Apr 4, 2026
5 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026
…VIDIA#12644)

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
…VIDIA#12644)

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants