[TRTLLM-12188][feat] Implement SWA prefill memory reuse (scratch slots) by lowsfer · Pull Request #13368 · NVIDIA/TensorRT-LLM

lowsfer · 2026-04-23T07:36:54Z

This commit introduces an opt-in memory saving feature for SWA (Sliding Window Attention) layers during prefill.

During prefill of a new request, out-of-window blocks' KV data is only needed during a single layer's attention computation, and can then be overwritten by the next layer. We leverage this by reinterpreting shared sub-pages within a coalesced slot to serve different blocks for the currently executing layer, rather than different layers for the same block.

Memory Savings:
For a 32 SWA layer model with prompt=1024, window=128, and tokens_per_block=32:

Current peak: 32 coalesced slots (one per block, each storing all 32 layers).
With scratch reuse: ceil(27/32) = 1 scratch slot + 5 normal slots = 6 slots.
Total reduction in peak KV cache memory: ~81%.

Implementation details:

KVCacheManagerConfig introduces enable_swa_scratch_reuse.
_KVCache.resize() partitions new blocks into scratch and normal blocks.
Added ScratchDesc and PageIndexMode to handle the two-source index conversion logic. PageIndexMode.PER_LAYER implies the converted indices include the layer's position within the coalesced slot, while PageIndexMode.SHARED indicates that the base pointer holds the per-layer offset.
PageIndexConverter.__call__ now supports processing base indices via scratch mode when configured.

Trade-off:
KV cache prefix reuse is degraded since scratch blocks have no preserved KV data after the step.

Made-with: Cursor

Summary by CodeRabbit

New Features
- Added scratch memory reuse support for sliding window attention models, improving KV cache efficiency during prefill operations.
- Extended public API with new page indexing modes and scratch descriptor utilities.
Documentation
- Added comprehensive subsystem documentation with build instructions and architectural guidance.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

lowsfer · 2026-04-23T07:38:19Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-23T07:44:06Z

PR_Github #45148 [ run ] triggered by Bot. Commit: 6165374 Link to invocation

coderabbitai · 2026-04-23T07:45:21Z

📝 Walkthrough

Walkthrough

The pull request introduces SWA (Sliding Window Attention) scratch-slot reuse functionality to the KV cache manager subsystem. Changes include a new PageIndexMode enum to distinguish indexing strategies, configuration option to enable scratch reuse, scratch descriptor dataclass for slot/range management, enhanced page index conversion with mode/scratch awareness, slot lock utilities for RAII-style slot management, lifecycle management updates for scratch blocks, and comprehensive test coverage for the new scratch reuse behavior across resize operations and lifecycle transitions.

Changes

Cohort / File(s)	Summary
Documentation & Guidance `tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md`	New repository guidance document explaining KVCacheManagerV2 architecture, build/test procedures, environment variables, and implementation gotchas.
Core Enums & Configuration `tensorrt_llm/runtime/kv_cache_manager_v2/_common.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_config.py`	Introduces `PageIndexMode` enum (SHARED/PER_LAYER) and adds `enable_swa_scratch_reuse` boolean config option to `KVCacheManagerConfig`.
Public API Exports `tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi`, `tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py`	Exports new public types: `PageIndexMode`, `PageIndexConverter`, `ScratchDesc` with updated stub signatures including scratch-aware APIs and `page_index_mode` property.
Scratch Slot Management `tensorrt_llm/runtime/kv_cache_manager_v2/_page.py`	Introduces `TempSlotLock` and `ScratchSlotLock` RAII-style helpers with explicit unlock/destructor logic, optional CUDA synchronization, and slot-ready-event reattachment support.
KVCache Scratch Integration `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`	Extends `_KVCache` with per-lifecycle scratch slot locks, scratch metadata exposure (`get_scratch_desc`, `has_scratch_slots`), mode selection based on allocation, updated `resize()` for scratch block ordinal computation and slot management, and lifecycle transition handling for scratch slots.
Page Index Conversion & Scratch Descriptors `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py`	Updates `PageIndexConverter` to accept base index sequences with optional mode and scratch descriptor, conditionally applies per-layer offsets, and remaps scratch blocks. Introduces `ScratchDesc` dataclass. Updates `KVCacheManager.get_mem_pool_base_address` to accept optional `index_mode` parameter.
Lifecycle Type Safety `tensorrt_llm/runtime/kv_cache_manager_v2/_life_cycle_registry.py`	Updates stale-range computation in `AttnLifeCycle` and `SsmLifeCycle` to use typed `HalfOpenRange[BlockOrdinal]` instead of untyped ranges.
Storage Layer Attributes `tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py`	Adds `LayerAttr` dataclass to track per-layer metadata and slot utilization. Updates storage manager to compute and expose layer attributes and per-lifecycle max slot utilization fractions. Changes `get_mem_pool_base_address` signature to accept pool group/pool indices directly.
Utility Updates `tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`	Adds `__contains__` to `HalfOpenRange` for membership checks. Changes `to_typed` index_type parameter from `Type[Index]` to `Callable[[Any], Index]`.
Test Infrastructure & Coverage `tests/unittest/kv_cache_manager_v2_tests/fake_engine.py`, `tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py`	Updates `FakeEngine` to use mode-aware page index conversion with scratch descriptors. Adds new `TestScratchReuse` test suite validating scratch allocation reduction, shared slot IDs, and behavior across resize chunk/window-size variations and lifecycle transitions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.92% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main feature being introduced: SWA prefill memory reuse using scratch slots, with proper ticket reference and type designation.
Description check	✅ Passed	The PR description comprehensively explains the feature, motivations, memory savings, implementation approach, and trade-offs, though the Description and Test Coverage sections are not explicitly filled following the template structure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md (1)

15-32: Make command examples repo-portable instead of host-specific.

Using ~/tekit/... makes the instructions brittle for other dev environments. Consider using a repo-root variable.

Proposed doc refactor

+REPO_ROOT="$(git rev-parse --show-toplevel)"
-PYTHONPATH=~/tekit/tensorrt_llm/runtime/ \
-    python ~/tekit/tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py -v
+PYTHONPATH="$REPO_ROOT/tensorrt_llm/runtime" \
+    python "$REPO_ROOT/tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py" -v

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md` around lines 15 - 32,
Update the example commands in the "Fast mode", "Single test class or method",
and "Production mode" sections so they are repo-portable instead of referencing
a hardcoded home path (~/tekit); replace the literal paths used in the
PYTHONPATH and python invocations with a repo-root variable or command
substitution (e.g., REPO_ROOT placeholder or $(git rev-parse --show-toplevel) /
$(pwd)) so contributors can run the examples from any clone; keep the same
examples and flags but update the paths in those three command blocks to use the
chosen repo-root variable (refer to the command examples under "Fast mode",
"Single test class or method", and "Production mode").

tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py (1)

272-283: Clarify index_type docstring to match the new callable annotation.

Line 272 now accepts a callable, but the parameter docs still read like a concrete type-only contract. Tightening wording will reduce confusion for NewType constructors.

Suggested docstring tweak

 def to_typed(index_type: Callable[[Any], Index], lst: list[T]) -> TypedIndexList[Index, T]:
@@
-    Parameters:
-        index_type: A type alias for the NewType index, e.g. type(BlockOrdinal(0)) or a concrete class derived from int.
+    Parameters:
+        index_type: A callable that constructs the typed index (e.g. BlockOrdinal), or an int subclass.
         lst: The list to cast

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py` around lines 272 - 283,
Update the docstring of to_typed to reflect that index_type is a callable (e.g.,
a NewType constructor or any callable that produces the Index from a base value)
rather than a concrete type; mention it can be a NewType factory like
BlockOrdinal or any callable that accepts an integer (or base index) and returns
an Index, and clarify that it is only used for typing/casting and not invoked on
list elements. Target the to_typed function's parameter docs (index_type and
lst) and adjust phrasing to remove ambiguity about concrete vs callable types.

tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py (1)

1867-2158: Add pre-merge perf coverage for the scratch-reuse path.

These unit tests cover correctness, but this PR changes KV cache management and the feature’s main value is memory reduction. Please add a test-db perf entry that exercises SWA scratch reuse; QA functional list updates are unnecessary for this unit-only addition, but a perf-list follow-up is warranted if you want scheduled coverage. As per coding guidelines, "If the PR touches performance-sensitive paths (attention kernels, MoE routing/dispatch, KV cache management, scheduler, batching logic, CUDA graph capture, speculative decoding, or quantization kernels), check whether a perf test entry is present or updated in: (a) tests/integration/test_lists/test-db/l0_perf.yml ... and (b) tests/integration/test_lists/qa/llm_perf_*.yml ..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py` around
lines 1867 - 2158, Add a perf-test entry that exercises the SWA scratch-reuse
path by invoking the new TestScratchReuse tests (e.g.,
TestScratchReuse::test_scratch_slot_count or
TestScratchReuse::test_scratch_shared_slot_ids) so the memory-reduction behavior
is covered by perf CI; update the L0 perf list (l0_perf.yml) to include a job
that runs pytest filtering for TestScratchReuse (or the specific test names)
with the same config/quota used in the unit tests, and also add a corresponding
entry to the QA perf list (llm_perf_*.yml) per guidelines so this
performance-sensitive change is tracked.

tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py (3)

753-757: Prefix unused variable with underscore.

Static analysis indicates scratch_ranges is unpacked but never used in the resume() method.

🔧 Proposed fix

         stale_scratch_slots, delta_scratch_slots, _scratch_ranges = self._take_stale_scratch_slots(
             self.capacity, self.history_length
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 753
- 757, In the resume() method change the unused unpacked variable from
scratch_ranges to a name starting with an underscore (e.g., _scratch_ranges)
where you call self._take_stale_scratch_slots(self.capacity,
self.history_length) so the tuple still unpacks into stale_scratch_slots,
delta_scratch_slots and _scratch_ranges and static analysis no longer flags the
unused variable.

506-519: Prefix unused variables with underscore.

Static analysis indicates scratch_beg and scratch_end are unpacked but never used. Prefix them with underscores to indicate intentional non-use.

🔧 Proposed fix

                 if enable_scratch:
-                    scratch_beg, scratch_end = scratch_ranges[lc]
+                    _scratch_beg, _scratch_end = scratch_ranges[lc]
                     num_scratch_blocks = len(scratch_ranges[lc])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 506
- 519, The unpacked variables scratch_beg and scratch_end in the enable_scratch
branch are not used; update the unpack to indicate intentional non-use by
renaming them to _scratch_beg and _scratch_end where scratch_ranges is unpacked
(within the enable_scratch handling around variables enable_scratch,
scratch_ranges, lc), leaving the rest of the logic (num_scratch_blocks,
num_new_normal_blocks, num_new_slots) unchanged so static-analysis warnings are
silenced.

779-784: Add explicit strict= parameter to zip().

Static analysis recommends adding explicit strict= parameter. Since num_slots and tmp_slots should have matching lengths (both sized by num_life_cycles), using strict=True would catch any mismatch.

🔧 Proposed fix

-            for lc_idx, slot_lst in zip(typed_range(num_life_cycles), tmp_slots):
+            for lc_idx, slot_lst in zip(typed_range(num_life_cycles), tmp_slots, strict=True):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 779
- 784, The zip over typed_range(num_life_cycles) and tmp_slots in the loop
should be made explicit about length matching—change the call in _kv_cache.py
where the loop uses "for lc_idx, slot_lst in zip(typed_range(num_life_cycles),
tmp_slots):" to "for lc_idx, slot_lst in zip(typed_range(num_life_cycles),
tmp_slots, strict=True):" so mismatched lengths raise immediately; update the
single location inside the KV cache management logic (referencing variables
typed_range, num_life_cycles, tmp_slots, deferred_slots, scratch_slots_to_add
and the loop body) and ensure the runtime target supports zip(strict=True).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md`:
- Around line 40-41: The build instruction in CLAUDE.md runs the mypyc setup
from the wrong directory: after entering rawref a single `cd ..` lands back in
kv_cache_manager_v2/ but setup_mypyc.py expects to be run from the runtime/
directory; update the command in CLAUDE.md to change up two directories before
invoking setup_mypyc.py (e.g., use `cd ../.. && python
kv_cache_manager_v2/setup_mypyc.py build_ext --inplace`) so the script
`setup_mypyc.py` is executed with runtime/ as the current working directory.

In `@tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py`:
- Around line 1974-1983: Add a regression case after the existing commit/close
sequence that reconstructs a new KV cache from the same prompt and asserts the
new cache's num_committed_tokens does not include tokens that were in scratch
slots; locate the block using symbols kv.commit, kv.stop_committing,
kv.has_scratch_slots, kv.close and manager.clear_reusable_blocks, create a
follow-up kv (or reuse manager API to build a cache from the same prompt), and
assert new_kv.num_committed_tokens == expected_non_scratch_prefix_length (i.e.,
only the non-scratch prefix is counted) to prove scratch-range tokens are not
preserved for reuse.
- Around line 1901-1983: The test only inspects ScratchDesc and base page
indices but doesn't assert the real allocated GPU slot count; update
test_scratch_slot_count to query the actual allocated slots after
kv.resize(prompt_len) (before commit) for the layer group and assert it equals
expected_total and is less than num_blocks. Locate where the layer group id is
computed (LayerGroupId(0)) and after scratch_desc is computed call the
manager/kv API that reports current allocated slot count for that layer group
(e.g., a method like kv.get_allocated_slot_count(lg_id) or
manager.get_allocated_slots for the group) and add assertions: actual_allocated
== expected_total and actual_allocated < num_blocks; if such API doesn't exist,
add a small helper that counts non-BAD_PAGE_INDEX entries from
kv.get_base_page_indices(lg_id) to derive the real allocated slot count and
assert it matches expected_total.

---

Nitpick comments:
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`:
- Around line 753-757: In the resume() method change the unused unpacked
variable from scratch_ranges to a name starting with an underscore (e.g.,
_scratch_ranges) where you call self._take_stale_scratch_slots(self.capacity,
self.history_length) so the tuple still unpacks into stale_scratch_slots,
delta_scratch_slots and _scratch_ranges and static analysis no longer flags the
unused variable.
- Around line 506-519: The unpacked variables scratch_beg and scratch_end in the
enable_scratch branch are not used; update the unpack to indicate intentional
non-use by renaming them to _scratch_beg and _scratch_end where scratch_ranges
is unpacked (within the enable_scratch handling around variables enable_scratch,
scratch_ranges, lc), leaving the rest of the logic (num_scratch_blocks,
num_new_normal_blocks, num_new_slots) unchanged so static-analysis warnings are
silenced.
- Around line 779-784: The zip over typed_range(num_life_cycles) and tmp_slots
in the loop should be made explicit about length matching—change the call in
_kv_cache.py where the loop uses "for lc_idx, slot_lst in
zip(typed_range(num_life_cycles), tmp_slots):" to "for lc_idx, slot_lst in
zip(typed_range(num_life_cycles), tmp_slots, strict=True):" so mismatched
lengths raise immediately; update the single location inside the KV cache
management logic (referencing variables typed_range, num_life_cycles, tmp_slots,
deferred_slots, scratch_slots_to_add and the loop body) and ensure the runtime
target supports zip(strict=True).

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`:
- Around line 272-283: Update the docstring of to_typed to reflect that
index_type is a callable (e.g., a NewType constructor or any callable that
produces the Index from a base value) rather than a concrete type; mention it
can be a NewType factory like BlockOrdinal or any callable that accepts an
integer (or base index) and returns an Index, and clarify that it is only used
for typing/casting and not invoked on list elements. Target the to_typed
function's parameter docs (index_type and lst) and adjust phrasing to remove
ambiguity about concrete vs callable types.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md`:
- Around line 15-32: Update the example commands in the "Fast mode", "Single
test class or method", and "Production mode" sections so they are repo-portable
instead of referencing a hardcoded home path (~/tekit); replace the literal
paths used in the PYTHONPATH and python invocations with a repo-root variable or
command substitution (e.g., REPO_ROOT placeholder or $(git rev-parse
--show-toplevel) / $(pwd)) so contributors can run the examples from any clone;
keep the same examples and flags but update the paths in those three command
blocks to use the chosen repo-root variable (refer to the command examples under
"Fast mode", "Single test class or method", and "Production mode").

In `@tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py`:
- Around line 1867-2158: Add a perf-test entry that exercises the SWA
scratch-reuse path by invoking the new TestScratchReuse tests (e.g.,
TestScratchReuse::test_scratch_slot_count or
TestScratchReuse::test_scratch_shared_slot_ids) so the memory-reduction behavior
is covered by perf CI; update the L0 perf list (l0_perf.yml) to include a job
that runs pytest filtering for TestScratchReuse (or the specific test names)
with the same config/quota used in the unit tests, and also add a corresponding
entry to the QA perf list (llm_perf_*.yml) per guidelines so this
performance-sensitive change is tracked.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bb321402-9c01-428d-bca4-b14e508fdca4

📥 Commits

Reviewing files that changed from the base of the PR and between e3ca723 and 6165374.

📒 Files selected for processing (15)

tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md
tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py
tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi
tensorrt_llm/runtime/kv_cache_manager_v2/_common.py
tensorrt_llm/runtime/kv_cache_manager_v2/_config.py
tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py
tensorrt_llm/runtime/kv_cache_manager_v2/_life_cycle_registry.py
tensorrt_llm/runtime/kv_cache_manager_v2/_page.py
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py
tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py
tests/unittest/kv_cache_manager_v2_tests/fake_engine.py
tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py

tensorrt-cicd · 2026-04-23T07:49:44Z

PR_Github #45148 [ run ] completed with state FAILURE. Commit: 6165374

Link to invocation

lowsfer · 2026-04-23T10:39:23Z

Responding to CodeRabbit nitpick comments from the review:

Nitpick: _utils.py line 272-283 — index_type docstring
Not a real issue. NewType constructors ARE callables — that's exactly how Python's NewType works. The annotation Callable[[Any], Index] matches the docstring description ("A type alias for the NewType index, e.g. type(BlockOrdinal(0))"). No change needed.

lowsfer · 2026-04-23T10:40:35Z

Nitpick: CLAUDE.md hardcoded ~/tekit paths (lines 15-32)
Good catch — fixed. Replaced hardcoded ~/tekit paths with REPO_ROOT="$(git rev-parse --show-toplevel)" in all three command blocks (Fast mode, Single test, Production mode).

lowsfer · 2026-04-23T10:41:31Z

Nitpick: Unused scratch_ranges in resume() (_kv_cache.py line 754)
Valid — renamed to _.

lowsfer · 2026-04-23T10:41:59Z

Nitpick: Unused scratch_beg/end in resize() (_kv_cache.py line 507)
Valid — removed the dead unpack line entirely.

lowsfer · 2026-04-23T10:43:00Z

Nitpick: zip(..., strict=True) (_kv_cache.py line 779)
Added strict=True.

lowsfer · 2026-04-23T10:43:27Z

Nitpick: Add perf test CI entries (test file lines 1867-2158)
Not applicable. These are unit tests with FakeEngine (mock GPU operations), not end-to-end inference benchmarks. The perf CI configs (l0_perf.yml, llm_perf_*.yml) are for model-level perf tracking — adding a unit test class there is meaningless.

lowsfer · 2026-04-23T10:52:47Z

/bot run

tensorrt-cicd · 2026-04-23T10:58:33Z

PR_Github #45174 [ run ] triggered by Bot. Commit: 00f81f6 Link to invocation

tensorrt-cicd · 2026-04-23T18:53:42Z

PR_Github #45174 [ run ] completed with state SUCCESS. Commit: 00f81f6
/LLM/main/L0_MergeRequest_PR pipeline #35451 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lowsfer · 2026-04-24T06:21:23Z

/bot run

tensorrt-cicd · 2026-04-24T06:28:12Z

PR_Github #45352 [ run ] triggered by Bot. Commit: 16713c6 Link to invocation

tensorrt-cicd · 2026-04-24T09:13:10Z

PR_Github #45352 [ run ] completed with state SUCCESS. Commit: 16713c6
/LLM/main/L0_MergeRequest_PR pipeline #35598 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lowsfer · 2026-04-27T10:25:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-27T10:32:01Z

PR_Github #45717 [ run ] triggered by Bot. Commit: 1c0b64f Link to invocation

tensorrt-cicd · 2026-04-27T15:04:47Z

PR_Github #45717 [ run ] completed with state FAILURE. Commit: 1c0b64f
/LLM/main/L0_MergeRequest_PR pipeline #35918 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lowsfer · 2026-04-28T04:13:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-28T04:19:37Z

PR_Github #45847 [ run ] triggered by Bot. Commit: 7c59d97 Link to invocation

tensorrt-cicd · 2026-04-28T13:35:46Z

PR_Github #45847 [ run ] completed with state FAILURE. Commit: 7c59d97
/LLM/main/L0_MergeRequest_PR pipeline #36025 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lowsfer · 2026-04-29T02:30:00Z

/bot run --disable-fail-fast

lowsfer · 2026-05-05T07:32:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-05T07:39:16Z

PR_Github #46771 [ run ] triggered by Bot. Commit: fa68abb Link to invocation

lowsfer · 2026-05-05T10:24:37Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-05T10:30:31Z

PR_Github #46790 [ run ] triggered by Bot. Commit: 8c0751c Link to invocation

tensorrt-cicd · 2026-05-05T14:20:09Z

PR_Github #46790 [ run ] completed with state SUCCESS. Commit: 8c0751c
/LLM/main/L0_MergeRequest_PR pipeline #36812 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lowsfer · 2026-05-05T15:13:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-05T15:21:46Z

PR_Github #46824 [ run ] triggered by Bot. Commit: 8c0751c Link to invocation

tensorrt-cicd · 2026-05-06T14:00:21Z

PR_Github #46824 [ run ] completed with state FAILURE. Commit: 8c0751c
/LLM/main/L0_MergeRequest_PR pipeline #36844 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-07T01:04:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T01:10:53Z

PR_Github #47062 [ run ] triggered by Bot. Commit: 8c0751c Link to invocation

This commit introduces an opt-in memory saving feature for SWA (Sliding Window Attention) layers during prefill. During prefill of a new request, out-of-window blocks' KV data is only needed during a single layer's attention computation, and can then be overwritten by the next layer. We leverage this by reinterpreting shared sub-pages within a coalesced slot to serve different blocks for the currently executing layer, rather than different layers for the same block. Memory Savings: For a 32 SWA layer model with prompt=1024, window=128, and tokens_per_block=32: - Current peak: 32 coalesced slots (one per block, each storing all 32 layers). - With scratch reuse: ceil(27/32) = 1 scratch slot + 5 normal slots = 6 slots. - Total reduction in peak KV cache memory: ~81%. Implementation details: 1. `KVCacheManagerConfig` introduces `enable_swa_scratch_reuse`. 2. `_KVCache.resize()` partitions new blocks into scratch and normal blocks. 3. Added `ScratchDesc` and `PageIndexMode` to handle the two-source index conversion logic. `PageIndexMode.PER_LAYER` implies the converted indices include the layer's position within the coalesced slot, while `PageIndexMode.SHARED` indicates that the base pointer holds the per-layer offset. 4. `PageIndexConverter.__call__` now supports processing base indices via scratch mode when configured. Trade-off: KV cache prefix reuse is degraded since scratch blocks have no preserved KV data after the step. Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com> Made-with: Cursor

…er_offset - Replace _KVCache.page_index_mode property with supports_index_mode(mode) method that returns bool (PER_LAYER: always True, SHARED: not has_scratch_slots). - Add KVCacheManager.supports_index_mode(mode) returning bool | None (True=always, False=never, None=per-instance). - Always populate PageIndexConverter.layer_offset so the converter supports both index modes unconditionally. - Keep index_mode defaulting to None with runtime checks: defaults to SHARED, asserts when scratch is active without explicit mode. - Add ScratchDesc.__bool__ for natural truthiness checks on scratch range. - Update fake_engine and test callers to use new API. Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

tensorrt-cicd · 2026-05-07T03:26:22Z

PR_Github #47096 [ run ] triggered by Bot. Commit: 8c0751c Link to invocation

lowsfer · 2026-05-07T03:26:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T03:32:43Z

PR_Github #47098 [ run ] triggered by Bot. Commit: 98b49e1 Link to invocation

tensorrt-cicd · 2026-05-07T08:41:44Z

PR_Github #47098 [ run ] completed with state SUCCESS. Commit: 98b49e1
/LLM/main/L0_MergeRequest_PR pipeline #37067 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lowsfer · 2026-05-07T09:57:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T10:04:56Z

PR_Github #47187 [ run ] triggered by Bot. Commit: 98b49e1 Link to invocation

tensorrt-cicd · 2026-05-07T11:46:47Z

PR_Github #47187 [ run ] completed with state SUCCESS. Commit: 98b49e1
/LLM/main/L0_MergeRequest_PR pipeline #37144 completed with status: 'SUCCESS'

CI Report

Link to invocation

…s) (NVIDIA#13368) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

github-actions Bot assigned lowsfer Apr 23, 2026

lowsfer requested review from jiaganc, lfr-0531 and yizhang-nv April 23, 2026 07:37

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md Outdated

Comment thread tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py

Comment thread tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py Outdated

lowsfer force-pushed the swa-mem-saving branch from 6165374 to 00f81f6 Compare April 23, 2026 10:52

lowsfer force-pushed the swa-mem-saving branch from 00f81f6 to 16713c6 Compare April 24, 2026 06:19

lowsfer force-pushed the swa-mem-saving branch 2 times, most recently from a5c6d75 to fa68abb Compare May 5, 2026 07:32

lowsfer force-pushed the swa-mem-saving branch from fa68abb to 8c0751c Compare May 5, 2026 10:23

yizhang-nv approved these changes May 7, 2026

View reviewed changes

lowsfer added 3 commits May 7, 2026 03:20

Add per-request SWA scratch reuse toggle

98b49e1

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

lowsfer force-pushed the swa-mem-saving branch from 8c0751c to 98b49e1 Compare May 7, 2026 03:26

lowsfer merged commit 47d7ecc into NVIDIA:main May 7, 2026
6 checks passed

jiaganc mentioned this pull request May 11, 2026

[TRTLLM-12229][feat] Enable DeepSeek V4 scratch reuse #13965

Closed

1 task

jiaganc pushed a commit to jiaganc/TensorRT-LLM that referenced this pull request May 11, 2026

[TRTLLM-12188][feat] Implement SWA prefill memory reuse (scratch slot…

092a6f2

…s) (NVIDIA#13368) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

chuangz0 pushed a commit to chuangz0/TensorRT-LLM that referenced this pull request May 14, 2026

[TRTLLM-12188][feat] Implement SWA prefill memory reuse (scratch slot…

06dab8e

…s) (NVIDIA#13368) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[TRTLLM-12188][feat] Implement SWA prefill memory reuse (scratch slot…

5db620a

…s) (NVIDIA#13368) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

Conversation

lowsfer commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

lowsfer commented Apr 23, 2026

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

lowsfer commented Apr 23, 2026

Uh oh!

lowsfer commented Apr 23, 2026

Uh oh!

lowsfer commented Apr 23, 2026

Uh oh!

lowsfer commented Apr 23, 2026

Uh oh!

lowsfer commented Apr 23, 2026

Uh oh!

lowsfer commented Apr 23, 2026

Uh oh!

lowsfer commented Apr 23, 2026

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

lowsfer commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

lowsfer commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

lowsfer commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

lowsfer commented Apr 29, 2026

Uh oh!

lowsfer commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

lowsfer commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

lowsfer commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 6, 2026

lowsfer commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading