Skip to content

[TRTLLM-12188][feat] Implement SWA prefill memory reuse (scratch slots)#13368

Merged
lowsfer merged 3 commits into
NVIDIA:mainfrom
lowsfer:swa-mem-saving
May 7, 2026
Merged

[TRTLLM-12188][feat] Implement SWA prefill memory reuse (scratch slots)#13368
lowsfer merged 3 commits into
NVIDIA:mainfrom
lowsfer:swa-mem-saving

Conversation

@lowsfer
Copy link
Copy Markdown
Member

@lowsfer lowsfer commented Apr 23, 2026

This commit introduces an opt-in memory saving feature for SWA (Sliding Window Attention) layers during prefill.

During prefill of a new request, out-of-window blocks' KV data is only needed during a single layer's attention computation, and can then be overwritten by the next layer. We leverage this by reinterpreting shared sub-pages within a coalesced slot to serve different blocks for the currently executing layer, rather than different layers for the same block.

Memory Savings:
For a 32 SWA layer model with prompt=1024, window=128, and tokens_per_block=32:

  • Current peak: 32 coalesced slots (one per block, each storing all 32 layers).
  • With scratch reuse: ceil(27/32) = 1 scratch slot + 5 normal slots = 6 slots.
  • Total reduction in peak KV cache memory: ~81%.

Implementation details:

  1. KVCacheManagerConfig introduces enable_swa_scratch_reuse.
  2. _KVCache.resize() partitions new blocks into scratch and normal blocks.
  3. Added ScratchDesc and PageIndexMode to handle the two-source index conversion logic. PageIndexMode.PER_LAYER implies the converted indices include the layer's position within the coalesced slot, while PageIndexMode.SHARED indicates that the base pointer holds the per-layer offset.
  4. PageIndexConverter.__call__ now supports processing base indices via scratch mode when configured.

Trade-off:
KV cache prefix reuse is degraded since scratch blocks have no preserved KV data after the step.

Made-with: Cursor

Summary by CodeRabbit

  • New Features

    • Added scratch memory reuse support for sliding window attention models, improving KV cache efficiency during prefill operations.
    • Extended public API with new page indexing modes and scratch descriptor utilities.
  • Documentation

    • Added comprehensive subsystem documentation with build instructions and architectural guidance.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 23, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45148 [ run ] triggered by Bot. Commit: 6165374 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

The pull request introduces SWA (Sliding Window Attention) scratch-slot reuse functionality to the KV cache manager subsystem. Changes include a new PageIndexMode enum to distinguish indexing strategies, configuration option to enable scratch reuse, scratch descriptor dataclass for slot/range management, enhanced page index conversion with mode/scratch awareness, slot lock utilities for RAII-style slot management, lifecycle management updates for scratch blocks, and comprehensive test coverage for the new scratch reuse behavior across resize operations and lifecycle transitions.

Changes

Cohort / File(s) Summary
Documentation & Guidance
tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md
New repository guidance document explaining KVCacheManagerV2 architecture, build/test procedures, environment variables, and implementation gotchas.
Core Enums & Configuration
tensorrt_llm/runtime/kv_cache_manager_v2/_common.py, tensorrt_llm/runtime/kv_cache_manager_v2/_config.py
Introduces PageIndexMode enum (SHARED/PER_LAYER) and adds enable_swa_scratch_reuse boolean config option to KVCacheManagerConfig.
Public API Exports
tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py, tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi, tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py
Exports new public types: PageIndexMode, PageIndexConverter, ScratchDesc with updated stub signatures including scratch-aware APIs and page_index_mode property.
Scratch Slot Management
tensorrt_llm/runtime/kv_cache_manager_v2/_page.py
Introduces TempSlotLock and ScratchSlotLock RAII-style helpers with explicit unlock/destructor logic, optional CUDA synchronization, and slot-ready-event reattachment support.
KVCache Scratch Integration
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
Extends _KVCache with per-lifecycle scratch slot locks, scratch metadata exposure (get_scratch_desc, has_scratch_slots), mode selection based on allocation, updated resize() for scratch block ordinal computation and slot management, and lifecycle transition handling for scratch slots.
Page Index Conversion & Scratch Descriptors
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py
Updates PageIndexConverter to accept base index sequences with optional mode and scratch descriptor, conditionally applies per-layer offsets, and remaps scratch blocks. Introduces ScratchDesc dataclass. Updates KVCacheManager.get_mem_pool_base_address to accept optional index_mode parameter.
Lifecycle Type Safety
tensorrt_llm/runtime/kv_cache_manager_v2/_life_cycle_registry.py
Updates stale-range computation in AttnLifeCycle and SsmLifeCycle to use typed HalfOpenRange[BlockOrdinal] instead of untyped ranges.
Storage Layer Attributes
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py, tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
Adds LayerAttr dataclass to track per-layer metadata and slot utilization. Updates storage manager to compute and expose layer attributes and per-lifecycle max slot utilization fractions. Changes get_mem_pool_base_address signature to accept pool group/pool indices directly.
Utility Updates
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py
Adds __contains__ to HalfOpenRange for membership checks. Changes to_typed index_type parameter from Type[Index] to Callable[[Any], Index].
Test Infrastructure & Coverage
tests/unittest/kv_cache_manager_v2_tests/fake_engine.py, tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py
Updates FakeEngine to use mode-aware page index conversion with scratch descriptors. Adds new TestScratchReuse test suite validating scratch allocation reduction, shared slot IDs, and behavior across resize chunk/window-size variations and lifecycle transitions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main feature being introduced: SWA prefill memory reuse using scratch slots, with proper ticket reference and type designation.
Description check ✅ Passed The PR description comprehensively explains the feature, motivations, memory savings, implementation approach, and trade-offs, though the Description and Test Coverage sections are not explicitly filled following the template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md (1)

15-32: Make command examples repo-portable instead of host-specific.

Using ~/tekit/... makes the instructions brittle for other dev environments. Consider using a repo-root variable.

Proposed doc refactor
+REPO_ROOT="$(git rev-parse --show-toplevel)"
-PYTHONPATH=~/tekit/tensorrt_llm/runtime/ \
-    python ~/tekit/tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py -v
+PYTHONPATH="$REPO_ROOT/tensorrt_llm/runtime" \
+    python "$REPO_ROOT/tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py" -v
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md` around lines 15 - 32,
Update the example commands in the "Fast mode", "Single test class or method",
and "Production mode" sections so they are repo-portable instead of referencing
a hardcoded home path (~/tekit); replace the literal paths used in the
PYTHONPATH and python invocations with a repo-root variable or command
substitution (e.g., REPO_ROOT placeholder or $(git rev-parse --show-toplevel) /
$(pwd)) so contributors can run the examples from any clone; keep the same
examples and flags but update the paths in those three command blocks to use the
chosen repo-root variable (refer to the command examples under "Fast mode",
"Single test class or method", and "Production mode").
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py (1)

272-283: Clarify index_type docstring to match the new callable annotation.

Line 272 now accepts a callable, but the parameter docs still read like a concrete type-only contract. Tightening wording will reduce confusion for NewType constructors.

Suggested docstring tweak
 def to_typed(index_type: Callable[[Any], Index], lst: list[T]) -> TypedIndexList[Index, T]:
@@
-    Parameters:
-        index_type: A type alias for the NewType index, e.g. type(BlockOrdinal(0)) or a concrete class derived from int.
+    Parameters:
+        index_type: A callable that constructs the typed index (e.g. BlockOrdinal), or an int subclass.
         lst: The list to cast
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py` around lines 272 - 283,
Update the docstring of to_typed to reflect that index_type is a callable (e.g.,
a NewType constructor or any callable that produces the Index from a base value)
rather than a concrete type; mention it can be a NewType factory like
BlockOrdinal or any callable that accepts an integer (or base index) and returns
an Index, and clarify that it is only used for typing/casting and not invoked on
list elements. Target the to_typed function's parameter docs (index_type and
lst) and adjust phrasing to remove ambiguity about concrete vs callable types.
tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py (1)

1867-2158: Add pre-merge perf coverage for the scratch-reuse path.

These unit tests cover correctness, but this PR changes KV cache management and the feature’s main value is memory reduction. Please add a test-db perf entry that exercises SWA scratch reuse; QA functional list updates are unnecessary for this unit-only addition, but a perf-list follow-up is warranted if you want scheduled coverage. As per coding guidelines, "If the PR touches performance-sensitive paths (attention kernels, MoE routing/dispatch, KV cache management, scheduler, batching logic, CUDA graph capture, speculative decoding, or quantization kernels), check whether a perf test entry is present or updated in: (a) tests/integration/test_lists/test-db/l0_perf.yml ... and (b) tests/integration/test_lists/qa/llm_perf_*.yml ..."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py` around
lines 1867 - 2158, Add a perf-test entry that exercises the SWA scratch-reuse
path by invoking the new TestScratchReuse tests (e.g.,
TestScratchReuse::test_scratch_slot_count or
TestScratchReuse::test_scratch_shared_slot_ids) so the memory-reduction behavior
is covered by perf CI; update the L0 perf list (l0_perf.yml) to include a job
that runs pytest filtering for TestScratchReuse (or the specific test names)
with the same config/quota used in the unit tests, and also add a corresponding
entry to the QA perf list (llm_perf_*.yml) per guidelines so this
performance-sensitive change is tracked.
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py (3)

753-757: Prefix unused variable with underscore.

Static analysis indicates scratch_ranges is unpacked but never used in the resume() method.

🔧 Proposed fix
         stale_scratch_slots, delta_scratch_slots, _scratch_ranges = self._take_stale_scratch_slots(
             self.capacity, self.history_length
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 753
- 757, In the resume() method change the unused unpacked variable from
scratch_ranges to a name starting with an underscore (e.g., _scratch_ranges)
where you call self._take_stale_scratch_slots(self.capacity,
self.history_length) so the tuple still unpacks into stale_scratch_slots,
delta_scratch_slots and _scratch_ranges and static analysis no longer flags the
unused variable.

506-519: Prefix unused variables with underscore.

Static analysis indicates scratch_beg and scratch_end are unpacked but never used. Prefix them with underscores to indicate intentional non-use.

🔧 Proposed fix
                 if enable_scratch:
-                    scratch_beg, scratch_end = scratch_ranges[lc]
+                    _scratch_beg, _scratch_end = scratch_ranges[lc]
                     num_scratch_blocks = len(scratch_ranges[lc])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 506
- 519, The unpacked variables scratch_beg and scratch_end in the enable_scratch
branch are not used; update the unpack to indicate intentional non-use by
renaming them to _scratch_beg and _scratch_end where scratch_ranges is unpacked
(within the enable_scratch handling around variables enable_scratch,
scratch_ranges, lc), leaving the rest of the logic (num_scratch_blocks,
num_new_normal_blocks, num_new_slots) unchanged so static-analysis warnings are
silenced.

779-784: Add explicit strict= parameter to zip().

Static analysis recommends adding explicit strict= parameter. Since num_slots and tmp_slots should have matching lengths (both sized by num_life_cycles), using strict=True would catch any mismatch.

🔧 Proposed fix
-            for lc_idx, slot_lst in zip(typed_range(num_life_cycles), tmp_slots):
+            for lc_idx, slot_lst in zip(typed_range(num_life_cycles), tmp_slots, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 779
- 784, The zip over typed_range(num_life_cycles) and tmp_slots in the loop
should be made explicit about length matching—change the call in _kv_cache.py
where the loop uses "for lc_idx, slot_lst in zip(typed_range(num_life_cycles),
tmp_slots):" to "for lc_idx, slot_lst in zip(typed_range(num_life_cycles),
tmp_slots, strict=True):" so mismatched lengths raise immediately; update the
single location inside the KV cache management logic (referencing variables
typed_range, num_life_cycles, tmp_slots, deferred_slots, scratch_slots_to_add
and the loop body) and ensure the runtime target supports zip(strict=True).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md`:
- Around line 40-41: The build instruction in CLAUDE.md runs the mypyc setup
from the wrong directory: after entering rawref a single `cd ..` lands back in
kv_cache_manager_v2/ but setup_mypyc.py expects to be run from the runtime/
directory; update the command in CLAUDE.md to change up two directories before
invoking setup_mypyc.py (e.g., use `cd ../.. && python
kv_cache_manager_v2/setup_mypyc.py build_ext --inplace`) so the script
`setup_mypyc.py` is executed with runtime/ as the current working directory.

In `@tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py`:
- Around line 1974-1983: Add a regression case after the existing commit/close
sequence that reconstructs a new KV cache from the same prompt and asserts the
new cache's num_committed_tokens does not include tokens that were in scratch
slots; locate the block using symbols kv.commit, kv.stop_committing,
kv.has_scratch_slots, kv.close and manager.clear_reusable_blocks, create a
follow-up kv (or reuse manager API to build a cache from the same prompt), and
assert new_kv.num_committed_tokens == expected_non_scratch_prefix_length (i.e.,
only the non-scratch prefix is counted) to prove scratch-range tokens are not
preserved for reuse.
- Around line 1901-1983: The test only inspects ScratchDesc and base page
indices but doesn't assert the real allocated GPU slot count; update
test_scratch_slot_count to query the actual allocated slots after
kv.resize(prompt_len) (before commit) for the layer group and assert it equals
expected_total and is less than num_blocks. Locate where the layer group id is
computed (LayerGroupId(0)) and after scratch_desc is computed call the
manager/kv API that reports current allocated slot count for that layer group
(e.g., a method like kv.get_allocated_slot_count(lg_id) or
manager.get_allocated_slots for the group) and add assertions: actual_allocated
== expected_total and actual_allocated < num_blocks; if such API doesn't exist,
add a small helper that counts non-BAD_PAGE_INDEX entries from
kv.get_base_page_indices(lg_id) to derive the real allocated slot count and
assert it matches expected_total.

---

Nitpick comments:
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`:
- Around line 753-757: In the resume() method change the unused unpacked
variable from scratch_ranges to a name starting with an underscore (e.g.,
_scratch_ranges) where you call self._take_stale_scratch_slots(self.capacity,
self.history_length) so the tuple still unpacks into stale_scratch_slots,
delta_scratch_slots and _scratch_ranges and static analysis no longer flags the
unused variable.
- Around line 506-519: The unpacked variables scratch_beg and scratch_end in the
enable_scratch branch are not used; update the unpack to indicate intentional
non-use by renaming them to _scratch_beg and _scratch_end where scratch_ranges
is unpacked (within the enable_scratch handling around variables enable_scratch,
scratch_ranges, lc), leaving the rest of the logic (num_scratch_blocks,
num_new_normal_blocks, num_new_slots) unchanged so static-analysis warnings are
silenced.
- Around line 779-784: The zip over typed_range(num_life_cycles) and tmp_slots
in the loop should be made explicit about length matching—change the call in
_kv_cache.py where the loop uses "for lc_idx, slot_lst in
zip(typed_range(num_life_cycles), tmp_slots):" to "for lc_idx, slot_lst in
zip(typed_range(num_life_cycles), tmp_slots, strict=True):" so mismatched
lengths raise immediately; update the single location inside the KV cache
management logic (referencing variables typed_range, num_life_cycles, tmp_slots,
deferred_slots, scratch_slots_to_add and the loop body) and ensure the runtime
target supports zip(strict=True).

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`:
- Around line 272-283: Update the docstring of to_typed to reflect that
index_type is a callable (e.g., a NewType constructor or any callable that
produces the Index from a base value) rather than a concrete type; mention it
can be a NewType factory like BlockOrdinal or any callable that accepts an
integer (or base index) and returns an Index, and clarify that it is only used
for typing/casting and not invoked on list elements. Target the to_typed
function's parameter docs (index_type and lst) and adjust phrasing to remove
ambiguity about concrete vs callable types.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md`:
- Around line 15-32: Update the example commands in the "Fast mode", "Single
test class or method", and "Production mode" sections so they are repo-portable
instead of referencing a hardcoded home path (~/tekit); replace the literal
paths used in the PYTHONPATH and python invocations with a repo-root variable or
command substitution (e.g., REPO_ROOT placeholder or $(git rev-parse
--show-toplevel) / $(pwd)) so contributors can run the examples from any clone;
keep the same examples and flags but update the paths in those three command
blocks to use the chosen repo-root variable (refer to the command examples under
"Fast mode", "Single test class or method", and "Production mode").

In `@tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py`:
- Around line 1867-2158: Add a perf-test entry that exercises the SWA
scratch-reuse path by invoking the new TestScratchReuse tests (e.g.,
TestScratchReuse::test_scratch_slot_count or
TestScratchReuse::test_scratch_shared_slot_ids) so the memory-reduction behavior
is covered by perf CI; update the L0 perf list (l0_perf.yml) to include a job
that runs pytest filtering for TestScratchReuse (or the specific test names)
with the same config/quota used in the unit tests, and also add a corresponding
entry to the QA perf list (llm_perf_*.yml) per guidelines so this
performance-sensitive change is tracked.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bb321402-9c01-428d-bca4-b14e508fdca4

📥 Commits

Reviewing files that changed from the base of the PR and between e3ca723 and 6165374.

📒 Files selected for processing (15)
  • tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md
  • tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi
  • tensorrt_llm/runtime/kv_cache_manager_v2/_common.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_config.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_life_cycle_registry.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_page.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py
  • tests/unittest/kv_cache_manager_v2_tests/fake_engine.py
  • tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py

Comment thread tensorrt_llm/runtime/kv_cache_manager_v2/CLAUDE.md Outdated
Comment thread tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45148 [ run ] completed with state FAILURE. Commit: 6165374

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 23, 2026

Responding to CodeRabbit nitpick comments from the review:

Nitpick: _utils.py line 272-283 — index_type docstring
Not a real issue. NewType constructors ARE callables — that's exactly how Python's NewType works. The annotation Callable[[Any], Index] matches the docstring description ("A type alias for the NewType index, e.g. type(BlockOrdinal(0))"). No change needed.

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 23, 2026

Nitpick: CLAUDE.md hardcoded ~/tekit paths (lines 15-32)
Good catch — fixed. Replaced hardcoded ~/tekit paths with REPO_ROOT="$(git rev-parse --show-toplevel)" in all three command blocks (Fast mode, Single test, Production mode).

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 23, 2026

Nitpick: Unused scratch_ranges in resume() (_kv_cache.py line 754)
Valid — renamed to _.

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 23, 2026

Nitpick: Unused scratch_beg/end in resize() (_kv_cache.py line 507)
Valid — removed the dead unpack line entirely.

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 23, 2026

Nitpick: zip(..., strict=True) (_kv_cache.py line 779)
Added strict=True.

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 23, 2026

Nitpick: Add perf test CI entries (test file lines 1867-2158)
Not applicable. These are unit tests with FakeEngine (mock GPU operations), not end-to-end inference benchmarks. The perf CI configs (l0_perf.yml, llm_perf_*.yml) are for model-level perf tracking — adding a unit test class there is meaningless.

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 23, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45174 [ run ] triggered by Bot. Commit: 00f81f6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45174 [ run ] completed with state SUCCESS. Commit: 00f81f6
/LLM/main/L0_MergeRequest_PR pipeline #35451 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 24, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45352 [ run ] triggered by Bot. Commit: 16713c6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45352 [ run ] completed with state SUCCESS. Commit: 16713c6
/LLM/main/L0_MergeRequest_PR pipeline #35598 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 27, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45717 [ run ] triggered by Bot. Commit: 1c0b64f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45717 [ run ] completed with state FAILURE. Commit: 1c0b64f
/LLM/main/L0_MergeRequest_PR pipeline #35918 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 28, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45847 [ run ] triggered by Bot. Commit: 7c59d97 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45847 [ run ] completed with state FAILURE. Commit: 7c59d97
/LLM/main/L0_MergeRequest_PR pipeline #36025 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented Apr 29, 2026

/bot run --disable-fail-fast

@lowsfer lowsfer force-pushed the swa-mem-saving branch 2 times, most recently from a5c6d75 to fa68abb Compare May 5, 2026 07:32
@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented May 5, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46771 [ run ] triggered by Bot. Commit: fa68abb Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented May 5, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46790 [ run ] triggered by Bot. Commit: 8c0751c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46790 [ run ] completed with state SUCCESS. Commit: 8c0751c
/LLM/main/L0_MergeRequest_PR pipeline #36812 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented May 5, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46824 [ run ] triggered by Bot. Commit: 8c0751c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46824 [ run ] completed with state FAILURE. Commit: 8c0751c
/LLM/main/L0_MergeRequest_PR pipeline #36844 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator

lfr-0531 commented May 7, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47062 [ run ] triggered by Bot. Commit: 8c0751c Link to invocation

lowsfer added 3 commits May 7, 2026 03:20
This commit introduces an opt-in memory saving feature for SWA (Sliding Window Attention) layers during prefill.

During prefill of a new request, out-of-window blocks' KV data is only needed during a single layer's attention computation, and can then be overwritten by the next layer. We leverage this by reinterpreting shared sub-pages within a coalesced slot to serve different blocks for the currently executing layer, rather than different layers for the same block.

Memory Savings:
For a 32 SWA layer model with prompt=1024, window=128, and tokens_per_block=32:
- Current peak: 32 coalesced slots (one per block, each storing all 32 layers).
- With scratch reuse: ceil(27/32) = 1 scratch slot + 5 normal slots = 6 slots.
- Total reduction in peak KV cache memory: ~81%.

Implementation details:
1. `KVCacheManagerConfig` introduces `enable_swa_scratch_reuse`.
2. `_KVCache.resize()` partitions new blocks into scratch and normal blocks.
3. Added `ScratchDesc` and `PageIndexMode` to handle the two-source index conversion logic. `PageIndexMode.PER_LAYER` implies the converted indices include the layer's position within the coalesced slot, while `PageIndexMode.SHARED` indicates that the base pointer holds the per-layer offset.
4. `PageIndexConverter.__call__` now supports processing base indices via scratch mode when configured.

Trade-off:
KV cache prefix reuse is degraded since scratch blocks have no preserved KV data after the step.

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
Made-with: Cursor
…er_offset

- Replace _KVCache.page_index_mode property with supports_index_mode(mode)
  method that returns bool (PER_LAYER: always True, SHARED: not has_scratch_slots).
- Add KVCacheManager.supports_index_mode(mode) returning bool | None
  (True=always, False=never, None=per-instance).
- Always populate PageIndexConverter.layer_offset so the converter supports
  both index modes unconditionally.
- Keep index_mode defaulting to None with runtime checks: defaults to SHARED,
  asserts when scratch is active without explicit mode.
- Add ScratchDesc.__bool__ for natural truthiness checks on scratch range.
- Update fake_engine and test callers to use new API.

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47096 [ run ] triggered by Bot. Commit: 8c0751c Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented May 7, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47098 [ run ] triggered by Bot. Commit: 98b49e1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47098 [ run ] completed with state SUCCESS. Commit: 98b49e1
/LLM/main/L0_MergeRequest_PR pipeline #37067 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lowsfer
Copy link
Copy Markdown
Member Author

lowsfer commented May 7, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47187 [ run ] triggered by Bot. Commit: 98b49e1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47187 [ run ] completed with state SUCCESS. Commit: 98b49e1
/LLM/main/L0_MergeRequest_PR pipeline #37144 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lowsfer lowsfer merged commit 47d7ecc into NVIDIA:main May 7, 2026
6 checks passed
jiaganc pushed a commit to jiaganc/TensorRT-LLM that referenced this pull request May 11, 2026
…s) (NVIDIA#13368)

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
chuangz0 pushed a commit to chuangz0/TensorRT-LLM that referenced this pull request May 14, 2026
…s) (NVIDIA#13368)

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
…s) (NVIDIA#13368)

Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants