Prefix caching | Mamba memory only. by lmcafee-nvidia · Pull Request #3657 · NVIDIA/Megatron-LM

lmcafee-nvidia · 2026-03-02T20:14:43Z

Summary

Hybrid models (Transformer + Mamba) cannot skip prefill computation because Mamba layers maintain recurrent states that depend on the full sequence history, unlike attention KV cache blocks which are self-contained and reusable.
This PR adds a guard in _compute_prefix_match that forces prefix_skip_tokens = 0 when is_hybrid_model is True, so matched prefix blocks are still shared (saving memory) but all prompt tokens are still processed through the model (preserving Mamba state correctness).
Adds 4 tests in TestHybridModelMemoryOnly verifying: no prefill skipping, block reuse for memory savings, correct ref counts for shared blocks, and all prompt tokens present in context.

Details

When prefix caching is enabled for a hybrid model, the system operates in "memory-only" mode:

KV cache blocks are shared across requests with matching prefix hashes, reducing memory consumption.
Prefill is NOT skipped because Mamba layers must process the full sequence to reconstruct their internal states.

The change is a single 3-line guard in _compute_prefix_match (~line 1624 of dynamic_context.py):

# Hybrid models: disable prefill skipping (no Mamba states per block),
# but keep matched blocks for memory sharing.
if self.is_hybrid_model:
    prefix_skip_tokens = 0

Benchmarked on a 2B hybrid model (23 Mamba + 4 Attention + 23 MLP layers, 50 total) with 10 identical requests (644 tokens each):

64.2% block savings (31.0 → 11.1 blocks used)
0% prefill token reduction (6440 tokens in all configs), as expected
Token-for-token output correctness vs. prefix caching disabled

Test plan

test_no_prefill_skipping_for_hybrid_model: verifies prefix_skip_tokens == 0 and effective_chunk_length == chunk_length even when blocks match
test_matched_blocks_reused_saving_memory: verifies second request consumes no additional blocks from pool
test_ref_counts_incremented_for_matched_blocks: verifies matched blocks have ref_count == 2 after sharing
test_all_prompt_tokens_in_context: verifies all prompt tokens are active (none skipped) and kv_length_offset == 0

/opt/venv/bin/python -m torch.distributed.run --nproc-per-node 1 -m pytest \
  tests/unit_tests/inference/contexts/test_dynamic_prefix_caching.py -v

🤖 Generated with Claude Code

copy-pr-bot · 2026-03-02T20:15:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ring Hybrid models (Transformer + Mamba) lack per-block Mamba states, so prefix computation cannot be skipped. This adds a guard in _compute_prefix_match that forces prefix_skip_tokens=0 when is_hybrid_model is True, ensuring all tokens are recomputed while still sharing KV blocks for memory savings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mory-only

lmcafee-nvidia · 2026-03-03T19:34:50Z

/ok to test 13af54a

lmcafee-nvidia · 2026-03-03T20:32:15Z

/ok to test 13af54a

svcnvidia-nemo-ci · 2026-03-03T21:27:59Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22643519147

svcnvidia-nemo-ci · 2026-03-03T22:28:37Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22645678779

lmcafee-nvidia requested review from a team as code owners March 2, 2026 20:14

lmcafee-nvidia self-assigned this Mar 2, 2026

lmcafee-nvidia requested review from kvareddy and santhnm2 March 2, 2026 20:14

svcnvidia-nemo-ci requested a review from a team March 2, 2026 20:14

lmcafee-nvidia changed the title ~~Disable prefill skipping for hybrid models while preserving block sharing~~ Prefix caching | Mamba memory only. Mar 2, 2026

lmcafee-nvidia requested a review from a team as a code owner March 2, 2026 20:38

lmcafee-nvidia marked this pull request as draft March 3, 2026 02:48

lmcafee-nvidia marked this pull request as ready for review March 3, 2026 02:49

santhnm2 approved these changes Mar 3, 2026

View reviewed changes

kvareddy approved these changes Mar 3, 2026

View reviewed changes

lmcafee-nvidia force-pushed the prefix-caching-mamba-memory-only branch from e6fd0b4 to cc45bb3 Compare March 3, 2026 04:00

lmcafee-nvidia requested a review from shanmugamr1992 March 3, 2026 14:34

shanmugamr1992 approved these changes Mar 3, 2026

View reviewed changes

Merge remote-tracking branch 'main/main' into prefix-caching-mamba-me…

13af54a

…mory-only

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 3, 2026

lmcafee-nvidia enabled auto-merge March 3, 2026 19:35

copy-pr-bot bot temporarily deployed to test March 3, 2026 19:36 Inactive

lmcafee-nvidia added this pull request to the merge queue Mar 3, 2026

Merged via the queue into NVIDIA:main with commit 6fc7690 Mar 3, 2026
73 of 82 checks passed

lmcafee-nvidia deleted the prefix-caching-mamba-memory-only branch March 3, 2026 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix caching | Mamba memory only.#3657

Prefix caching | Mamba memory only.#3657
lmcafee-nvidia merged 2 commits intoNVIDIA:mainfrom
lmcafee-nvidia:prefix-caching-mamba-memory-only

lmcafee-nvidia commented Mar 2, 2026

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

lmcafee-nvidia commented Mar 3, 2026

Uh oh!

lmcafee-nvidia commented Mar 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

lmcafee-nvidia commented Mar 2, 2026

Summary

Details

Test plan

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

lmcafee-nvidia commented Mar 3, 2026

Uh oh!

lmcafee-nvidia commented Mar 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants