Skip to content

[None][fix] CppMambaHybridCacheManager: handle ranks with zero local mamba layers#13999

Merged
VALLIS-NERIA merged 2 commits into
NVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/mamba-hybrid-zero-mamba-layers
May 12, 2026
Merged

[None][fix] CppMambaHybridCacheManager: handle ranks with zero local mamba layers#13999
VALLIS-NERIA merged 2 commits into
NVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/mamba-hybrid-zero-mamba-layers

Conversation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator

@VALLIS-NERIA VALLIS-NERIA commented May 11, 2026

Summary

Under PP sharding, a rank can end up with only full-attention layers and no mamba layers. The previous CppMambaHybridCacheManager.__init__ unconditionally allocated SSM/conv state and set up linear-attention metadata, leaving these ranks in an invalid state.

This PR adds an early-exit path after super().__init__() when local_num_mamba_layers == 0:

  • Forward num_layers (not mamba_num_layers + num_layers) and the union of mamba+attention layer masks to the parent KVCacheManager.
  • Skip all mamba-only setup (ssm/conv_state_shape, linear_attention_metadata, MTP intermediate states, replay buffers, layer offsets, etc.).
  • Initialize self.requests = [] before the early return so the guards below are safe.

Guard prepare_resources / update_mamba_states / _setup_state_indices with the same condition so they no-op on these ranks (the parent's prepare_resources still runs).

Test plan

  • New unit test test_cpp_hybrid_zero_local_mamba_layers in tests/unittest/_torch/executor/test_mamba_cache_manager.py:
    • Constructs CppMambaHybridCacheManager with the real parent KVCacheManager (no mocks) on world_size=1 with an all-attention layer mask — same Python branch as a PP rank with zero local mamba layers.
    • Verifies early-exit invariants: local_num_mamba_layers == 0, mamba_pp_layers == [], requests == [], pp_layers == [0,1,2,3], parent self.impl / blocks_per_window initialized, is_linear_attention is False, no mamba-only attributes set.
    • Calls the three guarded methods on the zero-mamba manager to confirm they no-op without raising on the uninitialized mamba state.
  • Existing test_mamba_cache_manager.py suite still passes (10/10).

Summary by CodeRabbit

  • Refactor

    • Optimized Mamba cache manager initialization to skip unnecessary state setup when ranks have no local Mamba layers, improving efficiency in distributed configurations.
  • Tests

    • Added test coverage for Mamba cache manager edge cases in distributed tensor-parallel scenarios.

Review Change Stack

…mamba layers (By Agent)

Under PP sharding, a rank can end up with only full-attention layers and
no mamba layers. The previous constructor unconditionally allocated SSM /
conv state and set up linear-attention metadata, leading to invalid state
on these ranks.

Take an early-exit after super().__init__ when local_num_mamba_layers == 0:
forward num_layers (not mamba_num_layers + num_layers) and the union of
mamba+attention layer masks to the parent KVCacheManager, skip all
mamba-only setup. Guard prepare_resources / update_mamba_states /
_setup_state_indices with the same condition so they no-op on these ranks.

Add a regression test that constructs the manager with a real parent
KVCacheManager and verifies: early-exit invariants on mgr state, no
mamba-only attributes set, and the three guarded methods no-op without
raising.

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
@VALLIS-NERIA VALLIS-NERIA force-pushed the user/xiweny/mamba-hybrid-zero-mamba-layers branch from ca348ad to ac30ded Compare May 11, 2026 12:31
@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

📝 Walkthrough

Walkthrough

CppMambaHybridCacheManager now detects when the current rank has zero local mamba layers during __init__, normalizes the layer mask and PP-sharding upfront, and short-circuits initialization and state-handling methods to skip unneeded mamba machinery on those ranks. Test coverage validates the early-exit path.

Changes

Zero local mamba layers optimization

Layer / File(s) Summary
Early-exit initialization logic
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
__init__ normalizes the combined layer_mask, computes pp_layers before KV backend initialization, derives mamba_pp_layers and local_num_mamba_layers, and returns early after base KVCacheManager construction when local_num_mamba_layers == 0, skipping mamba state setup.
Attention window configuration
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
After kv_cache_config.max_attention_window is set, per-layer attention window types and per_layer_kv_heads are derived using the already-normalized layer_mask, including zeroing KV heads on mamba layers.
State-handling method short-circuits
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
prepare_resources, update_mamba_states, and _setup_state_indices return immediately when local_num_mamba_layers == 0.
Test coverage for zero-layers edge case
tests/unittest/_torch/executor/test_mamba_cache_manager.py
Module docstring expanded to document CppMambaHybridCacheManager PP-sharding edge cases; new imports for cache types and configuration; _build_zero_mamba_hybrid() helper constructs a manager with zero local layers; test_cpp_hybrid_zero_local_mamba_layers() verifies early-exit invariants, absence of mamba-only attributes, and no-op behavior of mamba methods.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: handling the edge case of zero local mamba layers in CppMambaHybridCacheManager, which is the core change across the modified files.
Description check ✅ Passed The description covers the problem, solution approach, and test coverage. It explains the early-exit path, what is skipped, how methods are guarded, and provides a comprehensive test plan with verification steps.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`:
- Around line 1000-1015: The constructor reads and copies layer_mask
unconditionally which breaks when layer_mask is None; before calling .copy() or
indexing, compute total_layers = len(mamba_layer_mask) and if layer_mask is None
set full_attention_layer_mask = [False] * total_layers else set
full_attention_layer_mask = layer_mask.copy(), then build the combined
layer_mask = [mamba_layer_mask[i] or full_attention_layer_mask[i] for i in
range(total_layers)]; update the code paths around the existing variables
(layer_mask, mamba_layer_mask, full_attention_layer_mask) to use this guarded
logic.
- Around line 1029-1053: Seed the publicly visible mamba-related fields before
the zero-layer early return so accessor calls don't raise AttributeError: assign
self._use_replay_state_update = use_replay_state_update and initialize the
backing attribute(s) used by get_mamba_ssm_cache_dtype (e.g.
self._mamba_ssm_cache_dtype or whatever private field that method reads) as well
as any other public/externally accessed mamba fields that are normally set
later; do this immediately after setting self.local_num_mamba_layers and
self.requests and before the if self.local_num_mamba_layers == 0: early return,
then keep the existing super().__init__ call and return.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2ae55170-1756-4283-a131-19a7bf8a466f

📥 Commits

Reviewing files that changed from the base of the PR and between 9547230 and ca348ad.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tests/unittest/_torch/executor/test_mamba_cache_manager.py

Comment thread tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
Comment thread tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47737 [ run ] triggered by Bot. Commit: ac30ded Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47737 [ run ] completed with state SUCCESS. Commit: ac30ded
/LLM/main/L0_MergeRequest_PR pipeline #37632 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

… Agent)

Two fixes from CodeRabbit's review on NVIDIA#13999:

1. Handle layer_mask=None defensively. The signature has always typed it
   as Optional[List[bool]], and the new constructor was unconditionally
   calling layer_mask.copy() / indexing before the zero-mamba guard ran.
   Treat None as an all-False full-attention mask and validate length
   against mamba_layer_mask.

2. Seed externally visible mamba fields before the early-return guard.
   get_mamba_ssm_cache_dtype() reads self.ssm_state_dtype, and the
   use_replay_state_update property reads self._use_replay_state_update;
   both were only assigned on the non-early-exit path, so any caller
   that hit those accessors on a zero-local-mamba rank would have
   AttributeError'd. Move both assignments above the early-return.

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

Addressed CodeRabbit feedback:

  • API Usage Error #1015: Handle layer_mask=None (treat as all-False full-attention mask) and validate length against mamba_layer_mask.
  • New Model Support #1053: Seed self._use_replay_state_update and self.ssm_state_dtype before the zero-mamba early-return so get_mamba_ssm_cache_dtype() and the use_replay_state_update property don't AttributeError on ranks with no local mamba layers.

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47860 [ run ] triggered by Bot. Commit: 5aad39a Link to invocation

@Wanli-Jiang
Copy link
Copy Markdown
Collaborator

Verified that it can run well for new ckpt, thanks!

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47860 [ run ] completed with state SUCCESS. Commit: 5aad39a
/LLM/main/L0_MergeRequest_PR pipeline #37724 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@VALLIS-NERIA VALLIS-NERIA enabled auto-merge (squash) May 12, 2026 14:11
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47978 [ run ] triggered by Bot. Commit: 5aad39a Link to invocation

VALLIS-NERIA added a commit to VALLIS-NERIA/TensorRT-LLM that referenced this pull request May 12, 2026
…recurrent-cuda-graph-snapshots (By Agent)

Brings in the CppMambaHybridCacheManager zero-local-mamba-layers fix and
its CodeRabbit follow-ups (PR NVIDIA#13999). Conflict in
tests/unittest/_torch/executor/test_mamba_cache_manager.py is
resolved by keeping both test sections (recurrent-state pool sizing from
this branch, zero-local-mamba early-exit from the merged branch) and
unioning the import set.

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47978 [ run ] completed with state SUCCESS. Commit: 5aad39a
/LLM/main/L0_MergeRequest_PR pipeline #37818 completed with status: 'SUCCESS'

CI Report

Link to invocation

@VALLIS-NERIA VALLIS-NERIA merged commit 1eb420a into NVIDIA:main May 12, 2026
7 checks passed
Wanli-Jiang pushed a commit to Wanli-Jiang/TensorRT-LLM that referenced this pull request May 13, 2026
… Agent)

Two fixes from CodeRabbit's review on NVIDIA#13999:

1. Handle layer_mask=None defensively. The signature has always typed it
   as Optional[List[bool]], and the new constructor was unconditionally
   calling layer_mask.copy() / indexing before the zero-mamba guard ran.
   Treat None as an all-False full-attention mask and validate length
   against mamba_layer_mask.

2. Seed externally visible mamba fields before the early-return guard.
   get_mamba_ssm_cache_dtype() reads self.ssm_state_dtype, and the
   use_replay_state_update property reads self._use_replay_state_update;
   both were only assigned on the non-early-exit path, so any caller
   that hit those accessors on a zero-local-mamba rank would have
   AttributeError'd. Move both assignments above the early-return.

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
…mamba layers (NVIDIA#13999)

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Wanli-Jiang added a commit to Wanli-Jiang/TensorRT-LLM that referenced this pull request May 21, 2026
commit c9933a8
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Mon May 18 17:58:20 2026 +0800

    [None][fix] Scale recurrent-state pool by pp_size for hybrid Mamba models (By Agent)

    `_calculate_max_num_blocks_for_linear_attention` sizes the recurrent-state
    slot pool as `max_batch_size + 1`, ignoring `pp_size`. With pipeline
    parallelism (e.g. `TestNemotronV3Super::test_nvfp4_parallelism[TP4_PP2]`),
    multiple microbatches are in-flight on the same rank simultaneously, each
    holding up to `max_batch_size` sequences' Mamba state — so the live-state
    slot count must be `max_batch_size * pp_size + 1`. The KVCacheManagerV2
    path already does this (`max_num_sequences = max_batch_size * pp_size`);
    this commit aligns the linear-attention sizing path with that invariant.

    Three call sites are fixed:

    1. `KVCacheManager._calculate_max_num_blocks_for_linear_attention` — both
       `intercept` (memory reservation in the affine model) and `max_snapshots`
       (slot count) now scale with `pp_size`.
    2. The `is_linear_attention` branch in the estimation dry-run code path,
       which previously used a bare `max_batch_size`.
    3. `CppMambaHybridCacheManager.get_cache_size_per_token` — the per-request
       fixed-cost intercept that drives upstream `_tokens_for_budget`.

    Without this fix, `PP > 1` runs trip `No free block found. This shouldn't
    happen!` in `evictionPolicy::getFreeBlock` once requests beyond the first
    microbatch arrive at the pool. Added a regression unit test that asserts
    `recurrent_primary >= max_batch_size * pp_size + 1` for `pp_size ∈ {2, 4}`,
    and verified the fix end-to-end with the TP4_PP2 model run on a TP2_PP2
    configuration locally (test passed in 8m53s, no `No free block` error).

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 7a6e90f
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Mon May 18 01:45:33 2026 +0800

    [None][fix] Cap CppMamba warmup tokens by recurrent-state pool capacity (By Agent)

    `_create_cuda_graph_warmup_request` constructs the longest dummy request
    using `available_tokens` returned by `kv_cache_manager.get_num_available_tokens`.
    For `CppMambaHybridCacheManager`, the parent's implementation only bounds
    this by the attention KV cache, so the long dummy's `token_num` could
    balloon to near `max_seq_len` (262K for Qwen3.5-397B-A17B). With block
    reuse on and `mamba_state_cache_interval=256`, that one dummy then needs
    ~1024 recurrent-state blocks, exhausting the RS pool when stacked with
    the (batch_size - 1) short dummies in the same warmup batch and tripping
    `No free block found` in `evictionPolicy::getFreeBlock`.

    Override `get_num_available_tokens` in CppMambaHybridCacheManager to also
    clamp by `(rs_free_blocks - 1) * states_snapshot_interval` whenever the
    snapshot interval is positive, leaving one block of headroom for the
    always-allocated trailing snapshot. The override is a no-op when
    `enable_block_reuse=False` (interval == 0).

    Verified with
    `tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestQwen3_5_397B_A17B::test_nvfp4[tep4_block_reuse]`.

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 717edeb
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Sat May 16 16:27:02 2026 +0800

    fix schedule issue, improve runtime perf, allow change manager

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 1813212
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Fri May 15 08:47:09 2026 +0800

    [None][fix] Address recurrent-state pool sizing and layer-first transfer bugs (By Agent)

    Concern 1 (kvCacheTransferManager.cpp): In the layer-first cudaMemcpy2DAsync
    path, source and destination pitches were both derived from pool.primaryPtr.
    For host-offload transfers, the secondary pool has mNumSecondaryBlocks which
    can differ from mNumPrimaryBlocks, giving a different per-layer stride.
    Fix: compute srcLayerStrideBytes and dstLayerStrideBytes independently from
    srcPool and dstPool, and use them as separate pitches in cudaMemcpy2DAsync.

    Concern 2 (resource_manager.py): In _calculate_max_num_blocks_for_linear_attention,
    the block-reuse branch unconditionally overwrote max_snapshots with
    max_tokens // interval, discarding the live-state + CUDA-graph-padding floor
    (max_batch_size + 1 + max_draft_len) set earlier.
    Fix: use max(max_tokens // interval, max_snapshots) to preserve the floor.

    Add a regression test (enable_block_reuse=True) that would have caught the
    dropped floor: with max_batch_size=4 and max_tokens=512/interval=256,
    the naive branch returns 2 slots (< required floor of 5).

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit e00c230
Merge: 69899c5 be4f57e
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Wed May 13 18:12:00 2026 +0800

    Merge branch 'main' into user/xiweny/mamba-recurrent-cuda-graph-snapshots

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 69899c5
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Wed May 13 18:01:04 2026 +0800

    fix replay check on sm120

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 3f0e36c
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Wed May 13 17:57:48 2026 +0800

    fix missing is_draft

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 516ea84
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Tue May 12 23:16:54 2026 +0800

    reduce sync overhead

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit ace03a2
Merge: 5a85831 5aad39a
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Tue May 12 21:27:53 2026 +0800

    Merge branch 'user/xiweny/mamba-hybrid-zero-mamba-layers' into mamba-recurrent-cuda-graph-snapshots (By Agent)

    Brings in the CppMambaHybridCacheManager zero-local-mamba-layers fix and
    its CodeRabbit follow-ups (PR NVIDIA#13999). Conflict in
    tests/unittest/_torch/executor/test_mamba_cache_manager.py is
    resolved by keeping both test sections (recurrent-state pool sizing from
    this branch, zero-local-mamba early-exit from the merged branch) and
    unioning the import set.

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 5aad39a
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Tue May 12 11:15:21 2026 +0800

    [None][fix] CppMambaHybridCacheManager: address CodeRabbit review (By Agent)

    Two fixes from CodeRabbit's review on NVIDIA#13999:

    1. Handle layer_mask=None defensively. The signature has always typed it
       as Optional[List[bool]], and the new constructor was unconditionally
       calling layer_mask.copy() / indexing before the zero-mamba guard ran.
       Treat None as an all-False full-attention mask and validate length
       against mamba_layer_mask.

    2. Seed externally visible mamba fields before the early-return guard.
       get_mamba_ssm_cache_dtype() reads self.ssm_state_dtype, and the
       use_replay_state_update property reads self._use_replay_state_update;
       both were only assigned on the non-early-exit path, so any caller
       that hit those accessors on a zero-local-mamba rank would have
       AttributeError'd. Move both assignments above the early-return.

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 5a85831
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Tue May 12 09:41:25 2026 +0800

    [None][fix] KVCacheManager: store spec_config on self in __init__ (By Agent)

    Address CodeRabbit review on NVIDIA#14003: the linear-attention sizing branch
    reads self.spec_config, but KVCacheManager.__init__ never assigned the
    incoming spec_config argument to self. The code worked only by accident,
    because CppMambaHybridCacheManager's _setup_mtp_intermediate_states runs
    before super().__init__ and writes self.spec_config on the partially
    initialized instance. Any other KVCacheManager path that exercises the
    linear-attention branch with a spec_config would AttributeError.

    Assign self.spec_config = spec_config at the top of KVCacheManager.__init__
    so the attribute is always present regardless of subclass.

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 3310de0
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Mon May 11 23:46:42 2026 +0800

    [None][fix] KVCacheManager: fix max_draft_tokens typo and add slot reservation tests (By Agent)

    Follow-up to the recurrent-state slot reservation: spec_config has
    no max_draft_tokens attribute (DecodingBaseConfig uses StrictBaseModel
    with extra="forbid"). The correct field is max_draft_len, which is also
    the count of additional dummy request ids issued by
    CUDAGraphRunner._get_padded_batch.

    Add two unit tests that construct a real CppMambaHybridCacheManager and
    inspect blocks_per_window[RECURRENT_STATES]:
      - without spec_config: pool reserves max_batch_size + 1 slots
      - with spec_config(max_draft_len=N): pool reserves max_batch_size + 1 + N
    Tests disable block reuse so the override (max_tokens // interval) does
    not mask the addition under test.

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit 992ec1b
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Mon May 11 23:34:04 2026 +0800

    [None][fix] KVCacheManager: reserve recurrent-state slots for CUDA graph padding and draft-len sentinels (By Agent)

    The recurrent-state snapshot pool was sized at exactly max_batch_size,
    which leaves no room for the CUDA graph padding sentinel
    (CUDA_GRAPH_DUMMY_REQUEST_ID) and the per-draft-len sentinels that
    CUDAGraphRunner::_get_padded_batch issues during speculative decoding.

    Reserve one extra slot for the CUDA graph padding dummy, and when a
    spec_config is present, reserve one additional slot per draft length so
    all distinct sentinel request IDs can coexist without evicting live
    recurrent state. The block-reuse path (max_tokens // interval) still
    overrides this when active.

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

commit ac30ded
Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Date:   Mon May 11 20:26:14 2026 +0800

    [None][fix] CppMambaHybridCacheManager: handle ranks with zero local mamba layers (By Agent)

    Under PP sharding, a rank can end up with only full-attention layers and
    no mamba layers. The previous constructor unconditionally allocated SSM /
    conv state and set up linear-attention metadata, leading to invalid state
    on these ranks.

    Take an early-exit after super().__init__ when local_num_mamba_layers == 0:
    forward num_layers (not mamba_num_layers + num_layers) and the union of
    mamba+attention layer masks to the parent KVCacheManager, skip all
    mamba-only setup. Guard prepare_resources / update_mamba_states /
    _setup_state_indices with the same condition so they no-op on these ranks.

    Add a regression test that constructs the manager with a real parent
    KVCacheManager and verifies: early-exit invariants on mgr state, no
    mamba-only attributes set, and the three guarded methods no-op without
    raising.

    Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants