[None][fix] CppMambaHybridCacheManager: handle ranks with zero local mamba layers#13999
Conversation
…mamba layers (By Agent) Under PP sharding, a rank can end up with only full-attention layers and no mamba layers. The previous constructor unconditionally allocated SSM / conv state and set up linear-attention metadata, leading to invalid state on these ranks. Take an early-exit after super().__init__ when local_num_mamba_layers == 0: forward num_layers (not mamba_num_layers + num_layers) and the union of mamba+attention layer masks to the parent KVCacheManager, skip all mamba-only setup. Guard prepare_resources / update_mamba_states / _setup_state_indices with the same condition so they no-op on these ranks. Add a regression test that constructs the manager with a real parent KVCacheManager and verifies: early-exit invariants on mgr state, no mamba-only attributes set, and the three guarded methods no-op without raising. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
ca348ad to
ac30ded
Compare
|
/bot run |
📝 WalkthroughWalkthrough
ChangesZero local mamba layers optimization
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`:
- Around line 1000-1015: The constructor reads and copies layer_mask
unconditionally which breaks when layer_mask is None; before calling .copy() or
indexing, compute total_layers = len(mamba_layer_mask) and if layer_mask is None
set full_attention_layer_mask = [False] * total_layers else set
full_attention_layer_mask = layer_mask.copy(), then build the combined
layer_mask = [mamba_layer_mask[i] or full_attention_layer_mask[i] for i in
range(total_layers)]; update the code paths around the existing variables
(layer_mask, mamba_layer_mask, full_attention_layer_mask) to use this guarded
logic.
- Around line 1029-1053: Seed the publicly visible mamba-related fields before
the zero-layer early return so accessor calls don't raise AttributeError: assign
self._use_replay_state_update = use_replay_state_update and initialize the
backing attribute(s) used by get_mamba_ssm_cache_dtype (e.g.
self._mamba_ssm_cache_dtype or whatever private field that method reads) as well
as any other public/externally accessed mamba fields that are normally set
later; do this immediately after setting self.local_num_mamba_layers and
self.requests and before the if self.local_num_mamba_layers == 0: early return,
then keep the existing super().__init__ call and return.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2ae55170-1756-4283-a131-19a7bf8a466f
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytests/unittest/_torch/executor/test_mamba_cache_manager.py
|
PR_Github #47737 [ run ] triggered by Bot. Commit: |
|
PR_Github #47737 [ run ] completed with state
|
… Agent) Two fixes from CodeRabbit's review on NVIDIA#13999: 1. Handle layer_mask=None defensively. The signature has always typed it as Optional[List[bool]], and the new constructor was unconditionally calling layer_mask.copy() / indexing before the zero-mamba guard ran. Treat None as an all-False full-attention mask and validate length against mamba_layer_mask. 2. Seed externally visible mamba fields before the early-return guard. get_mamba_ssm_cache_dtype() reads self.ssm_state_dtype, and the use_replay_state_update property reads self._use_replay_state_update; both were only assigned on the non-early-exit path, so any caller that hit those accessors on a zero-local-mamba rank would have AttributeError'd. Move both assignments above the early-return. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
|
Addressed CodeRabbit feedback:
|
|
/bot run --disable-fail-fast |
|
PR_Github #47860 [ run ] triggered by Bot. Commit: |
|
Verified that it can run well for new ckpt, thanks! |
|
PR_Github #47860 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47978 [ run ] triggered by Bot. Commit: |
…recurrent-cuda-graph-snapshots (By Agent) Brings in the CppMambaHybridCacheManager zero-local-mamba-layers fix and its CodeRabbit follow-ups (PR NVIDIA#13999). Conflict in tests/unittest/_torch/executor/test_mamba_cache_manager.py is resolved by keeping both test sections (recurrent-state pool sizing from this branch, zero-local-mamba early-exit from the merged branch) and unioning the import set. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
|
PR_Github #47978 [ run ] completed with state |
… Agent) Two fixes from CodeRabbit's review on NVIDIA#13999: 1. Handle layer_mask=None defensively. The signature has always typed it as Optional[List[bool]], and the new constructor was unconditionally calling layer_mask.copy() / indexing before the zero-mamba guard ran. Treat None as an all-False full-attention mask and validate length against mamba_layer_mask. 2. Seed externally visible mamba fields before the early-return guard. get_mamba_ssm_cache_dtype() reads self.ssm_state_dtype, and the use_replay_state_update property reads self._use_replay_state_update; both were only assigned on the non-early-exit path, so any caller that hit those accessors on a zero-local-mamba rank would have AttributeError'd. Move both assignments above the early-return. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…mamba layers (NVIDIA#13999) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
commit c9933a8 Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Mon May 18 17:58:20 2026 +0800 [None][fix] Scale recurrent-state pool by pp_size for hybrid Mamba models (By Agent) `_calculate_max_num_blocks_for_linear_attention` sizes the recurrent-state slot pool as `max_batch_size + 1`, ignoring `pp_size`. With pipeline parallelism (e.g. `TestNemotronV3Super::test_nvfp4_parallelism[TP4_PP2]`), multiple microbatches are in-flight on the same rank simultaneously, each holding up to `max_batch_size` sequences' Mamba state — so the live-state slot count must be `max_batch_size * pp_size + 1`. The KVCacheManagerV2 path already does this (`max_num_sequences = max_batch_size * pp_size`); this commit aligns the linear-attention sizing path with that invariant. Three call sites are fixed: 1. `KVCacheManager._calculate_max_num_blocks_for_linear_attention` — both `intercept` (memory reservation in the affine model) and `max_snapshots` (slot count) now scale with `pp_size`. 2. The `is_linear_attention` branch in the estimation dry-run code path, which previously used a bare `max_batch_size`. 3. `CppMambaHybridCacheManager.get_cache_size_per_token` — the per-request fixed-cost intercept that drives upstream `_tokens_for_budget`. Without this fix, `PP > 1` runs trip `No free block found. This shouldn't happen!` in `evictionPolicy::getFreeBlock` once requests beyond the first microbatch arrive at the pool. Added a regression unit test that asserts `recurrent_primary >= max_batch_size * pp_size + 1` for `pp_size ∈ {2, 4}`, and verified the fix end-to-end with the TP4_PP2 model run on a TP2_PP2 configuration locally (test passed in 8m53s, no `No free block` error). Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 7a6e90f Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Mon May 18 01:45:33 2026 +0800 [None][fix] Cap CppMamba warmup tokens by recurrent-state pool capacity (By Agent) `_create_cuda_graph_warmup_request` constructs the longest dummy request using `available_tokens` returned by `kv_cache_manager.get_num_available_tokens`. For `CppMambaHybridCacheManager`, the parent's implementation only bounds this by the attention KV cache, so the long dummy's `token_num` could balloon to near `max_seq_len` (262K for Qwen3.5-397B-A17B). With block reuse on and `mamba_state_cache_interval=256`, that one dummy then needs ~1024 recurrent-state blocks, exhausting the RS pool when stacked with the (batch_size - 1) short dummies in the same warmup batch and tripping `No free block found` in `evictionPolicy::getFreeBlock`. Override `get_num_available_tokens` in CppMambaHybridCacheManager to also clamp by `(rs_free_blocks - 1) * states_snapshot_interval` whenever the snapshot interval is positive, leaving one block of headroom for the always-allocated trailing snapshot. The override is a no-op when `enable_block_reuse=False` (interval == 0). Verified with `tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestQwen3_5_397B_A17B::test_nvfp4[tep4_block_reuse]`. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 717edeb Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Sat May 16 16:27:02 2026 +0800 fix schedule issue, improve runtime perf, allow change manager Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 1813212 Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Fri May 15 08:47:09 2026 +0800 [None][fix] Address recurrent-state pool sizing and layer-first transfer bugs (By Agent) Concern 1 (kvCacheTransferManager.cpp): In the layer-first cudaMemcpy2DAsync path, source and destination pitches were both derived from pool.primaryPtr. For host-offload transfers, the secondary pool has mNumSecondaryBlocks which can differ from mNumPrimaryBlocks, giving a different per-layer stride. Fix: compute srcLayerStrideBytes and dstLayerStrideBytes independently from srcPool and dstPool, and use them as separate pitches in cudaMemcpy2DAsync. Concern 2 (resource_manager.py): In _calculate_max_num_blocks_for_linear_attention, the block-reuse branch unconditionally overwrote max_snapshots with max_tokens // interval, discarding the live-state + CUDA-graph-padding floor (max_batch_size + 1 + max_draft_len) set earlier. Fix: use max(max_tokens // interval, max_snapshots) to preserve the floor. Add a regression test (enable_block_reuse=True) that would have caught the dropped floor: with max_batch_size=4 and max_tokens=512/interval=256, the naive branch returns 2 slots (< required floor of 5). Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit e00c230 Merge: 69899c5 be4f57e Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Wed May 13 18:12:00 2026 +0800 Merge branch 'main' into user/xiweny/mamba-recurrent-cuda-graph-snapshots Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 69899c5 Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Wed May 13 18:01:04 2026 +0800 fix replay check on sm120 Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 3f0e36c Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Wed May 13 17:57:48 2026 +0800 fix missing is_draft Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 516ea84 Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Tue May 12 23:16:54 2026 +0800 reduce sync overhead Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit ace03a2 Merge: 5a85831 5aad39a Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Tue May 12 21:27:53 2026 +0800 Merge branch 'user/xiweny/mamba-hybrid-zero-mamba-layers' into mamba-recurrent-cuda-graph-snapshots (By Agent) Brings in the CppMambaHybridCacheManager zero-local-mamba-layers fix and its CodeRabbit follow-ups (PR NVIDIA#13999). Conflict in tests/unittest/_torch/executor/test_mamba_cache_manager.py is resolved by keeping both test sections (recurrent-state pool sizing from this branch, zero-local-mamba early-exit from the merged branch) and unioning the import set. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 5aad39a Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Tue May 12 11:15:21 2026 +0800 [None][fix] CppMambaHybridCacheManager: address CodeRabbit review (By Agent) Two fixes from CodeRabbit's review on NVIDIA#13999: 1. Handle layer_mask=None defensively. The signature has always typed it as Optional[List[bool]], and the new constructor was unconditionally calling layer_mask.copy() / indexing before the zero-mamba guard ran. Treat None as an all-False full-attention mask and validate length against mamba_layer_mask. 2. Seed externally visible mamba fields before the early-return guard. get_mamba_ssm_cache_dtype() reads self.ssm_state_dtype, and the use_replay_state_update property reads self._use_replay_state_update; both were only assigned on the non-early-exit path, so any caller that hit those accessors on a zero-local-mamba rank would have AttributeError'd. Move both assignments above the early-return. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 5a85831 Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Tue May 12 09:41:25 2026 +0800 [None][fix] KVCacheManager: store spec_config on self in __init__ (By Agent) Address CodeRabbit review on NVIDIA#14003: the linear-attention sizing branch reads self.spec_config, but KVCacheManager.__init__ never assigned the incoming spec_config argument to self. The code worked only by accident, because CppMambaHybridCacheManager's _setup_mtp_intermediate_states runs before super().__init__ and writes self.spec_config on the partially initialized instance. Any other KVCacheManager path that exercises the linear-attention branch with a spec_config would AttributeError. Assign self.spec_config = spec_config at the top of KVCacheManager.__init__ so the attribute is always present regardless of subclass. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 3310de0 Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Mon May 11 23:46:42 2026 +0800 [None][fix] KVCacheManager: fix max_draft_tokens typo and add slot reservation tests (By Agent) Follow-up to the recurrent-state slot reservation: spec_config has no max_draft_tokens attribute (DecodingBaseConfig uses StrictBaseModel with extra="forbid"). The correct field is max_draft_len, which is also the count of additional dummy request ids issued by CUDAGraphRunner._get_padded_batch. Add two unit tests that construct a real CppMambaHybridCacheManager and inspect blocks_per_window[RECURRENT_STATES]: - without spec_config: pool reserves max_batch_size + 1 slots - with spec_config(max_draft_len=N): pool reserves max_batch_size + 1 + N Tests disable block reuse so the override (max_tokens // interval) does not mask the addition under test. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit 992ec1b Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Mon May 11 23:34:04 2026 +0800 [None][fix] KVCacheManager: reserve recurrent-state slots for CUDA graph padding and draft-len sentinels (By Agent) The recurrent-state snapshot pool was sized at exactly max_batch_size, which leaves no room for the CUDA graph padding sentinel (CUDA_GRAPH_DUMMY_REQUEST_ID) and the per-draft-len sentinels that CUDAGraphRunner::_get_padded_batch issues during speculative decoding. Reserve one extra slot for the CUDA graph padding dummy, and when a spec_config is present, reserve one additional slot per draft length so all distinct sentinel request IDs can coexist without evicting live recurrent state. The block-reuse path (max_tokens // interval) still overrides this when active. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> commit ac30ded Author: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Date: Mon May 11 20:26:14 2026 +0800 [None][fix] CppMambaHybridCacheManager: handle ranks with zero local mamba layers (By Agent) Under PP sharding, a rank can end up with only full-attention layers and no mamba layers. The previous constructor unconditionally allocated SSM / conv state and set up linear-attention metadata, leading to invalid state on these ranks. Take an early-exit after super().__init__ when local_num_mamba_layers == 0: forward num_layers (not mamba_num_layers + num_layers) and the union of mamba+attention layer masks to the parent KVCacheManager, skip all mamba-only setup. Guard prepare_resources / update_mamba_states / _setup_state_indices with the same condition so they no-op on these ranks. Add a regression test that constructs the manager with a real parent KVCacheManager and verifies: early-exit invariants on mgr state, no mamba-only attributes set, and the three guarded methods no-op without raising. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Summary
Under PP sharding, a rank can end up with only full-attention layers and no mamba layers. The previous
CppMambaHybridCacheManager.__init__unconditionally allocated SSM/conv state and set up linear-attention metadata, leaving these ranks in an invalid state.This PR adds an early-exit path after
super().__init__()whenlocal_num_mamba_layers == 0:num_layers(notmamba_num_layers + num_layers) and the union of mamba+attention layer masks to the parentKVCacheManager.ssm/conv_state_shape,linear_attention_metadata, MTP intermediate states, replay buffers, layer offsets, etc.).self.requests = []before the early return so the guards below are safe.Guard
prepare_resources/update_mamba_states/_setup_state_indiceswith the same condition so they no-op on these ranks (the parent'sprepare_resourcesstill runs).Test plan
test_cpp_hybrid_zero_local_mamba_layersintests/unittest/_torch/executor/test_mamba_cache_manager.py:CppMambaHybridCacheManagerwith the real parentKVCacheManager(no mocks) onworld_size=1with an all-attention layer mask — same Python branch as a PP rank with zero local mamba layers.local_num_mamba_layers == 0,mamba_pp_layers == [],requests == [],pp_layers == [0,1,2,3], parentself.impl/blocks_per_windowinitialized,is_linear_attention is False, no mamba-only attributes set.test_mamba_cache_manager.pysuite still passes (10/10).Summary by CodeRabbit
Refactor
Tests