[None][fix] DSv4 indexer: resize radix aux scratch in update_spec_dec_param by longcheng-nv · Pull Request #14443 · NVIDIA/TensorRT-LLM

longcheng-nv · 2026-05-22T06:29:01Z

Summary

Fix a buffer-sizing regression introduced together with the persistent
radix aux scratch in PR #14297, and backfill missing CI coverage for
DSv4-relevant unit tests in tests/unittest/_torch/thop/parallel/ and
tests/unittest/_torch/attention/sparse/dsa/.

Root cause

DSAtrtllmAttentionMetadata.__init__ allocates
radix_aux_indices / radix_aux_logits sized to
max_num_sequences * (1 + max_draft_tokens) * kMaxBlocksPerRowDecode(=10) * num_sparse_topk,
and cpp/tensorrt_llm/thop/IndexerTopKOp.cpp enforces
numel >= num_rows * blocks_per_row * index_topk at every
indexer_topk_decode call. The same PR taught
update_spec_dec_param to resize kv_lens_expanded_host and
heuristic_scratch_values whenever max_draft_tokens changes at
runtime — but the parallel radix buffers were not added to that
resize path.

When the framework reconfigures max_draft_tokens post-construction
(spec decoding warmup → real run, MTP depth change, disagg gen server
picking up a larger draft length, etc.), num_rows reflects the new
bound while the radix buffers stay at their construction-time size,
producing:

RuntimeError: radix_aux_{indices,logits} must hold at least
num_rows*blocks_per_row*index_topk elements (got 10240 / 10240, need 16384)

inside torch.ops.trtllm.indexer_topk_decode. The math matches a
typical reconfig:
10240 = max_num_sequences*(1+max_draft_tokens_init) * 10 * num_sparse_topk
(init=1 Flash / init=0 Pro), 16384 = num_rows_new * blocks_per_row * index_topk.

Fix

Mirror the existing heuristic_scratch_values resize block for
radix_aux_indices / radix_aux_logits. Unlike heuristic scratch the
radix buffers are allocated unconditionally (the radix dispatcher can
fire when enable_heuristic_topk=True falls back for small
numColumns), so the resize is placed outside the
if self.enable_heuristic_topk: guard but inside the existing
if self.max_num_sequences * (1 + self.max_draft_tokens) != init_shape:
gate.

CI coverage backfill

Investigation revealed that this regression slipped past pre-merge CI
because the relevant unit tests were never listed in any
tests/integration/test_lists/test-db/*.yml:

tests/unittest/_torch/thop/parallel/test_indexer_topk.py — directly
exercises the persistent radix_aux scratch and CUDA Graph capture/replay
added in [None][fix] DSv4 indexer: stable radix aux scratch for CUDA Graph safety #14297 (its own docstring documents the very bug being fixed).
tests/unittest/_torch/attention/sparse/dsa/test_dsa_indexer.py —
contains TestPrepareRestoreAttnMetadataForDraftReplay, the natural
regression site for the resize path.
tests/unittest/_torch/attention/sparse/dsa/test_dsa_sparse_mla.py —
DSA sparse-MLA forward, previously uncovered.
tests/unittest/_torch/thop/parallel/test_dsv3_fused_a_gemm.py and
test_dsv3_router_gemm.py — the only two ops from thop/parallel/
that modeling_deepseekv4.py invokes directly.

All five files added to l0_b200_ds.yml (single-B200 pre-merge, matches
world_size=1 declarations in the test files and skip_pre_blackwell
gating).

Test Coverage

Manual repro on the failing 8×B300 spec-decoding warmup → real run
transition: the runtime TORCH_CHECK no longer fires after the fix.
The added test list entries will exercise the resize path in pre-merge
CI going forward (TestPrepareRestoreAttnMetadataForDraftReplay,
test_indexer_topk_decode_radix_aux_cuda_graph_replay, and the
_v4_cr4 / _dist Top-K parametric sweeps).
No kernel or C++ changes; existing IndexerTopKOp.cpp /
indexerTopK.cu behaviour preserved.

Notes

Pure CPU-side buffer-bookkeeping fix; no kernel changes.
The fix matches the exact pattern PR [None][fix] DSv4 indexer: stable radix aux scratch for CUDA Graph safety #14297 already used for
heuristic_scratch_values, so no new convention is introduced.
Broader cleanup of the remaining ~25 uncovered files under
tests/unittest/_torch/thop/parallel/ (FP4 / FP8 / W4A* generic
quant kernels) is tracked as a dedicated follow-up to keep this
fix-PR scope bounded.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

🤖 Generated with Claude Code

…_param PR NVIDIA#14297 added persistent radix_aux_{indices,logits} scratch buffers in DSAtrtllmAttentionMetadata.__init__ sized to max_num_sequences * (1 + max_draft_tokens), and added a kernel-side TORCH_CHECK in IndexerTopKOp.cpp that the buffers' numel >= num_rows * blocks_per_row * index_topk. It also patched update_spec_dec_param to resize kv_lens_expanded_host (via create_expanded_buffers) and heuristic_scratch_values when max_draft_tokens changes at runtime, but missed the parallel radix buffers. When the framework reconfigures max_draft_tokens (e.g. spec decoding warmup -> real run, or disagg gen server picking up a different draft length), num_rows starts reflecting the new bound while the radix aux buffers stay at their construction-time size, triggering RuntimeError: radix_aux_{indices,logits} must hold at least num_rows*blocks_per_row*index_topk elements (got 10240 / 10240, need 16384) inside torch.ops.trtllm.indexer_topk_decode on the next forward step. This patch mirrors the existing heuristic_scratch_values resize block for the radix buffers, allocated unconditionally to match the __init__ path (the radix dispatcher can still run when enable_heuristic_topk=True falls back for small numColumns). Made-with: claude-code (https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

longcheng-nv · 2026-05-22T06:54:02Z

/bot run --disable-fail-fast

Five previously-uncovered files added to the single-B200 DS pre-merge list, covering kernel paths exercised by DeepSeek-V4 modeling: - unittest/_torch/thop/parallel/test_indexer_topk.py - unittest/_torch/attention/sparse/dsa/test_dsa_indexer.py - unittest/_torch/attention/sparse/dsa/test_dsa_sparse_mla.py - unittest/_torch/thop/parallel/test_dsv3_fused_a_gemm.py - unittest/_torch/thop/parallel/test_dsv3_router_gemm.py The first three guard the radix_aux scratch + update_spec_dec_param resize path fixed in the preceding commit; the last two cover torch.ops.trtllm.dsv3_fused_a_gemm_op and dsv3_router_gemm_op which modeling_deepseekv4.py invokes directly. All files declare world_size=1 / tp_size=1 and use skip_pre_blackwell or skip_pre_hopper gating, so they fit the single-B200 pre_merge condition. Broader cleanup of the remaining ~25 uncovered files under tests/unittest/_torch/thop/parallel/ (FP4 / FP8 / W4A* generic quant kernels) is tracked as a follow-up. Made-with: claude-code (https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

tensorrt-cicd · 2026-05-22T07:00:00Z

PR_Github #49873 [ run ] triggered by Bot. Commit: 975ef28 Link to invocation

longcheng-nv · 2026-05-22T07:00:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-22T07:06:51Z

PR_Github #49876 [ run ] triggered by Bot. Commit: 975ef28 Link to invocation

tensorrt-cicd · 2026-05-22T07:10:31Z

PR_Github #49873 [ run ] completed with state ABORTED. Commit: 975ef28

Link to invocation

…PRoundingMode Migrate `cute.arch.fma_packed_f32x2(..., rnd=nvvm.RoundingModeKind.RN)` to `rnd='rn'` (8 callsites in fp8_paged_mqa_logits.py). The cute DSL FPRoundingMode parameter now accepts only string literals; the enum form raises: TypeError: Expected a string literal for FPRoundingMode, but got enum 'RoundingModeKind.RN'. Please pass a string instead (e.g., 'rn' instead of RoundingModeKind.RN). on the DSL kernel path used by torch.ops.trtllm.cute_dsl_fp8_paged_mqa_logits. Drop now-unused `nvvm` from the cutlass._mlir.dialects import. Verified locally on B200: pytest tests/unittest/_torch/attention/sparse/dsa/test_dsa_indexer.py \ -k "test_indexer_decode_with_paged_kv_cache and dsl" -> 8 passed in 11.85s (previously 8 failed with FPRoundingMode TypeError). Made-with: Claude Code Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

longcheng-nv · 2026-05-22T09:27:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-22T09:33:42Z

PR_Github #49910 [ run ] triggered by Bot. Commit: b67c7b0 Link to invocation

tensorrt-cicd · 2026-05-22T09:39:34Z

PR_Github #49876 [ run ] completed with state ABORTED. Commit: 975ef28

Link to invocation

longcheng-nv requested a review from a team as a code owner May 22, 2026 06:29

longcheng-nv requested review from brb-nv and removed request for a team May 22, 2026 06:29

github-actions Bot assigned longcheng-nv May 22, 2026

longcheng-nv requested review from lfr-0531 and mingyangHao and removed request for brb-nv May 22, 2026 06:41

longcheng-nv force-pushed the fix/dsa-radix-aux-resize-on-spec-dec-update branch from 98bf3f2 to f779a46 Compare May 22, 2026 06:48

longcheng-nv force-pushed the fix/dsa-radix-aux-resize-on-spec-dec-update branch from f779a46 to 975ef28 Compare May 22, 2026 06:59

lfr-0531 approved these changes May 22, 2026

View reviewed changes

lfr-0531 added the deepseek-v4 label May 22, 2026

longcheng-nv requested a review from a team as a code owner May 22, 2026 09:27

longcheng-nv requested review from HuiGao-NV and suyoggupta and removed request for a team May 22, 2026 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] DSv4 indexer: resize radix aux scratch in update_spec_dec_param#14443

[None][fix] DSv4 indexer: resize radix aux scratch in update_spec_dec_param#14443
longcheng-nv wants to merge 3 commits into
NVIDIA:feat/deepseek_v4from
longcheng-nv:fix/dsa-radix-aux-resize-on-spec-dec-update

longcheng-nv commented May 22, 2026 •

edited

Loading

Uh oh!

longcheng-nv commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

longcheng-nv commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

longcheng-nv commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

longcheng-nv commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

CI coverage backfill

Test Coverage

Notes

PR Checklist

GitHub Bot Help

Uh oh!

longcheng-nv commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

longcheng-nv commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

longcheng-nv commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

longcheng-nv commented May 22, 2026 •

edited

Loading