[None][fix] DSv4 indexer: stable radix aux scratch for CUDA Graph safety by longcheng-nv · Pull Request #14297 · NVIDIA/TensorRT-LLM

longcheng-nv · 2026-05-19T08:21:35Z

Description

The fp32 Radix path of indexer_topk_decode was not CUDA-Graph-safe: it allocated its split-work scratch buffers (aux_indices, aux_logits) via th::empty per call when blocks_per_row > 1. Under CUDA Graph capture + replay these per-call pointers become stale when the caching allocator is perturbed by chunked-prefill activations at high concurrency. The captured Radix part1 / part2 kernels then write to recycled memory; the resulting CUDA_ERROR_ILLEGAL_ADDRESS surfaces at the next sync (commonly inside DeepGEMM's smxx_layout.hpp:97 TMA descriptor pack), looking like a downstream FP8 GEMM crash but caused upstream by the indexer.

IndexerTopKOp.cpp:91 already documents the pitfall for the Heuristic path:

"Caller-owned scratch buffer for heuristic TopK output values. Must be pre-allocated with stable address for CUDA Graph compatibility."

and dsa.py:879 follows that contract for heuristic_scratch_values. The Radix path was the missing half of the same pattern — this PR adds it.

Observed failure (Flash MXFP4, ISL ≈ 100K, TP = EP = 4, BS = 32, MTP = 0)

cell	CONC	Mode	GVR	result
`cell_013_C008_TEP_GVRF`	8	TEP	OFF (Radix)	`CUDA 700` at `smxx_layout.hpp:97`
`cell_017_C016_TEP_GVRF`	16	TEP	OFF (Radix)	`CUDA 700` at `smxx_layout.hpp:97`
`cell_021_C032_TEP_GVRF`	32	TEP	OFF (Radix)	`CUDA 700` at `smxx_layout.hpp:97`

All three crashes converge on the identical stack: fp8_swap_ab_gemm → _fp8_quantize_1x128_ue8m0 → deep_gemm.get_mn_major_tma_aligned_packed_ue8m0_tensor → smxx_layout.hpp:97. The DeepGEMM line is the next sync after the offending kernel, not the offender itself. Same-point GVR-ON cells (Heuristic path, uses persistent heuristic_scratch) run clean — confirming the asymmetry is in the indexer.

Change

cpp/tensorrt_llm/thop/IndexerTopKOp.cpp — extend indexer_topk_decode with two optional Tensor? kwargs (radix_aux_indices, radix_aux_logits). When both are provided and blocks_per_row > 1, use the caller-owned buffers; otherwise fall back to per-call th::empty (back-compat for bench scripts / warmup helpers / callers not under CUDA Graph capture). Validates is_cuda, same device, contiguous, dtype (int32 / float32), numel ≥ num_rows × blocks_per_row × index_topk.
tensorrt_llm/_torch/attention_backend/sparse/dsa.py — unconditionally allocate persistent radix_aux_indices / radix_aux_logits in DSAtrtllmAttentionMetadata sized to the worst case (max_gen_tokens × kMaxBlocksPerRowDecode = 10 × num_sparse_topk). Buffers are routed through self.get_empty(self.cuda_graph_buffers, ...) so they have stable addresses across graph replays and survive metadata-level CUDA-graph capture. Passed at the call site.
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py — extend the register_fake Python signature to match the new schema.

Side-effect surface

Same Radix kernel invocation, same dispatch logic, same numerical output.
DSv3.2 sparse path shares this op and metadata class — benefits transparently, no regression.
API back-compat: new kwargs default to None; existing callers continue to use the th::empty fallback unchanged.
Memory cost: ≈ 1.3 MB per DSAtrtllmAttentionMetadata instance on Flash defaults (max_num_seqs = 32 × kMaxBlocksPerRowDecode = 10 × index_topk = 512 × 2 buffers × 4 B/elt). Negligible.

Test Coverage

Unit tests added to tests/unittest/_torch/thop/parallel/test_indexer_topk.py:

test_indexer_topk_decode_radix_aux_equivalence — Parametrised over (batch_size, num_tokens, compress_ratio) shapes that force blocks_per_row > 1 (including the exact batch_size=8, num_tokens=16384, compress_ratio=4 that mirrors the Flash CONC=8 failure point). Verifies caller-owned-aux output is bit-identical to the legacy th::empty fallback.
test_indexer_topk_decode_radix_aux_cuda_graph_replay — Captures the op into a CUDA Graph with caller-owned aux, replays 8× and asserts each replay's output matches a non-graph reference. This is the direct regression test for the original stale-pointer bug.
test_indexer_topk_decode_radix_aux_validation — Negative tests for the new validation paths: wrong dtype (indices not int32 / logits not float32), undersized buffer (< num_rows × blocks_per_row × index_topk), non-contiguous tensor.

Existing coverage retained:

test_indexer_topk_decode / test_indexer_topk_decode_sm_saturation / test_indexer_topk_decode_launch_policy_transitions continue to exercise the th::empty fallback path (do not pass the new kwargs).
bench_indexer_topk_kernel.py continues to work unchanged (does not pass new kwargs).

End-to-end validation:

Re-run of pareto_v4_pro_swebench100k_b300 sweep against this binary on 8× B300 is in progress; CONC = 1 paired cells (P1 / P2) already completed cleanly with the new fix in place. High-CONC GVR-OFF cells (the ones that previously triggered G4) will be the live test.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-05-19T08:33:18Z

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

This PR introduces DeepSeek-V4 KV cache compression kernels, MHC hyper-connection fused GPU kernels, NIXL transfer agent virtual memory region mapping, TopK decode compression support, and FP8 block-scale MoE enhancements across the TensorRT-LLM kernel and executor stacks, along with supporting CMake wiring and infrastructure updates.

Changes

KV Cache Compression System

Layer / File(s)	Summary
DeepSeek-V4 KV Cache Compression Kernels `cpp/tensorrt_llm/kernels/compressorKernels/CMakeLists.txt`, `compressorKernels.h`, `compressorKernels.cu`	Three main CUDA kernels implement online-softmax-based KV cache compression: `pagedKvCompressKernel` for decode-step token compression, `prefillReductionKernel` for bulk prefill compression, and `postProcessScatterKernel` for RMSNorm + RoPE + Hadamard butterfly + paged scatter with FP8/MXFP4 quantization support.

MHC Hyper-Connection Fused Kernels

Layer / File(s)	Summary
MHC Kernel Infrastructure `cpp/tensorrt_llm/kernels/mhcKernels/CMakeLists.txt`, `mhcKernels.h`, `mhcKernels.cu`	Core MHC kernel launchers: `mhcBigFuseKernel` for token-wise split-K reduction + Sinkhorn normalization + RMSNorm fusion, `mhcGemmSqrsumFmaKernel` for split-N GEMM + sqr-sum fusion, and post-mapping/head-apply kernels for residual and gating operations.
MHC TF32 Fused Kernels `cpp/tensorrt_llm/kernels/mhcKernels/fused_tf32_pmap_gemm.cuh`, `mhcFusedHcKernel.cu`	TF32 tcgen05 MMA + TMA pipelined kernels: two-step `mhcFusedHcLaunch` (GEMM → bigFuse postlog) and all-in-one `mhcFusedHcAllInOneLaunch` with done-counter-based last-CTA election for fused postlogue execution.
MHC FMA Fused Kernels `cpp/tensorrt_llm/kernels/mhcKernels/mhc_fused_fma.cuh`, `mhcFusedHcKernel.cu`	Register-path FMA variants: `fused_pmap_gemm_fma_ksplit` for two-step K-split execution and `fused_pmap_gemm_fma_allinone` for single-launch pmap + GEMM + bigFuse with atomicAdd reduction when multi-split.

NIXL Transfer Agent VMM Region Mapping

Layer / File(s)	Summary
VMM Region Utilities and Agent Metadata `cpp/include/tensorrt_llm/executor/transferAgent.h`, `cpp/tensorrt_llm/executor/cache_transmission/transferAgent.cpp`	Introduces `VramRegionInfo`, `VramRegionMap`, and `VmmDescSplitter` utilities for backend-agnostic chunk-aware descriptor splitting. `VmmDescSplitter` provides static methods: `lookupChunkInfo()`, `splitDescsWithRegionMap()`, `splitTransferDescsWithRegionMaps()`, `splitVmmDescs()`, and `detectVramRegionMap()`. Extends `AgentDesc` with serialization/deserialization of per-region VMM metadata for remote boundary computation.
NIXL Transfer Status and Agent Lifecycle `cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/transferAgent.h`, `transferAgent.cpp`	Adds RAII cleanup for `NixlTransferStatus` with destructor, status tracking via `mLastStatus` atomic, and accessors `getLastStatus()`/`getLastStatusStr()`. Extends `NixlTransferAgent` with `shutdown()` method and updates `registerMemory()`, `deregisterMemory()`, `loadRemoteAgent()`, `getLocalAgentDesc()`, and `submitTransferRequests()` to use `VmmDescSplitter` region maps instead of registry-based splitting. `NixlLoopbackAgent` gains a destructor for safe resource cleanup.
Python Bindings `cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp`	Adds nanobind methods `serialize()` and `deserialize()` for `AgentDesc`, and extends `BaseTransferAgent` and `NixlTransferAgent` bindings with `nb::call_guard<nb::gil_scoped_release>()` to release Python GIL during long-running transfer operations.

TopK Decode Compression Support

Layer / File(s)	Summary
Heuristic TopK Decode Compression `cpp/tensorrt_llm/kernels/heuristicTopKDecode.h`, `heuristicTopKDecode.cu`	Updates heuristic TopK kernels to accept `compressRatio` parameter, compute `actual_kv_len` from token window, and derive search domain `N` as `actual_kv_len / compressRatio`. Adjusts `preIdxOffset` temporal-shift logic to depend on `compressRatio` mode (nonzero offset for ratio==1, zero otherwise).
Indexer TopK Multi-Block Scheduling `cpp/tensorrt_llm/kernels/IndexerTopK.h`, `indexerTopK.cu`	Adds `computeIndexerTopKDecodeBlocksPerRow()` helper for SM-aware multi-block split/merge sizing. Introduces file-scope tuning constants (`kNumBins`, split-work thresholds). Updates `topKPerRowDecode` kernel to compute `rowEnd` as `actual_kv_len / compressRatio`. Tightens heuristic eligibility gating to allowed `compressRatio` modes (1 or 4). Refactors dispatch logic around computed `blocksPerRow` and fixes histogram refinement with explicit counter resets and garbage index initialization.

FP8 Block-Scale MoE and Routing Enhancements

Layer / File(s)	Summary
SwiGLU Clamping for FP8 Activation `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h`, `DevKernel.cu`, `runner.cu`	Extends `activation::Data` and `activation::KernelParams` with optional `swigluLimit` and `hasSwigluLimit` fields. Both standard and DeepSeek FP8 kernels conditionally clamp gate/up inputs before activation computation. Runner forwards scalar clamp value for DeepSeek FP8 path.
DeepSeekV4 Routing Mode `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h`, `runner.cu`	Adds `DeepSeekV4 = 7` to `RoutingMethodType` enum (renumbering `Unspecified` to 8), adds serialization support, and extends `MoERunnerArgs` with `gemm1_clamp_limit_value` and `has_gemm1_clamp_limit_value` for scalar clamp-limit FP8 path.
Top-K Sort Specializations `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernelTopK.cuh`	Adds explicit `Sort<N, RedType>` template specializations for N=5,6,7,8 using hardcoded `TOPK_SWAP` sequences to enable up-to-8-way expert selection.
FP8 Block-Scale GEMM Quantization `cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_quant_packed.h`, `fp8_blockscale_quant_packed.cu`	Fused FP8 1x128 quantization kernel: loads BF16, computes per-block amax, derives UE8M0 scale, quantizes to FP8 e4m3, and packs four scale bytes into uint32 MN-major layout for block-scale GEMM dispatch.
Custom MoE Gate Forward Kernel `cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h`, `customMoeRoutingKernels.cu`	New gate_forward kernel computes per-expert score transform (softplus), performs top-K selection in hash or topK mode, normalizes weights, and dispatches for `n_experts` 256 and 384.

Infrastructure and Configuration Updates

Layer / File(s)	Summary
Build System Wiring `cpp/tensorrt_llm/CMakeLists.txt`, `cpp/tensorrt_llm/kernels/CMakeLists.txt`	Adds `mhcKernels_src` and `compressorKernels_src` subdirectories and object libraries to CMake build tree and links them into the main shared target.
Attention Workspace Sizing `cpp/tensorrt_llm/common/attentionOp.h`, `attentionOp.cpp`	Extends `getWorkspaceSizeForContext()` with `total_kv_len` parameter for KV cache reuse scenarios where effective total KV length exceeds `max_num_tokens`. Updates FP8 Context MLA K/V buffer sizing accordingly.
KV Cache Manager Utils `cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.h`	Adds `IndexMapper::size()` and `IndexMapper::numFreeSlots()` accessor methods for active sequence and free-slot counts.
NIXL CMake Update `cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/CMakeLists.txt`	Narrows NIXL wrapper CUDA dependencies from cuda_driver + cudart to cudart only.
Configuration and Dependencies `.pre-commit-config.yaml`, `3rdparty/fetch_content.json`, `cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3RouterGemm.cu`	Excludes broader `trtllm-gen` generated artifacts in pre-commit config, updates deepgemm dependency pin, and adds DSV3 router GEMM template instantiations for `kHiddenDim=4096`.

Sequence Diagram(s)

sequenceDiagram
  participant App as Application
  participant Compressor as Compressor Kernel
  participant KVCache as Paged KV Cache
  participant Transfer as NIXL Transfer
  
  Note over App,Transfer: KV Cache Compression + Transfer Workflow
  
  App->>Compressor: Decode: pagedKvCompressLaunch(kv_score, ...)
  activate Compressor
  Compressor->>KVCache: Load prior compressor state
  Compressor->>Compressor: Online softmax (running max + sum)
  Compressor->>KVCache: Write updated state + compressed tokens
  deactivate Compressor
  
  App->>Compressor: Prefill: prefillReductionLaunch(kv_score, ...)
  activate Compressor
  Compressor->>KVCache: Load full sequence
  Compressor->>Compressor: Reduce COMPRESS_RATIO rows via softmax
  Compressor->>KVCache: Write compressed output
  deactivate Compressor
  
  App->>Compressor: PostProcess: postProcessScatterLaunch(kv_comp, ...)
  activate Compressor
  Compressor->>Compressor: RMSNorm + RoPE + Hadamard
  Compressor->>KVCache: Scatter with FP8/MXFP4 quantization
  deactivate Compressor
  
  App->>Transfer: submitTransferRequests(src, dst)
  activate Transfer
  Transfer->>Transfer: detectVramRegionMap(src)
  Transfer->>Transfer: splitTransferDescsWithRegionMaps
  Transfer->>Transfer: Create per-chunk transfer ops
  deactivate Transfer

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#13740: Updates DSV3 router GEMM template instantiation variants for bf16 with different hidden dimensions, directly related to the dsv3RouterGemm.cu changes in this PR.

Suggested reviewers

2ez4bz
byshiue
Wanli-Jiang
omera-nv
moraxu

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch fix/indexer-radix-aux-cuda-graph

tensorrt-cicd · 2026-05-19T14:40:51Z

PR_Github #49177 [ run ] completed with state ABORTED. Commit: 5f75930

Link to invocation

tensorrt-cicd · 2026-05-19T16:30:44Z

PR_Github #49209 [ run ] completed with state SUCCESS. Commit: 5f75930
/LLM/main/L0_MergeRequest_PR pipeline #38883 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

longcheng-nv · 2026-05-20T01:00:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T01:07:41Z

PR_Github #49291 [ run ] triggered by Bot. Commit: 5f75930 Link to invocation

heyuhhh · 2026-05-20T06:51:22Z

+            // (back-compat for bench scripts / older callers — exposes the
+            // CUDA-Graph stale-pointer hazard at high CONC, see SKILL G4).
+            int64_t const needed_elts = static_cast<int64_t>(num_rows) * blocks_per_row * index_topk;
+            if (radix_aux_indices.has_value() && radix_aux_logits.has_value())


This should always be true because we pre-allocated the space and pass them in the function. For simplify i think we can add some assert sentence here to make sure and remove the remaining code, especially the if-else

heyuhhh

LGTM. Leave little comments for refine the code. Thanks!

tensorrt-cicd · 2026-05-20T08:00:11Z

PR_Github #49291 [ run ] completed with state SUCCESS. Commit: 5f75930
/LLM/main/L0_MergeRequest_PR pipeline #38953 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…precondition Two small follow-ups on top of the radix-aux PR head: 1. model_config.py — drop the duplicate `indexer_k_dtype` kwarg in the V4 `DeepSeekV4SparseAttentionConfig(...)` construction. The `update_sparse_attention_indexer_config` dict-builder already writes `indexer_config['indexer_k_dtype']`, and `**indexer_config` unpacks it at the call site. Passing it ALSO as an explicit kwarg raised: TypeError: DeepSeekV4SparseAttentionConfig() got multiple values for keyword argument 'indexer_k_dtype' This was failing all 10 of the L0_MergeRequest_PR/38953 test cases that hit `ModelConfig.from_pretrained` on a V4 checkpoint (`test_deepseek_v4_sparse_ratios_*`, `TestDeepSeekV4Flash.test_auto_dtype`, B200 / B300 / GB200 agg + disagg). None of the 10 was related to the radix-aux change itself; all fail before reaching any indexer code. 2. IndexerTopKOp.cpp — apply heyuhhh's review suggestion. Now that the in-tree caller (`dsa.py:DSAtrtllmAttentionMetadata.allocate`) always pre-allocates the radix aux buffers, the per-call `th::empty` fallback is dead code in production. Replace the `if/else` branch with a single `TORCH_CHECK(radix_aux_indices.has_value() && ...)` precondition plus the existing tensor-property checks. Net: -22/+12 lines, same behavior on the success path, cleaner failure mode (clear error vs silent CUDA-Graph stale-pointer corruption) on any future caller that forgets to pass the buffers. Validation: - Existing radix-aux unit tests still cover the path: `test_indexer_topk_decode_radix_aux_equivalence` (7 cases) and `test_indexer_topk_decode_radix_aux_cuda_graph_replay` (1 case) both pass the buffers explicitly, so they exercise the success branch. - Existing parametrized `test_indexer_topk_decode` (3 test classes, `num_tokens` ∈ {4096, 8192, 16384, 32768}) all keep `num_columns < splitWorkThreshold = 200_000`, so `blocks_per_row == 1` and the new TORCH_CHECK does not fire. - Warmup helper at dsa.py:102 calls with `num_cols` for a single row; also `blocks_per_row == 1`. Made-with: Claude Code (Opus 4.7, 1M context) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

longcheng-nv · 2026-05-20T08:33:28Z

Pushed commit 383d8c5 (top of HEAD now) addressing two items:

1. @heyuhhh's inline review on IndexerTopKOp.cpp:L132 — done. Replaced the if (radix_aux_indices.has_value() && ...) runtime branch with a TORCH_CHECK(...) precondition and dropped the dead else fallback (per-call th::empty). The success path is byte-identical; future callers that forget to supply the buffers now get a clear error instead of silent CUDA-Graph stale-pointer corruption.

2. Unrelated CI unblock in model_config.py — the previous run (PR_Github #49291, L0_MergeRequest_PR/38953) failed all 10 V4 test cases with:

TypeError: tensorrt_llm.llmapi.llm_args.DeepSeekV4SparseAttentionConfig()
got multiple values for keyword argument 'indexer_k_dtype'

Root cause was a duplicate-kwarg on the V4 construction site at model_config.py:838 — indexer_k_dtype was being passed both as an explicit kwarg and via **indexer_config (which itself sets indexer_config['indexer_k_dtype'] in update_sparse_attention_indexer_config). None of the 10 failures involved the radix-aux code itself; they all abort in ModelConfig.from_pretrained before reaching any indexer path. Folded the 2-line fix into this PR rather than opening a separate one for speed. Full reasoning in the commit body.

@lfr-0531 @mingyangHao Could you take a look when convenient? Local validation summary already in the PR body's "Local validation results" section (7/7 radix-aux unit tests pass + 3 high-CONC TEP+Radix CUDA-illegal-address crashes on B200 Pareto sweep are fully eliminated).

longcheng-nv · 2026-05-20T08:33:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T08:40:55Z

PR_Github #49397 [ run ] triggered by Bot. Commit: 383d8c5 Link to invocation

heyuhhh · 2026-05-20T09:13:28Z

Do we really need this two tensors? As we have already passed two aux tensors. I think we can just used them to avoid incomplete graph capture

@heyuhhh

Address @heyuhhh's second review comment on PR NVIDIA#14297 (cpp/tensorrt_llm/thop/IndexerTopKOp.cpp:122). The two `th::empty({0}, ...)` placeholders for the fp32 radix aux scratch became vestigial after commit 383d8c5 made the caller-owned buffers required for blocks_per_row > 1 — they were only reachable in the blocks_per_row == 1 case, where the kernel never dereferences them. Replace both Tensor placeholders with raw nullable `int32_t*` / `float*` pointers (matching the existing `heuristicScratchPtr` convention at lines 98-111). Hoist all caller-buffer validation out of the `blocks_per_row > 1` branch (dtype/device/contiguity always; numel only when the kernel actually needs them). For `blocks_per_row == 1` without caller-owned scratch, pass nullptr through to the kernel; for `> 1` without scratch, fail early with the same TORCH_CHECK message as before. Side-effect: also fix two same-PR-introduced regressions that the strict TORCH_CHECK in 383d8c5 would now trip: 1. `warmup_heuristic_topk_decode` (dsa.py:68) uses `num_cols=4096`, which falls below `kSeqSmall=12288`, so canUseHeuristic is false and the call routes to the Radix path with `blocks_per_row=2` (`num_rows=1` sweeps bp ∈ [2, maxByCols=2]). Allocate worst-case (kMaxBlocksPerRowDecode=10) caller-owned scratch in the warmup — cost is ~80 KB, the warmup runs once. 2. `_run_indexer_topk_decode_check` test helper (test_indexer_topk.py:229) covers params that produce `blocks_per_row > 1` (e.g. `num_tokens=32768, compress_ratio=1` → numCols=32768 → bp=9–10). Same fix: pre-allocate `_build_radix_aux_buffers` and pass via kwargs. Same change applied to `test_indexer_topk_decode_dist` (fp32 with num_tokens=8192 falls below kSeqSmall, hits Radix with bp>1) and `_run_indexer_topk_decode_v4_gvr_check` (defense-in-depth — GVR almost always succeeds for its param matrix but the kwarg is now uniform). The radix-aux equivalence test no longer makes sense in its original form (it compared caller-owned vs the removed `th::empty` fallback); rewrite it to verify allocation-pointer-independence — two invocations with freshly-allocated aux at distinct backing addresses must yield the same selected top-K logit values. Validation: - `pre-commit run` clean on the three changed files (clang-format, ruff-format, isort, codespell, yapf). - radix_aux unit tests will run in CI; pre-commit pre-flight is green. Made with Claude Code (Opus 4.7, 1M context) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

longcheng-nv · 2026-05-20T15:45:10Z

Pushed commit ffbc5addce addressing @heyuhhh's second inline review comment on IndexerTopKOp.cpp:122.

The reviewer's point — the two th::empty({0}, ...) placeholder Tensors at lines 122-123 became vestigial after the prior commit 383d8c5 tightened the blocks_per_row > 1 contract. They were only reachable in the blocks_per_row == 1 case, where the kernel never dereferences them. Reviewer suggested using the caller-passed buffers directly to avoid the leftover per-call allocation.

The fix — replace both Tensor placeholders with raw nullable int32_t* / float* pointers, matching the existing heuristicScratchPtr convention at lines 98-111. Hoist all caller-buffer validation (dtype / device / contiguity) out of the bp > 1 branch; keep the numel check inside it since that bound only matters when the kernel actually uses the buffers. For bp == 1 without caller-owned scratch, pass nullptr through; for bp > 1 without scratch, fail early with the same TORCH_CHECK message as before.

Same-PR-introduced regressions also fixed — chased down two paths that the strict TORCH_CHECK in 383d8c5 would now trip but weren't caught locally (the previous run was filtered with -k radix_aux, and CI broke earlier at model_config.py so the indexer tests never executed):

warmup_heuristic_topk_decode (dsa.py:68) — default num_cols=4096 falls below kSeqSmall=12288 so canUseHeuristic is false and the call routes to Radix with blocks_per_row=2 (num_rows=1 sweeps bp ∈ [2, maxByCols=2]). Allocate worst-case kMaxBlocksPerRowDecode=10 caller-owned scratch in the warmup. Cost ~80 KB, one-shot.
_run_indexer_topk_decode_check (test_indexer_topk.py:229) — covers params like num_tokens=32768, compress_ratio=1 → numCols=32768 → bp=9–10. Same fix: pre-allocate via existing _build_radix_aux_buffers and pass via kwargs. Same change applied to test_indexer_topk_decode_dist (fp32 + num_tokens=8192 hits Radix bp>1) and _run_indexer_topk_decode_v4_gvr_check (defense-in-depth — GVR almost always succeeds for that param matrix, but the kwarg is now uniform).

The radix-aux equivalence test no longer made sense in its original form (compared caller-owned vs the removed th::empty fallback). Rewrote it to verify allocation-pointer-independence — two invocations with freshly-allocated aux at distinct backing addresses must yield the same selected top-K logit values.

pre-commit run is clean on all three changed files (clang-format, ruff-format, isort, codespell, yapf).

cc @heyuhhh @lfr-0531 @mingyangHao

longcheng-nv · 2026-05-20T15:45:19Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T15:54:09Z

PR_Github #49446 [ run ] triggered by Bot. Commit: ffbc5ad Link to invocation

tensorrt-cicd · 2026-05-20T15:59:27Z

PR_Github #49397 [ run ] completed with state ABORTED. Commit: 383d8c5

Link to invocation

tensorrt-cicd · 2026-05-20T20:00:07Z

PR_Github #49446 [ run ] completed with state SUCCESS. Commit: ffbc5ad
/LLM/main/L0_MergeRequest_PR pipeline #39091 completed with status: 'SUCCESS'

CI Report

Link to invocation

longcheng-nv · 2026-05-21T01:02:02Z

@lfr-0531 Friendly ping — this PR is ready and would appreciate a merge when you have a moment.

Current status (head ffbc5ad):

✅ CI green — latest PR_Github #49446 on ffbc5ad → SUCCESS (L0_MergeRequest_PR pipeline #39091 SUCCESS)
✅ Approved by @heyuhhh on 2026-05-20: "LGTM. Leave little comments for refine the code. Thanks!"
✅ Both inline review comments resolved:
- IndexerTopKOp.cpp:L132 — replaced the if (radix_aux_indices.has_value() && ...) runtime branch with a TORCH_CHECK(...) precondition and dropped the dead else fallback (commit 383d8c5).
- IndexerTopKOp.cpp:L122 — removed the two vestigial th::empty({0}, ...) placeholder tensors; the caller-passed aux tensors are reused directly (commit ffbc5ad).
✅ Mergeable: clean, base = feat/deepseek_v4.

Scope recap: makes the fp32 Radix path of indexer_topk_decode CUDA-Graph-safe by moving aux_indices / aux_logits scratch to caller-owned stable allocations — mirroring the same contract heuristic_scratch_values already follows on the Heuristic path. Fixes the CUDA 700 at smxx_layout.hpp:97 seen at ISL ≈ 100K / BS ≥ 8 with GVR-OFF (Radix) on Flash MXFP4.

Thanks!

…ynth Two new skills mirror the V3.2 swebench-temporal-synth methodology but parameterised for V4 production: swebench-temporal-synth-v4flash - K=512, compress_ratio=4 - 3 beta cfgs fitted from real V4 Flash captures (21 GVR-active layers, even 2..42), three-bucketed by mean - per-cfg target_hr ~ 0.36 / 0.46 / 0.44 swebench-temporal-synth-v4pro - K=1024, compress_ratio=4 - 3 beta cfgs fitted from real V4 Pro captures (30 GVR-active layers, even 2..60) - per-cfg target_hr ~ 0.69 / 0.75 / 0.77 (Pro decode steps show much stronger prev/current preIdx overlap than Flash) Both skills enforce the V4 kernel contract: - preIdx caller-side offset = 0 (cr=4 path; kernel uses preIdx[i] directly per heuristicTopKDecode.cu preIdxOffset rule) - radix_aux_{indices,logits} pre-allocated for split-work path (post-NVIDIA#14297 contract) - numColumns % 8 == 0 (V4 kernel alignment) - logits dtype switchable to fp32 / bf16 / fp16 Sources: - Real V4 Flash + Pro captures under auto_optimization_v1/ablation_study/gvr_phase_timing/09_precision_ablation/{11,12}_dsv4_*_indexer_data_capture/ - One-shot beta + hit_rate fit via analyze_v4_dist.py - V3.2 sibling skill .claude/skills/swebench-temporal-synth/ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…mmetric clip Two refinements to the V4 sibling skills landed last commit: 1. BENCH=1 path now uses nsys instead of cuda.Event - run_all_n.sh BENCH=1 synthesises all 12 cells (no inline --bench), then runs ONE nsys session over the whole output dir via bench_nsys.py, exports the NVTX→GPU projection CSV via `nsys stats`, and parses to a per-(cfg, N, BS, dtype) R/H summary (`summary_table.txt` + `nsys_speedup_summary.json`). - nsys timing measures pure GPU kernel duration and avoids the 5-10 µs launch-tail bias of cuda.Event that systematically under-reported R/H by 0.6-1.8×. 2. Distribution clip is now asymmetric (clip_low, clip_high) - V4 indexer logits show stronger positive-tail extension than negative (e.g. Flash L2 32K: mean=-1.21, range=[-3.28, +3.54] so the range-center is +0.13, well above mean). The V3.2 sibling's `low = mean - fr/2, high = mean + fr/2` symmetric envelope truncates the positive tail and over-extends negative. - BETA_CFGS replaces `full_range` with explicit `clip_low / clip_high` taken from the per-(layer, ISL) observed (min, max) widest envelope. sample_beta_row uses these directly. Same beta-fit math; only the support changes. Updated cfgs: Flash beta_shallow: [-4.36, +7.83] (was [-4.97, +2.34]) Flash beta_moderate: [-5.08, +7.17] (was [-6.06, +1.89]) Flash beta_deep: [-6.28, +7.55] (was [-8.32, +3.13]) Pro beta_shallow: [-4.54, +7.33] (was [-5.56, +3.20]) Pro beta_moderate: [-6.15, +8.45] (was [-6.91, +3.14]) Pro beta_deep: [-5.42, +6.47] (was [-7.46, +2.28]) Validated on B200 (sm_10.0, 148 SM) with the corrected build (PR-A kFTarget=K, PR-B BSMAX env, post-NVIDIA#14297 radix_aux contract): Flash K=512 R/H bf16 (nsys, BS=1): 2.98-4.60x (N=4K to 25K) Pro K=1024 R/H bf16 (nsys, BS=1): 2.46-3.68x (N=4K to 25K) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Add tests/unittest/_torch/thop/parallel/test_indexer_topk.py to the single-B200 DS pre-merge list so the DeepSeek-V4 CI stage actually exercises the Heuristic / Radix Top-K kernel paths and the radix_aux scratch + CUDA Graph replay equivalence tests introduced by PR NVIDIA#14297. Without this entry the indexer Top-K kernel only got coverage from end-to-end DSv4 accuracy stages, leaving the kernel-level resize and TORCH_CHECK behaviour (which this PR fixes for the update_spec_dec_param resize path) implicitly untested. TIMEOUT (30) chosen with margin over measured ~120 s wall on B200 for the full file at the default parallel_factor. Made-with: claude-code (https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…_param PR NVIDIA#14297 added persistent radix_aux_{indices,logits} scratch buffers in DSAtrtllmAttentionMetadata.__init__ sized to max_num_sequences * (1 + max_draft_tokens), and added a kernel-side TORCH_CHECK in IndexerTopKOp.cpp that the buffers' numel >= num_rows * blocks_per_row * index_topk. It also patched update_spec_dec_param to resize kv_lens_expanded_host (via create_expanded_buffers) and heuristic_scratch_values when max_draft_tokens changes at runtime, but missed the parallel radix buffers. When the framework reconfigures max_draft_tokens (e.g. spec decoding warmup -> real run, or disagg gen server picking up a different draft length), num_rows starts reflecting the new bound while the radix aux buffers stay at their construction-time size, triggering RuntimeError: radix_aux_{indices,logits} must hold at least num_rows*blocks_per_row*index_topk elements (got 10240 / 10240, need 16384) inside torch.ops.trtllm.indexer_topk_decode on the next forward step. This patch mirrors the existing heuristic_scratch_values resize block for the radix buffers, allocated unconditionally to match the __init__ path (the radix dispatcher can still run when enable_heuristic_topk=True falls back for small numColumns). Made-with: claude-code (https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Add tests/unittest/_torch/thop/parallel/test_indexer_topk.py to the single-B200 DS pre-merge list so the DeepSeek-V4 CI stage actually exercises the Heuristic / Radix Top-K kernel paths and the radix_aux scratch + CUDA Graph replay equivalence tests introduced by PR NVIDIA#14297. Without this entry the indexer Top-K kernel only got coverage from end-to-end DSv4 accuracy stages, leaving the kernel-level resize and TORCH_CHECK behaviour (which this PR fixes for the update_spec_dec_param resize path) implicitly untested. TIMEOUT (30) chosen with margin over measured ~120 s wall on B200 for the full file at the default parallel_factor. Made-with: claude-code (https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

longcheng-nv requested review from a team as code owners May 19, 2026 08:21

longcheng-nv requested review from EmmaQiaoCh and dpitman-nvda May 19, 2026 08:21

github-actions Bot assigned longcheng-nv May 19, 2026

longcheng-nv changed the base branch from main to feat/deepseek_v4 May 19, 2026 08:26

longcheng-nv requested review from a team as code owners May 19, 2026 08:26

longcheng-nv requested review from Superjomn, brb-nv, dongxuy04 and yizhang-nv and removed request for a team May 19, 2026 08:26

longcheng-nv force-pushed the fix/indexer-radix-aux-cuda-graph branch from e0b4e34 to d04f58a Compare May 19, 2026 08:29

longcheng-nv requested review from lfr-0531 and mingyangHao and removed request for a team, EmmaQiaoCh, Superjomn, brb-nv and dpitman-nvda May 19, 2026 09:00

longcheng-nv requested a review from heyuhhh May 20, 2026 05:53

heyuhhh reviewed May 20, 2026

View reviewed changes

heyuhhh approved these changes May 20, 2026

View reviewed changes

heyuhhh reviewed May 20, 2026

View reviewed changes

longcheng-nv removed the request for review from mingyangHao May 21, 2026 01:01

longcheng-nv added the bug Something isn't working label May 21, 2026

lfr-0531 approved these changes May 21, 2026

View reviewed changes

lfr-0531 merged commit f030122 into NVIDIA:feat/deepseek_v4 May 21, 2026
6 checks passed

longcheng-nv mentioned this pull request May 22, 2026

[None][fix] DSv4 indexer: resize radix aux scratch in update_spec_dec_param #14443

Open

1 task

Conversation

longcheng-nv commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Observed failure (Flash MXFP4, ISL ≈ 100K, TP = EP = 4, BS = 32, MTP = 0)

Change

Side-effect surface

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

longcheng-nv commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

heyuhhh May 20, 2026

Choose a reason for hiding this comment

Uh oh!

heyuhhh left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

longcheng-nv commented May 20, 2026

Uh oh!

longcheng-nv commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

heyuhhh May 20, 2026

Choose a reason for hiding this comment

Uh oh!

longcheng-nv commented May 20, 2026

Uh oh!

longcheng-nv commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

longcheng-nv commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

longcheng-nv commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading