Skip to content

[None][fix] DSv4 indexer: stable radix aux scratch for CUDA Graph safety#14297

Merged
lfr-0531 merged 3 commits into
NVIDIA:feat/deepseek_v4from
longcheng-nv:fix/indexer-radix-aux-cuda-graph
May 21, 2026
Merged

[None][fix] DSv4 indexer: stable radix aux scratch for CUDA Graph safety#14297
lfr-0531 merged 3 commits into
NVIDIA:feat/deepseek_v4from
longcheng-nv:fix/indexer-radix-aux-cuda-graph

Conversation

@longcheng-nv
Copy link
Copy Markdown
Collaborator

@longcheng-nv longcheng-nv commented May 19, 2026

Description

The fp32 Radix path of indexer_topk_decode was not CUDA-Graph-safe: it allocated its split-work scratch buffers (aux_indices, aux_logits) via th::empty per call when blocks_per_row > 1. Under CUDA Graph capture + replay these per-call pointers become stale when the caching allocator is perturbed by chunked-prefill activations at high concurrency. The captured Radix part1 / part2 kernels then write to recycled memory; the resulting CUDA_ERROR_ILLEGAL_ADDRESS surfaces at the next sync (commonly inside DeepGEMM's smxx_layout.hpp:97 TMA descriptor pack), looking like a downstream FP8 GEMM crash but caused upstream by the indexer.

IndexerTopKOp.cpp:91 already documents the pitfall for the Heuristic path:

"Caller-owned scratch buffer for heuristic TopK output values. Must be pre-allocated with stable address for CUDA Graph compatibility."

and dsa.py:879 follows that contract for heuristic_scratch_values. The Radix path was the missing half of the same pattern — this PR adds it.

Observed failure (Flash MXFP4, ISL ≈ 100K, TP = EP = 4, BS = 32, MTP = 0)

cell CONC Mode GVR result
cell_013_C008_TEP_GVRF 8 TEP OFF (Radix) CUDA 700 at smxx_layout.hpp:97
cell_017_C016_TEP_GVRF 16 TEP OFF (Radix) CUDA 700 at smxx_layout.hpp:97
cell_021_C032_TEP_GVRF 32 TEP OFF (Radix) CUDA 700 at smxx_layout.hpp:97

All three crashes converge on the identical stack: fp8_swap_ab_gemm_fp8_quantize_1x128_ue8m0deep_gemm.get_mn_major_tma_aligned_packed_ue8m0_tensorsmxx_layout.hpp:97. The DeepGEMM line is the next sync after the offending kernel, not the offender itself. Same-point GVR-ON cells (Heuristic path, uses persistent heuristic_scratch) run clean — confirming the asymmetry is in the indexer.

Change

  • cpp/tensorrt_llm/thop/IndexerTopKOp.cpp — extend indexer_topk_decode with two optional Tensor? kwargs (radix_aux_indices, radix_aux_logits). When both are provided and blocks_per_row > 1, use the caller-owned buffers; otherwise fall back to per-call th::empty (back-compat for bench scripts / warmup helpers / callers not under CUDA Graph capture). Validates is_cuda, same device, contiguous, dtype (int32 / float32), numel ≥ num_rows × blocks_per_row × index_topk.
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py — unconditionally allocate persistent radix_aux_indices / radix_aux_logits in DSAtrtllmAttentionMetadata sized to the worst case (max_gen_tokens × kMaxBlocksPerRowDecode = 10 × num_sparse_topk). Buffers are routed through self.get_empty(self.cuda_graph_buffers, ...) so they have stable addresses across graph replays and survive metadata-level CUDA-graph capture. Passed at the call site.
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py — extend the register_fake Python signature to match the new schema.

Side-effect surface

  • Same Radix kernel invocation, same dispatch logic, same numerical output.
  • DSv3.2 sparse path shares this op and metadata class — benefits transparently, no regression.
  • API back-compat: new kwargs default to None; existing callers continue to use the th::empty fallback unchanged.
  • Memory cost: ≈ 1.3 MB per DSAtrtllmAttentionMetadata instance on Flash defaults (max_num_seqs = 32 × kMaxBlocksPerRowDecode = 10 × index_topk = 512 × 2 buffers × 4 B/elt). Negligible.

Test Coverage

Unit tests added to tests/unittest/_torch/thop/parallel/test_indexer_topk.py:

  • test_indexer_topk_decode_radix_aux_equivalence — Parametrised over (batch_size, num_tokens, compress_ratio) shapes that force blocks_per_row > 1 (including the exact batch_size=8, num_tokens=16384, compress_ratio=4 that mirrors the Flash CONC=8 failure point). Verifies caller-owned-aux output is bit-identical to the legacy th::empty fallback.
  • test_indexer_topk_decode_radix_aux_cuda_graph_replay — Captures the op into a CUDA Graph with caller-owned aux, replays 8× and asserts each replay's output matches a non-graph reference. This is the direct regression test for the original stale-pointer bug.
  • test_indexer_topk_decode_radix_aux_validation — Negative tests for the new validation paths: wrong dtype (indices not int32 / logits not float32), undersized buffer (< num_rows × blocks_per_row × index_topk), non-contiguous tensor.

Existing coverage retained:

  • test_indexer_topk_decode / test_indexer_topk_decode_sm_saturation / test_indexer_topk_decode_launch_policy_transitions continue to exercise the th::empty fallback path (do not pass the new kwargs).
  • bench_indexer_topk_kernel.py continues to work unchanged (does not pass new kwargs).

End-to-end validation:

  • Re-run of pareto_v4_pro_swebench100k_b300 sweep against this binary on 8× B300 is in progress; CONC = 1 paired cells (P1 / P2) already completed cleanly with the new fix in place. High-CONC GVR-OFF cells (the ones that previously triggered G4) will be the live test.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@longcheng-nv longcheng-nv requested review from a team as code owners May 19, 2026 08:21
@longcheng-nv longcheng-nv changed the base branch from main to feat/deepseek_v4 May 19, 2026 08:26
@longcheng-nv longcheng-nv requested review from a team as code owners May 19, 2026 08:26
@longcheng-nv longcheng-nv requested review from Superjomn, brb-nv, dongxuy04 and yizhang-nv and removed request for a team May 19, 2026 08:26
@longcheng-nv longcheng-nv force-pushed the fix/indexer-radix-aux-cuda-graph branch from e0b4e34 to d04f58a Compare May 19, 2026 08:29
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

This PR introduces DeepSeek-V4 KV cache compression kernels, MHC hyper-connection fused GPU kernels, NIXL transfer agent virtual memory region mapping, TopK decode compression support, and FP8 block-scale MoE enhancements across the TensorRT-LLM kernel and executor stacks, along with supporting CMake wiring and infrastructure updates.

Changes

KV Cache Compression System

Layer / File(s) Summary
DeepSeek-V4 KV Cache Compression Kernels
cpp/tensorrt_llm/kernels/compressorKernels/CMakeLists.txt, compressorKernels.h, compressorKernels.cu
Three main CUDA kernels implement online-softmax-based KV cache compression: pagedKvCompressKernel for decode-step token compression, prefillReductionKernel for bulk prefill compression, and postProcessScatterKernel for RMSNorm + RoPE + Hadamard butterfly + paged scatter with FP8/MXFP4 quantization support.

MHC Hyper-Connection Fused Kernels

Layer / File(s) Summary
MHC Kernel Infrastructure
cpp/tensorrt_llm/kernels/mhcKernels/CMakeLists.txt, mhcKernels.h, mhcKernels.cu
Core MHC kernel launchers: mhcBigFuseKernel for token-wise split-K reduction + Sinkhorn normalization + RMSNorm fusion, mhcGemmSqrsumFmaKernel for split-N GEMM + sqr-sum fusion, and post-mapping/head-apply kernels for residual and gating operations.
MHC TF32 Fused Kernels
cpp/tensorrt_llm/kernels/mhcKernels/fused_tf32_pmap_gemm.cuh, mhcFusedHcKernel.cu
TF32 tcgen05 MMA + TMA pipelined kernels: two-step mhcFusedHcLaunch (GEMM → bigFuse postlog) and all-in-one mhcFusedHcAllInOneLaunch with done-counter-based last-CTA election for fused postlogue execution.
MHC FMA Fused Kernels
cpp/tensorrt_llm/kernels/mhcKernels/mhc_fused_fma.cuh, mhcFusedHcKernel.cu
Register-path FMA variants: fused_pmap_gemm_fma_ksplit for two-step K-split execution and fused_pmap_gemm_fma_allinone for single-launch pmap + GEMM + bigFuse with atomicAdd reduction when multi-split.

NIXL Transfer Agent VMM Region Mapping

Layer / File(s) Summary
VMM Region Utilities and Agent Metadata
cpp/include/tensorrt_llm/executor/transferAgent.h, cpp/tensorrt_llm/executor/cache_transmission/transferAgent.cpp
Introduces VramRegionInfo, VramRegionMap, and VmmDescSplitter utilities for backend-agnostic chunk-aware descriptor splitting. VmmDescSplitter provides static methods: lookupChunkInfo(), splitDescsWithRegionMap(), splitTransferDescsWithRegionMaps(), splitVmmDescs(), and detectVramRegionMap(). Extends AgentDesc with serialization/deserialization of per-region VMM metadata for remote boundary computation.
NIXL Transfer Status and Agent Lifecycle
cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/transferAgent.h, transferAgent.cpp
Adds RAII cleanup for NixlTransferStatus with destructor, status tracking via mLastStatus atomic, and accessors getLastStatus()/getLastStatusStr(). Extends NixlTransferAgent with shutdown() method and updates registerMemory(), deregisterMemory(), loadRemoteAgent(), getLocalAgentDesc(), and submitTransferRequests() to use VmmDescSplitter region maps instead of registry-based splitting. NixlLoopbackAgent gains a destructor for safe resource cleanup.
Python Bindings
cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp
Adds nanobind methods serialize() and deserialize() for AgentDesc, and extends BaseTransferAgent and NixlTransferAgent bindings with nb::call_guard<nb::gil_scoped_release>() to release Python GIL during long-running transfer operations.

TopK Decode Compression Support

Layer / File(s) Summary
Heuristic TopK Decode Compression
cpp/tensorrt_llm/kernels/heuristicTopKDecode.h, heuristicTopKDecode.cu
Updates heuristic TopK kernels to accept compressRatio parameter, compute actual_kv_len from token window, and derive search domain N as actual_kv_len / compressRatio. Adjusts preIdxOffset temporal-shift logic to depend on compressRatio mode (nonzero offset for ratio==1, zero otherwise).
Indexer TopK Multi-Block Scheduling
cpp/tensorrt_llm/kernels/IndexerTopK.h, indexerTopK.cu
Adds computeIndexerTopKDecodeBlocksPerRow() helper for SM-aware multi-block split/merge sizing. Introduces file-scope tuning constants (kNumBins, split-work thresholds). Updates topKPerRowDecode kernel to compute rowEnd as actual_kv_len / compressRatio. Tightens heuristic eligibility gating to allowed compressRatio modes (1 or 4). Refactors dispatch logic around computed blocksPerRow and fixes histogram refinement with explicit counter resets and garbage index initialization.

FP8 Block-Scale MoE and Routing Enhancements

Layer / File(s) Summary
SwiGLU Clamping for FP8 Activation
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h, DevKernel.cu, runner.cu
Extends activation::Data and activation::KernelParams with optional swigluLimit and hasSwigluLimit fields. Both standard and DeepSeek FP8 kernels conditionally clamp gate/up inputs before activation computation. Runner forwards scalar clamp value for DeepSeek FP8 path.
DeepSeekV4 Routing Mode
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h, runner.cu
Adds DeepSeekV4 = 7 to RoutingMethodType enum (renumbering Unspecified to 8), adds serialization support, and extends MoERunnerArgs with gemm1_clamp_limit_value and has_gemm1_clamp_limit_value for scalar clamp-limit FP8 path.
Top-K Sort Specializations
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernelTopK.cuh
Adds explicit Sort<N, RedType> template specializations for N=5,6,7,8 using hardcoded TOPK_SWAP sequences to enable up-to-8-way expert selection.
FP8 Block-Scale GEMM Quantization
cpp/tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_quant_packed.h, fp8_blockscale_quant_packed.cu
Fused FP8 1x128 quantization kernel: loads BF16, computes per-block amax, derives UE8M0 scale, quantizes to FP8 e4m3, and packs four scale bytes into uint32 MN-major layout for block-scale GEMM dispatch.
Custom MoE Gate Forward Kernel
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h, customMoeRoutingKernels.cu
New gate_forward kernel computes per-expert score transform (softplus), performs top-K selection in hash or topK mode, normalizes weights, and dispatches for n_experts 256 and 384.

Infrastructure and Configuration Updates

Layer / File(s) Summary
Build System Wiring
cpp/tensorrt_llm/CMakeLists.txt, cpp/tensorrt_llm/kernels/CMakeLists.txt
Adds mhcKernels_src and compressorKernels_src subdirectories and object libraries to CMake build tree and links them into the main shared target.
Attention Workspace Sizing
cpp/tensorrt_llm/common/attentionOp.h, attentionOp.cpp
Extends getWorkspaceSizeForContext() with total_kv_len parameter for KV cache reuse scenarios where effective total KV length exceeds max_num_tokens. Updates FP8 Context MLA K/V buffer sizing accordingly.
KV Cache Manager Utils
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.h
Adds IndexMapper::size() and IndexMapper::numFreeSlots() accessor methods for active sequence and free-slot counts.
NIXL CMake Update
cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/CMakeLists.txt
Narrows NIXL wrapper CUDA dependencies from cuda_driver + cudart to cudart only.
Configuration and Dependencies
.pre-commit-config.yaml, 3rdparty/fetch_content.json, cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3RouterGemm.cu
Excludes broader trtllm-gen generated artifacts in pre-commit config, updates deepgemm dependency pin, and adds DSV3 router GEMM template instantiations for kHiddenDim=4096.

Sequence Diagram(s)

sequenceDiagram
  participant App as Application
  participant Compressor as Compressor Kernel
  participant KVCache as Paged KV Cache
  participant Transfer as NIXL Transfer
  
  Note over App,Transfer: KV Cache Compression + Transfer Workflow
  
  App->>Compressor: Decode: pagedKvCompressLaunch(kv_score, ...)
  activate Compressor
  Compressor->>KVCache: Load prior compressor state
  Compressor->>Compressor: Online softmax (running max + sum)
  Compressor->>KVCache: Write updated state + compressed tokens
  deactivate Compressor
  
  App->>Compressor: Prefill: prefillReductionLaunch(kv_score, ...)
  activate Compressor
  Compressor->>KVCache: Load full sequence
  Compressor->>Compressor: Reduce COMPRESS_RATIO rows via softmax
  Compressor->>KVCache: Write compressed output
  deactivate Compressor
  
  App->>Compressor: PostProcess: postProcessScatterLaunch(kv_comp, ...)
  activate Compressor
  Compressor->>Compressor: RMSNorm + RoPE + Hadamard
  Compressor->>KVCache: Scatter with FP8/MXFP4 quantization
  deactivate Compressor
  
  App->>Transfer: submitTransferRequests(src, dst)
  activate Transfer
  Transfer->>Transfer: detectVramRegionMap(src)
  Transfer->>Transfer: splitTransferDescsWithRegionMaps
  Transfer->>Transfer: Create per-chunk transfer ops
  deactivate Transfer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#13740: Updates DSV3 router GEMM template instantiation variants for bf16 with different hidden dimensions, directly related to the dsv3RouterGemm.cu changes in this PR.

Suggested reviewers

  • 2ez4bz
  • byshiue
  • Wanli-Jiang
  • omera-nv
  • moraxu
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/indexer-radix-aux-cuda-graph

@longcheng-nv longcheng-nv requested review from lfr-0531 and mingyangHao and removed request for a team, EmmaQiaoCh, Superjomn, brb-nv and dpitman-nvda May 19, 2026 09:00
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49177 [ run ] completed with state ABORTED. Commit: 5f75930

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49209 [ run ] completed with state SUCCESS. Commit: 5f75930
/LLM/main/L0_MergeRequest_PR pipeline #38883 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49291 [ run ] triggered by Bot. Commit: 5f75930 Link to invocation

@longcheng-nv longcheng-nv requested a review from heyuhhh May 20, 2026 05:53
Comment thread cpp/tensorrt_llm/thop/IndexerTopKOp.cpp Outdated
// (back-compat for bench scripts / older callers — exposes the
// CUDA-Graph stale-pointer hazard at high CONC, see SKILL G4).
int64_t const needed_elts = static_cast<int64_t>(num_rows) * blocks_per_row * index_topk;
if (radix_aux_indices.has_value() && radix_aux_logits.has_value())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should always be true because we pre-allocated the space and pass them in the function. For simplify i think we can add some assert sentence here to make sure and remove the remaining code, especially the if-else

Copy link
Copy Markdown
Collaborator

@heyuhhh heyuhhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Leave little comments for refine the code. Thanks!

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49291 [ run ] completed with state SUCCESS. Commit: 5f75930
/LLM/main/L0_MergeRequest_PR pipeline #38953 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…precondition

Two small follow-ups on top of the radix-aux PR head:

1. model_config.py — drop the duplicate `indexer_k_dtype` kwarg in the
   V4 `DeepSeekV4SparseAttentionConfig(...)` construction. The
   `update_sparse_attention_indexer_config` dict-builder already writes
   `indexer_config['indexer_k_dtype']`, and `**indexer_config` unpacks it
   at the call site. Passing it ALSO as an explicit kwarg raised:

       TypeError: DeepSeekV4SparseAttentionConfig() got multiple values
       for keyword argument 'indexer_k_dtype'

   This was failing all 10 of the L0_MergeRequest_PR/38953 test cases
   that hit `ModelConfig.from_pretrained` on a V4 checkpoint
   (`test_deepseek_v4_sparse_ratios_*`, `TestDeepSeekV4Flash.test_auto_dtype`,
   B200 / B300 / GB200 agg + disagg). None of the 10 was related to the
   radix-aux change itself; all fail before reaching any indexer code.

2. IndexerTopKOp.cpp — apply heyuhhh's review suggestion. Now that the
   in-tree caller (`dsa.py:DSAtrtllmAttentionMetadata.allocate`) always
   pre-allocates the radix aux buffers, the per-call `th::empty` fallback
   is dead code in production. Replace the `if/else` branch with a
   single `TORCH_CHECK(radix_aux_indices.has_value() && ...)` precondition
   plus the existing tensor-property checks. Net: -22/+12 lines, same
   behavior on the success path, cleaner failure mode (clear error vs
   silent CUDA-Graph stale-pointer corruption) on any future caller that
   forgets to pass the buffers.

Validation:
- Existing radix-aux unit tests still cover the path:
  `test_indexer_topk_decode_radix_aux_equivalence` (7 cases) and
  `test_indexer_topk_decode_radix_aux_cuda_graph_replay` (1 case) both
  pass the buffers explicitly, so they exercise the success branch.
- Existing parametrized `test_indexer_topk_decode` (3 test classes,
  `num_tokens` ∈ {4096, 8192, 16384, 32768}) all keep
  `num_columns < splitWorkThreshold = 200_000`, so `blocks_per_row == 1`
  and the new TORCH_CHECK does not fire.
- Warmup helper at dsa.py:102 calls with `num_cols` for a single row;
  also `blocks_per_row == 1`.

Made-with: Claude Code (Opus 4.7, 1M context)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

Pushed commit 383d8c5 (top of HEAD now) addressing two items:

1. @heyuhhh's inline review on IndexerTopKOp.cpp:L132 — done. Replaced the if (radix_aux_indices.has_value() && ...) runtime branch with a TORCH_CHECK(...) precondition and dropped the dead else fallback (per-call th::empty). The success path is byte-identical; future callers that forget to supply the buffers now get a clear error instead of silent CUDA-Graph stale-pointer corruption.

2. Unrelated CI unblock in model_config.py — the previous run (PR_Github #49291, L0_MergeRequest_PR/38953) failed all 10 V4 test cases with:

TypeError: tensorrt_llm.llmapi.llm_args.DeepSeekV4SparseAttentionConfig()
got multiple values for keyword argument 'indexer_k_dtype'

Root cause was a duplicate-kwarg on the V4 construction site at model_config.py:838indexer_k_dtype was being passed both as an explicit kwarg and via **indexer_config (which itself sets indexer_config['indexer_k_dtype'] in update_sparse_attention_indexer_config). None of the 10 failures involved the radix-aux code itself; they all abort in ModelConfig.from_pretrained before reaching any indexer path. Folded the 2-line fix into this PR rather than opening a separate one for speed. Full reasoning in the commit body.

@lfr-0531 @mingyangHao Could you take a look when convenient? Local validation summary already in the PR body's "Local validation results" section (7/7 radix-aux unit tests pass + 3 high-CONC TEP+Radix CUDA-illegal-address crashes on B200 Pareto sweep are fully eliminated).

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49397 [ run ] triggered by Bot. Commit: 383d8c5 Link to invocation

Comment thread cpp/tensorrt_llm/thop/IndexerTopKOp.cpp Outdated
@@ -122,10 +123,28 @@ void indexer_topk_decode(th::Tensor const& logits, th::Tensor const& seq_lens, t
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this two tensors? As we have already passed two aux tensors. I think we can just used them to avoid incomplete graph capture

Address @heyuhhh's second review comment on PR NVIDIA#14297
(cpp/tensorrt_llm/thop/IndexerTopKOp.cpp:122). The two
`th::empty({0}, ...)` placeholders for the fp32 radix
aux scratch became vestigial after commit 383d8c5
made the caller-owned buffers required for blocks_per_row > 1
— they were only reachable in the blocks_per_row == 1 case,
where the kernel never dereferences them.

Replace both Tensor placeholders with raw nullable `int32_t*` /
`float*` pointers (matching the existing `heuristicScratchPtr`
convention at lines 98-111). Hoist all caller-buffer validation
out of the `blocks_per_row > 1` branch (dtype/device/contiguity
always; numel only when the kernel actually needs them). For
`blocks_per_row == 1` without caller-owned scratch, pass nullptr
through to the kernel; for `> 1` without scratch, fail early
with the same TORCH_CHECK message as before.

Side-effect: also fix two same-PR-introduced regressions
that the strict TORCH_CHECK in 383d8c5 would now trip:

  1. `warmup_heuristic_topk_decode` (dsa.py:68) uses
     `num_cols=4096`, which falls below `kSeqSmall=12288`,
     so canUseHeuristic is false and the call routes to
     the Radix path with `blocks_per_row=2` (`num_rows=1`
     sweeps bp ∈ [2, maxByCols=2]). Allocate worst-case
     (kMaxBlocksPerRowDecode=10) caller-owned scratch
     in the warmup — cost is ~80 KB, the warmup runs once.

  2. `_run_indexer_topk_decode_check` test helper
     (test_indexer_topk.py:229) covers params that
     produce `blocks_per_row > 1` (e.g. `num_tokens=32768,
     compress_ratio=1` → numCols=32768 → bp=9–10).
     Same fix: pre-allocate `_build_radix_aux_buffers`
     and pass via kwargs. Same change applied to
     `test_indexer_topk_decode_dist` (fp32 with
     num_tokens=8192 falls below kSeqSmall, hits Radix
     with bp>1) and `_run_indexer_topk_decode_v4_gvr_check`
     (defense-in-depth — GVR almost always succeeds for
     its param matrix but the kwarg is now uniform).

The radix-aux equivalence test no longer makes sense in its
original form (it compared caller-owned vs the removed
`th::empty` fallback); rewrite it to verify
allocation-pointer-independence — two invocations with
freshly-allocated aux at distinct backing addresses must
yield the same selected top-K logit values.

Validation:
  - `pre-commit run` clean on the three changed files
    (clang-format, ruff-format, isort, codespell, yapf).
  - radix_aux unit tests will run in CI; pre-commit
    pre-flight is green.

Made with Claude Code (Opus 4.7, 1M context)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

Pushed commit ffbc5addce addressing @heyuhhh's second inline review comment on IndexerTopKOp.cpp:122.

The reviewer's point — the two th::empty({0}, ...) placeholder Tensors at lines 122-123 became vestigial after the prior commit 383d8c5 tightened the blocks_per_row > 1 contract. They were only reachable in the blocks_per_row == 1 case, where the kernel never dereferences them. Reviewer suggested using the caller-passed buffers directly to avoid the leftover per-call allocation.

The fix — replace both Tensor placeholders with raw nullable int32_t* / float* pointers, matching the existing heuristicScratchPtr convention at lines 98-111. Hoist all caller-buffer validation (dtype / device / contiguity) out of the bp > 1 branch; keep the numel check inside it since that bound only matters when the kernel actually uses the buffers. For bp == 1 without caller-owned scratch, pass nullptr through; for bp > 1 without scratch, fail early with the same TORCH_CHECK message as before.

Same-PR-introduced regressions also fixed — chased down two paths that the strict TORCH_CHECK in 383d8c5 would now trip but weren't caught locally (the previous run was filtered with -k radix_aux, and CI broke earlier at model_config.py so the indexer tests never executed):

  1. warmup_heuristic_topk_decode (dsa.py:68) — default num_cols=4096 falls below kSeqSmall=12288 so canUseHeuristic is false and the call routes to Radix with blocks_per_row=2 (num_rows=1 sweeps bp ∈ [2, maxByCols=2]). Allocate worst-case kMaxBlocksPerRowDecode=10 caller-owned scratch in the warmup. Cost ~80 KB, one-shot.
  2. _run_indexer_topk_decode_check (test_indexer_topk.py:229) — covers params like num_tokens=32768, compress_ratio=1 → numCols=32768 → bp=9–10. Same fix: pre-allocate via existing _build_radix_aux_buffers and pass via kwargs. Same change applied to test_indexer_topk_decode_dist (fp32 + num_tokens=8192 hits Radix bp>1) and _run_indexer_topk_decode_v4_gvr_check (defense-in-depth — GVR almost always succeeds for that param matrix, but the kwarg is now uniform).

The radix-aux equivalence test no longer made sense in its original form (compared caller-owned vs the removed th::empty fallback). Rewrote it to verify allocation-pointer-independence — two invocations with freshly-allocated aux at distinct backing addresses must yield the same selected top-K logit values.

pre-commit run is clean on all three changed files (clang-format, ruff-format, isort, codespell, yapf).

cc @heyuhhh @lfr-0531 @mingyangHao

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49446 [ run ] triggered by Bot. Commit: ffbc5ad Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49397 [ run ] completed with state ABORTED. Commit: 383d8c5

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49446 [ run ] completed with state SUCCESS. Commit: ffbc5ad
/LLM/main/L0_MergeRequest_PR pipeline #39091 completed with status: 'SUCCESS'

CI Report

Link to invocation

@longcheng-nv longcheng-nv removed the request for review from mingyangHao May 21, 2026 01:01
@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

@lfr-0531 Friendly ping — this PR is ready and would appreciate a merge when you have a moment.

Current status (head ffbc5ad):

  • CI green — latest PR_Github #49446 on ffbc5adSUCCESS (L0_MergeRequest_PR pipeline #39091 SUCCESS)
  • Approved by @heyuhhh on 2026-05-20: "LGTM. Leave little comments for refine the code. Thanks!"
  • Both inline review comments resolved:
    • IndexerTopKOp.cpp:L132 — replaced the if (radix_aux_indices.has_value() && ...) runtime branch with a TORCH_CHECK(...) precondition and dropped the dead else fallback (commit 383d8c5).
    • IndexerTopKOp.cpp:L122 — removed the two vestigial th::empty({0}, ...) placeholder tensors; the caller-passed aux tensors are reused directly (commit ffbc5ad).
  • Mergeable: clean, base = feat/deepseek_v4.

Scope recap: makes the fp32 Radix path of indexer_topk_decode CUDA-Graph-safe by moving aux_indices / aux_logits scratch to caller-owned stable allocations — mirroring the same contract heuristic_scratch_values already follows on the Heuristic path. Fixes the CUDA 700 at smxx_layout.hpp:97 seen at ISL ≈ 100K / BS ≥ 8 with GVR-OFF (Radix) on Flash MXFP4.

Thanks!

@longcheng-nv longcheng-nv added the bug Something isn't working label May 21, 2026
@lfr-0531 lfr-0531 merged commit f030122 into NVIDIA:feat/deepseek_v4 May 21, 2026
6 checks passed
longcheng-nv added a commit to longcheng-nv/TensorRT-LLM that referenced this pull request May 21, 2026
…ynth

Two new skills mirror the V3.2 swebench-temporal-synth methodology
but parameterised for V4 production:

  swebench-temporal-synth-v4flash
    - K=512, compress_ratio=4
    - 3 beta cfgs fitted from real V4 Flash captures (21 GVR-active
      layers, even 2..42), three-bucketed by mean
    - per-cfg target_hr ~ 0.36 / 0.46 / 0.44

  swebench-temporal-synth-v4pro
    - K=1024, compress_ratio=4
    - 3 beta cfgs fitted from real V4 Pro captures (30 GVR-active
      layers, even 2..60)
    - per-cfg target_hr ~ 0.69 / 0.75 / 0.77 (Pro decode steps show
      much stronger prev/current preIdx overlap than Flash)

Both skills enforce the V4 kernel contract:
  - preIdx caller-side offset = 0 (cr=4 path; kernel uses preIdx[i]
    directly per heuristicTopKDecode.cu preIdxOffset rule)
  - radix_aux_{indices,logits} pre-allocated for split-work path
    (post-NVIDIA#14297 contract)
  - numColumns % 8 == 0 (V4 kernel alignment)
  - logits dtype switchable to fp32 / bf16 / fp16

Sources:
  - Real V4 Flash + Pro captures under
    auto_optimization_v1/ablation_study/gvr_phase_timing/09_precision_ablation/{11,12}_dsv4_*_indexer_data_capture/
  - One-shot beta + hit_rate fit via analyze_v4_dist.py
  - V3.2 sibling skill .claude/skills/swebench-temporal-synth/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
longcheng-nv added a commit to longcheng-nv/TensorRT-LLM that referenced this pull request May 21, 2026
…mmetric clip

Two refinements to the V4 sibling skills landed last commit:

1. BENCH=1 path now uses nsys instead of cuda.Event
   - run_all_n.sh BENCH=1 synthesises all 12 cells (no inline --bench),
     then runs ONE nsys session over the whole output dir via
     bench_nsys.py, exports the NVTX→GPU projection CSV via
     `nsys stats`, and parses to a per-(cfg, N, BS, dtype) R/H
     summary (`summary_table.txt` + `nsys_speedup_summary.json`).
   - nsys timing measures pure GPU kernel duration and avoids the
     5-10 µs launch-tail bias of cuda.Event that systematically
     under-reported R/H by 0.6-1.8×.

2. Distribution clip is now asymmetric (clip_low, clip_high)
   - V4 indexer logits show stronger positive-tail extension than
     negative (e.g. Flash L2 32K: mean=-1.21, range=[-3.28, +3.54]
     so the range-center is +0.13, well above mean). The V3.2
     sibling's `low = mean - fr/2, high = mean + fr/2` symmetric
     envelope truncates the positive tail and over-extends negative.
   - BETA_CFGS replaces `full_range` with explicit `clip_low /
     clip_high` taken from the per-(layer, ISL) observed (min, max)
     widest envelope. sample_beta_row uses these directly. Same
     beta-fit math; only the support changes.

   Updated cfgs:
     Flash beta_shallow:  [-4.36, +7.83]  (was [-4.97, +2.34])
     Flash beta_moderate: [-5.08, +7.17]  (was [-6.06, +1.89])
     Flash beta_deep:     [-6.28, +7.55]  (was [-8.32, +3.13])
     Pro   beta_shallow:  [-4.54, +7.33]  (was [-5.56, +3.20])
     Pro   beta_moderate: [-6.15, +8.45]  (was [-6.91, +3.14])
     Pro   beta_deep:     [-5.42, +6.47]  (was [-7.46, +2.28])

Validated on B200 (sm_10.0, 148 SM) with the corrected build
(PR-A kFTarget=K, PR-B BSMAX env, post-NVIDIA#14297 radix_aux contract):

  Flash K=512 R/H bf16 (nsys, BS=1): 2.98-4.60x (N=4K to 25K)
  Pro   K=1024 R/H bf16 (nsys, BS=1): 2.46-3.68x (N=4K to 25K)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
longcheng-nv added a commit to longcheng-nv/TensorRT-LLM that referenced this pull request May 22, 2026
Add tests/unittest/_torch/thop/parallel/test_indexer_topk.py to the
single-B200 DS pre-merge list so the DeepSeek-V4 CI stage actually
exercises the Heuristic / Radix Top-K kernel paths and the radix_aux
scratch + CUDA Graph replay equivalence tests introduced by PR NVIDIA#14297.

Without this entry the indexer Top-K kernel only got coverage from
end-to-end DSv4 accuracy stages, leaving the kernel-level resize and
TORCH_CHECK behaviour (which this PR fixes for the
update_spec_dec_param resize path) implicitly untested.

TIMEOUT (30) chosen with margin over measured ~120 s wall on B200 for
the full file at the default parallel_factor.

Made-with: claude-code (https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
longcheng-nv added a commit to longcheng-nv/TensorRT-LLM that referenced this pull request May 22, 2026
…_param

PR NVIDIA#14297 added persistent radix_aux_{indices,logits} scratch buffers in
DSAtrtllmAttentionMetadata.__init__ sized to
max_num_sequences * (1 + max_draft_tokens), and added a kernel-side
TORCH_CHECK in IndexerTopKOp.cpp that the buffers' numel >=
num_rows * blocks_per_row * index_topk.

It also patched update_spec_dec_param to resize kv_lens_expanded_host
(via create_expanded_buffers) and heuristic_scratch_values when
max_draft_tokens changes at runtime, but missed the parallel radix
buffers. When the framework reconfigures max_draft_tokens (e.g. spec
decoding warmup -> real run, or disagg gen server picking up a different
draft length), num_rows starts reflecting the new bound while the radix
aux buffers stay at their construction-time size, triggering

  RuntimeError: radix_aux_{indices,logits} must hold at least
  num_rows*blocks_per_row*index_topk elements (got 10240 / 10240,
  need 16384)

inside torch.ops.trtllm.indexer_topk_decode on the next forward step.

This patch mirrors the existing heuristic_scratch_values resize block
for the radix buffers, allocated unconditionally to match the __init__
path (the radix dispatcher can still run when enable_heuristic_topk=True
falls back for small numColumns).

Made-with: claude-code (https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
longcheng-nv added a commit to longcheng-nv/TensorRT-LLM that referenced this pull request May 22, 2026
Add tests/unittest/_torch/thop/parallel/test_indexer_topk.py to the
single-B200 DS pre-merge list so the DeepSeek-V4 CI stage actually
exercises the Heuristic / Radix Top-K kernel paths and the radix_aux
scratch + CUDA Graph replay equivalence tests introduced by PR NVIDIA#14297.

Without this entry the indexer Top-K kernel only got coverage from
end-to-end DSv4 accuracy stages, leaving the kernel-level resize and
TORCH_CHECK behaviour (which this PR fixes for the
update_spec_dec_param resize path) implicitly untested.

TIMEOUT (30) chosen with margin over measured ~120 s wall on B200 for
the full file at the default parallel_factor.

Made-with: claude-code (https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek-v4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants