[None][fix] DSv4 indexer: stable radix aux scratch for CUDA Graph safety#14297
Conversation
e0b4e34 to
d04f58a
Compare
|
Caution Review failedFailed to post review comments 📝 WalkthroughWalkthroughThis PR introduces DeepSeek-V4 KV cache compression kernels, MHC hyper-connection fused GPU kernels, NIXL transfer agent virtual memory region mapping, TopK decode compression support, and FP8 block-scale MoE enhancements across the TensorRT-LLM kernel and executor stacks, along with supporting CMake wiring and infrastructure updates. ChangesKV Cache Compression System
MHC Hyper-Connection Fused Kernels
NIXL Transfer Agent VMM Region Mapping
TopK Decode Compression Support
FP8 Block-Scale MoE and Routing Enhancements
Infrastructure and Configuration Updates
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Compressor as Compressor Kernel
participant KVCache as Paged KV Cache
participant Transfer as NIXL Transfer
Note over App,Transfer: KV Cache Compression + Transfer Workflow
App->>Compressor: Decode: pagedKvCompressLaunch(kv_score, ...)
activate Compressor
Compressor->>KVCache: Load prior compressor state
Compressor->>Compressor: Online softmax (running max + sum)
Compressor->>KVCache: Write updated state + compressed tokens
deactivate Compressor
App->>Compressor: Prefill: prefillReductionLaunch(kv_score, ...)
activate Compressor
Compressor->>KVCache: Load full sequence
Compressor->>Compressor: Reduce COMPRESS_RATIO rows via softmax
Compressor->>KVCache: Write compressed output
deactivate Compressor
App->>Compressor: PostProcess: postProcessScatterLaunch(kv_comp, ...)
activate Compressor
Compressor->>Compressor: RMSNorm + RoPE + Hadamard
Compressor->>KVCache: Scatter with FP8/MXFP4 quantization
deactivate Compressor
App->>Transfer: submitTransferRequests(src, dst)
activate Transfer
Transfer->>Transfer: detectVramRegionMap(src)
Transfer->>Transfer: splitTransferDescsWithRegionMaps
Transfer->>Transfer: Create per-chunk transfer ops
deactivate Transfer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
|
|
PR_Github #49177 [ run ] completed with state |
|
PR_Github #49209 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #49291 [ run ] triggered by Bot. Commit: |
| // (back-compat for bench scripts / older callers — exposes the | ||
| // CUDA-Graph stale-pointer hazard at high CONC, see SKILL G4). | ||
| int64_t const needed_elts = static_cast<int64_t>(num_rows) * blocks_per_row * index_topk; | ||
| if (radix_aux_indices.has_value() && radix_aux_logits.has_value()) |
There was a problem hiding this comment.
This should always be true because we pre-allocated the space and pass them in the function. For simplify i think we can add some assert sentence here to make sure and remove the remaining code, especially the if-else
heyuhhh
left a comment
There was a problem hiding this comment.
LGTM. Leave little comments for refine the code. Thanks!
|
PR_Github #49291 [ run ] completed with state
|
…precondition
Two small follow-ups on top of the radix-aux PR head:
1. model_config.py — drop the duplicate `indexer_k_dtype` kwarg in the
V4 `DeepSeekV4SparseAttentionConfig(...)` construction. The
`update_sparse_attention_indexer_config` dict-builder already writes
`indexer_config['indexer_k_dtype']`, and `**indexer_config` unpacks it
at the call site. Passing it ALSO as an explicit kwarg raised:
TypeError: DeepSeekV4SparseAttentionConfig() got multiple values
for keyword argument 'indexer_k_dtype'
This was failing all 10 of the L0_MergeRequest_PR/38953 test cases
that hit `ModelConfig.from_pretrained` on a V4 checkpoint
(`test_deepseek_v4_sparse_ratios_*`, `TestDeepSeekV4Flash.test_auto_dtype`,
B200 / B300 / GB200 agg + disagg). None of the 10 was related to the
radix-aux change itself; all fail before reaching any indexer code.
2. IndexerTopKOp.cpp — apply heyuhhh's review suggestion. Now that the
in-tree caller (`dsa.py:DSAtrtllmAttentionMetadata.allocate`) always
pre-allocates the radix aux buffers, the per-call `th::empty` fallback
is dead code in production. Replace the `if/else` branch with a
single `TORCH_CHECK(radix_aux_indices.has_value() && ...)` precondition
plus the existing tensor-property checks. Net: -22/+12 lines, same
behavior on the success path, cleaner failure mode (clear error vs
silent CUDA-Graph stale-pointer corruption) on any future caller that
forgets to pass the buffers.
Validation:
- Existing radix-aux unit tests still cover the path:
`test_indexer_topk_decode_radix_aux_equivalence` (7 cases) and
`test_indexer_topk_decode_radix_aux_cuda_graph_replay` (1 case) both
pass the buffers explicitly, so they exercise the success branch.
- Existing parametrized `test_indexer_topk_decode` (3 test classes,
`num_tokens` ∈ {4096, 8192, 16384, 32768}) all keep
`num_columns < splitWorkThreshold = 200_000`, so `blocks_per_row == 1`
and the new TORCH_CHECK does not fire.
- Warmup helper at dsa.py:102 calls with `num_cols` for a single row;
also `blocks_per_row == 1`.
Made-with: Claude Code (Opus 4.7, 1M context)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
|
Pushed commit 383d8c5 (top of HEAD now) addressing two items: 1. @heyuhhh's inline review on 2. Unrelated CI unblock in Root cause was a duplicate-kwarg on the V4 construction site at @lfr-0531 @mingyangHao Could you take a look when convenient? Local validation summary already in the PR body's "Local validation results" section (7/7 radix-aux unit tests pass + 3 high-CONC TEP+Radix CUDA-illegal-address crashes on B200 Pareto sweep are fully eliminated). |
|
/bot run --disable-fail-fast |
|
PR_Github #49397 [ run ] triggered by Bot. Commit: |
| @@ -122,10 +123,28 @@ void indexer_topk_decode(th::Tensor const& logits, th::Tensor const& seq_lens, t | |||
There was a problem hiding this comment.
Do we really need this two tensors? As we have already passed two aux tensors. I think we can just used them to avoid incomplete graph capture
Address @heyuhhh's second review comment on PR NVIDIA#14297 (cpp/tensorrt_llm/thop/IndexerTopKOp.cpp:122). The two `th::empty({0}, ...)` placeholders for the fp32 radix aux scratch became vestigial after commit 383d8c5 made the caller-owned buffers required for blocks_per_row > 1 — they were only reachable in the blocks_per_row == 1 case, where the kernel never dereferences them. Replace both Tensor placeholders with raw nullable `int32_t*` / `float*` pointers (matching the existing `heuristicScratchPtr` convention at lines 98-111). Hoist all caller-buffer validation out of the `blocks_per_row > 1` branch (dtype/device/contiguity always; numel only when the kernel actually needs them). For `blocks_per_row == 1` without caller-owned scratch, pass nullptr through to the kernel; for `> 1` without scratch, fail early with the same TORCH_CHECK message as before. Side-effect: also fix two same-PR-introduced regressions that the strict TORCH_CHECK in 383d8c5 would now trip: 1. `warmup_heuristic_topk_decode` (dsa.py:68) uses `num_cols=4096`, which falls below `kSeqSmall=12288`, so canUseHeuristic is false and the call routes to the Radix path with `blocks_per_row=2` (`num_rows=1` sweeps bp ∈ [2, maxByCols=2]). Allocate worst-case (kMaxBlocksPerRowDecode=10) caller-owned scratch in the warmup — cost is ~80 KB, the warmup runs once. 2. `_run_indexer_topk_decode_check` test helper (test_indexer_topk.py:229) covers params that produce `blocks_per_row > 1` (e.g. `num_tokens=32768, compress_ratio=1` → numCols=32768 → bp=9–10). Same fix: pre-allocate `_build_radix_aux_buffers` and pass via kwargs. Same change applied to `test_indexer_topk_decode_dist` (fp32 with num_tokens=8192 falls below kSeqSmall, hits Radix with bp>1) and `_run_indexer_topk_decode_v4_gvr_check` (defense-in-depth — GVR almost always succeeds for its param matrix but the kwarg is now uniform). The radix-aux equivalence test no longer makes sense in its original form (it compared caller-owned vs the removed `th::empty` fallback); rewrite it to verify allocation-pointer-independence — two invocations with freshly-allocated aux at distinct backing addresses must yield the same selected top-K logit values. Validation: - `pre-commit run` clean on the three changed files (clang-format, ruff-format, isort, codespell, yapf). - radix_aux unit tests will run in CI; pre-commit pre-flight is green. Made with Claude Code (Opus 4.7, 1M context) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
|
Pushed commit The reviewer's point — the two The fix — replace both Tensor placeholders with raw nullable Same-PR-introduced regressions also fixed — chased down two paths that the strict TORCH_CHECK in 383d8c5 would now trip but weren't caught locally (the previous run was filtered with
The radix-aux equivalence test no longer made sense in its original form (compared caller-owned vs the removed
|
|
/bot run --disable-fail-fast |
|
PR_Github #49446 [ run ] triggered by Bot. Commit: |
|
PR_Github #49397 [ run ] completed with state |
|
PR_Github #49446 [ run ] completed with state |
|
@lfr-0531 Friendly ping — this PR is ready and would appreciate a merge when you have a moment. Current status (head
Scope recap: makes the fp32 Radix path of Thanks! |
…ynth
Two new skills mirror the V3.2 swebench-temporal-synth methodology
but parameterised for V4 production:
swebench-temporal-synth-v4flash
- K=512, compress_ratio=4
- 3 beta cfgs fitted from real V4 Flash captures (21 GVR-active
layers, even 2..42), three-bucketed by mean
- per-cfg target_hr ~ 0.36 / 0.46 / 0.44
swebench-temporal-synth-v4pro
- K=1024, compress_ratio=4
- 3 beta cfgs fitted from real V4 Pro captures (30 GVR-active
layers, even 2..60)
- per-cfg target_hr ~ 0.69 / 0.75 / 0.77 (Pro decode steps show
much stronger prev/current preIdx overlap than Flash)
Both skills enforce the V4 kernel contract:
- preIdx caller-side offset = 0 (cr=4 path; kernel uses preIdx[i]
directly per heuristicTopKDecode.cu preIdxOffset rule)
- radix_aux_{indices,logits} pre-allocated for split-work path
(post-NVIDIA#14297 contract)
- numColumns % 8 == 0 (V4 kernel alignment)
- logits dtype switchable to fp32 / bf16 / fp16
Sources:
- Real V4 Flash + Pro captures under
auto_optimization_v1/ablation_study/gvr_phase_timing/09_precision_ablation/{11,12}_dsv4_*_indexer_data_capture/
- One-shot beta + hit_rate fit via analyze_v4_dist.py
- V3.2 sibling skill .claude/skills/swebench-temporal-synth/
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…mmetric clip
Two refinements to the V4 sibling skills landed last commit:
1. BENCH=1 path now uses nsys instead of cuda.Event
- run_all_n.sh BENCH=1 synthesises all 12 cells (no inline --bench),
then runs ONE nsys session over the whole output dir via
bench_nsys.py, exports the NVTX→GPU projection CSV via
`nsys stats`, and parses to a per-(cfg, N, BS, dtype) R/H
summary (`summary_table.txt` + `nsys_speedup_summary.json`).
- nsys timing measures pure GPU kernel duration and avoids the
5-10 µs launch-tail bias of cuda.Event that systematically
under-reported R/H by 0.6-1.8×.
2. Distribution clip is now asymmetric (clip_low, clip_high)
- V4 indexer logits show stronger positive-tail extension than
negative (e.g. Flash L2 32K: mean=-1.21, range=[-3.28, +3.54]
so the range-center is +0.13, well above mean). The V3.2
sibling's `low = mean - fr/2, high = mean + fr/2` symmetric
envelope truncates the positive tail and over-extends negative.
- BETA_CFGS replaces `full_range` with explicit `clip_low /
clip_high` taken from the per-(layer, ISL) observed (min, max)
widest envelope. sample_beta_row uses these directly. Same
beta-fit math; only the support changes.
Updated cfgs:
Flash beta_shallow: [-4.36, +7.83] (was [-4.97, +2.34])
Flash beta_moderate: [-5.08, +7.17] (was [-6.06, +1.89])
Flash beta_deep: [-6.28, +7.55] (was [-8.32, +3.13])
Pro beta_shallow: [-4.54, +7.33] (was [-5.56, +3.20])
Pro beta_moderate: [-6.15, +8.45] (was [-6.91, +3.14])
Pro beta_deep: [-5.42, +6.47] (was [-7.46, +2.28])
Validated on B200 (sm_10.0, 148 SM) with the corrected build
(PR-A kFTarget=K, PR-B BSMAX env, post-NVIDIA#14297 radix_aux contract):
Flash K=512 R/H bf16 (nsys, BS=1): 2.98-4.60x (N=4K to 25K)
Pro K=1024 R/H bf16 (nsys, BS=1): 2.46-3.68x (N=4K to 25K)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Add tests/unittest/_torch/thop/parallel/test_indexer_topk.py to the single-B200 DS pre-merge list so the DeepSeek-V4 CI stage actually exercises the Heuristic / Radix Top-K kernel paths and the radix_aux scratch + CUDA Graph replay equivalence tests introduced by PR NVIDIA#14297. Without this entry the indexer Top-K kernel only got coverage from end-to-end DSv4 accuracy stages, leaving the kernel-level resize and TORCH_CHECK behaviour (which this PR fixes for the update_spec_dec_param resize path) implicitly untested. TIMEOUT (30) chosen with margin over measured ~120 s wall on B200 for the full file at the default parallel_factor. Made-with: claude-code (https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…_param PR NVIDIA#14297 added persistent radix_aux_{indices,logits} scratch buffers in DSAtrtllmAttentionMetadata.__init__ sized to max_num_sequences * (1 + max_draft_tokens), and added a kernel-side TORCH_CHECK in IndexerTopKOp.cpp that the buffers' numel >= num_rows * blocks_per_row * index_topk. It also patched update_spec_dec_param to resize kv_lens_expanded_host (via create_expanded_buffers) and heuristic_scratch_values when max_draft_tokens changes at runtime, but missed the parallel radix buffers. When the framework reconfigures max_draft_tokens (e.g. spec decoding warmup -> real run, or disagg gen server picking up a different draft length), num_rows starts reflecting the new bound while the radix aux buffers stay at their construction-time size, triggering RuntimeError: radix_aux_{indices,logits} must hold at least num_rows*blocks_per_row*index_topk elements (got 10240 / 10240, need 16384) inside torch.ops.trtllm.indexer_topk_decode on the next forward step. This patch mirrors the existing heuristic_scratch_values resize block for the radix buffers, allocated unconditionally to match the __init__ path (the radix dispatcher can still run when enable_heuristic_topk=True falls back for small numColumns). Made-with: claude-code (https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Add tests/unittest/_torch/thop/parallel/test_indexer_topk.py to the single-B200 DS pre-merge list so the DeepSeek-V4 CI stage actually exercises the Heuristic / Radix Top-K kernel paths and the radix_aux scratch + CUDA Graph replay equivalence tests introduced by PR NVIDIA#14297. Without this entry the indexer Top-K kernel only got coverage from end-to-end DSv4 accuracy stages, leaving the kernel-level resize and TORCH_CHECK behaviour (which this PR fixes for the update_spec_dec_param resize path) implicitly untested. TIMEOUT (30) chosen with margin over measured ~120 s wall on B200 for the full file at the default parallel_factor. Made-with: claude-code (https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Description
The fp32 Radix path of
indexer_topk_decodewas not CUDA-Graph-safe: it allocated its split-work scratch buffers (aux_indices,aux_logits) viath::emptyper call whenblocks_per_row > 1. Under CUDA Graph capture + replay these per-call pointers become stale when the caching allocator is perturbed by chunked-prefill activations at high concurrency. The captured Radix part1 / part2 kernels then write to recycled memory; the resultingCUDA_ERROR_ILLEGAL_ADDRESSsurfaces at the next sync (commonly inside DeepGEMM'ssmxx_layout.hpp:97TMA descriptor pack), looking like a downstream FP8 GEMM crash but caused upstream by the indexer.IndexerTopKOp.cpp:91already documents the pitfall for the Heuristic path:and
dsa.py:879follows that contract forheuristic_scratch_values. The Radix path was the missing half of the same pattern — this PR adds it.Observed failure (Flash MXFP4, ISL ≈ 100K, TP = EP = 4, BS = 32, MTP = 0)
cell_013_C008_TEP_GVRFCUDA 700atsmxx_layout.hpp:97cell_017_C016_TEP_GVRFCUDA 700atsmxx_layout.hpp:97cell_021_C032_TEP_GVRFCUDA 700atsmxx_layout.hpp:97All three crashes converge on the identical stack:
fp8_swap_ab_gemm→_fp8_quantize_1x128_ue8m0→deep_gemm.get_mn_major_tma_aligned_packed_ue8m0_tensor→smxx_layout.hpp:97. The DeepGEMM line is the next sync after the offending kernel, not the offender itself. Same-point GVR-ON cells (Heuristic path, uses persistentheuristic_scratch) run clean — confirming the asymmetry is in the indexer.Change
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp— extendindexer_topk_decodewith two optionalTensor?kwargs (radix_aux_indices,radix_aux_logits). When both are provided andblocks_per_row > 1, use the caller-owned buffers; otherwise fall back to per-callth::empty(back-compat for bench scripts / warmup helpers / callers not under CUDA Graph capture). Validatesis_cuda, same device, contiguous, dtype (int32/float32),numel ≥ num_rows × blocks_per_row × index_topk.tensorrt_llm/_torch/attention_backend/sparse/dsa.py— unconditionally allocate persistentradix_aux_indices/radix_aux_logitsinDSAtrtllmAttentionMetadatasized to the worst case (max_gen_tokens × kMaxBlocksPerRowDecode = 10 × num_sparse_topk). Buffers are routed throughself.get_empty(self.cuda_graph_buffers, ...)so they have stable addresses across graph replays and survive metadata-level CUDA-graph capture. Passed at the call site.tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py— extend theregister_fakePython signature to match the new schema.Side-effect surface
None; existing callers continue to use theth::emptyfallback unchanged.DSAtrtllmAttentionMetadatainstance on Flash defaults (max_num_seqs = 32 × kMaxBlocksPerRowDecode = 10 × index_topk = 512 × 2 buffers × 4 B/elt). Negligible.Test Coverage
Unit tests added to
tests/unittest/_torch/thop/parallel/test_indexer_topk.py:test_indexer_topk_decode_radix_aux_equivalence— Parametrised over(batch_size, num_tokens, compress_ratio)shapes that forceblocks_per_row > 1(including the exactbatch_size=8, num_tokens=16384, compress_ratio=4that mirrors the Flash CONC=8 failure point). Verifies caller-owned-aux output is bit-identical to the legacyth::emptyfallback.test_indexer_topk_decode_radix_aux_cuda_graph_replay— Captures the op into a CUDA Graph with caller-owned aux, replays 8× and asserts each replay's output matches a non-graph reference. This is the direct regression test for the original stale-pointer bug.test_indexer_topk_decode_radix_aux_validation— Negative tests for the new validation paths: wrongdtype(indices not int32 / logits not float32), undersized buffer (<num_rows × blocks_per_row × index_topk), non-contiguous tensor.Existing coverage retained:
test_indexer_topk_decode/test_indexer_topk_decode_sm_saturation/test_indexer_topk_decode_launch_policy_transitionscontinue to exercise theth::emptyfallback path (do not pass the new kwargs).bench_indexer_topk_kernel.pycontinues to work unchanged (does not pass new kwargs).End-to-end validation:
pareto_v4_pro_swebench100k_b300sweep against this binary on 8× B300 is in progress; CONC = 1 paired cells (P1 / P2) already completed cleanly with the new fix in place. High-CONC GVR-OFF cells (the ones that previously triggered G4) will be the live test.PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.