[None][perf] Scheme X L2-aware dispatcher and PDL launchers for sparse-attention GVR Top-K#13477
[None][perf] Scheme X L2-aware dispatcher and PDL launchers for sparse-attention GVR Top-K#13477longcheng-nv wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
Add ``warmup_heuristic_topk_decode`` helper that issues one small ``indexer_topk_decode`` call from the ``Indexer`` setup hook before any CUDA Graph capture begins. This forces the C++ Scheme X dispatcher to run its one-time ``cudaGetDevice`` / ``cudaDeviceGetAttribute`` host queries outside capture, so the cached ``sm_count`` and ``L2CacheSize`` values are populated up front and not frozen into a captured graph. The warmup is gated on ``enable_heuristic_topk`` to match the runtime configuration already used to select the heuristic path; cold-start rows stay cold (no ``threshold_pred`` is passed), so Opt-M semantics are unaffected. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Top-K Host-side dispatcher in `invokeIndexerTopKDecode` routes between the heuristic and radix Top-K paths by comparing `numRows` against an architecture-derived threshold: kBsLarge = min(3*SM - SM/8, 0.9*L2 / (4*N)) The occupancy bound (3*SM - SM/8) reflects the per-CTA SMEM budget (~58 KB kernel SMEM vs B200's 228 KB dynamic SMEM → 3 CTA/SM max), with a -SM/8 margin for CTA-setup and L2-ingestion overhead. On B200(148 SM): 3×148 − 18 = 426. The L2 bound (0.9*L2 / (4*N)) reflects per-row logits fit into the GPU L2 — once `concurrent_CTAs × N × 4B` exceeds L2, eviction dominates and the heuristic kernel loses to radix. The two constraints cross near N ≈ 73K; for SWE-Bench N ≈ 70K both yield ≈426, so this is a zero-regression change for DSv3.2 decode. For larger N (e.g. 128K → 227, 196K → 148), the L2 bound auto-tightens and preserves the no-regression guarantee. Both hardware attributes (`cudaDevAttrMultiProcessorCount`, `cudaDevAttrL2CacheSize`) are queried once and static-cached. Zero data-dependent hyperparameters; zero kernel changes; +10 host LOC. An opt-in `TRTLLM_SCHEMEX_DEBUG=1` env var emits a per-launch dispatch trace for introspection. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…2.3)
Add a small-N lower bound `kSeqSmall` to `canUseHeuristic` so the
GVR heuristic Top-K path engages wherever the original TRT-LLM
Radix-radix branch would have triggered. Below `kSeqSmall` the
existing Insertion / Radix-radix path stays active.
Rationale:
* GVR Heuristic has a fixed per-launch overhead (P1 preIdx stats
+ P4 2048-bin histogram snap, ~11 us regardless of N). For small
N the fixed cost dominates and the kernel loses to the existing
insertion-sort path. The crossover N depends on data: random
benchmarks show Heuristic reaching parity at N=16384, but real
SWE-Bench workloads see Heuristic ~6.3 us faster than random
(preIdx-vs-logits ~99% correlated -> P1 stats accurate -> P2
secant converges in 1-2 iterations), shifting the real crossover
into the [12288, 16384] band.
* Setting `kSeqSmall = 12288` lets the Heuristic axis take over
wherever the original Radix-radix branch would fire on real
workload, while keeping N < 12288 on the insertion path (where
GVR's fixed overhead remains uncompetitive).
The original radix dispatcher constants (`kSortingAlgorithmThreshold
= 12288`, `kDefaultSplitWorkThreshold = 200000`) are NOT touched --
when `canUseHeuristic` is false (e.g. preIdx missing, BS too large,
N < kSeqSmall), the dispatcher falls back to BYTE-IDENTICAL
pre-Scheme-X behavior.
Tunables:
* `TRTLLM_HEURISTIC_NMIN` env (range [1024, 200000]) overrides the
default kSeqSmall at process start; cached after first query.
* `TRTLLM_SCHEMEX_DEBUG=1` env now prints `kSeqSmall` plus a
"(small-N route)" marker when N < kSeqSmall, alongside the
existing `kBsWave / kBsL2 / kBsLarge` trace.
Validation (B200, single-GPU):
* Indices set match `torch.topk` for N in {12288, 13312, 14336,
15360, 16384, 32768, 70688} -- both Heuristic and fallback paths.
* Crossover-band perf at [12288, 16384) (correlated preIdx, BS=1):
Heuristic 20-23 us vs Radix-radix fallback 24-26 us -- 1.09-1.20x
speedup at the new boundary.
* Production N=70688 path unchanged: 32.77 us Heuristic vs 55.30 us
Radix-radix fallback (1.69x preserved).
* Scenario B regression guard (preIdx=null) -- all N produce
indices set matching `torch.topk` reference.
No kernel-body or radix dispatcher changes; +57 lines (mostly
explanatory comments around the new `kSeqSmall` cache + the updated
Heuristic eligibility predicate), -5 lines.
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire the heuristic TopK decode path into the project's Programmatic Dependent Launch (PDL) pipeline so it stays symmetric with the radix and insertion fallbacks (both of which already use PDL via ``cudaLaunchKernelEx`` in ``invokeIndexerTopKDecode``). Changes: * ``launchHeuristicTopKDecode`` now launches via ``cudaLaunchKernelEx`` with ``cudaLaunchAttributeProgrammaticStreamSerialization`` set from ``tensorrt_llm::common::getEnvEnablePDL()``. When ``TRTLLM_ENABLE_PDL=1`` (the default in production), the heuristic kernel is allowed to start before the preceding PDL kernel's tail drains, giving the same overlap the radix path already enjoyed. * Both heuristic kernel entry points -- the multi-row ``heuristicTopKMultiRowKernel`` and the standalone ``heuristicTopKKernel`` -- now call ``cudaTriggerProgrammaticLaunchCompletion()`` at their exit points (guarded by ``__CUDA_ARCH__ >= 900``) so the next PDL kernel can likewise pre-launch. No functional change when PDL is disabled. No kernel-body or dispatcher-logic changes; this only adjusts the launch attributes and adds the device-side trigger. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the single-CTA single-row micro-kernel symbols inside
`heuristic_topk.cuh` from the `heuristicTopK*` family to the
`gvrTopK*` family, to align the in-source vocabulary with the
upcoming algorithm note "Guess-Verify-Refine: Data-Aware Top-K
for Sparse-Attention Decoding on Blackwell via Temporal
Correlation".
Renames (inner micro-kernel only):
* `heuristicTopKJob` -> `gvrTopKJob` (`__device__` per-CTA
algorithm body)
* `heuristicTopKKernel` -> `gvrTopKKernel` (`__global__` single-row
wrapper used by the
standalone launcher
`launchHeuristicTopK`)
Outer wrappers keep their `heuristic*` / `kHeuristic*` names because
they sit on the public dispatcher / launch surface that callers
already depend on:
* `launchHeuristicTopKDecode` -- multi-row decode launcher
* `heuristicTopKMultiRowKernel` -- multi-row global wrapper
* `launchHeuristicTopK` -- single-row standalone launcher
* `canUseHeuristic`, `kHeuristicTopK`, `kHeuristicSize` -- dispatcher
predicates +
public
constants
Phase comments inside `gvrTopKJob` are also annotated with the GVR
mapping for cross-reference with the algorithm note:
* P1 (preIdx Min/Max/Mean) -> GVR Guess, part 1
* P2 (Secant threshold search) -> GVR Guess, part 2
* P3 (Ballot-free collect) -> GVR Verify
* P4 (Histogram snap + partition) -> GVR Refine
Pure source-level rename; no signature changes on any public symbol,
no kernel-body changes, no dispatcher-logic changes. Rebuilt
`tensorrt_llm` + `th_common` and re-ran the indices-correctness +
crossover-band + Scenario-B suite -- all PASS, with N=70688 timing
byte-identical to the pre-rename run (32.77 us / 1.69x vs
Radix-radix).
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/bot run --disable-fail-fast |
|
PR_Github #45585 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughImplements a "Guess-Verify-Refine" (GVR) micro-kernel architecture for TopK decoding with renamed kernels ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
1161-1191: Make the warmup idempotent.This helper initializes process-global dispatcher caches, but it reruns the CUDA allocations and
torch.cuda.synchronize()every time it is called. Memoizing it per(device, top_k, hint_size, num_cols)would avoid repeated init-time stalls when multiple models/indexers are constructed in one process.♻️ Suggested direction
+_heuristic_topk_warmups: set[tuple[int, int, int, int]] = set() + def warmup_heuristic_topk_decode(top_k: int = 2048, hint_size: int = 2048, num_cols: int = 4096) -> None: + device_idx = torch.cuda.current_device() + key = (device_idx, top_k, hint_size, num_cols) + if key in _heuristic_topk_warmups: + return - device = torch.device("cuda") + device = torch.device(f"cuda:{device_idx}") logits = torch.zeros((1, num_cols), dtype=torch.float32, device=device) seq_lens = torch.tensor([num_cols], dtype=torch.int32, device=device) indices = torch.empty((1, top_k), dtype=torch.int32, device=device) pre_idx = torch.zeros((1, hint_size), dtype=torch.int32, device=device) scratch = torch.empty((top_k, ), dtype=torch.float32, device=device) torch.ops.trtllm.indexer_topk_decode(logits, seq_lens, indices, 1, top_k, pre_idx=pre_idx, heuristic_scratch=scratch) torch.cuda.synchronize() + _heuristic_topk_warmups.add(key)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py` around lines 1161 - 1191, Make warmup_heuristic_topk_decode idempotent by adding a module-level cache and lock (e.g. _warmup_heuristic_topk_decode_cache and _warmup_heuristic_topk_decode_lock) and short-circuiting when the tuple key for (device, top_k, hint_size, num_cols) is already warmed; construct the key using the CUDA device identity (device.index or device.type+index) plus the three ints, acquire the lock, check cache, perform the existing allocation/call/synchronize only if not present, then add the key to the cache before releasing the lock so subsequent calls skip the heavy work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/kernels/heuristic_topk.cuh`:
- Around line 802-806: The launcher for gvrTopKKernel must enable programmatic
stream serialization (PDL); replace the triple-angle launch in
launchHeuristicTopK() with a cudaLaunchKernelEx call and pass
cudaLaunchAttributeProgrammaticStreamSerialization in the attributes. Keep the
same kernel function (gvrTopKKernel), grid (1), block (BLOCK_SIZE), shared
memory size (smemSize), and stream, and build the void* args array containing
input, &N, &preIdx, &M, &topK, outputValues, outputIndices, &thresholdPos
(matching the kernel signature). Also preserve the prior cudaFuncSetAttribute
call for MaxDynamicSharedMemorySize; ensure cudaLaunchKernelEx is used exactly
like the pattern in heuristicTopKDecode.cu with the programmatic serialization
attribute.
In `@cpp/tensorrt_llm/kernels/indexerTopK.cu`:
- Around line 740-803: The function-level mutable statics sCachedSmCount,
sCachedL2Bytes, sCachedNMin and the debug flags must be initialized once to
avoid races; wrap all their initialization (the cudaDeviceGetAttribute calls,
the TRTLLM_HEURISTIC_NMIN getenv logic, and the TRTLLM_SCHEMEX_DEBUG
getenv/flags) in a single std::call_once using a static std::once_flag (e.g.,
sSchemeXInitFlag) so concurrent callers of invokeIndexerTopKDecode() cannot
race; move the existing if (sCached... == 0) blocks into a single lambda invoked
by std::call_once and remove the separate sDebugChecked pattern so sDebug is set
under the same call_once guard.
---
Nitpick comments:
In `@tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py`:
- Around line 1161-1191: Make warmup_heuristic_topk_decode idempotent by adding
a module-level cache and lock (e.g. _warmup_heuristic_topk_decode_cache and
_warmup_heuristic_topk_decode_lock) and short-circuiting when the tuple key for
(device, top_k, hint_size, num_cols) is already warmed; construct the key using
the CUDA device identity (device.index or device.type+index) plus the three
ints, acquire the lock, check cache, perform the existing
allocation/call/synchronize only if not present, then add the key to the cache
before releasing the lock so subsequent calls skip the heavy work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eaa9f343-c674-49cf-b002-5a5fb043ef42
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/heuristicTopKDecode.cucpp/tensorrt_llm/kernels/heuristic_topk.cuhcpp/tensorrt_llm/kernels/indexerTopK.cutensorrt_llm/_torch/attention_backend/sparse/dsa.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
| cudaFuncSetAttribute(gvrTopKKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, static_cast<int>(smemSize)); | ||
| } | ||
|
|
||
| heuristicTopKKernel<<<1, BLOCK_SIZE, smemSize, stream>>>( | ||
| gvrTopKKernel<<<1, BLOCK_SIZE, smemSize, stream>>>( | ||
| input, N, preIdx, M, topK, outputValues, outputIndices, thresholdPos); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Call sites / definition for launchHeuristicTopK:"
rg -nP '\blaunchHeuristicTopK\s*\('
echo
echo "Current single-row launcher:"
sed -n '790,807p' cpp/tensorrt_llm/kernels/heuristic_topk.cuh
echo
echo "PDL-enabled decode launcher for comparison:"
sed -n '113,125p' cpp/tensorrt_llm/kernels/heuristicTopKDecode.cuRepository: NVIDIA/TensorRT-LLM
Length of output: 1701
🏁 Script executed:
# Check for all call sites of launchHeuristicTopK across the repository
rg -n '\blaunchHeuristicTopK\s*\(' --type cpp --type cu --type cuh -A 2
echo "---"
echo "Checking gvrTopKKernel implementation for cudaTriggerProgrammaticLaunchCompletion:"
sed -n '770,780p' cpp/tensorrt_llm/kernels/heuristic_topk.cuhRepository: NVIDIA/TensorRT-LLM
Length of output: 585
🏁 Script executed:
# Search for all references to launchHeuristicTopK in the repo
rg -i 'launchHeuristicTopK' --no-type -n
echo "---"
echo "Checking if launchHeuristicTopK is declared in any header:"
rg -i 'launchHeuristicTopK' --no-type -n | grep -E '\.(h|hpp|cuh):'Repository: NVIDIA/TensorRT-LLM
Length of output: 1016
🏁 Script executed:
# Simple search for all launchHeuristicTopK references
rg 'launchHeuristicTopK' -n
echo "---"
echo "Check the full function definition and context:"
sed -n '782,810p' cpp/tensorrt_llm/kernels/heuristic_topk.cuhRepository: NVIDIA/TensorRT-LLM
Length of output: 2418
Enable PDL on launchHeuristicTopK() to match the kernel's PDL-aware design.
The kernel gvrTopKKernel calls cudaTriggerProgrammaticLaunchCompletion() (lines 772–774, Hopper+ only), but this launcher uses <<<...>>> syntax and doesn't enable PDL. While no internal call sites exist, the function is template-instantiated and exported (line 811), making it part of the public API. The mismatch between the kernel's PDL-aware completion signaling and the launcher's inability to enable stream serialization should be resolved.
Update the launcher to use cudaLaunchKernelEx with cudaLaunchAttributeProgrammaticStreamSerialization to match the pattern used in heuristicTopKDecode.cu (lines 113–125).
🔧 Suggested fix
- gvrTopKKernel<<<1, BLOCK_SIZE, smemSize, stream>>>(
- input, N, preIdx, M, topK, outputValues, outputIndices, thresholdPos);
+ cudaLaunchConfig_t config;
+ config.gridDim = 1;
+ config.blockDim = BLOCK_SIZE;
+ config.dynamicSmemBytes = smemSize;
+ config.stream = stream;
+
+ cudaLaunchAttribute attrs[1];
+ attrs[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
+ attrs[0].val.programmaticStreamSerializationAllowed = tensorrt_llm::common::getEnvEnablePDL();
+ config.numAttrs = 1;
+ config.attrs = attrs;
+
+ cudaLaunchKernelEx(&config, gvrTopKKernel, input, N, preIdx, M, topK, outputValues, outputIndices, thresholdPos);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tensorrt_llm/kernels/heuristic_topk.cuh` around lines 802 - 806, The
launcher for gvrTopKKernel must enable programmatic stream serialization (PDL);
replace the triple-angle launch in launchHeuristicTopK() with a
cudaLaunchKernelEx call and pass
cudaLaunchAttributeProgrammaticStreamSerialization in the attributes. Keep the
same kernel function (gvrTopKKernel), grid (1), block (BLOCK_SIZE), shared
memory size (smemSize), and stream, and build the void* args array containing
input, &N, &preIdx, &M, &topK, outputValues, outputIndices, &thresholdPos
(matching the kernel signature). Also preserve the prior cudaFuncSetAttribute
call for MaxDynamicSharedMemorySize; ensure cudaLaunchKernelEx is used exactly
like the pattern in heuristicTopKDecode.cu with the programmatic serialization
attribute.
| static int sCachedSmCount = 0; | ||
| static int sCachedL2Bytes = 0; | ||
| if (sCachedSmCount == 0 || sCachedL2Bytes == 0) | ||
| { | ||
| int dev = 0; | ||
| cudaGetDevice(&dev); | ||
| cudaDeviceGetAttribute(&sCachedSmCount, cudaDevAttrMultiProcessorCount, dev); | ||
| cudaDeviceGetAttribute(&sCachedL2Bytes, cudaDevAttrL2CacheSize, dev); | ||
| } | ||
| int const kBsWave = (sCachedSmCount > 0) ? (sCachedSmCount * 3 - sCachedSmCount / 8) : 426; | ||
| int const kBsL2 = (sCachedL2Bytes > 0 && numColumns > 0) | ||
| ? (int) ((int64_t) sCachedL2Bytes * 9 / 10 / ((int64_t) numColumns * 4)) | ||
| : kBsWave; | ||
| int const kBsLarge = std::min(kBsWave, kBsL2 > 0 ? kBsL2 : kBsWave); | ||
|
|
||
| // v1.2: small-N lower bound — set to kSortingAlgorithmThreshold (12288) so | ||
| // the Heuristic axis takes over wherever the original Radix-radix branch | ||
| // would have triggered. Random-data benchmarks suggested 16384, but real | ||
| // SWE-Bench workloads see Heuristic ~6.3 us faster than random (preIdx-vs- | ||
| // logits ~99% correlated → P1 stats accurate → P2 secant 1-2 iter), shifting | ||
| // the real crossover into the [12288, 16384] band. Below 12288 the Insertion | ||
| // path is still used (canUseHeuristic gating + dispatcher fallback both | ||
| // route there). Configurable via TRTLLM_HEURISTIC_NMIN env (>=1024). | ||
| static int sCachedNMin = 0; | ||
| if (sCachedNMin == 0) | ||
| { | ||
| constexpr int kSeqSmallDefault = 12288; | ||
| char const* env = std::getenv("TRTLLM_HEURISTIC_NMIN"); | ||
| if (env != nullptr) | ||
| { | ||
| int const v = std::atoi(env); | ||
| sCachedNMin = (v >= 1024 && v <= 200000) ? v : kSeqSmallDefault; | ||
| } | ||
| else | ||
| { | ||
| sCachedNMin = kSeqSmallDefault; | ||
| } | ||
| } | ||
| int const kSeqSmall = sCachedNMin; | ||
|
|
||
| bool const canUseHeuristic = preIdx != nullptr && stride1 == 1 && topK == kHeuristicTopK | ||
| && preIdxCount == kHeuristicSize && preIdxStride >= preIdxCount && numColumns < effectiveSplitWorkThreshold | ||
| && heuristicScratch != nullptr; | ||
| && numColumns >= kSeqSmall && heuristicScratch != nullptr && numRows < kBsLarge; | ||
|
|
||
| // Optional env-gated dispatch trace (set TRTLLM_SCHEMEX_DEBUG=1 to enable) | ||
| { | ||
| static bool sDebugChecked = false; | ||
| static bool sDebug = false; | ||
| if (!sDebugChecked) | ||
| { | ||
| char const* env = std::getenv("TRTLLM_SCHEMEX_DEBUG"); | ||
| sDebug = (env != nullptr && env[0] == '1'); | ||
| sDebugChecked = true; | ||
| } | ||
| if (sDebug) | ||
| { | ||
| fprintf(stderr, | ||
| "[Scheme X v1.2] numRows=%d numColumns=%d kBsWave=%d kBsL2=%d kBsLarge=%d kSeqSmall=%d smCount=%d " | ||
| "L2=%dMB -> %s path%s\n", | ||
| numRows, numColumns, kBsWave, kBsL2, kBsLarge, kSeqSmall, sCachedSmCount, | ||
| sCachedL2Bytes / (1024 * 1024), canUseHeuristic ? "Heuristic" : "Radix", | ||
| (numColumns < kSeqSmall) ? " (small-N route)" : ""); | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the indexerTopK.cu file
find . -name "indexerTopK.cu" -type fRepository: NVIDIA/TensorRT-LLM
Length of output: 105
🏁 Script executed:
# Check the specific lines 740-803 in the file
if [ -f "cpp/tensorrt_llm/kernels/indexerTopK.cu" ]; then
wc -l cpp/tensorrt_llm/kernels/indexerTopK.cu
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 107
🏁 Script executed:
# Read lines around 740-803
if [ -f "cpp/tensorrt_llm/kernels/indexerTopK.cu" ]; then
sed -n '740,803p' cpp/tensorrt_llm/kernels/indexerTopK.cu
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 2988
🏁 Script executed:
# Search for getEnvEnablePDL to verify the reference pattern
rg "getEnvEnablePDL" --type cpp --type h -A 10 -B 2Repository: NVIDIA/TensorRT-LLM
Length of output: 28720
🏁 Script executed:
# Check for std::call_once usage in the codebase
rg "std::call_once" --type cpp --type h -B 2 -A 5Repository: NVIDIA/TensorRT-LLM
Length of output: 7434
🏁 Script executed:
# Look at the function containing these statics to understand the context
grep -n "invokeIndexerTopKDecode" cpp/tensorrt_llm/kernels/indexerTopK.cu | head -5Repository: NVIDIA/TensorRT-LLM
Length of output: 172
Protect Scheme X cache initialization with std::call_once to prevent data races.
The static caches sCachedSmCount, sCachedL2Bytes, sCachedNMin, and the debug flags are mutable function-local statics updated without synchronization. In multi-threaded scenarios, multiple host threads calling invokeIndexerTopKDecode() concurrently can race during initialization—multiple threads may observe zero values, enter the if-block, and write to the same statics simultaneously. Use std::call_once with a static std::once_flag, following the established pattern in getEnvEnablePDL() and elsewhere in the codebase.
Suggested fix
+#include <mutex>
...
+ static std::once_flag sSchemeXInit;
static int sCachedSmCount = 0;
static int sCachedL2Bytes = 0;
- if (sCachedSmCount == 0 || sCachedL2Bytes == 0)
+ static int sCachedNMin = 0;
+ static bool sDebugChecked = false;
+ static bool sDebug = false;
+
+ std::call_once(sSchemeXInit, [] {
+ int dev = 0;
+ cudaGetDevice(&dev);
+ cudaDeviceGetAttribute(&sCachedSmCount, cudaDevAttrMultiProcessorCount, dev);
+ cudaDeviceGetAttribute(&sCachedL2Bytes, cudaDevAttrL2CacheSize, dev);
+
+ constexpr int kSeqSmallDefault = 12288;
+ char const* env = std::getenv("TRTLLM_HEURISTIC_NMIN");
+ if (env != nullptr)
+ {
+ int const v = std::atoi(env);
+ sCachedNMin = (v >= 1024 && v <= 200000) ? v : kSeqSmallDefault;
+ }
+ else
+ {
+ sCachedNMin = kSeqSmallDefault;
+ }
+
+ env = std::getenv("TRTLLM_SCHEMEX_DEBUG");
+ sDebug = (env != nullptr && env[0] == '1');
+ sDebugChecked = true;
+ });
- {
- int dev = 0;
- cudaGetDevice(&dev);
- cudaDeviceGetAttribute(&sCachedSmCount, cudaDevAttrMultiProcessorCount, dev);
- cudaDeviceGetAttribute(&sCachedL2Bytes, cudaDevAttrL2CacheSize, dev);
- }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tensorrt_llm/kernels/indexerTopK.cu` around lines 740 - 803, The
function-level mutable statics sCachedSmCount, sCachedL2Bytes, sCachedNMin and
the debug flags must be initialized once to avoid races; wrap all their
initialization (the cudaDeviceGetAttribute calls, the TRTLLM_HEURISTIC_NMIN
getenv logic, and the TRTLLM_SCHEMEX_DEBUG getenv/flags) in a single
std::call_once using a static std::once_flag (e.g., sSchemeXInitFlag) so
concurrent callers of invokeIndexerTopKDecode() cannot race; move the existing
if (sCached... == 0) blocks into a single lambda invoked by std::call_once and
remove the separate sDebugChecked pattern so sDebug is set under the same
call_once guard.
|
PR_Github #45585 [ run ] completed with state
|
Three fixes responding to the automated review of the Scheme X / GVR Top-K PR: 1. heuristic_topk.cuh: switch launchHeuristicTopK to cudaLaunchKernelEx with cudaLaunchAttributeProgrammaticStreamSerialization so the kernel epilogue's cudaTriggerProgrammaticLaunchCompletion() actually takes effect. Honors TRTLLM_ENABLE_PDL=0 via std::getenv to stay self- contained (this header is also reused by the standalone JIT-compiled PyTorch extension under ablation_study/, which cannot pull in tensorrt_llm/common headers). 2. indexerTopK.cu: wrap the three function-local static caches inside invokeIndexerTopKDecode (sm count + L2 capacity, kSeqSmall, debug flag) in std::call_once with a once_flag to remove the data race on first concurrent entry. Pattern matches getEnvEnablePDL() in tensorrt_llm/common/envUtils.cpp. 3. cpp_custom_ops.py: add a module-level idempotency guard keyed by (device, top_k, hint_size, num_cols) around warmup_heuristic_topk_ decode so repeated Indexer constructions in the same process do not re-allocate scratch tensors or issue redundant synchronizations. Verified: rebuilt tensorrt_llm + th_common, smoke-tested both Heuristic and Radix dispatch paths plus the standalone JIT extension. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #45604 [ run ] triggered by Bot. Commit: |
|
PR_Github #45604 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45647 [ run ] triggered by Bot. Commit: |
|
PR_Github #45647 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45666 [ run ] triggered by Bot. Commit: |
|
PR_Github #45666 [ run ] completed with state |
Summary
Follow-up to #12385 (Temporally-Correlated Heuristic-guided Indexer TopK).
This PR adds the per-(BS, N) Scheme X dispatcher, PDL launchers,
CUDA Graph warmup, and renames the inner micro-kernel symbols to
GVR (Guess-Verify-Refine).
heuristic kernel and the radix fallback, derived from
MultiProcessorCountand
L2CacheSizequeried once at runtime. Closes the regression bandaround BS=128 / N=70K where the heuristic alone was slower than the
radix path.
launchHeuristicTopKDecodetocudaLaunchKernelExwithcudaLaunchAttributeProgrammaticStreamSerialization;call
cudaTriggerProgrammaticLaunchCompletion()at the kernel epilogue.Symmetric with the radix path that already used PDL via
invokeIndexerTopKDecode.warmup_heuristic_topk_decodehelper invoked fromthe
Indexersetup hook (layer_idx == 0) so the dispatcher'scudaGetDevice/cudaDeviceGetAttributequeries land outside anycapture region; the cached
sm_countandL2CacheSizeare populatedbefore any graph capture begins.
heuristicTopK*togvrTopK*(gvr= Guess-Verify-Refine, the algorithm of the upcomingalgorithm note). Public dispatcher / launcher names remain unchanged.
Commit Breakdown (5 commits)
Key Files
cpp/tensorrt_llm/kernels/heuristic_topk.cuhcpp/tensorrt_llm/kernels/heuristicTopKDecode.cucudaLaunchKernelEx+ GVR rename + 2 PDL triggerscpp/tensorrt_llm/kernels/indexerTopK.cukBsWave,kBsL2,kSeqSmall=12288)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.pywarmup_heuristic_topk_decodePython helpertensorrt_llm/_torch/attention_backend/sparse/dsa.pyIndexersetup hook calling the warmup helperAPI
No new user-facing API. The dispatcher is automatically engaged when
enable_heuristic_topk=True(per-DeepSeekSparseAttentionConfig,already shipped in #12385). Override knob for benchmarking:
TRTLLM_HEURISTIC_NMIN=<int>env var (defaults to dispatcher-decided).Test plan
pytest tests/unittest/_torch/thop/parallel/test_indexer_topk.py -k test_indexer_topk_decode→ 284 passed / 0 failed (32.4 s)torch.topk(set equality; outputs are unordered)Correctness coverage
test_indexer_topk_decodebs={1,64,512,2048} × next_n={1,2} × index_topk={2048,128} × num_tokens={4K,8K}test_indexer_topk_decode_distbeta / lognorm / logistic / weibull_min × MTP × success_ratiopreIdx=nullpath byte-identical to torch.topkPerformance Results (B200 sm_100)
The heuristic TopK micro-kernel (
gvrTopKJob— single-CTA single-row,called from
heuristicTopKMultiRowKernel) is benchmarked against thedefault radix-sort path (
topKPerRowDecode).BS=1 single-op vs N — realistic input (DeepSeek V3.2 SWE-Bench-64K decode logits)
Profiled across 9 layers × 17 decode steps (N ≈ 68.7K – 70.7K):
Average BS=1 speedup: 1.91× across 9 layers (consistent with the
1.81× on the synthetic DeepSeek workload reported in #12385). T4 spot
check at N=70688 with random-correlated preIdx: 34.82 μs Heuristic vs
55.30 μs Radix → 1.59×.
BS sweep — pooled across 9 layers × 16 rows
The Scheme X dispatcher routes BS ≥ 432 back to the radix path on a
148-SM B200 (where Heuristic loses its lead because the wave-occupancy
crossover point has been crossed). All 15 BS values are within ±2 μs of
the v1.1 baseline (max abs Δ = -1.09 μs, mean Δ = -0.07 μs).
Sweet spot — small-N + large-BS (Scheme X v1.2.3 vs ForceHeuristic)
kSeqSmall=12288lets the dispatcher route small-N + large-BS regimesto the radix path where Heuristic histogram overhead dominates:
Crossover band [12288, 16384) — T3 verification
At
kSeqSmall=12288, the heuristic kernel must remain ≤ the radixfallback in the BS=1 + correlated-preIdx case:
Regression guard
Verified against the Scheme X v1.1 baseline:
GitHub Bot Help
/bot run --disable-fail-fastSummary by CodeRabbit
New Features
Bug Fixes
Performance
Configuration
TRTLLM_HEURISTIC_NMIN,TRTLLM_SCHEMEX_DEBUG, PDL control) for tuning heuristic top-K behavior.