Skip to content

[None][feat] Integrate FP4 indexer for DSA on Blackwell#13340

Merged
longlee0622 merged 1 commit intoNVIDIA:mainfrom
lfr-0531:user/fanrongl/fp4_indexer
May 7, 2026
Merged

[None][feat] Integrate FP4 indexer for DSA on Blackwell#13340
longlee0622 merged 1 commit intoNVIDIA:mainfrom
lfr-0531:user/fanrongl/fp4_indexer

Conversation

@lfr-0531
Copy link
Copy Markdown
Collaborator

@lfr-0531 lfr-0531 commented Apr 22, 2026

@coderabbitai summary

Description

Add the FP4 indexer path for DSA on Blackwell (SM100+) and land it with a native CUDA fused-quantize op. Three commits on top of main:

  1. [None][feat] Integrate FP4 indexer for DSA on Blackwell (e9542afeb)

    • Upgrade DeepGEMM to Barry-Delaney/DeepGEMM@a97b74d7 (user/jinshik/nv_dev_rebase), which adds the FP4 MQA logits kernels and switches the paged MQA logits APIs to require 2D context_lens. All get_paged_mqa_logits_metadata / fp8_paged_mqa_logits call sites in dsa.py are migrated to 2D via a pre-allocated kv_lens_cuda_2d buffer (no capture-time allocations).
    • Add DeepSeekSparseAttentionConfig.indexer_k_dtype: Literal["fp8","fp4"].
    • Thread an indexerKCacheUseFp4 flag through the C++ KVCacheManager chain (WindowBlockManager, BlockManager, all 4 KVCacheManager overloads, the nanobind binding) and through the disagg CacheState (serialization + operator==, so prefill/decode refuse to pair when the flag disagrees). createIndexerKCachePools picks data_size = index_head_dim/2 under FP4, shrinking per-token size from 132B to 68B.
    • Relax IndexerKCacheScatterOp / IndexerKCacheGatherOp (+ their kernels) to accept head_dim{128 (FP8), 64 (FP4-packed)}, pick threads_per_block = head_dim / VEC_SIZE at runtime, and forward head_dim=64 under FP4 from the callers.
    • Indexer._prep_q_or_k dispatches to the FP4 quantizer when use_fp4; pre_indexer_proj returns a 5th q_scale output; weights under FP4 carry only softmax_scale * n_heads^-0.5 because the kernel applies q_scale internally. New _call_mqa_logits / _call_paged_mqa_logits helpers reinterpret the FP8 gather output as int8 / int32 under FP4 and route to fp8_fp4_mqa_logits / fp8_fp4_paged_mqa_logits. DSACacheManager computes per-token size based on use_fp4.
    • Thread q_scale through the two-op CUDA-graph-split DSA path in attention.py and through KVCacheManager.__init__ in resource_manager.py.
    • First-pass quantizer: fp4_quantize_1x32_sf_transpose — a Triton kernel that fuses amax, UE8M0 ceil, FP4 E2M1 quantize, nibble packing, and four-per-int32 scale packing, bit-identical to DeepGEMM's testing.per_token_cast_to_fp4(..., gran_k=32, use_packed_ue8m0=True) reference.
  2. [None][perf] Replace Triton FP4 indexer quantizer with fused_cat_fp4 CUDA op (c90c56e85)

    • Mirror the existing fused_cat_fp8 op with an FP4 E2M1 variant: torch.ops.trtllm.fused_cat_fp4(pe, nope) -> (packed_int8, scale_int32). Fuses concat + per-block-32 quantize + UE8M0 scale packing into one CUDA kernel, removing the remaining Triton DSL dependency on the DSA CUDA-graph hot path and giving FP4/FP8 indexer branches a uniform native-CUDA call-site shape.
    • Kernel: grid (ceil(M / 8),), block 256. Each warp handles one 128-element row; thread t covers elements [4t, 4t+3]. Per-block-32 amax via 3-round __shfl_xor_sync (offsets 1, 2, 4 — stays inside each group of 8 lanes). UE8M0 scale via IEEE 754 bit trick; MIN_AMAX = 1e-12f keeps ratio normal so no exp clamp needed. FP4 E2M1 quantize via 7-way bucketize on {0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0}. IEEE div.rn.f32 for the scaled value. Nibble pack (even → low, odd → high); lane 0 gathers the four UE8M0 exponent bytes from {0, 8, 16, 24} via __shfl_sync and writes one int32 per row (little-endian).
    • Indexer._prep_q_or_k switched to the new op; fp4_quantize_1x32_sf_transpose import removed.
    • tensorrt_llm/quantization/utils/fp4_utils.py reverted exactly to its pre-PR state (git diff HEAD^ on this file is empty). tests/unittest/_torch/quantization/{__init__.py,test_fp4_quantize.py} removed; coverage moves to test_cpp_custom_ops.py. The unittest/_torch/quantization entry is dropped from l0_b200.yml.
  3. [None][fix] Address review feedback on FP4 indexer PR (397eafe06) + [None][fix] Address round-2 review feedback on FP4 indexer PR (eac155a4c)

    • Round-1 P0: FP4 branch of pre_indexer_proj now runs q_scale = q_scale.view(-1, self.n_heads, 1)fused_cat_fp4 flattens to M = N * n_heads, so downstream token-axis slicing in sparse_attn_indexer (chunk / ctx / decode) needs the same reshape the FP8 branch already had. Without this, the FP4 path was silently slicing wrong per-token scales.
    • Round-1 P1: DSACacheManager.get_cache_size_per_token / get_cache_bytes_per_token now pick index_head_dim // 2 under FP4, so the KV cache block budget sees the 68 B/token footprint (was hardcoded to 132 B). Without this, the block budget over-counted and the FP4 memory win never reached the allocator.
    • Round-1 P1: @model_validator(mode="after") on DeepSeekSparseAttentionConfig rejects fp4 on SM<100 or index_head_dim != 128, degrades gracefully when CUDA is unavailable, and now includes recovery hints in the error message.
    • Round-2 P0: ModelConfig.from_pretrained (DSV3.2 / GlmMoeDsa rebuild branch) now forwards indexer_k_dtype — previously it was hand-copying every other sparse-attention field, so the user knob was silently reset to "fp8" before any downstream consumer saw it. Added a regression-guard test that exercises the rebuild and statically asserts the keyword is present in the source.
    • Round-2 P1: dropped the dead scheduler_metadata_buffer_mtp3 path. DeepGEMM's upgraded paged MQA logits kernel picks num_kv_multicast=1 for every next_n on SM100 (verified by the _schedule_meta_size assertion in deepgemm-src/csrc/apis/attention.hpp:339 firing when the legacy layout is passed). Removed the buffer allocation, both population sites, and the production dispatch branch.
    • Round-1 / Round-2 P2 + nits: CacheState::operator== now also compares mHasIndexerKCache, mIndexerDimPerHead, mIndexerKCacheQuantBlockSize so prefill/decode refuse to pair on any incompatible indexer layout (not just FP8/FP4); quant_block_size assertion tightened to == 128; DeepGEMM-source reference comment on the FP4 magic constants; 8-byte base-pointer TORCH_CHECK (with pointer in the message) on fused_cat_fp4Op; refreshed _call_mqa_logits comment post q_scale reshape fix; copyright year bumps on dataTransceiverState.h, cacheTransceiver.cpp, serialization.cpp.

Test Coverage

Bit-exactness of the new CUDA op:

  • tests/unittest/_torch/attention/sparse/test_cpp_custom_ops.py::test_fused_cat_fp4_matches_deepgemm — parametrized over shapes (4, 128), (1, 32, 128), (3, 7, 128), (2, 5, 4, 128) and seeds {0, 42, 2026}. Uses torch.equal (not allclose) on both packed bytes and scale int32 vs tensorrt_llm.deep_gemm.utils.math.per_token_cast_to_fp4(..., use_ue8m0=True, gran_k=32, use_packed_ue8m0=True) — all 12 parametrizations pass.
  • test_fused_cat_fp4_shape_dispatch[64-64 / 32-96 / 16-112 / 96-32] — asymmetric pe/nope splits, shape and dtype sanity.
  • test_fused_cat_fp4_noncontiguous_split — op accepts non-contiguous views from torch.split() and matches the contiguous baseline.

Config plumbing:

  • test_dsa_fp4_indexer.py::test_indexer_k_dtype_survives_model_config_rebuild — new. Exercises the ModelConfig.from_pretrained DSV3.2 rebuild branch with a stub pretrained config, asserts indexer_k_dtype="fp4" round-trips, and statically greps ModelConfig.from_pretrained's source for the forwarding keyword so a future edit that drops it fails fast.

FP4 indexer end-to-end on B200:

  • test_fp4_mqa_logits_shape_and_topk_intersection[32|64] — FP4 vs FP8 top-k overlap.
  • test_fp4_quantize_roundtrip_matches_bf16_kv — packing / scale recovery round-trip.
  • test_fp4_indexer_k_cache_per_token_size_drops_to_68_bytes — per-token KV-cache shrinkage from 132B to 68B.

FP8 indexer regression:

  • test_dsa_indexer.py — FP8 path unregressed; all parametrizations including next_n ∈ {1,2,3,4} now pass with the base scheduler metadata buffer.
  • test_cpp_custom_ops.py existing gather / scatter / convert_req_index tests still pass.

Full attention sweep:

  • tests/unittest/_torch/attention on B200 — no regressions from any of the four commits.

FP8 DSA smoke on a real model:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_fp8_blockscale[baseline] launched end-to-end on 8 × B200 with DeepSeek-V3.2-Exp-hf; model loaded and forward passes proceeded without error, confirming the FP8 fused_cat_fp8 path is unaffected (stopped after ramp-up to save time).

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@lfr-0531 lfr-0531 requested review from a team as code owners April 22, 2026 11:35
@lfr-0531 lfr-0531 changed the title [None][perf] Replace Triton FP4 indexer quantizer with fused_cat_fp4 CUDA op [None][feat] Integrate FP4 indexer for DSA on Blackwell Apr 22, 2026
@lfr-0531 lfr-0531 requested a review from Barry-Delaney April 22, 2026 11:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

This pull request adds FP4 quantization support to the indexer K-cache system. It introduces new CUDA kernels for fused concatenation and FP4 quantization, updates configuration parameters throughout the stack to enable FP4 mode, generalizes kernel implementations to support variable head dimensions (128 for FP8 or 64 for FP4 packed), and adds comprehensive test coverage for the FP4 indexer path.

Changes

Cohort / File(s) Summary
Dependency Update
3rdparty/fetch_content.json
Updated deepgemm git tag from 4ff3f54d... to c491439e... for FP4 support.
KV Cache Manager Configuration
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h, cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Extended WindowBlockManager, BlockManager, and KVCacheManager constructors and interfaces with indexerKCacheUseFp4 parameter. Modified WindowBlockManager::createIndexerKCachePools() to compute per-token pool size conditionally: scaleSize + mIndexerKCacheIndexHeadDim (FP8) vs scaleSize + mIndexerKCacheIndexHeadDim/2 (FP4 packed).
Executor State Serialization
cpp/include/tensorrt_llm/executor/dataTransceiverState.h, cpp/tensorrt_llm/executor/serialization.cpp, cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
Added indexerKCacheUseFp4 boolean field to kv_cache::CacheState with constructor, equality, serialization, and accessor support. Updated serialization/deserialization to include the new flag in size calculations.
FP4 CUDA Kernels
cpp/tensorrt_llm/kernels/fusedCatFp4.h, cpp/tensorrt_llm/kernels/fusedCatFp4.cu
New kernel implementation for fused concatenation of two BF16 row tensors into FP4-packed output with per-block quantization scale. Validates head_dim==128, enforces vectorization alignment, and performs 4-bit quantization via boundary-matching comparators.
Indexer Kernel Generalization
cpp/tensorrt_llm/kernels/indexerKCacheGather.cu, cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu
Replaced fixed head_dim==128 assertions and THREADS_PER_BLOCK=32 constants with runtime validation allowing head_dim of 128 (FP8) or 64 (FP4 packed). Compute threads_per_block = head_dim / VEC_SIZE dynamically for kernel launch configuration.
PyTorch Operator Bindings
cpp/tensorrt_llm/thop/IndexerKCacheGatherOp.cpp, cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp, cpp/tensorrt_llm/thop/fusedCatFp4Op.cpp, cpp/tensorrt_llm/thop/CMakeLists.txt
Added head_dim runtime parameter to indexer_k_cache_gather_op. Expanded validation in scatter op to accept FP4 configurations. New fused_cat_fp4 PyTorch operator with CUDA backend invocation. Updated CMakeLists to compile new FP4 operator source.
Deep GEMM Build Integration
cpp/tensorrt_llm/deep_gemm/CMakeLists.txt
Added rewrite rule to map from .. import _C to import tensorrt_llm.deep_gemm_cpp_tllm as _C for Python source files in the deep_gemm subtree.
Python Custom Ops Registration
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Registered fake implementations for new fused_cat_fp4 operator and updated indexer_k_cache_gather_op to include head_dim parameter in fake shape inference.
Configuration Schema
tensorrt_llm/llmapi/llm_args.py
Added indexer_k_dtype: Literal["fp8", "fp4"] field to DeepSeekSparseAttentionConfig with default "fp8".
Resource Manager
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Extended KVCacheManager.__init__ to accept indexer_k_cache_use_fp4 parameter and propagate it to underlying KVCacheManagerCpp configuration.
DSA Sparse Attention Implementation
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
Major refactoring: imported FP4-capable DeepGEMM variants, added pre-allocated 2D kv_lens_cuda_2d buffer for new API, introduced FP4 dispatch control via self.use_fp4, extended indexer TopK pipeline to propagate q_scale, updated quantization logic to conditionally use fused_cat_fp4 for FP4 mode, adjusted cache pool sizing for FP4 packed layout, and modified kernel metadata generation for 2D context_lens.
Attention Module Integration
tensorrt_llm/_torch/modules/attention.py
Updated mla_layer.forward_dsa_proj and downstream unpacking to capture and propagate q_scale from pre_indexer_proj, enabling FP4 kernel requirement for scale tensor.
Test Coverage
tests/unittest/_torch/attention/sparse/test_cpp_custom_ops.py, tests/unittest/_torch/attention/sparse/test_dsa_fp4_indexer.py, tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
Extended existing custom ops tests to include head_dim parameter. Added comprehensive new FP4 indexer test module with bit-exact quantization verification, roundtrip correctness checks, memory footprint validation, and JIT compilation latency probing. Updated DSA indexer tests to use 2D kv_lens_cuda_2d buffer with new DeepGEMM API.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: replacing a Triton FP4 quantizer with a fused CUDA op for DSA on Blackwell, which is the core focus of the second commit and PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description is comprehensive and follows the template structure with clear sections for Description, Test Coverage, and PR Checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year on this modified source file.

The file was changed in this PR but still shows 2025.

🛠️ Proposed fix
- * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines: “Add NVIDIA copyright header on ALL new files and update year on modified files.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` at line 2, Update the
copyright header at the top of cacheTransceiver.cpp to reflect the current
modification year (replace "2025" with the correct year) so the NVIDIA copyright
header is current for this modified file; ensure the header text and SPDX line
remain otherwise unchanged and match the project's standard header format used
in other files.
tensorrt_llm/llmapi/llm_args.py (1)

1-1: ⚠️ Potential issue | 🟠 Major

Add the required NVIDIA source header to this modified Python file.

This file was modified but is missing the mandatory NVIDIA copyright/license header.

🛠️ Proposed fix
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
  import ast

As per coding guidelines: “All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm_args.py` at line 1, Add the mandatory NVIDIA
copyright/license source header to the top of the modified module llm_args.py
(before any imports); ensure the header matches the project's standard NVIDIA
header used in other TensorRT-LLM files and includes the correct year of latest
meaningful modification and full license text, then save the file so the header
precedes the existing "import ast" statement.
tensorrt_llm/_torch/attention_backend/sparse/dsa.py (3)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the required NVIDIA copyright header to this modified source file.

This file is part of the changed surface, but it still starts directly with the module docstring.

As per coding guidelines, "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py` at line 1, Add the
required NVIDIA copyright header at the top of the modified source file (before
the module docstring) so the file begins with the standard multi-line copyright
comment including "NVIDIA" and the year of latest meaningful modification;
update the header year to the current modification year and ensure it precedes
the existing module docstring in
tensorrt_llm._torch.attention_backend.sparse.dsa (the top of the file).

76-78: ⚠️ Potential issue | 🔴 Critical

Use the FP4 packed byte width in slot-mapping math.

block_stride and scale_base_offset still assume head_dim data bytes per token. In FP4 mode the cache view is head_dim // 2 + 4 bytes per token (get_indexer_k_cache_buffers() at Lines 2194-2198), so these offsets are too large and the scatter/gather path will address the wrong locations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py` around lines 76 - 78,
The current math uses head_dim bytes per token when computing scale_size,
block_stride, and scale_base_offset, but FP4 packs two values per byte so the
cache layout uses head_dim//2 + 4 bytes per token; update the calculations to
use a data_bytes_per_token variable that is head_dim for normal modes and
head_dim // 2 for FP4 mode, then compute scale_size (based on quant_block_size)
and set block_stride = tokens_per_block * (data_bytes_per_token + scale_size)
and scale_base_offset = tokens_per_block * data_bytes_per_token; refer to
get_indexer_k_cache_buffers() to detect/align with the FP4 layout and use the
same packed-byte logic for FP4.

2207-2258: ⚠️ Potential issue | 🟠 Major

Update the cache-size estimators for FP4.

Both sizing helpers still charge the indexer cache as index_head_dim + scale_bytes per token. The runtime layout is only index_head_dim // 2 + 4 bytes in FP4 mode, so capacity planning overestimates KV usage and under-admits requests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py` around lines 2207 -
2258, Both cache-size helpers overcount indexer K-cache for FP4; update
get_cache_size_per_token and get_cache_bytes_per_token to use the FP4 runtime
layout (index_head_dim//2 + 4) instead of the current index_head_dim +
(index_head_dim // quant_block_size * 4) when FP4 is active. In
get_cache_size_per_token detect FP4 via model_config.quant_config /
model_config.quant_config.quant_mode (same check used elsewhere) and compute
head_dim_factor = (index_head_dim//2 + 4) / head_dim for FP4, otherwise keep the
existing formula; in get_cache_bytes_per_token branch on self.dtype ==
DataType.NVFP4 and replace the head_dim_factor calculation with
(self.index_head_dim//2 + 4) / self.head_dim, leaving the rest (kv_factor, size
conversion, and calculate_scaling_factor_size_bytes) unchanged.
cpp/tensorrt_llm/executor/serialization.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year in this modified source file.

This file was meaningfully modified in this PR, but the header still lists 2025 only.

🛠️ Suggested header update
- * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines: "Add NVIDIA copyright header on ALL new files and update year on modified files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/executor/serialization.cpp` at line 2, The file header in
serialization.cpp still lists only "2025" but this file was modified; update the
copyright header to include the current year (e.g., change "2025" to "2025-2026"
or "2026") so the NVIDIA copyright header reflects the modification; locate the
top-of-file comment block in serialization.cpp and adjust the year range
accordingly.
🧹 Nitpick comments (2)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)

150-154: Add parameter-name comments to this long positional constructor call.

This call has multiple non-obvious positional values (especially bools); inline param comments will reduce maintenance risk.

♻️ Proposed refactor
-    mCacheState
-        = std::make_unique<executor::kv_cache::CacheState>(cacheStateModelCfg, worldConfig, attentionLayerNumPerPP,
-            dataType, attentionType, kvFactor, cacheManager->isEnableBlockReuse(), cacheManager->isEnablePartialReuse(),
-            cacheManager->isEnableIndexerKCache(), cacheManager->getIndexerKCacheIndexHeadDim(),
-            cacheManager->getIndexerKCacheQuantBlockSize(), cacheManager->getIndexerKCacheUseFp4());
+    mCacheState = std::make_unique<executor::kv_cache::CacheState>(
+        /*modelConfig=*/cacheStateModelCfg,
+        /*worldConfig=*/worldConfig,
+        /*attentionLayerNumPerPP=*/attentionLayerNumPerPP,
+        /*dataType=*/dataType,
+        /*attentionType=*/attentionType,
+        /*kvFactor=*/kvFactor,
+        /*enableBlockReuse=*/cacheManager->isEnableBlockReuse(),
+        /*enablePartialReuse=*/cacheManager->isEnablePartialReuse(),
+        /*hasIndexerKCache=*/cacheManager->isEnableIndexerKCache(),
+        /*indexerDimPerHead=*/cacheManager->getIndexerKCacheIndexHeadDim(),
+        /*indexerKCacheQuantBlockSize=*/cacheManager->getIndexerKCacheQuantBlockSize(),
+        /*indexerKCacheUseFp4=*/cacheManager->getIndexerKCacheUseFp4());

As per coding guidelines: “In C++ function calls with non-obvious parameters, use inline C comments with the format /*paramName=*/ to document parameters.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 150 - 154,
The long positional call creating mCacheState using
executor::kv_cache::CacheState should document non-obvious parameters with
inline C comments; update the call to annotate each argument (especially the
bool-returning cacheManager methods) using the /*paramName=*/ style so readers
know what each value means — e.g., label cacheStateModelCfg, worldConfig,
attentionLayerNumPerPP, dataType, attentionType, kvFactor, and the cacheManager
calls as /*enableBlockReuse=*/ cacheManager->isEnableBlockReuse(),
/*enablePartialReuse=*/ cacheManager->isEnablePartialReuse(),
/*enableIndexerKCache=*/ cacheManager->isEnableIndexerKCache(),
/*indexerKCacheIndexHeadDim=*/ cacheManager->getIndexerKCacheIndexHeadDim(),
/*indexerKCacheQuantBlockSize=*/ cacheManager->getIndexerKCacheQuantBlockSize(),
/*indexerKCacheUseFp4=*/ cacheManager->getIndexerKCacheUseFp4() so the
CacheState(...) constructor arguments are explicit and maintainable.
cpp/tensorrt_llm/kernels/indexerKCacheGather.cu (1)

143-147: Prefer named constants for head-dim and scale literals.

Please replace inline literals (128, 64, 4) with named constants in this check block.

♻️ Suggested cleanup
     constexpr int32_t VEC_SIZE = 4;
+    constexpr int32_t kFp8HeadDim = 128;
+    constexpr int32_t kFp4PackedHeadDim = 64;
+    constexpr int32_t kScaleBytes = 4;
-    TLLM_CHECK_WITH_INFO(head_dim == 128 || head_dim == 64,
+    TLLM_CHECK_WITH_INFO(head_dim == kFp8HeadDim || head_dim == kFp4PackedHeadDim,
         "head_dim must be 128 (FP8) or 64 (FP4 packed) for the indexer cache (got %d)", head_dim);
     TLLM_CHECK_WITH_INFO(head_dim % VEC_SIZE == 0, "head_dim (%d) must be a multiple of %d", head_dim, VEC_SIZE);
-    TLLM_CHECK_WITH_INFO(scale_size == 4,
+    TLLM_CHECK_WITH_INFO(scale_size == kScaleBytes,
         "scale_size must equal 4 bytes (packed UE8M0 x4 for FP4, 1 float32 for FP8, got %d)", scale_size);

As per coding guidelines: "Except for 0, nullptr, true, and false, all other literal values in C++ should only be used for variable initialization; use named constants instead."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/indexerKCacheGather.cu` around lines 143 - 147,
Replace the magic literals in the TLLM_CHECK_WITH_INFO calls with named
constants: define constants like kHeadDimFp8 = 128, kHeadDimFp4Packed = 64 and
kScaleSizeBytes = 4 (in this translation unit or the appropriate header) and use
them in the checks and formatted messages instead of raw numbers; update the
conditions (head_dim == kHeadDimFp8 || head_dim == kHeadDimFp4Packed, head_dim %
VEC_SIZE == 0, scale_size == kScaleSizeBytes) and the error strings to reference
the constant names so the checks in indexerKCacheGather.cu (the
TLLM_CHECK_WITH_INFO calls referencing head_dim, VEC_SIZE, scale_size) no longer
contain hard-coded literals.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Around line 2108-2109: The new boolean flag indexerKCacheUseFp4 is declared on
the KVCacheManager constructor signature but not threaded through the remaining
overloaded constructors and call sites; update every KVCacheManager constructor
overload (the signatures that include SizeType32 indexerKCacheQuantBlockSize and
indexerKCacheIndexHeadDim) to accept and forward the indexerKCacheUseFp4
parameter, and update the production instantiation in
trtGptModelInflightBatching.cpp (the KVCacheManager construction around the
block previously at lines ~686-694) to pass the appropriate indexerKCacheUseFp4
argument (and the related indexerKCacheQuantBlockSize/indexerKCacheIndexHeadDim
where present) so the FP4 path is not silently disabled.

In `@cpp/include/tensorrt_llm/executor/dataTransceiverState.h`:
- Around line 56-57: Update the copyright header block in the top-of-file
comment for tensorrt_llm/executor/dataTransceiverState.h to include 2026
(replace the current 2024 year with 2026); ensure the header matches the NVIDIA
copyright format used across the project so the file-level comment reflects the
latest meaningful modification.
- Around line 114-119: operator== in dataTransceiverState (kv_cache::CacheState)
currently omits several layout-defining fields so states with different cache
layouts compare equal; update the operator== implementation (the method named
operator== in dataTransceiverState.h) to also compare mEnableBlockReuse,
mEnablePartialReuse, mHasIndexerKCache, mIndexerDimPerHead, and
mIndexerKCacheQuantBlockSize (i.e., include all layout-defining members used to
determine cache layout) so any differing cache-layout-related fields cause
inequality.

In `@cpp/tensorrt_llm/kernels/fusedCatFp4.cu`:
- Around line 188-202: Summary: The kernel uses *reinterpret_cast<int2
const*>(src + col) which requires 8-byte alignment of the base pointers; current
checks only validate strides not base-address alignment. Fix: In the
launcher/wrapper that prepares inputs for fusedCatFp4 (where pe.data_ptr() and
nope.data_ptr() are passed), add runtime checks that
reinterpret_cast<uintptr_t>(pe.data_ptr()) % 8 == 0 and
reinterpret_cast<uintptr_t>(nope.data_ptr()) % 8 == 0 before launching
fusedCatFp4; if either base pointer is not 8-byte aligned, either return an
error or take the scalar-safe fallback path (e.g., a separate kernel or code
path that reads elements individually instead of using int2 loads). Ensure the
new checks live alongside the existing dimension/stride checks and reference the
same symbols (pe.data_ptr(), nope.data_ptr(), fusedCatFp4 kernel launch).

In `@cpp/tensorrt_llm/thop/fusedCatFp4Op.cpp`:
- Around line 27-71: In fused_cat_fp4, add a device check and a device guard:
TORCH_CHECK that nope.device() == pe.device() (and that pe.is_cuda()) to reject
mixed-device inputs, then create an at::cuda::CUDAGuard (or at::DeviceGuard)
scoped to pe.device() before calling at::cuda::getCurrentCUDAStream(...) and
invoking tensorrt_llm::kernels::invokeFusedCatFp4 so the kernel launch and raw
data_ptr() access occur on the correct CUDA device/stream.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Around line 1957-1970: The FP4 branch reshapes q_fp8 but never reshapes
q_scale, so scales returned by torch.ops.trtllm.fused_cat_fp4 (shape [num_tokens
* n_heads, 1]) are left in token*head-major layout and later indexing/views
break; fix by reshaping q_scale in the same branch (e.g., after q_fp8 =
q_fp8.view(-1, self.n_heads, self.head_dim // 2) add q_scale = q_scale.view(-1,
self.n_heads, 1)) so callers (and any later _weight_scale usage) see q_scale in
[num_tokens, n_heads, 1] token-major format while preserving the existing
weights logic (weights *= self.weight_scale_factor).

In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 1752-1757: The fake custom-op implementation _mla_dsa_proj_fake
currently models the old 8-output contract while
forward_dsa_proj/forward_dsa_attn now return nine outputs (with q_scale as the
9th and the FP4 path expecting 5 tensors passed into mla_dsa_attn_inplace via
indexer_intermediates); update _mla_dsa_proj_fake to return the new 9-tensor
tuple (and ensure the list/tuple used for indexer_intermediates reflects five
runtime tensors for the FP4/FP8 path), and update the surrounding
docstrings/comments to state the new 9/5 contract so torch.compile shape
propagation matches runtime (adjust any unpacking in callers of
_mla_dsa_proj_fake, e.g., where indexer_intermediates is built/consumed, to
match the new ordering).

In `@tests/unittest/_torch/attention/sparse/test_cpp_custom_ops.py`:
- Around line 168-170: Parametrize the gather tests to run with head_dim values
128 and 64 so the FP4-packed branch in indexer_k_cache_gather_op is exercised:
replace the hardcoded HEAD_DIM usage in the test(s) that call
torch.ops.trtllm.indexer_k_cache_gather_op with a parameter (e.g., head_dim) and
compute per_token_size and slot_scale from that head_dim (derive per_token_size
= head_dim / some_unit used in the op and set slot_scale accordingly) before
constructing k_cache/slot_fp8/slot_scale; update the three affected call sites
around the test lines (168-170, 235-237, 261-263) to use the parameterized
head_dim so both 128 and 64 cases run.

In `@tests/unittest/_torch/attention/sparse/test_dsa_fp4_indexer.py`:
- Around line 198-224: The test
test_fp4_indexer_k_cache_per_token_size_drops_to_68_bytes currently only
recomputes constants and never exercises the implementation; replace the
synthetic assertions with an integration check that constructs the real FP4
cache manager and inspects the allocated pools: instantiate or obtain
WindowBlockManager and/or DSACacheManager, call createIndexerKCachePools (or
get_indexer_k_cache_buffers) with index_head_dim=128 and quant_block_size=128,
retrieve the actual buffer/stride/bytes-per-token from the returned pool objects
or buffers, and assert those runtime values equal the expected 132 (FP8) and 68
(FP4) and the shrink ratio; if constructing the real managers is heavy, use a
minimal fixture or monkeypatch to run the real allocation code instead of
recomputing literals so the test will fail on regressions.
- Around line 230-308: The test
test_fp4_paged_mqa_logits_jit_first_compile_latency currently only logs JIT
timings and must be turned into a perf-sanity check: add an assertion that fails
when jit_overhead (first_ms - warm_ms) exceeds a small, documented threshold
(parameterize by next_n) so regressions block CI, and make the threshold
configurable via environment or pytest marker; update the test to record the
measured values in the authoritative B200 perf DB entry and add the matching QA
perf list entry so this case is tracked by scheduled/perf jobs (ensure you
update the test metadata that references the B200 path and QA list accordingly).
- Around line 29-34: The current broad except hides real import/runtime errors;
change the import block for tensorrt_llm.deep_gemm so only import failures are
caught and attribute checks are done separately: wrap "from tensorrt_llm import
deep_gemm" in a try/except ImportError (or ModuleNotFoundError) and set
HAS_DEEP_GEMM = False on import failure; if import succeeds (else branch) set
HAS_DEEP_GEMM = hasattr(deep_gemm, "fp8_fp4_mqa_logits") so unexpected
exceptions during import or attribute access are allowed to surface instead of
being swallowed by a bare except.

In `@tests/unittest/_torch/attention/sparse/test_dsa_indexer.py`:
- Around line 460-470: The mock currently fills only kv_lens_cuda_2d and leaves
scheduler_metadata_buffer zeroed; update the mock to mirror production by
copying the same 2D slice into scheduler_metadata_buffer using the same shape
and dtype as done by get_paged_mqa_logits_metadata (i.e., take
gen_kv_lens.unsqueeze(-1).expand(-1, next_n_cap) or the equivalent 2D view and
write it into scheduler_metadata_buffer[:num_generations, :next_n_cap]),
ensuring the expanded path does the same so
test_indexer_decode_with_paged_kv_cache sees the populated scheduler metadata.

---

Outside diff comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Line 2: Update the copyright header at the top of cacheTransceiver.cpp to
reflect the current modification year (replace "2025" with the correct year) so
the NVIDIA copyright header is current for this modified file; ensure the header
text and SPDX line remain otherwise unchanged and match the project's standard
header format used in other files.

In `@cpp/tensorrt_llm/executor/serialization.cpp`:
- Line 2: The file header in serialization.cpp still lists only "2025" but this
file was modified; update the copyright header to include the current year
(e.g., change "2025" to "2025-2026" or "2026") so the NVIDIA copyright header
reflects the modification; locate the top-of-file comment block in
serialization.cpp and adjust the year range accordingly.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Line 1: Add the required NVIDIA copyright header at the top of the modified
source file (before the module docstring) so the file begins with the standard
multi-line copyright comment including "NVIDIA" and the year of latest
meaningful modification; update the header year to the current modification year
and ensure it precedes the existing module docstring in
tensorrt_llm._torch.attention_backend.sparse.dsa (the top of the file).
- Around line 76-78: The current math uses head_dim bytes per token when
computing scale_size, block_stride, and scale_base_offset, but FP4 packs two
values per byte so the cache layout uses head_dim//2 + 4 bytes per token; update
the calculations to use a data_bytes_per_token variable that is head_dim for
normal modes and head_dim // 2 for FP4 mode, then compute scale_size (based on
quant_block_size) and set block_stride = tokens_per_block *
(data_bytes_per_token + scale_size) and scale_base_offset = tokens_per_block *
data_bytes_per_token; refer to get_indexer_k_cache_buffers() to detect/align
with the FP4 layout and use the same packed-byte logic for FP4.
- Around line 2207-2258: Both cache-size helpers overcount indexer K-cache for
FP4; update get_cache_size_per_token and get_cache_bytes_per_token to use the
FP4 runtime layout (index_head_dim//2 + 4) instead of the current index_head_dim
+ (index_head_dim // quant_block_size * 4) when FP4 is active. In
get_cache_size_per_token detect FP4 via model_config.quant_config /
model_config.quant_config.quant_mode (same check used elsewhere) and compute
head_dim_factor = (index_head_dim//2 + 4) / head_dim for FP4, otherwise keep the
existing formula; in get_cache_bytes_per_token branch on self.dtype ==
DataType.NVFP4 and replace the head_dim_factor calculation with
(self.index_head_dim//2 + 4) / self.head_dim, leaving the rest (kv_factor, size
conversion, and calculate_scaling_factor_size_bytes) unchanged.

In `@tensorrt_llm/llmapi/llm_args.py`:
- Line 1: Add the mandatory NVIDIA copyright/license source header to the top of
the modified module llm_args.py (before any imports); ensure the header matches
the project's standard NVIDIA header used in other TensorRT-LLM files and
includes the correct year of latest meaningful modification and full license
text, then save the file so the header precedes the existing "import ast"
statement.

---

Nitpick comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 150-154: The long positional call creating mCacheState using
executor::kv_cache::CacheState should document non-obvious parameters with
inline C comments; update the call to annotate each argument (especially the
bool-returning cacheManager methods) using the /*paramName=*/ style so readers
know what each value means — e.g., label cacheStateModelCfg, worldConfig,
attentionLayerNumPerPP, dataType, attentionType, kvFactor, and the cacheManager
calls as /*enableBlockReuse=*/ cacheManager->isEnableBlockReuse(),
/*enablePartialReuse=*/ cacheManager->isEnablePartialReuse(),
/*enableIndexerKCache=*/ cacheManager->isEnableIndexerKCache(),
/*indexerKCacheIndexHeadDim=*/ cacheManager->getIndexerKCacheIndexHeadDim(),
/*indexerKCacheQuantBlockSize=*/ cacheManager->getIndexerKCacheQuantBlockSize(),
/*indexerKCacheUseFp4=*/ cacheManager->getIndexerKCacheUseFp4() so the
CacheState(...) constructor arguments are explicit and maintainable.

In `@cpp/tensorrt_llm/kernels/indexerKCacheGather.cu`:
- Around line 143-147: Replace the magic literals in the TLLM_CHECK_WITH_INFO
calls with named constants: define constants like kHeadDimFp8 = 128,
kHeadDimFp4Packed = 64 and kScaleSizeBytes = 4 (in this translation unit or the
appropriate header) and use them in the checks and formatted messages instead of
raw numbers; update the conditions (head_dim == kHeadDimFp8 || head_dim ==
kHeadDimFp4Packed, head_dim % VEC_SIZE == 0, scale_size == kScaleSizeBytes) and
the error strings to reference the constant names so the checks in
indexerKCacheGather.cu (the TLLM_CHECK_WITH_INFO calls referencing head_dim,
VEC_SIZE, scale_size) no longer contain hard-coded literals.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a1585573-a2ff-41c4-9421-166d5d822372

📥 Commits

Reviewing files that changed from the base of the PR and between 62ce575 and c90c56e.

📒 Files selected for processing (24)
  • 3rdparty/fetch_content.json
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/include/tensorrt_llm/executor/dataTransceiverState.h
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/deep_gemm/CMakeLists.txt
  • cpp/tensorrt_llm/executor/serialization.cpp
  • cpp/tensorrt_llm/kernels/fusedCatFp4.cu
  • cpp/tensorrt_llm/kernels/fusedCatFp4.h
  • cpp/tensorrt_llm/kernels/indexerKCacheGather.cu
  • cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • cpp/tensorrt_llm/thop/IndexerKCacheGatherOp.cpp
  • cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp
  • cpp/tensorrt_llm/thop/fusedCatFp4Op.cpp
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/attention/sparse/test_cpp_custom_ops.py
  • tests/unittest/_torch/attention/sparse/test_dsa_fp4_indexer.py
  • tests/unittest/_torch/attention/sparse/test_dsa_indexer.py

Comment thread cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Comment thread cpp/include/tensorrt_llm/executor/dataTransceiverState.h
Comment thread cpp/include/tensorrt_llm/executor/dataTransceiverState.h
Comment thread cpp/tensorrt_llm/kernels/fusedCatFp4.cu
Comment thread cpp/tensorrt_llm/thop/fusedCatFp4Op.cpp
Comment thread tests/unittest/_torch/attention/sparse/test_cpp_custom_ops.py
Comment thread tests/unittest/_torch/attention/sparse/test_dsa_fp4_indexer.py
Comment thread tests/unittest/_torch/attention/sparse/test_dsa_fp4_indexer.py
Comment thread tests/unittest/_torch/attention/sparse/test_dsa_fp4_indexer.py Outdated
Comment thread tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
@lfr-0531 lfr-0531 requested a review from a team as a code owner April 22, 2026 15:00
@lfr-0531
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44980 [ run ] triggered by Bot. Commit: eac155a Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

The 2 yaml files need to be updated in https://github.com/NVIDIA/TensorRT-LLM/tree/main/scripts/attribution/data.
We can re-run the script to update it.

Comment thread cpp/tensorrt_llm/deep_gemm/CMakeLists.txt Outdated
@lfr-0531 lfr-0531 requested a review from a team as a code owner April 23, 2026 07:13
lfr-0531 added a commit to lfr-0531/TensorRT-LLM that referenced this pull request Apr 23, 2026
Re-ran scripts/attribute.py against the Ninja build produced from this
branch. The scanner picked up the dependency versions this PR introduces
(or that landed on main since the last attribution refresh) and added:

  - deepgemm/c491439ed5966833d56883ca302b6f72e74f8105 (the upgrade this
    PR pulls in via 3rdparty/fetch_content.json)
  - cutlass/v4.4.2
  - cuda/13.2
  - nccl/2.29.2-1+cuda13.1

plus the matching file-hash entries in files_to_dependency.yml and three
new content-addressable license/copyright blobs under data/cas/.

No manual edits to the YAML/CAS files — everything is a straight
regeneration from `python scripts/attribute.py --build-dir cpp/build`.

Addresses PR feedback from Barry-Delaney:
NVIDIA#13340 (comment)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
@lfr-0531
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45181 [ run ] triggered by Bot. Commit: bbd7b05 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45181 [ run ] completed with state FAILURE. Commit: bbd7b05
/LLM/main/L0_MergeRequest_PR pipeline #35457 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45296 [ run ] triggered by Bot. Commit: 8a060c2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45296 [ run ] completed with state SUCCESS. Commit: 8a060c2
/LLM/main/L0_MergeRequest_PR pipeline #35550 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45494 [ run ] triggered by Bot. Commit: 8a060c2 Link to invocation

@Barry-Delaney Barry-Delaney force-pushed the user/fanrongl/fp4_indexer branch from edf9dd7 to 102e6be Compare May 6, 2026 13:43
@Barry-Delaney
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47010 [ run ] triggered by Bot. Commit: 102e6be Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47010 [ run ] completed with state SUCCESS. Commit: 102e6be
/LLM/main/L0_MergeRequest_PR pipeline #36985 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@pcastonguay
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47024 [ run ] triggered by Bot. Commit: 102e6be Link to invocation

@mikeiovine
Copy link
Copy Markdown
Collaborator

/bot help

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47024 [ run ] completed with state SUCCESS. Commit: 102e6be
/LLM/main/L0_MergeRequest_PR pipeline #36999 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator Author

lfr-0531 commented May 7, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47061 [ run ] triggered by Bot. Commit: 102e6be Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47061 [ run ] completed with state SUCCESS. Commit: 102e6be
/LLM/main/L0_MergeRequest_PR pipeline #37033 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47104 [ run ] triggered by Bot. Commit: 102e6be Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47104 [ run ] completed with state SUCCESS. Commit: 102e6be
/LLM/main/L0_MergeRequest_PR pipeline #37073 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Barry-Delaney added a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 7, 2026
…elated fails

Add waives for tests blocking PR NVIDIA#13340 CI that bisect to causes
unrelated to this PR:

1. piecewise CUDA graph capture rework (https://nvbugs/6153575)
   Two symptoms with the same bisection point:
     - perf/test_perf_sanity.py::test_e2e[aggr_upload-super_ad_blackwell-super_ad_ws1_1k1k]
       Reproducible -14% throughput / +103% mean_ttft on
       Nemotron-3-Super-120B-A12B-NVFP4 served via openai_server +
       AutoDeploy. TPOT only +3%; the regression lives in the warmup
       / first-token window, matching CUDA graph capture overhead.
     - unittest/tools/test_layer_wise_benchmarks.py::test_qwen3_next_gen_tep[1]
       Triton "out of memory" during chunk_fwd_kernel_o autotune on
       Qwen3-Next-80B-A3B-Instruct (--moe-backend=TRTLLM, layers 6,7).
       Increased CUDA graph footprint from the rework starves the
       Triton autotune scratch, which OOMs on the larger BLOCK_K/V
       configs.

   Both bisect to main commit 9c1869b
   "[https://nvbugs/5615248][fix] Broader capture of piecewise
   cudagraph (NVIDIA#13574)" which reworked piecewise CUDA graph capture
   filtering in tensorrt_llm/_torch/pyexecutor/model_engine.py to
   force-include max_batch_size*(max_seq_len-1-N) ceiling. Verified
   by comparing PR builds rebased before vs after that commit:
     L0/36967 (base 3e4a775)        : both PASS
     L0/37033 (base 7943536)        : both FAIL
     L0/37073 (base 7943536, retry) : both FAIL
   PR NVIDIA#13340 itself does not touch model_engine.py, AutoDeploy, fla,
   mamba, or fused_moe in the relevant paths.

2. openai-server smoke tests on A10 (https://nvbugs/6153638)
     - test_e2e.py::test_openai_lora           (A10-PyTorch-1)
     - test_e2e.py::test_openai_tool_call      (A10-PyTorch-2)
     - test_e2e.py::test_trtllm_serve_lora_example (A10-PyTorch-2)
   All three crash uniformly with "Server exited unexpectedly" /
   "Connection refused on health endpoint", with no underlying
   server-side traceback captured. Inconsistent across CI runs of
   the same PR head 102e6be (build 47061 / L0 37033 did not hit
   any of these), so the symptom is environmental/host-level
   flake on A10 stages, not a code regression. PR NVIDIA#13340 only
   touches openai_server.py to add getattr fallbacks for
   hf_tokenizer_path and vocab_size; both are no-op on the happy
   path.

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
@Barry-Delaney
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47108 [ run ] triggered by Bot. Commit: 33676f9 Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

/bot kill

@Barry-Delaney Barry-Delaney force-pushed the user/fanrongl/fp4_indexer branch from 33676f9 to 102e6be Compare May 7, 2026 06:34
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47130 [ kill ] triggered by Bot. Commit: 102e6be Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47130 [ kill ] completed with state SUCCESS. Commit: 102e6be
Successfully killed previous jobs for commit 102e6be

Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47153 [ run ] triggered by Bot. Commit: 102e6be Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47153 [ run ] completed with state FAILURE. Commit: 102e6be
/LLM/main/L0_MergeRequest_PR pipeline #37114 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47157 [ run ] triggered by Bot. Commit: 102e6be Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47157 [ run ] completed with state SUCCESS. Commit: 102e6be
/LLM/main/L0_MergeRequest_PR pipeline #37117 completed with status: 'SUCCESS'

CI Report

Link to invocation

@longlee0622 longlee0622 merged commit 897c4bf into NVIDIA:main May 7, 2026
10 checks passed
limin2021 added a commit to limin2021/TensorRT-LLM that referenced this pull request May 7, 2026
…xer_gemm

Resolve conflicts in DSA paged-MQA-logits dispatch and tests after
DeepGEMM submodule bump (4ff3f54d -> c491439e via PR NVIDIA#13340 / DG NVIDIA#304):

- dsa.py: take upstream's scheduler_metadata_buffer / _full_next_n
  selection (mtp3 buffer removed); add DSL early-branch using the
  existing scheduler_metadata_buffer (built with (num_gen, 1) shape,
  num_atoms=1, matching DSL's 1-atom-per-q design) and the 1D
  kv_lens_cuda_runtime slice for context_lens.
- dsa.py: introduce module-level _DG_SCHEDULE_BLOCK_KV = 64, used by
  all 6 get_paged_mqa_logits_metadata calls (3 in on_update_kv_lens(),
  3 in Indexer.prepare()) instead of cache tokens_per_block.
  Decouples schedule SPLIT_KV from cache page size and side-steps a
  SM100 + block_kv=32 latent regression in DG commit 7f2a703 (NVIDIA#304).
- test_dsa_indexer.py: take upstream's scheduler buffer selection;
  DSL test branch reads scheduler_metadata_buffer + 1D kv_lens.
- test_cute_dsl_fp8_paged_mqa_logits.py: 4 metadata calls now pass
  2D context_lens via .unsqueeze(-1) and DG_METADATA_BLOCK_KV=64;
  DG bench drops cluster(2,1,1) for next_n=4 (SM100 always uses
  num_kv_multicast=1) and passes 2D context_lens to fp8_paged_mqa_logits.

Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.