[None] [feat] Use triton kernels for RocketKV prediction module #8682

heyuhhh · 2025-10-27T09:01:54Z

Summary by CodeRabbit

New Features
- Added sparse attention block size parameter for improved control over attention computation granularity.
- Introduced comprehensive GPU-accelerated kernels for sparse attention operations including QK splitting, top-k selection, and KT cache management.
- Extended RocketKV sparse attention with optional interleave mode for enhanced performance.
Refactor
- Optimized sparse attention kernel data layout and processing pipeline for better performance.
- Propagated sparse attention configuration throughout the attention computation stack.
Tests
- Updated sparse attention tests with new token-level indexing and comprehensive validation.

Description

Add triton kernels for RocketKV prediction module:

Fuse indices convert into gather op
Replace for-loop operations to bached computation for better parallelism
Enable cuda graph in generation phase
Add more tests for prediction module and related kernels

Limitation

Some triton kernels perform not as good as expected, need to optimize further
Enable cuda graph with padding will drop the accuracy, need to debug

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-10-27T09:11:47Z

📝 Walkthrough

Walkthrough

This PR introduces comprehensive enhancements to the sparse attention system, including kernel optimizations with new template parameters, extended sparse attention metadata structures, new Triton-based GPU kernels for sparse operations (argsort, top-k selection, QK split, KT cache management), expanded sparse attention configuration options with interleaving support, and updated Python/C++ bindings to propagate a new sparse_attn_indices_block_size parameter throughout the attention operation pipeline.

Changes

Cohort / File(s)	Summary
Kernel Template and Parameter Extension `cpp/tensorrt_llm/kernels/sparseAttentionKernels.cu`, `cpp/tensorrt_llm/kernels/sparseAttentionKernels.h`	Added new template parameter MAX_NUM_PAGES to gatherKvPageOffsetsKernel; replaced single shared memory reduction with BlockScan/BlockReduce approach and per-page masking arrays; reworked sparse-index range handling with multi-stage loops; introduced sparse_attn_indices_block_size field to SparseAttentionParams with default value 1
Python/C++ Binding Layer `cpp/tensorrt_llm/nanobind/thop/bindings.cpp`, `cpp/tensorrt_llm/pybind/thop/bindings.cpp`	Added sparse_attn_indices_block_size parameter to attention function bindings in both nanobind and pybind11 interfaces
Attention Operation C++ Implementation `cpp/tensorrt_llm/thop/attentionOp.cpp`, `cpp/tensorrt_llm/thop/attentionOp.h`	Introduced sparse_attn_indices_block_size parameter throughout public interfaces; updated RunnerBase::prepare, RunnerBase::run, and top-level attention() function signatures; propagated parameter through context and generation flows via op.mRuntimeSparseAttentionParams
Sparse Attention Triton Kernels `tensorrt_llm/_torch/attention_backend/sparse/kernel.py`	Added extensive Triton-based GPU kernels including: argsort and bitonic merge for top-k selection; QK split kernel for query/context partitioning; dense BMM kernel with causal masking; softmax and flatten kernels; KT cache update/BMM kernels; top-k and score reduction kernels
RocketKV Implementation Overhaul `tensorrt_llm/_torch/attention_backend/sparse/rocket.py`	Significantly expanded RocketTrtllmAttentionMetadata with numerous CUDA buffers and configuration fields (window_size, page_size, topk, use_interleave, context_lens_cuda, q/k extraction offsets/lengths, etc.); introduced preprocess_for_gen method; replaced token-to-page sparse indexing with Triton-assisted flow; reworked sparse_kv_predict to use Triton kernels for score computation, softmax, and top-k selection; added num_extra_decoding_steps parameter to add_dummy_requests
TrtLLM Backend Integration `tensorrt_llm/_torch/attention_backend/trtllm.py`	Added sparse_attn_indices_block_size parameter (default 1) to TrtllmAttentionWrapper.plan method; updated plan/run to propagate parameter to thop.attention; introduced logic to compute sparse_attn_indices_block_size from sparse_attention_config.get_indices_block_size(); swapped docstring semantics between sparse_kv_predict and sparse_attn_predict methods
Sparse Attention Configuration `tensorrt_llm/llmapi/llm_args.py`	Added get_indices_block_size() method to BaseSparseAttentionConfig (returns 1 by default); added get_indices_block_size() override to RocketSparseAttentionConfig (returns 1 if use_interleave is True, otherwise page_size); introduced use_interleave field to RocketSparseAttentionConfig with default True
Unit Tests `cpp/tests/unit_tests/kernels/sparseAttentionKernelsTest.cpp`	Updated test data structure to use per-head, per-token sparse indices instead of per-page patterns; replaced total_sparse_pages with total_sparse_tokens; updated offsets to reflect per-batch token counts; rewrote verification logic with new ExpectedResult structure; added sparse_attn_indices_block_size = 1 to test initialization
Python Unit Tests `tests/unittest/_torch/attention/sparse/test_rocketkv.py`	Added helper functions create_rocket_kv_cache_manager and create_test_metadata; introduced test_sparse_kv_predict and test_sparse_attn_predict test functions; expanded test setup with cache manager creation, metadata building, and synthetic input generation
Examples and Configurations `examples/llm-api/llm_sparse_attention.py`, `examples/longbench/eval_longbench_v1.py`, `examples/longbench/eval_longbench_v2.py`	Added DSA to supported sparse attention algorithms; updated usage examples to use ROCKETKV variants; increased default max_seq_len to 10240 and max_num_tokens to 81920; changed default kv_cache_fraction to 0.7; removed torch_compile_config argument in eval_longbench_v1; added CudaGraphConfig support in eval_longbench_v2

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TrtLLMAttention
    participant TrtLLMWrapper
    participant RocketKV
    participant TritonKernels
    participant ThorpOp

    User->>TrtLLMAttention: forward(q, k, v, sparse_attention_config)
    TrtLLMAttention->>TrtLLMAttention: compute sparse_attn_indices_block_size<br/>from config.get_indices_block_size()
    TrtLLMAttention->>TrtLLMWrapper: plan(..., sparse_attn_indices_block_size)
    TrtLLMWrapper->>TrtLLMWrapper: store sparse_attn_indices_block_size
    
    alt Context Phase
        TrtLLMWrapper->>RocketKV: prepare_for_context()
        RocketKV->>TritonKernels: triton_update_kt_cache_ctx()
        TritonKernels-->>RocketKV: KT cache updated
    end
    
    alt Generation Phase
        TrtLLMWrapper->>RocketKV: sparse_attn_predict()
        RocketKV->>RocketKV: preprocess_for_gen(q, k)
        RocketKV->>TritonKernels: triton_kt_cache_update_and_bmm()
        TritonKernels-->>RocketKV: attention scores
        RocketKV->>TritonKernels: triton_softmax()
        TritonKernels-->>RocketKV: normalized scores
        RocketKV->>TritonKernels: triton_topk()
        TritonKernels-->>RocketKV: top-k sparse indices
    end
    
    TrtLLMWrapper->>ThorpOp: run(..., sparse_attn_indices_block_size)
    ThorpOp->>ThorpOp: propagate to sparse attention params
    ThorpOp-->>User: attention output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Dense kernel logic: The rewrite of gatherKvPageOffsetsKernel introduces complex BlockScan/BlockReduce patterns and multi-stage loop logic requiring careful validation
Large new kernel module: kernel.py adds 20+ new Triton kernels spanning multiple concern areas (sorting, BMM, softmax, caching) with varying complexity
Significant metadata overhaul: RocketTrtllmAttentionMetadata expansion introduces 25+ new CUDA buffer fields requiring coherent initialization and lifecycle management
Cross-cutting parameter propagation: sparse_attn_indices_block_size threading through multiple layers (C++, Python bindings, TrtLLM backend) requires careful trace-through
Mixed homogeneity: Parameter additions are repetitive across binding layers, but core kernel and Triton implementations require individual reasoning

Areas requiring extra attention:

cpp/tensorrt_llm/kernels/sparseAttentionKernels.cu: Verify correctness of BlockScan/BlockReduce orchestration, loop bounds, and updated offset calculations
tensorrt_llm/_torch/attention_backend/sparse/kernel.py: Each Triton kernel's thread block tiling, synchronization, and edge-case handling should be validated independently
tensorrt_llm/_torch/attention_backend/sparse/rocket.py: Ensure new CUDA buffer initialization and lifecycle management are consistent; verify Triton kernel call sequences and parameter passing
cpp/tests/unit_tests/kernels/sparseAttentionKernelsTest.cpp: Confirm test data layout changes (per-head, per-token) align with kernel expectations and new offset semantics

Suggested reviewers

syuoni
laikhtewari
Wanli-Jiang
juney-nvidia

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.94% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	PR description lacks critical details: missing JIRA ticket/issue reference in title, no detailed problem statement, Test Coverage section empty, and checklist items unchecked.	Add proper ticket reference (e.g., [TRTLLM-XXXX]), explain the problem being solved, document test coverage with specific test names, and check relevant checklist items before submission.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "[None] [feat] Use triton kernels for RocketKV prediction module" follows the repository's title format correctly and clearly communicates the primary change in the changeset. The title is concise, specific, and directly summarizes the main objective—introducing Triton kernels for the RocketKV prediction module. This aligns well with the substantive changes shown in the raw summary, which includes significant additions of Triton-based kernel implementations, modifications to the RocketKV attention backend, and supporting infrastructure changes. A reviewer scanning the project history would immediately understand the core purpose of this change.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

cpp/tensorrt_llm/thop/attentionOp.h (1)
17-17: Add include guard macro (guideline requirement).

Replace pragma once with TRTLLM_ATTENTIONOP_H guard.

[coding_guidelines]
Apply this diff:
-#pragma once
+#ifndef TRTLLM_ATTENTIONOP_H
+#define TRTLLM_ATTENTIONOP_H
...
-} // namespace torch_ext
+} // namespace torch_ext
+
+#endif // TRTLLM_ATTENTIONOP_H
cpp/tests/unit_tests/kernels/sparseAttentionKernelsTest.cpp (1)
1-1: Missing NVIDIA Apache-2.0 header.

Add the standard 2025 NVIDIA license header at file top.

[coding_guidelines]
Apply this diff:
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
tests/unittest/_torch/attention/sparse/test_rocketkv.py (1)
1-1: Missing NVIDIA Apache-2.0 header.

Add standard header at top of Python file.

[coding_guidelines]
Apply this diff:
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
tensorrt_llm/_torch/attention_backend/sparse/rocket.py (1)

1-26: Missing NVIDIA Apache‑2.0 header

Add the standard NVIDIA Apache‑2.0 header with current year at the top of the file.

♻️ Duplicate comments (1)

tests/unittest/_torch/attention/sparse/test_rocketkv.py (1)
545-547: Useless standalone expression (duplicate).

Remove stray arithmetic.

Apply this diff:
-            end_idx - start_idx

🧹 Nitpick comments (23)

examples/longbench/eval_longbench_v2.py (1)
366-368: Consider adding explanatory comment.

The conditional CUDA graph configuration logic is correct. However, consider adding a brief comment explaining why CUDA graph is only enabled for the TRTLLM backend, which would improve code maintainability.

Optionally apply this diff to add explanatory comments:
+        # Enable CUDA graph only for TRTLLM backend for performance optimization
+        # Note: enable_padding defaults to False to avoid accuracy issues (see PR objectives)
         cuda_graph_config = CudaGraphConfig(
             max_batch_size=args.max_batch_size
         ) if args.attention_backend == "TRTLLM" else None
tensorrt_llm/llmapi/llm_args.py (1)

221-224: Document and surface interleave behavior

Add a brief docstring to clarify that when use_interleave is True, indices are per-token (block size 1); when False, indices group by page. This helps downstream users wire correct metadata.
cpp/tensorrt_llm/kernels/sparseAttentionKernels.cu (3)
79-91: Remove unnecessary atomic on shared memory mask

Multiple threads setting a flag from 0 to 1 in shared memory does not need atomics; a plain store is sufficient and faster.
-                atomicExch(&s_page_mask[src_page_idx - src_page_idx_offset], 1);
+                s_page_mask[src_page_idx - src_page_idx_offset] = 1;
160-179: Sequence length logic: document non‑contiguous page handling

If sparse indices can produce non‑contiguous valid pages, num_valid_pages based length may overcount. Either enforce contiguity or document that we assume contiguous pages from 0..max_page_index except possibly the last partial page.

193-195: Template constant MAX_NUM_PAGES=512: expose or derive from runtime

Hard‑coding 512 is fine as a default, but consider making it a constexpr tied to max_num_pages_per_seq or exposing a compile‑time knob so tests with larger per‑seq capacities can reuse the kernel without rebuilds.
examples/llm-api/llm_sparse_attention.py (1)

73-75: Expose interleave/page sizing in the example CLI

Since RocketKV now uses indices block size semantics, add flags to control interleave/page size to make behavior explicit and to help reproduce perf/accuracy trade‑offs.

Example additions:

--rocketkv_use_interleave (bool)

--rocketkv_page_size (int, when interleave is False)

Also applies to: 108-109
cpp/tensorrt_llm/thop/attentionOp.h (2)
64-65: Align type width with kernels (int32_t vs int64_t).

SparseAttentionParams uses int32_t sparse_attn_indices_block_size; header exposes int64_t. Align to int32_t to avoid silent narrowing later.

Apply this diff:
-    std::vector<std::optional<torch::Tensor>> spec_decoding_tensor_params,
-    std::vector<std::optional<torch::Tensor>> sparse_attention_params, int64_t const sparse_attn_indices_block_size);
+    std::vector<std::optional<torch::Tensor>> spec_decoding_tensor_params,
+    std::vector<std::optional<torch::Tensor>> sparse_attention_params, int32_t const sparse_attn_indices_block_size);
Also ensure the corresponding cpp and bindings signatures match. Based on learnings.

25-36: Document the new parameter in the Doxygen block.

Add a brief line for sparse_attn_indices_block_size.

Example:
  * - Speculative decoding
  * - ...
+ * @param sparse_attn_indices_block_size Block size for sparse attention indices flattening/interleave.
cpp/tests/unit_tests/kernels/sparseAttentionKernelsTest.cpp (1)
61-71: Use fixed-width int32_t for INT32 buffers.

seq_lengths_host and sparse_indices_offsets_host are INT32; casting to int* is inconsistent. Use int32_t consistently.

Apply this diff:
-    auto seq_lengths_ptr = bufferCast<int>(*seq_lengths_host);
-    auto sparse_indices_ptr = bufferCast<int>(*sparse_indices_host);
-    auto sparse_indices_offsets_ptr = bufferCast<int>(*sparse_indices_offsets_host);
+    auto seq_lengths_ptr = bufferCast<int32_t>(*seq_lengths_host);
+    auto sparse_indices_ptr = bufferCast<int32_t>(*sparse_indices_host);
+    auto sparse_indices_offsets_ptr = bufferCast<int32_t>(*sparse_indices_offsets_host);
And later:
-    auto output_seq_len_ptr = bufferCast<int>(*output_seq_lengths_host);
+    auto output_seq_len_ptr = bufferCast<int32_t>(*output_seq_lengths_host);
tests/unittest/_torch/attention/sparse/test_rocketkv.py (4)
55-63: Skip E2E if input file missing to avoid CI flakes.

Guard test_model with a skip when data is absent.

Apply this diff:
-    with open(input_file, 'r') as f:
+    if not os.path.isfile(input_file):
+        pytest.skip(f"Missing test data: {input_file}")
+    with open(input_file, 'r') as f:
213-219: zip without strict — add strict=True.

Make intent explicit and catch length mismatches.

Apply this diff:
-    token_nums = [
-        seq_len + past_token
-        for seq_len, past_token in zip(seq_lens, past_seen_tokens)
-    ]
+    token_nums = [
+        seq_len + past_token
+        for seq_len, past_token in zip(seq_lens, past_seen_tokens, strict=True)
+    ]
236-246: Remove unused variable vanilla_metadata.

It’s assigned but never used in this test.

Apply this diff:
-    vanilla_metadata = create_test_metadata(seq_lens, num_contexts,
-                                            past_seen_tokens, request_ids,
-                                            vanilla_kv_cache_manager,
-                                            sparse_attn_config,
-                                            RocketVanillaAttentionMetadata)
406-411: zip without strict — add strict=True.

Mirror change from earlier loop.

Apply this diff:
-    token_nums = [
-        seq_len + past_token
-        for seq_len, past_token in zip(seq_lens, past_seen_tokens)
-    ]
+    token_nums = [
+        seq_len + past_token
+        for seq_len, past_token in zip(seq_lens, past_seen_tokens, strict=True)
+    ]
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
200-201: Expose default semantics in docstring for new parameter.

Plan signature adds sparse_attn_indices_block_size; document it and its source (sparse_attention_config.get_indices_block_size()).

Apply this diff near the Args section:
         Args:
@@
-            sparse_attn_offsets (torch.Tensor): The batch offsets for the sparse attention indices, with shape of (num_generations + 1) on GPU.
+            sparse_attn_offsets (torch.Tensor): The batch offsets for the sparse attention indices, with shape of (num_generations + 1) on GPU.
+            sparse_attn_indices_block_size (int): Block size to pack/flatten sparse attention indices. Defaults to 1. Typically set via sparse_attention_config.get_indices_block_size().
tensorrt_llm/_torch/attention_backend/sparse/kernel.py (6)
80-96: argsort assumes power-of-two length; add guard

The bitonic merge relies on BLOCK_SIZE being a power of two. Add a static assert to prevent accidental misconfig.
 def argsort(x,
@@
-    n_dims: core.constexpr = _log2(x.shape[_dim])
+    n_dims: core.constexpr = _log2(x.shape[_dim])
+    core.static_assert((1 << n_dims) == x.shape[_dim],
+                       "argsort requires length to be a power of two")
429-495: Replace lambda grid with def and type Optional for sm_scale

Satisfy linters and PEP484 without behavior change.
-    grid_bmm = lambda meta: (batch_size, num_q_heads)
+    def grid_bmm(meta):
+        return (batch_size, num_q_heads)
@@
-               sm_scale: float = None,
+               sm_scale: Optional[float] = None,
610-706: Replace lambda grid with def in flatten_to_batched

Align with style and E731.
-    grid = lambda meta: (batch_size, num_heads)
+    def grid(meta):
+        return (batch_size, num_heads)
1514-1567: Top‑k temp buffers and final tuple: drop unused value, keep masks solid

final_sorted_values is unused; store to underscore to appease linters.
-    final_sorted_values, final_sorted_indices = argsort(final_values,
+    _final_sorted_values, final_sorted_indices = argsort(final_values,
                                                         final_indices,
                                                         dim=0,
                                                         descending=True)
Also consider loading padded tails with -inf instead of 0.0 if you ever feed non‑softmaxed scores here; current pipeline uses softmax so zeros are fine.

Also applies to: 1642-1654

714-855: O(valid_batch_size) search per work‑item; precompute a mapping

flatten_sparse_indices_kernel linearly scans valid_seq_indices for every batch/head, increasing kernel time when valid_batch_size is large.

Precompute a dense boolean mask or an index map of size batch_size on CPU/GPU and pass it in; avoid the per‑kernel inner loop.

261-321: Remove unused function args (nits) to reduce register pressure

_total_tokens and _prompt_budget are unused; prefix with underscore to make intent clear.
-def triton_rocket_qk_split(input_tensor: torch.Tensor, num_heads: int, num_kv_heads: int,
-        head_dim: int, window_size: int, prompt_budget: int,
+def triton_rocket_qk_split(input_tensor: torch.Tensor, num_heads: int, num_kv_heads: int,
+        head_dim: int, window_size: int, _prompt_budget: int,
@@
-    total_tokens = input_tensor.shape[0]
+    _total_tokens = input_tensor.shape[0]
tensorrt_llm/_torch/attention_backend/sparse/rocket.py (3)
258-265: Potential shape source for cumsum

Ensure self.context_lens is a CPU Tensor with length == num_contexts. If not, compute cumsum on self.context_lens[:self.num_contexts] to avoid overrun.
-        self.context_cumsum[1:self.num_contexts + 1] = torch.cumsum(
-            self.context_lens, dim=0)
+        self.context_cumsum[1:self.num_contexts + 1] = torch.cumsum(
+            self.context_lens[:self.num_contexts], dim=0)
579-586: sparse_attn_predict: kwargs unused

Remove unused kwargs or accept **_ to silence linters.
-    def sparse_attn_predict(
+    def sparse_attn_predict(
         self,
         q: torch.Tensor,
         k: torch.Tensor,
         metadata: TrtllmAttentionMetadata,
-        **kwargs,
+        **_kwargs,
425-432: sparse_kv_predict: kwargs unused

Same as above.
-    def sparse_kv_predict(
+    def sparse_kv_predict(
         self,
         q: torch.Tensor,
         k: torch.Tensor,
         metadata: TrtllmAttentionMetadata,
-        **kwargs,
+        **_kwargs,

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 990b0c0 and cf70147.

📒 Files selected for processing (15)

cpp/tensorrt_llm/kernels/sparseAttentionKernels.cu (4 hunks)
cpp/tensorrt_llm/kernels/sparseAttentionKernels.h (1 hunks)
cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1 hunks)
cpp/tensorrt_llm/pybind/thop/bindings.cpp (1 hunks)
cpp/tensorrt_llm/thop/attentionOp.cpp (6 hunks)
cpp/tensorrt_llm/thop/attentionOp.h (1 hunks)
cpp/tests/unit_tests/kernels/sparseAttentionKernelsTest.cpp (7 hunks)
examples/llm-api/llm_sparse_attention.py (4 hunks)
examples/longbench/eval_longbench_v1.py (0 hunks)
examples/longbench/eval_longbench_v2.py (3 hunks)
tensorrt_llm/_torch/attention_backend/sparse/kernel.py (3 hunks)
tensorrt_llm/_torch/attention_backend/sparse/rocket.py (7 hunks)
tensorrt_llm/_torch/attention_backend/trtllm.py (6 hunks)
tensorrt_llm/llmapi/llm_args.py (3 hunks)
tests/unittest/_torch/attention/sparse/test_rocketkv.py (3 hunks)

💤 Files with no reviewable changes (1)

examples/longbench/eval_longbench_v1.py

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}