[TRTLLM-11058][feat] Support Helix CP with GQA by brb-nv · Pull Request #11570 · NVIDIA/TensorRT-LLM

brb-nv · 2026-02-18T17:37:34Z

Description

This MR generalizes Helix CP from MLA-only to standard GQA/MHA.

Attention module & kernels:

Added helix functionality to base Attention class.
Plumbed helix position offsets, inactive-rank flags, and softmax stats through the C++ kernel stack for GQA. Please note that even context kernels have been updated for testing though Helix is decode-only.
Cleaned up enable_helix_test from MLA module as it's only for testing.

Cache transmission:

Updated splitKVCacheKernel and cacheFormatter for CP-aware block distribution.

Qwen3 as example:

Made changes to Qwen3 to serve as an example.
We can probably avoid maintaining two mappings if we make MLP & MoE layers CP-aware. Deferred to a follow-up MR: https://jirasw.nvidia.com/browse/TRTLLM-11067.

Test Coverage

$ pytest tests/unittest/_torch/modules/test_mla_helix.py -s -v
$ pytest tests/unittest/_torch/modules/test_mha_helix.py -s -v
$ TRTLLM_USE_UCX_KVCACHE=1 TLLM_LOG_LEVEL=INFO mpirun -n 8 ./tests/unit_tests/multi_gpu/cacheTransceiverTest --gtest_filter="AsymmetricCaseTestWithCPAndDPForGQA*/AsymmetricalCacheTestWithDP.TestCase/*"
$ TRTLLM_USE_UCX_KVCACHE=1 TLLM_LOG_LEVEL=INFO mpirun -n 8 ./tests/unit_tests/multi_gpu/cacheTransceiverTest --gtest_filter="AsymmetricCaseTest*WithCPForGQA/AsymmetricalCacheTest.TestCase/*"
$ ./tests/unit_tests/multi_gpu/cacheTransceiverTest --gtest_filter="targetTest.CacheStateNODPForGQAWithCP"
$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype_with_helix -s -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2026-02-18T17:48:58Z

📝 Walkthrough

Walkthrough

This PR adds Helix parallelism support to TensorRT-LLM's attention computation stack by introducing helix_position_offsets, helix_is_inactive_rank, and softmax_stats parameters across public enqueue interfaces, kernel parameter structures, and dispatch paths. It includes Helix-aware RoPE calculations, KV cache gating, and Python-level post-processing with comprehensive distributed testing.

Changes

Cohort / File(s)	Summary
Enqueue Parameter Interfaces `cpp/tensorrt_llm/common/attentionOp.h`, `cpp/tensorrt_llm/thop/attentionOp.cpp`	Added `helix_position_offsets` and `helix_is_inactive_rank` pointers to `EnqueueContextParams` and `EnqueueGenerationParams`, with extraction and assignment from MLA tensor parameters in context and generation paths.
Kernel Parameter Definitions `cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention.h`, `cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/xqaParams.h`, `cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h`	Added `helix_position_offsets`, `helix_is_inactive_rank` pointers, and `softmax_stats` to base parameter structs used by masked multihead attention and XQA kernels, with updated `toString()` methods.
Kernel RoPE and KV Cache Logic `cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h`, `cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h`	Implemented Helix-aware position offset handling, KV cache gating based on `helix_is_inactive_rank` flag, and conditional rotary position selection (preferring `helix_position_offsets` over `tlength` when available).
Dispatcher Integration `cpp/tensorrt_llm/common/attentionOp.cpp`, `cpp/tensorrt_llm/kernels/xqaDispatcher.cpp`	Propagated Helix parameters and softmax statistics through XQA configuration and `QKVPreprocessingParams` construction, wiring them into downstream kernel dispatch.
Python Attention Module `tensorrt_llm/_torch/modules/attention.py`	Added `enable_helix_test` and `mapping_with_cp` parameters to `Attention` and `MLA` constructors; introduced `_helix_post_process` method for CP-based data redistribution using NCCL or FIFO backends; added CP-aware output projection mapping and head count adjustments for Helix execution.
Torch Backend Support `tensorrt_llm/_torch/attention_backend/trtllm.py`	Enhanced TRTLLM gen support detection to capture both support flag and reason, with assertion guarding KV-cache-related fallback behavior.
Distributed Testing `tests/unittest/_torch/modules/test_mha_helix.py`	Introduced comprehensive multi-GPU unit test module with `Scenario` and `RopeConfig` dataclasses, distributed KV-cache initialization, reference output generation on rank 0, and distributed validation across multiple communication backends (NCCL, FIFO v1/v2).

Sequence Diagram(s)

sequenceDiagram
    participant App as Python App
    participant AttentionMod as Attention Module<br/>(Python)
    participant TorchBackend as Torch Backend
    participant Enqueue as Enqueue Interface<br/>(C++)
    participant Dispatcher as XQA Dispatcher<br/>(C++)
    participant KernelParams as Kernel Parameter<br/>Structs
    participant AttentionKernel as Attention Kernels<br/>(CUDA)
    participant KVCache as KV Cache

    App->>AttentionMod: forward(input, attn_metadata)
    AttentionMod->>AttentionMod: detect Helix CP mode<br/>(mapping.has_cp_helix())
    AttentionMod->>AttentionMod: compute helix_position_offsets<br/>helix_is_inactive_rank
    AttentionMod->>TorchBackend: call attention backend<br/>with helix params
    TorchBackend->>Enqueue: create EnqueueGenerationParams<br/>with helix_position_offsets,<br/>helix_is_inactive_rank
    Enqueue->>Dispatcher: dispatch with helix params
    Dispatcher->>KernelParams: populate XQAParams<br/>helix_position_offsets,<br/>helix_is_inactive_rank,<br/>softmax_stats
    Dispatcher->>AttentionKernel: launch kernel<br/>with configured params
    AttentionKernel->>AttentionKernel: select rope_position from<br/>helix_position_offsets or tlength
    AttentionKernel->>AttentionKernel: check helix_is_inactive_rank<br/>for KV store gating
    AttentionKernel->>KVCache: conditionally update KV<br/>based on inactive flag
    AttentionKernel->>AttentionKernel: compute attention with<br/>adjusted rope positions
    AttentionKernel-->>TorchBackend: partial outputs + softmax_stats
    TorchBackend->>AttentionMod: return outputs
    AttentionMod->>AttentionMod: if Helix CP:<br/>_helix_post_process<br/>(partial_o, softmax_stats)
    AttentionMod->>AttentionMod: alltoall redistribution<br/>across CP ranks
    AttentionMod-->>App: final attention output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[None][feat] Cudagraph updates for helix parallelism #10141: Both PRs propagate helix parameters through the attention stack, wiring helix position offsets and inactive rank flags into metadata and kernel dispatch paths.
[TRTLLM-5966][feat] Helix: add full MLA support for Helix #8104: Both PRs implement comprehensive Helix MLA support including parameter propagation, post-processing kernels, and distributed testing infrastructure.

Suggested reviewers

yuxianq
liji-nv
thorjohnsen

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.26% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely describes the main objective: adding support for Helix CP (collective primitives) with GQA (Grouped Query Attention), which aligns with the changeset that introduces Helix parallelism parameters throughout the attention kernel stack.
Description check	✅ Passed	The PR description clearly explains the core changes to support Helix CP with standard GQA/MHA, including attention module updates, cache transmission changes, and test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)

cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h (1)
1-15: ⚠️ Potential issue | 🟡 Minor

Update the copyright year.

This file was modified for Helix support, but the header still lists 2020–2023. Please update it to include 2026. As per coding guidelines: “All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Include NVIDIA copyright header on ALL new files and update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h` around lines 1 - 15,
Update the NVIDIA copyright header at the top of the file by changing the year
range "2020-2023" to include the latest modification year (e.g., "2020-2026") so
the header reflects the most recent meaningful change; locate the header comment
block at the file top and replace the year range accordingly.
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/xqaParams.h (1)
1-15: ⚠️ Potential issue | 🟡 Minor

Update the copyright year.

Please update the header to include 2026 to reflect this modification. As per coding guidelines: “All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Include NVIDIA copyright header on ALL new files and update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/xqaParams.h` around
lines 1 - 15, Update the top-of-file NVIDIA copyright header so the year range
includes 2026 (e.g., change "2020-2025" to "2020-2026") in the header block at
the top of xqaParams.h; edit the existing comment block rather than adding a new
header and ensure the License text remains unchanged.
cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h (1)
1-16: ⚠️ Potential issue | 🟡 Minor

Update the copyright year.

The header still lists 2019–2024, but the file was modified in 2026. Please update the year range accordingly. As per coding guidelines: “All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Include NVIDIA copyright header on ALL new files and update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h`
around lines 1 - 16, Update the copyright header's year range in the file by
replacing the existing "Copyright (c) 2019-2024, NVIDIA CORPORATION." entry in
the top-of-file comment block so it reflects the latest modification year (e.g.,
change "2019-2024" to "2019-2026"); ensure the rest of the header text
(including the NAVER/CLOVA line and Apache License block) remains unchanged.
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
1-5: ⚠️ Potential issue | 🟡 Minor

Add the NVIDIA copyright header (2026).

This modified Python file currently has no NVIDIA Apache 2.0 header. Please add the standard header with the latest modification year. As per coding guidelines: “All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Include NVIDIA copyright header on ALL new files and update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 1 - 5, Add the
standard NVIDIA Apache-2.0 copyright header (with year 2026) to the top of the
file tensorrt_llm/_torch/attention_backend/trtllm.py before any imports; ensure
the header matches the project's canonical NVIDIA header text and license block
and update the file's modification year to 2026, leaving the rest of the module
(imports like math, os, weakref and dataclass/type hints) unchanged.
513-629: ⚠️ Potential issue | 🟠 Major

Avoid hard‑failing on non‑KV‑cache unsupported reasons.

The new assert crashes whenever TRTLLM‑GEN is enabled but unsupported for reasons other than "KV cache update"—including missing flashinfer, unsupported head configs (MLA, cross-attention, spec-decoding), ALiBi, padded input, or custom mask types. This is a behavior regression; previous behavior fell back to thop.attention for all unsupported reasons. Prefer a warning + fallback to preserve backward compatibility and prevent crashes for users with the env var enabled.
🐛 Proposed fix (warn + fallback)
-            # KV cache update is expected to fall back to thop since
-            # trtllm-gen only reads from KV cache. Assert on other reasons.
-            assert not _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION or (
-                "KV cache update" in _trtllm_gen_reason
-            ), (
-                f"TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION is set but trtllm-gen "
-                f"is not supported: {_trtllm_gen_reason}"
-            )
+            # KV cache update is expected to fall back to thop since
+            # trtllm-gen only reads from KV cache. Warn (don't hard-fail) for other reasons.
+            if _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION and _trtllm_gen_reason and (
+                    "KV cache update" not in _trtllm_gen_reason):
+                logger.warning(
+                    "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION is set but trtllm-gen "
+                    f"is not supported: {_trtllm_gen_reason}. Falling back to thop."
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 513 - 629, The
assert that checks _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION and _trtllm_gen_reason
should be replaced with a non‑fatal warning and fall back to the existing thop
attention path when trtllm_gen.is_supported returns False for reasons other than
KV cache update; locate the assert block after the trtllm_gen.is_supported call
and the trtllm_gen_attention invocation and change it to log a warning including
_trtllm_gen_reason (use your logger or warnings.warn) and let execution continue
to the else path (thop.attention fallback) instead of raising, keeping the
special case for KV cache update handling if needed.
cpp/tensorrt_llm/kernels/xqaDispatcher.cpp (1)
1-15: ⚠️ Potential issue | 🟡 Minor

Update the copyright year.

The header still lists 2020–2024; please include 2026 for this modification. As per coding guidelines: “All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Include NVIDIA copyright header on ALL new files and update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/xqaDispatcher.cpp` around lines 1 - 15, Update the
copyright header block at the top of cpp/tensorrt_llm/kernels/xqaDispatcher.cpp
by changing the year range in the comment that currently reads "2020-2024" to
include 2026 (e.g., "2020-2026") so the NVIDIA copyright header reflects the
latest modification.
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention.h (1)
1-15: ⚠️ Potential issue | 🟡 Minor

Update the copyright year.

The header still lists 2020–2023; please update to include 2026 after these changes. As per coding guidelines: “All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Include NVIDIA copyright header on ALL new files and update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention.h` around lines 1 -
15, Update the top-of-file NVIDIA copyright header block (the multi-line comment
starting with "Copyright (c) 2020-2023, NVIDIA CORPORATION.") to reflect the
latest modification year by replacing "2020-2023" with "2020-2026" so the header
shows the current year of change.
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h (1)
1562-1564: ⚠️ Potential issue | 🟠 Major

update_rotary_base_n_scale should use rope_position instead of tlength when Helix is active with dynamic scaling.

For RotaryScalingType::kDYNAMIC, the dynamic base extension depends directly on the sequence length parameter—specifically in the formula base * powf((scale*seq_len/max_positions)-(scale-1), d/(d-2)). When Helix is enabled and rope_position != tlength, passing tlength (KV cache length) to this function causes the base frequency extension to be computed for the wrong context window, while apply_rotary_embedding later applies rotation at rope_position. This mismatch yields inconsistent rotation frequencies.

Line 1726 already uses rope_position for the m_scale decision; line 1564 should be updated similarly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h`
around lines 1562 - 1564, The call to mmha::update_rotary_base_n_scale is using
tlength (KV cache length) which causes incorrect base extension for
RotaryScalingType::kDYNAMIC when Helix dynamic scaling is active; change the
third argument for sequence length from tlength to rope_position so the dynamic
base is computed using the actual rotation position used later by
apply_rotary_embedding; update the invocation that passes rotary_embedding_base,
rotary_embedding_scale, params.rotary_embedding_scale_type,
params.rotary_embedding_dim, params.rotary_embedding_max_positions, tlength to
instead pass rope_position (keeping the same other symbols) so the computed
frequencies match apply_rotary_embedding’s behavior.
cpp/tensorrt_llm/common/attentionOp.cpp (1)
1-2: ⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

The copyright header still shows 1993-2025, but this file has meaningful modifications in 2026. As per coding guidelines, the year should reflect the latest meaningful modification.
-* SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION &
+* SPDX-FileCopyrightText: Copyright (c) 1993-2026 NVIDIA CORPORATION &
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/common/attentionOp.cpp` around lines 1 - 2, Update the SPDX
copyright header at the top of the file by changing the year range from
"1993-2025" to "1993-2026" so it reflects the current meaningful modification;
locate the SPDX header line beginning with "SPDX-FileCopyrightText" (the comment
block in attentionOp.cpp) and replace the trailing year range accordingly,
preserving the rest of the header text and formatting.
tensorrt_llm/_torch/modules/attention.py (1)
1633-1637: ⚠️ Potential issue | 🟡 Minor

forward_context_default sets helix_position_offsets without the enable_helix_test guard, unlike Attention.forward.

In Attention.forward (line 728), helix position offsets are only set when self.enable_helix_test and self.mapping.has_cp_helix(). However, in MLA.forward_context_default (line 1633), the guard is only self.enable_helix_test — it doesn't check self.mapping.has_cp_helix(). If enable_helix_test=True is ever set without Helix CP, this will unconditionally write to attn_metadata.helix_position_offsets.
Proposed fix
-        if self.enable_helix_test:
+        if self.enable_helix_test and self.mapping.has_cp_helix():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/attention.py` around lines 1633 - 1637,
MLA.forward_context_default currently sets attn_metadata.helix_position_offsets
whenever self.enable_helix_test is true; change it to match Attention.forward by
guarding the write with both self.enable_helix_test and
self.mapping.has_cp_helix() so helix_position_offsets is only set when Helix CP
is present. Locate MLA.forward_context_default and replace the single-condition
block that assigns attn_metadata.helix_position_offsets = position_ids with a
compound condition checking self.enable_helix_test and
self.mapping.has_cp_helix() before assigning.

🧹 Nitpick comments (9)

cpp/tensorrt_llm/common/attentionOp.h (1)

147-206: enqueueContextParamsToString() is missing the two new Helix fields.

helix_position_offsets and helix_is_inactive_rank are not emitted. When debugging Helix-related attention issues, this omission makes the dump incomplete.

♻️ Proposed addition

     ss << "v_ptr: " << this->v_ptr << std::endl;
+    ss << "helix_position_offsets: " << this->helix_position_offsets << std::endl;
+    ss << "helix_is_inactive_rank: " << this->helix_is_inactive_rank << std::endl;
     return ss.str();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/common/attentionOp.h` around lines 147 - 206,
enqueueContextParamsToString() currently omits the two new Helix members; update
this function to append the helix fields to the string dump by adding lines that
output this->helix_position_offsets and this->helix_is_inactive_rank (mirror the
style used for other pointer/int members such as "block_offsets" and
"cross_kv"); ensure you format them consistently (e.g., "helix_position_offsets:
" << this->helix_position_offsets << std::endl and "helix_is_inactive_rank: " <<
this->helix_is_inactive_rank << std::endl) so Helix-related attention debugging
includes these values.

tests/unittest/_torch/modules/test_mha_helix.py (5)

616-628: Broad exception catch and re-raise without chaining loses context.

Catching bare Exception is overly broad, and re-raising without from loses the exception chain. Use raise ... from err to preserve the original traceback in the exception chain. As per coding guidelines: "When using try-except blocks, limit the except to the smallest set of errors possible. Avoid bare except: clauses."

Suggested fix

 def _run_single_rank(func, *args, **kwargs):
     rank = tensorrt_llm.mpi_rank()
     torch.cuda.set_device(rank)
     print(f"rank {rank} starting")
     try:
         ret = func(rank, *args, **kwargs)
-        print(f"rank {rank} done")
-        return ret
-    except Exception:
+    except Exception as err:
         traceback.print_exc()
         tb = traceback.format_exc()
-        raise Exception(f"\n\nError occurred. Original traceback is\n{tb}\n")
+        raise RuntimeError(
+            f"\n\nError occurred on rank {rank}. Original traceback is\n{tb}\n"
+        ) from err
+    print(f"rank {rank} done")
+    return ret

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modules/test_mha_helix.py` around lines 616 - 628, The
helper _run_single_rank currently catches Exception broadly and re-raises a new
Exception without chaining, which loses the original traceback; change the
except block to "except Exception as err" and re-raise the new Exception using
"raise Exception(... ) from err" (or simply re-raise the original error) so the
original exception context from the call to func(rank, ...) is preserved; update
references in this function around tensorrt_llm.mpi_rank() and
torch.cuda.set_device(rank) accordingly.

311-317: Dead assignment: start is assigned twice before the loop.

Line 311 assigns start = time.time() but line 317 immediately overwrites it before any use. Remove the first assignment.

     outputs = []
-    start = time.time()
 
     # CUDA graph setup for timing
     use_cuda_graph = gen_steps > scenario.ref_steps
     graph = None
     graph_output = None
     start = time.time()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modules/test_mha_helix.py` around lines 311 - 317,
Remove the redundant initial timestamp assignment: the variable start is set to
time.time() at the top and immediately overwritten later before use; delete the
first start = time.time() so only the later assignment remains. Edit the test
(around the CUDA graph setup where use_cuda_graph, graph, and graph_output are
declared) to keep a single start = time.time() just before timing begins.

197-205: Unused loop variable name.

-    for name, param in attn.named_parameters():
+    for _name, param in attn.named_parameters():

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modules/test_mha_helix.py` around lines 197 - 205, The
loop in _generate_random_weights uses an unused variable name from
attn.named_parameters(); change the loop to avoid the unused binding by either
iterating over attn.parameters() or replacing name with an underscore (for _,
param in attn.named_parameters()), then keep the existing dtype/initialization
logic for param.data so there are no unused variables flagged.

596-602: Using cp_allgather to broadcast is functional but wasteful.

Only rank 0 has valid ref_output; other ranks allocate an empty tensor just to participate in the allgather. A torch.distributed.broadcast from rank 0 would be more efficient. This is a test, so performance isn't critical, but worth noting.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modules/test_mha_helix.py` around lines 596 - 602, The
test currently uses cp_allgather(ref_output, mapping=mapping, dim=0) to
broadcast the reference from rank 0, which forces all other ranks to allocate
empty tensors; replace that allgather with a broadcast from rank 0 (e.g.,
torch.distributed.broadcast or your test-suite broadcast helper) so only rank 0
provides the real data and other ranks create an appropriately shaped/typed
tensor and receive it; update the code around the cp_allgather call (referencing
cp_allgather, ref_output, and mapping) to allocate ref_output on non-root ranks
with the same shape/dtype and then call broadcast(ref_output, src=0), removing
the mapping/allgather usage.

22-22: Use built-in generic types instead of typing.List and typing.Optional.

Since TensorRT-LLM requires Python ≥ 3.10, list and Optional from typing are unnecessary. Use list[float] and int | None directly.

-from typing import List, Optional

Then update usages at lines 639-641:

-    gen_steps: Optional[int] = None,
+    gen_steps: int | None = None,
     max_mismatch_ratio: float = 0.02,
-    mismatch_ratios: Optional[List[float]] = None,
+    mismatch_ratios: list[float] | None = None,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modules/test_mha_helix.py` at line 22, Remove the
typing import and migrate annotations that use List and Optional to Python 3.10+
built-in generics: delete the line importing "List" and "Optional" and replace
any occurrences of "List[float]" with "list[float]" and "Optional[int]" (or
similar Optional[...] uses) with the union form "int | None" (or the appropriate
type | None) throughout the module; specifically update the spots that reference
the symbols "List" and "Optional" so all type annotations use built-in generics.

tensorrt_llm/_torch/modules/attention.py (3)

737-741: Simplify redundant hasattr + getattr guard.

getattr(obj, attr, None) is not None already handles the case where the attribute doesn't exist; the preceding hasattr check is redundant.

Proposed simplification

-                if hasattr(attn_metadata,
-                           'helix_is_inactive_rank') and getattr(
-                               attn_metadata, 'helix_is_inactive_rank',
-                               None) is not None:
-                    attn_metadata.helix_is_inactive_rank.fill_(False)
+                inactive = getattr(attn_metadata, 'helix_is_inactive_rank', None)
+                if inactive is not None:
+                    inactive.fill_(False)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/attention.py` around lines 737 - 741, The guard
before clearing helix_is_inactive_rank is redundant: replace the combined
hasattr + getattr check with a single getattr(attn_metadata,
'helix_is_inactive_rank', None) is not None check and only then call
attn_metadata.helix_is_inactive_rank.fill_(False); locate this in the attention
module where attn_metadata and its helix_is_inactive_rank attribute are
referenced and remove the initial hasattr(...) condition so the code uses the
single getattr-based null check.

991-995: Silent fallback of rms_norm_eps in Helix test mode may hide configuration bugs.

When enable_helix_test is True, rms_norm_eps silently falls back to 1e-6 if the attribute is missing from pretrained_config. If the pretrained model actually uses a different epsilon (e.g., 1e-5), this silent default could produce subtly wrong test results. Consider logging when the fallback is used.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/attention.py` around lines 991 - 995, The silent
fallback of rms_norm_eps when enable_helix_test is True can hide config
mismatches; update the block that sets rms_norm_eps (referencing
enable_helix_test and config.pretrained_config) to detect whether
pretrained_config actually has rms_norm_eps and, if not, emit a warning or info
log stating that the default 1e-6 is being used for helix tests (include the
model identifier or config reference if available) so callers know a fallback
occurred; keep the existing fallback value but ensure the log is clear and only
triggered when the attribute is missing.

459-507: Helix post-processing logic is duplicated between Attention._helix_post_process and MLA._attn_forward_gen.

The NCCL path (lines 465-475) is character-for-character identical to MLA._attn_forward_gen lines 1298-1315, and the FIFO paths differ only in the value dimension (head_dim vs kv_lora_rank) and the use of maybe_execute_in_parallel in MLA. Consider extracting a shared helper (e.g., _helix_alltoall_and_combine(partial_o, softmax_stats, mapping, num_heads_tp_cp, value_dim, ...)) to avoid maintaining two copies of the same multi-branch logic.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/attention.py` around lines 459 - 507, The
duplicated Helix all-to-all + combine logic in Attention._helix_post_process and
MLA._attn_forward_gen should be extracted into a shared helper (suggested name
_helix_alltoall_and_combine) that accepts (partial_o, softmax_stats, mapping,
num_heads_tp_cp, value_dim, fifo_version_override=None,
use_maybe_parallel=False) and encapsulates the NCCL branch
(torch.transpose/contiguous, torch.split, alltoall_helix, transpose back,
torch.ops.trtllm.helix_post_process) and the FIFO branches
(HelixAllToAllNative.get(mapping), view/transpose patterns for fifo_version==1
and else, helix.alltoall_native, appropriate reshapes, and calls to
torch.ops.trtllm.helix_post_process_native with the correct final flag); then
replace logic in _helix_post_process to call this helper with value_dim=head_dim
and in MLA._attn_forward_gen to call it with value_dim=kv_lora_rank and
use_maybe_parallel set as before, preserving fifo_version from mapping.cp_config
and cp_size/num_tokens behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h`:
- Around line 1494-1500: The helix inactivity flag list is being appended
per-beam causing a [b*beam_width] shape; in _torch/pyexecutor/model_engine.py
fix the accumulation of helix_is_inactive_rank so it appends once per request
instead of once per beam: move the helix_is_inactive_rank.append(...) out of the
beam loop (or wrap it with a conditional like only append when beam_idx==0) so
that helix_is_inactive_rank has length b (one entry per request) and indexing
used by decoderMaskedMultiheadAttentionTemplate.h (batch_idx_for_helix,
helix_is_inactive_rank) works correctly with beam_width>1.

In
`@cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h`:
- Around line 431-445: The code computes rotary_position and helix_inactive
using helix_position_offsets[global_token_idx] (and other arrays) before
checking valid_token, which can cause out-of-bounds access when remove_padding
is enabled; move the valid_token guard to before the rotary_position and
helix_inactive calculations (i.e., evaluate valid_token first), and only read
params.helix_position_offsets[global_token_idx],
params.spec_decoding_position_offsets[...], and
params.mrope_position_deltas[...] when valid_token is true (otherwise use safe
defaults like 0 for rotary_position and false for helix_inactive); update
references in the block that sets rotary_position and helix_inactive (symbols:
rotary_position, helix_inactive, helix_position_offsets, global_token_idx,
spec_decoding_position_offsets, local_token_idx, batch_idx, past_seq_len,
mrope_position_deltas, helix_is_inactive_rank) accordingly.

In `@cpp/tensorrt_llm/thop/attentionOp.cpp`:
- Around line 474-483: Add brace-delimited blocks for the single-statement if
bodies around helix extraction and remove the duplicated extraction by creating
a small lambda that captures mla_tensor_params and assigns
helix_position_offsets and helix_is_inactive_rank into a given enqueue_params
instance; call this lambda for both EnqueueContextParams<T> and
EnqueueGenerationParams<T> (both inherit helix_* from EnqueueParams<T>) so the
logic for extracting mla_tensor_params and setting
enqueue_params.helix_position_offsets / enqueue_params.helix_is_inactive_rank is
centralized and all if(...) statements use braces.

In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 543-561: The Helix CP branch in attention.py (the block guarded by
self.mapping.has_cp_helix() and attn_metadata.num_contexts == 0) silently
bypasses all quantization parameters (out_scale, out_scale_sf, kv_scales_sf,
kv_scales_sf_inv); add an explicit guard or comment: either assert that
quantization is incompatible with Helix CP (e.g., raise/assert if any of those
quant parameters are set) or add a clear comment above the block referencing
mapping.has_cp_helix(), attn_metadata and explaining that Helix CP currently
disables quantization and why, and if applicable emit a one-line warning/log
when quant params are present to prevent silent skipping; ensure references to
self.attn.forward(), softmax_stats and self._helix_post_process() remain
unchanged.

---

Outside diff comments:
In `@cpp/tensorrt_llm/common/attentionOp.cpp`:
- Around line 1-2: Update the SPDX copyright header at the top of the file by
changing the year range from "1993-2025" to "1993-2026" so it reflects the
current meaningful modification; locate the SPDX header line beginning with
"SPDX-FileCopyrightText" (the comment block in attentionOp.cpp) and replace the
trailing year range accordingly, preserving the rest of the header text and
formatting.

In `@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention.h`:
- Around line 1-15: Update the top-of-file NVIDIA copyright header block (the
multi-line comment starting with "Copyright (c) 2020-2023, NVIDIA CORPORATION.")
to reflect the latest modification year by replacing "2020-2023" with
"2020-2026" so the header shows the current year of change.

In
`@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h`:
- Around line 1562-1564: The call to mmha::update_rotary_base_n_scale is using
tlength (KV cache length) which causes incorrect base extension for
RotaryScalingType::kDYNAMIC when Helix dynamic scaling is active; change the
third argument for sequence length from tlength to rope_position so the dynamic
base is computed using the actual rotation position used later by
apply_rotary_embedding; update the invocation that passes rotary_embedding_base,
rotary_embedding_scale, params.rotary_embedding_scale_type,
params.rotary_embedding_dim, params.rotary_embedding_max_positions, tlength to
instead pass rope_position (keeping the same other symbols) so the computed
frequencies match apply_rotary_embedding’s behavior.

In `@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/xqaParams.h`:
- Around line 1-15: Update the top-of-file NVIDIA copyright header so the year
range includes 2026 (e.g., change "2020-2025" to "2020-2026") in the header
block at the top of xqaParams.h; edit the existing comment block rather than
adding a new header and ensure the License text remains unchanged.

In `@cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h`:
- Around line 1-15: Update the NVIDIA copyright header at the top of the file by
changing the year range "2020-2023" to include the latest modification year
(e.g., "2020-2026") so the header reflects the most recent meaningful change;
locate the header comment block at the file top and replace the year range
accordingly.

In
`@cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h`:
- Around line 1-16: Update the copyright header's year range in the file by
replacing the existing "Copyright (c) 2019-2024, NVIDIA CORPORATION." entry in
the top-of-file comment block so it reflects the latest modification year (e.g.,
change "2019-2024" to "2019-2026"); ensure the rest of the header text
(including the NAVER/CLOVA line and Apache License block) remains unchanged.

In `@cpp/tensorrt_llm/kernels/xqaDispatcher.cpp`:
- Around line 1-15: Update the copyright header block at the top of
cpp/tensorrt_llm/kernels/xqaDispatcher.cpp by changing the year range in the
comment that currently reads "2020-2024" to include 2026 (e.g., "2020-2026") so
the NVIDIA copyright header reflects the latest modification.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1-5: Add the standard NVIDIA Apache-2.0 copyright header (with
year 2026) to the top of the file
tensorrt_llm/_torch/attention_backend/trtllm.py before any imports; ensure the
header matches the project's canonical NVIDIA header text and license block and
update the file's modification year to 2026, leaving the rest of the module
(imports like math, os, weakref and dataclass/type hints) unchanged.
- Around line 513-629: The assert that checks
_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION and _trtllm_gen_reason should be replaced
with a non‑fatal warning and fall back to the existing thop attention path when
trtllm_gen.is_supported returns False for reasons other than KV cache update;
locate the assert block after the trtllm_gen.is_supported call and the
trtllm_gen_attention invocation and change it to log a warning including
_trtllm_gen_reason (use your logger or warnings.warn) and let execution continue
to the else path (thop.attention fallback) instead of raising, keeping the
special case for KV cache update handling if needed.

In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 1633-1637: MLA.forward_context_default currently sets
attn_metadata.helix_position_offsets whenever self.enable_helix_test is true;
change it to match Attention.forward by guarding the write with both
self.enable_helix_test and self.mapping.has_cp_helix() so helix_position_offsets
is only set when Helix CP is present. Locate MLA.forward_context_default and
replace the single-condition block that assigns
attn_metadata.helix_position_offsets = position_ids with a compound condition
checking self.enable_helix_test and self.mapping.has_cp_helix() before
assigning.

---

Nitpick comments:
In `@cpp/tensorrt_llm/common/attentionOp.h`:
- Around line 147-206: enqueueContextParamsToString() currently omits the two
new Helix members; update this function to append the helix fields to the string
dump by adding lines that output this->helix_position_offsets and
this->helix_is_inactive_rank (mirror the style used for other pointer/int
members such as "block_offsets" and "cross_kv"); ensure you format them
consistently (e.g., "helix_position_offsets: " << this->helix_position_offsets
<< std::endl and "helix_is_inactive_rank: " << this->helix_is_inactive_rank <<
std::endl) so Helix-related attention debugging includes these values.

In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 737-741: The guard before clearing helix_is_inactive_rank is
redundant: replace the combined hasattr + getattr check with a single
getattr(attn_metadata, 'helix_is_inactive_rank', None) is not None check and
only then call attn_metadata.helix_is_inactive_rank.fill_(False); locate this in
the attention module where attn_metadata and its helix_is_inactive_rank
attribute are referenced and remove the initial hasattr(...) condition so the
code uses the single getattr-based null check.
- Around line 991-995: The silent fallback of rms_norm_eps when
enable_helix_test is True can hide config mismatches; update the block that sets
rms_norm_eps (referencing enable_helix_test and config.pretrained_config) to
detect whether pretrained_config actually has rms_norm_eps and, if not, emit a
warning or info log stating that the default 1e-6 is being used for helix tests
(include the model identifier or config reference if available) so callers know
a fallback occurred; keep the existing fallback value but ensure the log is
clear and only triggered when the attribute is missing.
- Around line 459-507: The duplicated Helix all-to-all + combine logic in
Attention._helix_post_process and MLA._attn_forward_gen should be extracted into
a shared helper (suggested name _helix_alltoall_and_combine) that accepts
(partial_o, softmax_stats, mapping, num_heads_tp_cp, value_dim,
fifo_version_override=None, use_maybe_parallel=False) and encapsulates the NCCL
branch (torch.transpose/contiguous, torch.split, alltoall_helix, transpose back,
torch.ops.trtllm.helix_post_process) and the FIFO branches
(HelixAllToAllNative.get(mapping), view/transpose patterns for fifo_version==1
and else, helix.alltoall_native, appropriate reshapes, and calls to
torch.ops.trtllm.helix_post_process_native with the correct final flag); then
replace logic in _helix_post_process to call this helper with value_dim=head_dim
and in MLA._attn_forward_gen to call it with value_dim=kv_lora_rank and
use_maybe_parallel set as before, preserving fifo_version from mapping.cp_config
and cp_size/num_tokens behavior.

In `@tests/unittest/_torch/modules/test_mha_helix.py`:
- Around line 616-628: The helper _run_single_rank currently catches Exception
broadly and re-raises a new Exception without chaining, which loses the original
traceback; change the except block to "except Exception as err" and re-raise the
new Exception using "raise Exception(... ) from err" (or simply re-raise the
original error) so the original exception context from the call to func(rank,
...) is preserved; update references in this function around
tensorrt_llm.mpi_rank() and torch.cuda.set_device(rank) accordingly.
- Around line 311-317: Remove the redundant initial timestamp assignment: the
variable start is set to time.time() at the top and immediately overwritten
later before use; delete the first start = time.time() so only the later
assignment remains. Edit the test (around the CUDA graph setup where
use_cuda_graph, graph, and graph_output are declared) to keep a single start =
time.time() just before timing begins.
- Around line 197-205: The loop in _generate_random_weights uses an unused
variable name from attn.named_parameters(); change the loop to avoid the unused
binding by either iterating over attn.parameters() or replacing name with an
underscore (for _, param in attn.named_parameters()), then keep the existing
dtype/initialization logic for param.data so there are no unused variables
flagged.
- Around line 596-602: The test currently uses cp_allgather(ref_output,
mapping=mapping, dim=0) to broadcast the reference from rank 0, which forces all
other ranks to allocate empty tensors; replace that allgather with a broadcast
from rank 0 (e.g., torch.distributed.broadcast or your test-suite broadcast
helper) so only rank 0 provides the real data and other ranks create an
appropriately shaped/typed tensor and receive it; update the code around the
cp_allgather call (referencing cp_allgather, ref_output, and mapping) to
allocate ref_output on non-root ranks with the same shape/dtype and then call
broadcast(ref_output, src=0), removing the mapping/allgather usage.
- Line 22: Remove the typing import and migrate annotations that use List and
Optional to Python 3.10+ built-in generics: delete the line importing "List" and
"Optional" and replace any occurrences of "List[float]" with "list[float]" and
"Optional[int]" (or similar Optional[...] uses) with the union form "int | None"
(or the appropriate type | None) throughout the module; specifically update the
spots that reference the symbols "List" and "Optional" so all type annotations
use built-in generics.

...nsorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h

cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h

cpp/tensorrt_llm/thop/attentionOp.cpp

tensorrt_llm/_torch/modules/attention.py

brb-nv · 2026-02-19T02:06:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-19T02:12:49Z

PR_Github #36195 [ run ] triggered by Bot. Commit: 4ae518f Link to invocation

tensorrt_llm/_torch/modules/attention.py

brb-nv · 2026-02-26T06:42:35Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-26T06:49:27Z

PR_Github #36878 [ run ] triggered by Bot. Commit: 579333c Link to invocation

tensorrt-cicd · 2026-02-26T12:11:40Z

PR_Github #36878 [ run ] completed with state FAILURE. Commit: 579333c
/LLM/main/L0_MergeRequest_PR pipeline #28553 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

mikeiovine

Signing off on torch module changes

brb-nv · 2026-02-26T20:01:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-26T20:07:39Z

PR_Github #36965 [ run ] triggered by Bot. Commit: df698fd Link to invocation

tensorrt-cicd · 2026-02-27T04:37:57Z

PR_Github #36965 [ run ] completed with state SUCCESS. Commit: df698fd
/LLM/main/L0_MergeRequest_PR pipeline #28622 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

brb-nv · 2026-02-27T05:03:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-27T05:09:28Z

PR_Github #37023 [ run ] triggered by Bot. Commit: df698fd Link to invocation

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv · 2026-02-27T05:19:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-27T05:25:55Z

PR_Github #37024 [ run ] triggered by Bot. Commit: dafaee1 Link to invocation

tensorrt-cicd · 2026-02-27T11:16:11Z

PR_Github #37024 [ run ] completed with state SUCCESS. Commit: dafaee1
/LLM/main/L0_MergeRequest_PR pipeline #28668 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

brb-nv · 2026-02-27T16:27:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-27T16:33:32Z

PR_Github #37090 [ run ] triggered by Bot. Commit: dafaee1 Link to invocation

tensorrt-cicd · 2026-02-27T20:07:58Z

PR_Github #37090 [ run ] completed with state SUCCESS. Commit: dafaee1
/LLM/main/L0_MergeRequest_PR pipeline #28717 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

brb-nv · 2026-02-27T20:11:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-27T20:18:58Z

PR_Github #37116 [ run ] triggered by Bot. Commit: dafaee1 Link to invocation

tensorrt-cicd · 2026-02-27T21:23:03Z

PR_Github #37116 [ run ] completed with state SUCCESS. Commit: dafaee1
/LLM/main/L0_MergeRequest_PR pipeline #28734 completed with status: 'SUCCESS'

Link to invocation

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv requested a review from a team as a code owner February 18, 2026 17:37

brb-nv requested a review from QiJune February 18, 2026 17:37

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

brb-nv force-pushed the user/brb/helix-for-mha branch from e409754 to d046dac Compare February 18, 2026 20:26

brb-nv changed the title ~~[None][feat] Core changes to support Helix CP with MHA~~ [None][feat] Core changes to support Helix CP with GQA Feb 18, 2026

brb-nv requested a review from a team as a code owner February 18, 2026 21:19

brb-nv requested a review from yizhang-nv February 18, 2026 21:19

brb-nv force-pushed the user/brb/helix-for-mha branch from 6c9550f to 681c940 Compare February 18, 2026 21:20

brb-nv requested a review from a team as a code owner February 18, 2026 21:26

brb-nv force-pushed the user/brb/helix-for-mha branch 3 times, most recently from edb7f6c to d90ab3a Compare February 18, 2026 23:10

brb-nv requested a review from a team as a code owner February 18, 2026 23:10

brb-nv requested review from chuangz0 and schetlur-nv February 18, 2026 23:10

brb-nv force-pushed the user/brb/helix-for-mha branch from d90ab3a to 7165932 Compare February 18, 2026 23:30

brb-nv requested review from a team as code owners February 19, 2026 00:18

brb-nv requested review from dongjiyingdjy and yechank-nvidia February 19, 2026 00:18

brb-nv force-pushed the user/brb/helix-for-mha branch from 5416db9 to 6afc9e0 Compare February 19, 2026 00:19

brb-nv changed the title ~~[None][feat] Core changes to support Helix CP with GQA~~ [None][feat] Support Helix CP with GQA Feb 19, 2026

brb-nv force-pushed the user/brb/helix-for-mha branch from 6eb0db6 to 4043768 Compare February 19, 2026 00:42

brb-nv changed the title ~~[None][feat] Support Helix CP with GQA~~ [TRTLLM-11058][feat] Support Helix CP with GQA Feb 19, 2026

brb-nv force-pushed the user/brb/helix-for-mha branch 2 times, most recently from 6b2eafa to 4ae518f Compare February 19, 2026 02:06

pengbowang-nv reviewed Feb 25, 2026

View reviewed changes

tensorrt_llm/_torch/modules/attention.py Show resolved Hide resolved

brb-nv force-pushed the user/brb/helix-for-mha branch 2 times, most recently from b2a951f to 579333c Compare February 26, 2026 06:16

pengbowang-nv approved these changes Feb 26, 2026

View reviewed changes

mikeiovine approved these changes Feb 26, 2026

View reviewed changes

brb-nv force-pushed the user/brb/helix-for-mha branch from 579333c to df698fd Compare February 26, 2026 20:01

brb-nv added 4 commits February 26, 2026 21:15

[TRTLLM-11058][feat] Core changes to support Helix CP with GQA

eaa2ef2

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

revert changes to non-trtllmgen codepath

4d4900a

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

defensive assert for torch-compile codepath

9ee1bf1

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

avoid flashinfer trtllm-gen codepath

dafaee1

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv force-pushed the user/brb/helix-for-mha branch from df698fd to dafaee1 Compare February 27, 2026 05:15

brb-nv merged commit 3fe0908 into NVIDIA:main Feb 27, 2026
5 checks passed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Mar 9, 2026

[TRTLLM-11058][feat] Support Helix CP with GQA (NVIDIA#11570)

c6eef4c

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

Conversation

brb-nv commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brb-nv commented Feb 19, 2026

Uh oh!

tensorrt-cicd commented Feb 19, 2026

Uh oh!

Uh oh!

brb-nv commented Feb 26, 2026

Uh oh!

tensorrt-cicd commented Feb 26, 2026

Uh oh!

tensorrt-cicd commented Feb 26, 2026

Uh oh!

mikeiovine left a comment

Choose a reason for hiding this comment

Uh oh!

brb-nv commented Feb 26, 2026

Uh oh!

tensorrt-cicd commented Feb 26, 2026

Uh oh!

tensorrt-cicd commented Feb 27, 2026

Uh oh!

brb-nv commented Feb 27, 2026

Uh oh!

tensorrt-cicd commented Feb 27, 2026

Uh oh!

brb-nv commented Feb 27, 2026

Uh oh!

tensorrt-cicd commented Feb 27, 2026

Uh oh!

tensorrt-cicd commented Feb 27, 2026

Uh oh!

brb-nv commented Feb 27, 2026

Uh oh!

tensorrt-cicd commented Feb 27, 2026

Uh oh!

tensorrt-cicd commented Feb 27, 2026

Uh oh!

brb-nv commented Feb 27, 2026

Uh oh!

tensorrt-cicd commented Feb 27, 2026

Uh oh!

tensorrt-cicd commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

brb-nv commented Feb 18, 2026 •

edited

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading