[None][feat] Avoid duplicated computation with ADP + Helix CP in GQA by brb-nv · Pull Request #11891 · NVIDIA/TensorRT-LLM

brb-nv · 2026-03-04T04:55:42Z

Description

This MR is the GQA counterpart of this MR: #11167

Commonizes functionality between MLA() and Attention().

Test Coverage

$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype_with_helix[fifo_v2-cudagraph:with_padding-pp1dp2cp2] -s -v
$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype_with_helix[fifo_v2-cudagraph:with_padding-pp1dp2cp2] -s -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Refactor
- Enhanced residual processing in attention mechanisms for improved data flow handling
- Optimized context parallelism utilities for efficient computation across distributed settings
- Streamlined attention layer integration with improved parameter propagation

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

coderabbitai · 2026-03-04T05:01:20Z

📝 Walkthrough

Walkthrough

The pull request adds Helix Context Parallelism (CP) support with residual forwarding in Qwen3 models. Changes introduce CP utility functions for data partitioning and cross-rank communication, extend Attention and MLA forward signatures to accept and propagate residuals, and update Qwen3 decoder layers to pass residuals through attention operations with CP-aware data handling.

Changes

Cohort / File(s)	Summary
Qwen3 Model CP Integration `tensorrt_llm/_torch/models/modeling_qwen3.py`	Updated Qwen3DecoderLayer.forward to unpack and forward residuals into attention calls via AllReduceParams. Added mapping_with_cp attribute initialization in Qwen3Model and Qwen3ForCausalLM to enable CP/helix mapping configuration during model construction; triggers cp_allgather across CP groups with subsequent token slicing in final forward pass.
Attention Module CP Utilities `tensorrt_llm/_torch/modules/attention.py`	Introduced four CP helper functions: _helix_cp_pad, _helix_cp_slice, _helix_cp_allgather_input, and _helix_cp_output_projection to handle CP-based data partitioning and cross-rank communication. Extended Attention.forward and MLA.forward signatures to accept optional residuals and return tuples when residuals are provided. Added public mapping_o attribute to represent CP-output-mapping for combined TP+CP scenarios. Replaced prior private CP helpers with new implementations.

Sequence Diagram(s)

sequenceDiagram
    participant DecoderLayer as Qwen3DecoderLayer
    participant Attn as Attention/MLA
    participant CPUtil as CP Utilities
    participant CPGroup as CP Rank Group

    DecoderLayer->>Attn: forward(hidden_states, residual, AllReduceParams)
    activate Attn
    Attn->>CPUtil: _helix_cp_allgather_input(hidden_states, attn_metadata, mapping)
    activate CPUtil
    CPUtil->>CPGroup: cp_allgather across ranks
    CPGroup-->>CPUtil: gathered data
    CPUtil-->>Attn: concatenated input
    deactivate CPUtil
    Attn->>Attn: attention computation
    Attn->>CPUtil: _helix_cp_output_projection(o_proj, attn_output, residual, mapping_o)
    activate CPUtil
    CPUtil->>CPUtil: CP-aware projection & slice
    CPUtil->>CPUtil: handle residual slicing if provided
    CPUtil-->>Attn: (projected_output, sliced_residual)
    deactivate CPUtil
    Attn-->>DecoderLayer: (hidden_states, residual) or tensor
    deactivate Attn
    DecoderLayer->>DecoderLayer: propagate residual to next layer

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main objective: avoiding duplicated computation when using ADP (Attention Data-Parallel) with Helix CP (Context Parallel) in GQA (Grouped Query Attention).
Description check	✅ Passed	The PR description includes the required sections: a clear explanation of purpose (GQA counterpart work), specific test coverage with pytest commands, and a completed checklist indicating author review.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

tensorrt_llm/_torch/modules/attention.py (1)
279-280: The Ellipsis (...) sentinel pattern for optional residual handling.

The residual is not ... check accommodates MLA.forward where residual defaults to ... (Ellipsis) as a sentinel to distinguish "no residual provided" from residual=None. This is a valid runtime pattern, though the type annotation residual: Optional[torch.Tensor] doesn't capture it.

Consider documenting this sentinel pattern in the docstring for clarity, since the interaction between Attention.forward (where residual=None is the default) and MLA.forward (where residual=... is the default) relies on this distinction.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/attention.py` around lines 279 - 280, The code
uses the Ellipsis sentinel (residual == ...) to distinguish "no residual
provided" (MLA.forward default) from an explicit None (Attention.forward
default); document this pattern in the relevant docstrings so future readers
understand why the check `residual is not ...` exists. Update the docstrings on
MLA.forward and Attention.forward to state that the residual parameter can be
Ellipsis to mean "not provided" (and explain the semantic difference from None),
reference the residual parameter and the runtime check (`residual is not ...`)
and, if desired, mention the helper `_helix_cp_slice` that is run when a
residual is present.
tensorrt_llm/_torch/models/modeling_qwen3.py (1)
285-300: Consider documenting the _frozen attribute access pattern.

The temporary unfreezing of model_config._frozen to modify the mapping is necessary for the CP-to-TP repurposing logic. However, accessing this private attribute (_frozen) creates a coupling with the internal implementation of ModelConfig.

If this pattern is expected to be used elsewhere or ModelConfig changes, this could silently break. Consider either:

Adding a comment explaining this is intentional and tested, or

Exposing a public method on ModelConfig like with_mapping(new_mapping) that handles the freeze/unfreeze safely.

The logic itself is correct: attention layers use the original CP mapping via mapping_with_cp, while other components (MLP, etc.) see the repurposed TP mapping.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_qwen3.py` around lines 285 - 300, This
code temporarily toggles the private flag model_config._frozen to swap mappings
(using model_config.mapping.repurpose_helix_cp_to_tp) before constructing
Qwen3Model and then restores it; document this pattern: add a concise inline
comment above the block that mentions the intentional unfreeze/freeze of
model_config._frozen, why mapping_with_cp is preserved, and that this behavior
is tested/required for CP→TP repurposing (referencing model_config._frozen,
mapping_with_cp, repurpose_helix_cp_to_tp, and Qwen3Model); alternatively, if
preferred, add a public helper on ModelConfig (e.g., with_mapping(new_mapping))
that encapsulates the freeze/unfreeze and call that here instead—either add the
explanatory comment near the shown block or refactor to use
ModelConfig.with_mapping to avoid direct _frozen access.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3.py`:
- Around line 285-300: This code temporarily toggles the private flag
model_config._frozen to swap mappings (using
model_config.mapping.repurpose_helix_cp_to_tp) before constructing Qwen3Model
and then restores it; document this pattern: add a concise inline comment above
the block that mentions the intentional unfreeze/freeze of model_config._frozen,
why mapping_with_cp is preserved, and that this behavior is tested/required for
CP→TP repurposing (referencing model_config._frozen, mapping_with_cp,
repurpose_helix_cp_to_tp, and Qwen3Model); alternatively, if preferred, add a
public helper on ModelConfig (e.g., with_mapping(new_mapping)) that encapsulates
the freeze/unfreeze and call that here instead—either add the explanatory
comment near the shown block or refactor to use ModelConfig.with_mapping to
avoid direct _frozen access.

In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 279-280: The code uses the Ellipsis sentinel (residual == ...) to
distinguish "no residual provided" (MLA.forward default) from an explicit None
(Attention.forward default); document this pattern in the relevant docstrings so
future readers understand why the check `residual is not ...` exists. Update the
docstrings on MLA.forward and Attention.forward to state that the residual
parameter can be Ellipsis to mean "not provided" (and explain the semantic
difference from None), reference the residual parameter and the runtime check
(`residual is not ...`) and, if desired, mention the helper `_helix_cp_slice`
that is run when a residual is present.

ℹ️ Review info

Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 559091fc-4df2-4371-818b-a83d68fd0ee6

📥 Commits

Reviewing files that changed from the base of the PR and between a106419 and b042381.

📒 Files selected for processing (2)

tensorrt_llm/_torch/models/modeling_qwen3.py
tensorrt_llm/_torch/modules/attention.py

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

tensorrt_llm/_torch/models/modeling_deepseekv3.py

tensorrt_llm/_torch/modules/attention.py

tensorrt_llm/_torch/models/modeling_deepseekv3.py

tensorrt_llm/_torch/modules/attention.py

2ez4bz

Approvin

tensorrt_llm/_torch/models/modeling_qwen3.py

brb-nv · 2026-03-05T23:25:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-05T23:32:41Z

PR_Github #37920 [ run ] triggered by Bot. Commit: e93b873 Link to invocation

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

byshiue

LGTM

brb-nv · 2026-03-06T01:20:46Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-06T01:26:54Z

PR_Github #37935 [ run ] triggered by Bot. Commit: e1d12bb Link to invocation

tensorrt-cicd · 2026-03-06T07:05:03Z

PR_Github #37935 [ run ] completed with state SUCCESS. Commit: e1d12bb
/LLM/main/L0_MergeRequest_PR pipeline #29379 completed with status: 'SUCCESS'

Link to invocation

…VIDIA#11891) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

[None][feat] Avoid duplicated computation with ADP + Helix CP in GQA

b042381

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv requested review from a team as code owners March 4, 2026 04:55

brb-nv requested review from 2ez4bz, dongjiyingdjy and pengbowang-nv March 4, 2026 04:55

github-actions bot assigned brb-nv Mar 4, 2026

brb-nv requested a review from yuxianq March 4, 2026 04:55

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

address deferred review comments from MR-11167

73ee390

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv requested a review from a team as a code owner March 4, 2026 05:09

brb-nv requested a review from hlu1 March 4, 2026 05:09

yuxianq reviewed Mar 4, 2026

View reviewed changes

tensorrt_llm/_torch/models/modeling_deepseekv3.py Outdated Show resolved Hide resolved

yuxianq reviewed Mar 4, 2026

View reviewed changes

2ez4bz reviewed Mar 4, 2026

View reviewed changes

tensorrt_llm/_torch/models/modeling_qwen3.py Outdated Show resolved Hide resolved

2ez4bz approved these changes Mar 5, 2026

View reviewed changes

address comments from Yuxian and William

e1d12bb

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv force-pushed the user/brb/avoid-duplication-for-qwen branch from e93b873 to e1d12bb Compare March 6, 2026 01:12

byshiue approved these changes Mar 6, 2026

View reviewed changes

yuxianq approved these changes Mar 6, 2026

View reviewed changes

syuoni approved these changes Mar 6, 2026

View reviewed changes

brb-nv enabled auto-merge (squash) March 6, 2026 03:27

brb-nv merged commit c6c6dc1 into NVIDIA:main Mar 6, 2026
5 checks passed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Mar 9, 2026

[None][feat] Avoid duplicated computation with ADP + Helix CP in GQA (N…

81a9b07

…VIDIA#11891) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

Conversation

brb-nv commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 4, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

2ez4bz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brb-nv commented Mar 5, 2026

Uh oh!

tensorrt-cicd commented Mar 5, 2026

Uh oh!

byshiue left a comment

Choose a reason for hiding this comment

Uh oh!

brb-nv commented Mar 6, 2026

Uh oh!

tensorrt-cicd commented Mar 6, 2026

Uh oh!

tensorrt-cicd commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

brb-nv commented Mar 4, 2026 •

edited by coderabbitai bot

Loading