Skip to content

[None][feat] Avoid duplicated computation with ADP + Helix CP in GQA#11891

Merged
brb-nv merged 3 commits intoNVIDIA:mainfrom
brb-nv:user/brb/avoid-duplication-for-qwen
Mar 6, 2026
Merged

[None][feat] Avoid duplicated computation with ADP + Helix CP in GQA#11891
brb-nv merged 3 commits intoNVIDIA:mainfrom
brb-nv:user/brb/avoid-duplication-for-qwen

Conversation

@brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented Mar 4, 2026

Description

This MR is the GQA counterpart of this MR: #11167

Commonizes functionality between MLA() and Attention().

Test Coverage

$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype_with_helix[fifo_v2-cudagraph:with_padding-pp1dp2cp2] -s -v
$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype_with_helix[fifo_v2-cudagraph:with_padding-pp1dp2cp2] -s -v

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • Refactor
    • Enhanced residual processing in attention mechanisms for improved data flow handling
    • Optimized context parallelism utilities for efficient computation across distributed settings
    • Streamlined attention layer integration with improved parameter propagation

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv requested review from a team as code owners March 4, 2026 04:55
@brb-nv brb-nv requested a review from yuxianq March 4, 2026 04:55
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

The pull request adds Helix Context Parallelism (CP) support with residual forwarding in Qwen3 models. Changes introduce CP utility functions for data partitioning and cross-rank communication, extend Attention and MLA forward signatures to accept and propagate residuals, and update Qwen3 decoder layers to pass residuals through attention operations with CP-aware data handling.

Changes

Cohort / File(s) Summary
Qwen3 Model CP Integration
tensorrt_llm/_torch/models/modeling_qwen3.py
Updated Qwen3DecoderLayer.forward to unpack and forward residuals into attention calls via AllReduceParams. Added mapping_with_cp attribute initialization in Qwen3Model and Qwen3ForCausalLM to enable CP/helix mapping configuration during model construction; triggers cp_allgather across CP groups with subsequent token slicing in final forward pass.
Attention Module CP Utilities
tensorrt_llm/_torch/modules/attention.py
Introduced four CP helper functions: _helix_cp_pad, _helix_cp_slice, _helix_cp_allgather_input, and _helix_cp_output_projection to handle CP-based data partitioning and cross-rank communication. Extended Attention.forward and MLA.forward signatures to accept optional residuals and return tuples when residuals are provided. Added public mapping_o attribute to represent CP-output-mapping for combined TP+CP scenarios. Replaced prior private CP helpers with new implementations.

Sequence Diagram(s)

sequenceDiagram
    participant DecoderLayer as Qwen3DecoderLayer
    participant Attn as Attention/MLA
    participant CPUtil as CP Utilities
    participant CPGroup as CP Rank Group

    DecoderLayer->>Attn: forward(hidden_states, residual, AllReduceParams)
    activate Attn
    Attn->>CPUtil: _helix_cp_allgather_input(hidden_states, attn_metadata, mapping)
    activate CPUtil
    CPUtil->>CPGroup: cp_allgather across ranks
    CPGroup-->>CPUtil: gathered data
    CPUtil-->>Attn: concatenated input
    deactivate CPUtil
    Attn->>Attn: attention computation
    Attn->>CPUtil: _helix_cp_output_projection(o_proj, attn_output, residual, mapping_o)
    activate CPUtil
    CPUtil->>CPUtil: CP-aware projection & slice
    CPUtil->>CPUtil: handle residual slicing if provided
    CPUtil-->>Attn: (projected_output, sliced_residual)
    deactivate CPUtil
    Attn-->>DecoderLayer: (hidden_states, residual) or tensor
    deactivate Attn
    DecoderLayer->>DecoderLayer: propagate residual to next layer
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main objective: avoiding duplicated computation when using ADP (Attention Data-Parallel) with Helix CP (Context Parallel) in GQA (Grouped Query Attention).
Description check ✅ Passed The PR description includes the required sections: a clear explanation of purpose (GQA counterpart work), specific test coverage with pytest commands, and a completed checklist indicating author review.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/attention.py (1)

279-280: The Ellipsis (...) sentinel pattern for optional residual handling.

The residual is not ... check accommodates MLA.forward where residual defaults to ... (Ellipsis) as a sentinel to distinguish "no residual provided" from residual=None. This is a valid runtime pattern, though the type annotation residual: Optional[torch.Tensor] doesn't capture it.

Consider documenting this sentinel pattern in the docstring for clarity, since the interaction between Attention.forward (where residual=None is the default) and MLA.forward (where residual=... is the default) relies on this distinction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/attention.py` around lines 279 - 280, The code
uses the Ellipsis sentinel (residual == ...) to distinguish "no residual
provided" (MLA.forward default) from an explicit None (Attention.forward
default); document this pattern in the relevant docstrings so future readers
understand why the check `residual is not ...` exists. Update the docstrings on
MLA.forward and Attention.forward to state that the residual parameter can be
Ellipsis to mean "not provided" (and explain the semantic difference from None),
reference the residual parameter and the runtime check (`residual is not ...`)
and, if desired, mention the helper `_helix_cp_slice` that is run when a
residual is present.
tensorrt_llm/_torch/models/modeling_qwen3.py (1)

285-300: Consider documenting the _frozen attribute access pattern.

The temporary unfreezing of model_config._frozen to modify the mapping is necessary for the CP-to-TP repurposing logic. However, accessing this private attribute (_frozen) creates a coupling with the internal implementation of ModelConfig.

If this pattern is expected to be used elsewhere or ModelConfig changes, this could silently break. Consider either:

  1. Adding a comment explaining this is intentional and tested, or
  2. Exposing a public method on ModelConfig like with_mapping(new_mapping) that handles the freeze/unfreeze safely.

The logic itself is correct: attention layers use the original CP mapping via mapping_with_cp, while other components (MLP, etc.) see the repurposed TP mapping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_qwen3.py` around lines 285 - 300, This
code temporarily toggles the private flag model_config._frozen to swap mappings
(using model_config.mapping.repurpose_helix_cp_to_tp) before constructing
Qwen3Model and then restores it; document this pattern: add a concise inline
comment above the block that mentions the intentional unfreeze/freeze of
model_config._frozen, why mapping_with_cp is preserved, and that this behavior
is tested/required for CP→TP repurposing (referencing model_config._frozen,
mapping_with_cp, repurpose_helix_cp_to_tp, and Qwen3Model); alternatively, if
preferred, add a public helper on ModelConfig (e.g., with_mapping(new_mapping))
that encapsulates the freeze/unfreeze and call that here instead—either add the
explanatory comment near the shown block or refactor to use
ModelConfig.with_mapping to avoid direct _frozen access.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3.py`:
- Around line 285-300: This code temporarily toggles the private flag
model_config._frozen to swap mappings (using
model_config.mapping.repurpose_helix_cp_to_tp) before constructing Qwen3Model
and then restores it; document this pattern: add a concise inline comment above
the block that mentions the intentional unfreeze/freeze of model_config._frozen,
why mapping_with_cp is preserved, and that this behavior is tested/required for
CP→TP repurposing (referencing model_config._frozen, mapping_with_cp,
repurpose_helix_cp_to_tp, and Qwen3Model); alternatively, if preferred, add a
public helper on ModelConfig (e.g., with_mapping(new_mapping)) that encapsulates
the freeze/unfreeze and call that here instead—either add the explanatory
comment near the shown block or refactor to use ModelConfig.with_mapping to
avoid direct _frozen access.

In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 279-280: The code uses the Ellipsis sentinel (residual == ...) to
distinguish "no residual provided" (MLA.forward default) from an explicit None
(Attention.forward default); document this pattern in the relevant docstrings so
future readers understand why the check `residual is not ...` exists. Update the
docstrings on MLA.forward and Attention.forward to state that the residual
parameter can be Ellipsis to mean "not provided" (and explain the semantic
difference from None), reference the residual parameter and the runtime check
(`residual is not ...`) and, if desired, mention the helper `_helix_cp_slice`
that is run when a residual is present.

ℹ️ Review info
Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 559091fc-4df2-4371-818b-a83d68fd0ee6

📥 Commits

Reviewing files that changed from the base of the PR and between a106419 and b042381.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/models/modeling_qwen3.py
  • tensorrt_llm/_torch/modules/attention.py

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv requested a review from a team as a code owner March 4, 2026 05:09
@brb-nv brb-nv requested a review from hlu1 March 4, 2026 05:09
Copy link
Collaborator

@2ez4bz 2ez4bz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approvin

@brb-nv
Copy link
Collaborator Author

brb-nv commented Mar 5, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37920 [ run ] triggered by Bot. Commit: e93b873 Link to invocation

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv force-pushed the user/brb/avoid-duplication-for-qwen branch from e93b873 to e1d12bb Compare March 6, 2026 01:12
Copy link
Collaborator

@byshiue byshiue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brb-nv
Copy link
Collaborator Author

brb-nv commented Mar 6, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37935 [ run ] triggered by Bot. Commit: e1d12bb Link to invocation

@brb-nv brb-nv enabled auto-merge (squash) March 6, 2026 03:27
@tensorrt-cicd
Copy link
Collaborator

PR_Github #37935 [ run ] completed with state SUCCESS. Commit: e1d12bb
/LLM/main/L0_MergeRequest_PR pipeline #29379 completed with status: 'SUCCESS'

Link to invocation

@brb-nv brb-nv merged commit c6c6dc1 into NVIDIA:main Mar 6, 2026
5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Mar 9, 2026
…VIDIA#11891)

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants