[None][refactor] VisualGen attention backend refactor by NVShreyas · Pull Request #12663 · NVIDIA/TensorRT-LLM

NVShreyas · 2026-04-01T16:38:03Z

Summary by CodeRabbit

Release Notes

Refactor
- Introduced a standardized AttentionBackend interface to unify behavior across attention implementations.
- Simplified attention forward method signatures by automatically deriving batch size and sequence length from tensor shapes.
- Updated FlashAttention, TRTLLM, Ulysses, and Vanilla attention implementations to adopt the new interface.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-01T16:51:23Z

📝 Walkthrough

Walkthrough

This PR introduces a standardized AttentionBackend abstract base class and refactors all attention backend implementations to inherit from it. The changes remove explicit dimension parameters (batch_size, seq_len) from forward signatures, replacing them with shape-derived computations. Callers are updated to stop passing these dimensions.

Changes

Cohort / File(s)	Summary
Interface Definition `tensorrt_llm/_torch/visual_gen/attention_backend/interface.py`, `tensorrt_llm/_torch/visual_gen/attention_backend/__init__.py`	Introduced `AttentionBackend` as `nn.Module` + `ABC` with abstract `forward(q, k, v, **kwargs)` method and `preferred_layout` property. Added default `support_fused_qkv()` classmethod. Exported via `__init__.py`.
Backend Implementations `tensorrt_llm/_torch/visual_gen/attention_backend/flash_attn4.py`, `tensorrt_llm/_torch/visual_gen/attention_backend/vanilla.py`, `tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py`, `tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py`	Updated all backends to inherit `AttentionBackend`. Removed `batch_size`/`seq_len`/`seq_len_kv` parameters from `forward()` signatures and implemented shape-based dimension derivation (e.g., `batch_size = q.shape[0]`). Made `attention_mask` keyword-only.
Type Utilities `tensorrt_llm/_torch/visual_gen/attention_backend/utils.py`	Replaced `DiffusionAttentionBackend` union type with concrete `AttentionBackend` in return type annotations. Updated docstrings to document `"FA4"` backend support.
Attention Module Wrapper `tensorrt_llm/_torch/visual_gen/modules/attention.py`	Modified `_attn_impl()` to remove explicit `batch_size`, `seq_len`, `kv_seq_len` parameters. Backend calls now use shape-derived values via `**kwargs` forwarding.
Model Callers `tensorrt_llm/_torch/visual_gen/models/flux/attention.py`, `tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py`	Updated attention backend invocations to remove dimension arguments. Calls changed from `_attn_impl(q, k, v, batch_size, seq_len)` to `_attn_impl(q, k, v)`.
Tests `tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_attention.py`	Updated test backend class `_FusedVanillaAttention` to inherit `AttentionBackend`. Removed dimension arguments from all attention invocation sites in test assertions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is essentially empty, containing only the template structure with no actual content filling the required sections.	Provide a clear description of what is being refactored and why. Include implementation details, test coverage information, and ensure all PR checklist items are explicitly addressed with actual content rather than template placeholders.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: refactoring the VisualGen attention backend interface and implementations.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py (2)
1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

Per coding guidelines, the copyright year should reflect the latest meaningful modification.
Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines: "All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py` at line 1, Update
the SPDX copyright header in the file by changing the year from 2025 to 2026:
locate the line beginning with "# SPDX-FileCopyrightText: Copyright (c) 2025
NVIDIA CORPORATION & AFFILIATES. All rights reserved." (the SPDX header in
trtllm.py) and revise the year to 2026 so it reads 2026.
242-245: ⚠️ Potential issue | 🟡 Minor

Handle mismatched k/v None states.

If only one of k or v is None (but not both), the code falls through to _concat_qkv which will fail when accessing .view() on None. Consider adding validation or explicitly handling this edge case.
Proposed fix
+        if (k is None) != (v is None):
+            raise ValueError("k and v must both be None (fused QKV) or both provided")
+
         if k is None and v is None:
             qkv = q.reshape(batch_size * seq_len, -1)
         else:
             qkv = self._concat_qkv(q, k, v, batch_size, seq_len, kv_seq_len)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py` around lines 242
- 245, The code currently only handles both k and v being None or both present;
add a guard in the block using q, k, v (around the qkv assignment) to detect
mismatched None states (when (k is None) != (v is None)) and handle it: either
raise a clear ValueError mentioning _concat_qkv and the mismatched k/v state, or
normalize by setting the missing tensor to the present one before calling
self._concat_qkv; ensure the check references k, v, self._concat_qkv, and
q.reshape so the intent and location are clear.
tensorrt_llm/_torch/visual_gen/attention_backend/vanilla.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

Per coding guidelines, the NVIDIA copyright header should include the year of the latest meaningful modification. Since this file is being modified in 2026, the copyright year should be updated.
Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines: "All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/attention_backend/vanilla.py` at line 1,
Update the copyright year in the SPDX header: locate the SPDX comment line that
currently reads "# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION
& AFFILIATES. All rights reserved." and change the year from 2025 to 2026 so the
header reflects the latest modification year.

🧹 Nitpick comments (3)

tensorrt_llm/_torch/visual_gen/modules/attention.py (1)
254-254: Call the backend module via __call__, not forward() directly.

Directly invoking self.attn.forward(...) bypasses nn.Module.__call__, so forward hooks and wrappers on the backend never run. Using self.attn(...) preserves the normal PyTorch dispatch path and still reaches the same implementation.
Suggested fix
-        out = self.attn.forward(q=q, k=k, v=v, **kwargs)
+        out = self.attn(q=q, k=k, v=v, **kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/modules/attention.py` at line 254, Replace the
direct backend forward invocation with a normal module call so PyTorch
hooks/wrappers run: change the call site that currently does out =
self.attn.forward(q=q, k=k, v=v, **kwargs) to use the module __call__ (e.g., out
= self.attn(q=q, k=k, v=v, **kwargs)) in the method where self.attn is used so
that nn.Module.__call__ dispatch executes; ensure any keyword/positional
arguments remain identical to preserve behavior.
tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_attention.py (1)
455-466: Keep the new backend test double typed like the interface it is validating.

_FusedVanillaAttention.forward() drops the tensor and return annotations right where this class starts modeling the shared AttentionBackend contract. Adding them here makes interface drift visible to type checkers instead of silently collapsing to Any.
Suggested fix
-    def forward(self, q, k=None, v=None, **kwargs):
+    def forward(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor | None = None,
+        v: torch.Tensor | None = None,
+        **kwargs,
+    ) -> torch.Tensor:
As per coding guidelines, "Static type checking is opt-in by submodule PICs in Python. Always annotate functions with return types, and make the return type None if the function does not return anything."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_attention.py` around
lines 455 - 466, The new test backend _FusedVanillaAttention must preserve the
same typed signature as the AttentionBackend interface: add explicit parameter
and return type annotations on _FusedVanillaAttention.forward (e.g. q:
torch.Tensor, k: Optional[torch.Tensor]=None, v: Optional[torch.Tensor]=None,
**kwargs) -> torch.Tensor (or -> None if the interface forward returns None) so
static type checkers catch drift; ensure you import Optional and torch.Tensor
and match the exact return type declared on AttentionBackend.forward.
tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py (1)
166-168: support_fused_qkv() is broader than this wrapper's public API.

UlyssesAttention.forward() still requires separate q, k, and v and only builds fused QKV internally. Returning True here makes the wrapper look interchangeable with backends that actually accept fused inputs, which is not true today. Either document this as an internal optimization hint or make the wrapper accept fused inputs as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py` around lines
166 - 168, The method support_fused_qkv() currently returns True but the
wrapper’s public API (UlyssesAttention.forward) still requires separate q, k, v;
change support_fused_qkv() to return False so the wrapper is not advertised as
accepting fused inputs, or alternatively implement fused-input handling in
UlyssesAttention.forward (accept a single fused qkv tensor, split it into q/k/v
before existing logic) and update any input checks; prefer the first option (set
support_fused_qkv() -> False) unless you also add fused-input parsing in
UlyssesAttention.forward.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/visual_gen/attention_backend/interface.py`:
- Around line 48-55: Update the abstract base class method signature for
AttentionBackend.forward to accept Optional[torch.Tensor] for k and v (matching
implementations like TRTLLMAttention and call sites such as
UlyssesAttention._forward_fused which pass k=None, v=None), and split the
one-line ellipsis stubs into their own lines (replace single-line "->
torch.Tensor: ..." with a proper signature line followed by a separate line
containing "..." for both forward and the preferred_layout property) so the ABC
matches implementations and fixes the E704 style violation.

In `@tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py`:
- Line 148: The MRO breaks because AttentionBackend.__init__ in the main backend
doesn't call super().__init__, so nn.Module.__init__ is never run when creating
TrtllmAttention (which inherits BaseTrtllmAttention, AttentionBackend); fix by
updating the main backend AttentionBackend.__init__ to call
super().__init__(**kwargs) (preserving existing logic) so the init chain reaches
visual_gen AttentionBackend -> nn.Module, or alternatively ensure
TrtllmAttention.__init__ explicitly invokes nn.Module.__init__(self) before
other inits; reference AttentionBackend.__init__, TrtllmAttention, and
BaseTrtllmAttention when making the change.

---

Outside diff comments:
In `@tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py`:
- Line 1: Update the SPDX copyright header in the file by changing the year from
2025 to 2026: locate the line beginning with "# SPDX-FileCopyrightText:
Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved." (the
SPDX header in trtllm.py) and revise the year to 2026 so it reads 2026.
- Around line 242-245: The code currently only handles both k and v being None
or both present; add a guard in the block using q, k, v (around the qkv
assignment) to detect mismatched None states (when (k is None) != (v is None))
and handle it: either raise a clear ValueError mentioning _concat_qkv and the
mismatched k/v state, or normalize by setting the missing tensor to the present
one before calling self._concat_qkv; ensure the check references k, v,
self._concat_qkv, and q.reshape so the intent and location are clear.

In `@tensorrt_llm/_torch/visual_gen/attention_backend/vanilla.py`:
- Line 1: Update the copyright year in the SPDX header: locate the SPDX comment
line that currently reads "# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA
CORPORATION & AFFILIATES. All rights reserved." and change the year from 2025 to
2026 so the header reflects the latest modification year.

---

Nitpick comments:
In `@tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py`:
- Around line 166-168: The method support_fused_qkv() currently returns True but
the wrapper’s public API (UlyssesAttention.forward) still requires separate q,
k, v; change support_fused_qkv() to return False so the wrapper is not
advertised as accepting fused inputs, or alternatively implement fused-input
handling in UlyssesAttention.forward (accept a single fused qkv tensor, split it
into q/k/v before existing logic) and update any input checks; prefer the first
option (set support_fused_qkv() -> False) unless you also add fused-input
parsing in UlyssesAttention.forward.

In `@tensorrt_llm/_torch/visual_gen/modules/attention.py`:
- Line 254: Replace the direct backend forward invocation with a normal module
call so PyTorch hooks/wrappers run: change the call site that currently does out
= self.attn.forward(q=q, k=k, v=v, **kwargs) to use the module __call__ (e.g.,
out = self.attn(q=q, k=k, v=v, **kwargs)) in the method where self.attn is used
so that nn.Module.__call__ dispatch executes; ensure any keyword/positional
arguments remain identical to preserve behavior.

In `@tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_attention.py`:
- Around line 455-466: The new test backend _FusedVanillaAttention must preserve
the same typed signature as the AttentionBackend interface: add explicit
parameter and return type annotations on _FusedVanillaAttention.forward (e.g. q:
torch.Tensor, k: Optional[torch.Tensor]=None, v: Optional[torch.Tensor]=None,
**kwargs) -> torch.Tensor (or -> None if the interface forward returns None) so
static type checkers catch drift; ensure you import Optional and torch.Tensor
and match the exact return type declared on AttentionBackend.forward.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 27942b7e-f67f-4d22-ba17-6501dcd38b6d

📥 Commits

Reviewing files that changed from the base of the PR and between 1a65a0d and 7411cfd.

📒 Files selected for processing (11)

tensorrt_llm/_torch/visual_gen/attention_backend/__init__.py
tensorrt_llm/_torch/visual_gen/attention_backend/flash_attn4.py
tensorrt_llm/_torch/visual_gen/attention_backend/interface.py
tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py
tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py
tensorrt_llm/_torch/visual_gen/attention_backend/utils.py
tensorrt_llm/_torch/visual_gen/attention_backend/vanilla.py
tensorrt_llm/_torch/visual_gen/models/flux/attention.py
tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py
tensorrt_llm/_torch/visual_gen/modules/attention.py
tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_attention.py

tensorrt_llm/_torch/visual_gen/attention_backend/interface.py

tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py

NVShreyas · 2026-04-01T17:15:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T17:21:45Z

PR_Github #41238 [ run ] triggered by Bot. Commit: 7411cfd Link to invocation

NVShreyas · 2026-04-01T17:25:56Z

/bot kill

tensorrt-cicd · 2026-04-01T17:31:48Z

PR_Github #41239 [ kill ] triggered by Bot. Commit: 4b48264 Link to invocation

tensorrt-cicd · 2026-04-01T17:31:49Z

PR_Github #41238 [ run ] completed with state ABORTED. Commit: 7411cfd

Link to invocation

tensorrt-cicd · 2026-04-01T17:32:20Z

PR_Github #41239 [ kill ] completed with state SUCCESS. Commit: 4b48264
Successfully killed previous jobs for commit 4b48264

Link to invocation

NVShreyas · 2026-04-01T18:11:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T18:18:25Z

PR_Github #41242 [ run ] triggered by Bot. Commit: 4b48264 Link to invocation

tensorrt-cicd · 2026-04-01T23:37:55Z

PR_Github #41242 [ run ] completed with state SUCCESS. Commit: 4b48264
/LLM/main/L0_MergeRequest_PR pipeline #32201 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

NVShreyas · 2026-04-02T01:21:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T01:27:34Z

PR_Github #41290 [ run ] triggered by Bot. Commit: 4b48264 Link to invocation

tensorrt-cicd · 2026-04-02T04:10:36Z

PR_Github #41290 [ run ] completed with state SUCCESS. Commit: 4b48264
/LLM/main/L0_MergeRequest_PR pipeline #32247 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas · 2026-04-02T14:52:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T14:58:19Z

PR_Github #41443 [ run ] triggered by Bot. Commit: 4ec41ad Link to invocation

chang-l

In general, LGTM with minor comments

tensorrt_llm/_torch/visual_gen/attention_backend/interface.py

tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas · 2026-04-02T16:51:03Z

/bot kill

tensorrt-cicd · 2026-04-02T16:56:56Z

PR_Github #41467 [ kill ] triggered by Bot. Commit: 53312e6 Link to invocation

tensorrt-cicd · 2026-04-02T16:57:39Z

PR_Github #41467 [ kill ] completed with state SUCCESS. Commit: 53312e6
Successfully killed previous jobs for commit 53312e6

Link to invocation

NVShreyas · 2026-04-02T17:17:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T17:24:14Z

PR_Github #41475 [ run ] triggered by Bot. Commit: 53312e6 Link to invocation

tensorrt-cicd · 2026-04-02T22:03:04Z

PR_Github #41475 [ run ] completed with state SUCCESS. Commit: 53312e6
/LLM/main/L0_MergeRequest_PR pipeline #32400 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

NVShreyas · 2026-04-02T22:06:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T22:12:30Z

PR_Github #41512 [ run ] triggered by Bot. Commit: 53312e6 Link to invocation

tensorrt-cicd · 2026-04-02T22:59:52Z

PR_Github #41512 [ run ] completed with state SUCCESS. Commit: 53312e6
/LLM/main/L0_MergeRequest_PR pipeline #32428 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

NVShreyas · 2026-04-02T23:04:23Z

/bot help

github-actions · 2026-04-02T23:04:32Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

NVShreyas · 2026-04-02T23:05:10Z

/bot skip --comment "Failing autodeploy DS test is unrelated to this PRs changes"

tensorrt-cicd · 2026-04-02T23:10:46Z

PR_Github #41515 [ skip ] triggered by Bot. Commit: 53312e6 Link to invocation

tensorrt-cicd · 2026-04-02T23:17:19Z

PR_Github #41515 [ skip ] completed with state SUCCESS. Commit: 53312e6
Skipping testing for commit 53312e6

Link to invocation

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas requested a review from a team as a code owner April 1, 2026 16:38

NVShreyas added the VisualGen label Apr 1, 2026

github-actions bot assigned NVShreyas Apr 1, 2026

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

tensorrt_llm/_torch/visual_gen/attention_backend/interface.py Show resolved Hide resolved

tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py Show resolved Hide resolved

NVShreyas added 2 commits April 2, 2026 07:51

attention backend refactor

62e1228

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

trtllm attn fix

4ec41ad

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas force-pushed the user/shreyasm/vg-attn-backend-refactor branch from 4b48264 to 4ec41ad Compare April 2, 2026 14:52

chang-l approved these changes Apr 2, 2026

View reviewed changes

tensorrt_llm/_torch/visual_gen/attention_backend/interface.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py Outdated Show resolved Hide resolved

remove nn.Module inheritance

53312e6

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

chang-l merged commit 45449ad into NVIDIA:main Apr 2, 2026
5 checks passed

govind-ramnarayan pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Apr 6, 2026

[None][refactor] VisualGen attention backend refactor (NVIDIA#12663)

e0f3113

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[None][refactor] VisualGen attention backend refactor (NVIDIA#12663)

60c3590

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

Conversation

NVShreyas commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NVShreyas commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

NVShreyas commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

NVShreyas commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

NVShreyas commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

NVShreyas commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

chang-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NVShreyas commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

NVShreyas commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

NVShreyas commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

NVShreyas commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

NVShreyas commented Apr 1, 2026 •

edited

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading