[https://nvbugs/6299530][fix] Capture Qwen3.5 GDN for piecewise CUDA … by liji-nv · Pull Request #15594 · NVIDIA/TensorRT-LLM

liji-nv · 2026-06-24T12:58:12Z

…graph

Add an inplace custom op boundary for Qwen3.5 GDN so torch.compile piecewise CUDA graph can keep tokenwise projections outside the custom op while hiding FLA state updates from FX capture.

Update the FLA GDN helpers to write into caller-provided output tensors, register inplace custom-op metadata, and exclude the GDN custom op from piecewise CUDA graph capture. Add a Qwen3.5 FP8 piecewise CUDA graph smoke test.

Bug: https://nvbugs/6299530

Tested:

python -m py_compile tests/integration/defs/accuracy/test_llm_api_pytorch.py
git diff --check
PDX job 110733: pytest -q tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestQwen3_5_4B::test_fp8_piecewise_cuda_graph -s

Summary by CodeRabbit

New Features
- Added support for optional preallocated output buffers in several gated delta rule and recurrent inference paths, improving flexibility and reducing extra allocations.
- Expanded model execution to carry additional runtime metadata needed for newer optimized execution paths.
Bug Fixes
- Improved piecewise partitioning behavior for compiled graphs, including better handling of boundary and stop operations.
- Added support for an additional custom runtime operator when available.
Tests
- Added regression coverage for FP8 piecewise CUDA graph generation on Qwen3.5-4B.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

liji-nv · 2026-06-24T12:58:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-24T13:04:10Z

PR_Github #55499 [ run ] triggered by Bot. Commit: b9281aa Link to invocation

coderabbitai · 2026-06-24T13:04:32Z

📝 Walkthrough

Walkthrough

This PR enables piecewise CUDA graph compilation for Qwen3Next GatedDeltaNet (GDN) layers. It propagates preallocated output buffers through FLA kernel APIs (chunk_fwd_o, fused_recurrent), introduces a gdn_custom_op_inplace custom op with weakref-based layer registry, refactors GDN forward kwargs from a/b to g/beta, registers the new op with the piecewise optimizer, and adds an FP8 integration test.

Changes

GDN Piecewise CUDA Graph Support

Layer / File(s)	Summary
FLA kernel preallocated output buffer APIs `tensorrt_llm/_torch/modules/fla/chunk_o.py`, `tensorrt_llm/_torch/modules/fla/chunk.py`, `tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`, `tensorrt_llm/_torch/modules/fla/fused_recurrent.py`	Adds optional `output: Optional[torch.Tensor]` to `chunk_fwd_o`, `chunk_gated_delta_rule_fwd`, both `ChunkGatedDeltaRuleFunction.forward` and `chunk_gated_delta_rule`, and the full `fused_recurrent` family. Each function reuses the provided tensor instead of allocating a new one; `@input_guard` decorators are updated to exclude `output`.
`gdn_custom_op_inplace` registration and weakref layer registry `tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Adds `weakref` import and removes `fused_sigmoid_gating_delta_rule_update`. Extends `__init__` to register each GDN instance by unique `layer_idx_str` into `model_config.extra_attrs["gdn_layers"]`. Defines `_extract_gdn_extra_attrs` and the `gdn_custom_op_inplace` custom op that fetches runtime metadata and calls `gdn_layer.forward_core` to write results in-place. Propagates `spec_metadata` into `extra_attrs` in `model_engine.model_forward`.
GDN tokenwise input refactor: `a`/`b` → `g`/`beta` `tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`	Extracts `_compute_tokenwise_inputs(hidden_states)` that runs dual projection matmuls and computes `g`/`beta` via `fused_gdn_gating_with_sigmoid`. Updates `_postprocess_gdn_output`. Changes the kwargs payload from `{mixed_qkv, a, b, z}` to `{mixed_qkv, g, beta}` at call sites.
GDN `forward_decode`/`forward_extend`/`forward_core` output threading `tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`	Updates `forward_decode` and `forward_extend` to accept `output`, consume `g`/`beta` from kwargs, construct `output_d`/`output_p` views for speculative paths, and thread `output` into kernel calls. Switches the primary decode kernel to `fused_recurrent_gated_delta_rule_update` with `g`/`beta`. Refactors `forward_core` to dispatch to `gdn_custom_op_inplace` when compiling or to the direct path with `output`.
Piecewise optimizer boundary op and inplace map `tensorrt_llm/_torch/compilation/piecewise_optimizer.py`, `tensorrt_llm/_torch/compilation/utils.py`	Adds `_piecewise_boundary_ops()` helper that conditionally includes `gdn_custom_op_inplace.default`. Refactors node classification in `piecewise_optimizer` to use the helper. Registers `gdn_custom_op_inplace.default → {1: "output"}` in `inplace_info()`'s `inplace_map` via a guarded `try/except` block.
FP8 piecewise CUDA graph integration test `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/qa/llm_function_core.txt`, `tests/integration/test_lists/test-db/l0_h100.yml`	Adds `test_fp8_piecewise_cuda_graph` to `TestQwen3_5_4B` with `TorchCompileConfig` for fullgraph + piecewise CUDA graphs, running `LLM.generate` on FP8 with chunked prefill. Registers the test in the QA list and H100 pre-merge database.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

xxi-nv
hyukn
QiJune
galagam

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the main change and includes the bug link and fix type.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description clearly states the issue, solution, bug link, and test coverage, even though it doesn’t follow the template headings exactly.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py (1)

118-123: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Check output before reusing it as FlashInfer’s output buffer.

Line 122 assumes output.squeeze(0) has shape [T, num_o_heads, head_size] and a compatible contiguous layout. Validate this before passing it as output= so bad caller buffers fail deterministically.

Suggested guard

     total_seq_len = q3.shape[0]
     num_o_heads = max(q3.shape[1], v3.shape[1])
     head_size = q3.shape[2]
     need_state = inplace_indexed_state_update or output_final_state
-    output_buf = output.squeeze(0) if output is not None else q3.new_empty(
-        total_seq_len, num_o_heads, head_size)
+    if output is not None:
+        expected_shape = (1, total_seq_len, num_o_heads, head_size)
+        if output.shape != expected_shape or output.dtype != q3.dtype or output.device != q3.device:
+            raise ValueError(
+                "`output` must match FlashInfer output shape/dtype/device; "
+                f"got {tuple(output.shape)}/{output.dtype}/{output.device}, "
+                f"expected {expected_shape}/{q3.dtype}/{q3.device}"
+            )
+        if not output.is_contiguous():
+            raise ValueError("`output` must be contiguous")
+        output_buf = output.squeeze(0)
+    else:
+        output_buf = q3.new_empty(total_seq_len, num_o_heads, head_size)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py` around lines 118 - 123,
In flashinfer_chunk.py, the output buffer handling in the FlashInfer path
currently reuses output via FlashInfer’s output buffer without validating its
shape or layout. Update the logic around the output.squeeze(0) reuse in the
chunk/forward flow to first check that a caller-provided output matches [T,
num_o_heads, head_size] and is contiguous/compatible before passing it as
output=; otherwise fall back to allocating a fresh buffer or raise a clear error
from the same code path.

tensorrt_llm/_torch/modules/fla/fused_recurrent.py (1)

130-141: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate output before unsqueezing it into the Triton destination.

Both paths pass caller-owned output into kernels that assume the same dense layout as v. Add shape/dtype/device/contiguity checks before output.unsqueeze(0) to avoid bad writes.

Suggested helper

+def _validate_recurrent_output(output: torch.Tensor, v: torch.Tensor) -> None:
+    if output.shape != v.shape or output.dtype != v.dtype or output.device != v.device:
+        raise ValueError(
+            "`output` must match `v` in shape, dtype, and device; "
+            f"got output={tuple(output.shape)}/{output.dtype}/{output.device}, "
+            f"v={tuple(v.shape)}/{v.dtype}/{v.device}"
+        )
+    if not output.is_contiguous():
+        raise ValueError("`output` must be contiguous for fused recurrent kernels")

Then call it before each output.unsqueeze(0) branch.

Also applies to: 483-498

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/fused_recurrent.py` around lines 130 - 141,
The `output` tensor is passed into Triton kernels via `fused_recurrent` and
related call sites, but it is not validated before `output.unsqueeze(0)` is used
as the destination. Add a helper in this module to check `output`’s shape,
dtype, device, and contiguity against `v`, and invoke it before each
`output.unsqueeze(0)` branch so only compatible dense tensors reach the kernel.

tensorrt_llm/_torch/modules/fla/chunk_o.py (1)

133-144: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate the caller-provided output buffer before launching the Triton kernel.

Line 144 now lets callers supply o, but chunk_fwd_kernel_o writes with raw contiguous pointer arithmetic based on v.shape. A wrong dtype/device/shape/stride can silently write incorrect memory. Add a local contract check before assigning o.

Suggested guard

-    o = output if output is not None else torch.empty_like(v)
+    if output is not None:
+        if output.shape != v.shape or output.dtype != v.dtype or output.device != v.device:
+            raise ValueError(
+                "`output` must match `v` in shape, dtype, and device; "
+                f"got output={tuple(output.shape)}/{output.dtype}/{output.device}, "
+                f"v={tuple(v.shape)}/{v.dtype}/{v.device}"
+            )
+        if not output.is_contiguous():
+            raise ValueError("`output` must be contiguous for chunk_fwd_kernel_o")
+        o = output
+    else:
+        o = torch.empty_like(v)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/chunk_o.py` around lines 133 - 144, Validate
the caller-supplied output buffer in chunk_fwd_kernel_o before using it for the
Triton launch: ensure output/o matches the expected tensor dtype, device, shape,
and contiguity/stride layout derived from v.shape and q.shape. Add a local
assertion or explicit contract check right before assigning o so invalid buffers
fail fast instead of allowing raw pointer writes to corrupt memory.

🧹 Nitpick comments (2)

tensorrt_llm/_torch/compilation/piecewise_optimizer.py (1)
22-32: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a return type annotation to _piecewise_boundary_ops.

This new function is missing a return annotation.
As per coding guidelines, "Always annotate functions with return types (use None if no return)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/compilation/piecewise_optimizer.py` around lines 22 - 32,
The helper _piecewise_boundary_ops currently lacks the required return type
annotation. Update the function signature for _piecewise_boundary_ops to
explicitly declare its return type based on the list of ops it builds, keeping
the implementation unchanged and ensuring it follows the project’s function
annotation guidelines.
Source: Coding guidelines
tensorrt_llm/_torch/modules/fla/chunk.py (1)
135-180: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Document the new output tensor contract.

The public chunk_gated_delta_rule docstring now omits output, but callers need to know it must match the returned o layout/shape/dtype. As per coding guidelines, public Tensor-like arguments should document expected dimensions and dtype options.
Suggested docstring addition
         cu_seqlens (torch.LongTensor):
             Cumulative sequence lengths of shape `[N+1]` used for variable-length training,
             consistent with the FlashAttention API.
+        output (Optional[torch.Tensor]):
+            Optional preallocated output buffer with the same shape, dtype, device, and contiguous
+            layout as the returned `o`.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/chunk.py` around lines 135 - 180, The public
docstring for chunk_gated_delta_rule is missing the new output argument
contract. Update the docstring near the existing parameter docs in chunk.py to
describe output as an optional preallocated tensor that must match the returned
o layout, shape, and dtype (including head_first-dependent dimensions). Keep the
description aligned with the other tensor arguments so callers can safely pass a
correctly sized buffer.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`:
- Around line 82-90: The custom-op signature for gdn_custom_op_inplace is out of
sync with the inplace metadata, causing mutation tracking to point at the wrong
argument. Update the gdn_custom_op_inplace parameter order so output matches the
position expected by inplace_info() (or change the inplace_info() mapping to the
current output position), and keep mutates_args consistent with the actual
mutable tensor name.

---

Outside diff comments:
In `@tensorrt_llm/_torch/modules/fla/chunk_o.py`:
- Around line 133-144: Validate the caller-supplied output buffer in
chunk_fwd_kernel_o before using it for the Triton launch: ensure output/o
matches the expected tensor dtype, device, shape, and contiguity/stride layout
derived from v.shape and q.shape. Add a local assertion or explicit contract
check right before assigning o so invalid buffers fail fast instead of allowing
raw pointer writes to corrupt memory.

In `@tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`:
- Around line 118-123: In flashinfer_chunk.py, the output buffer handling in the
FlashInfer path currently reuses output via FlashInfer’s output buffer without
validating its shape or layout. Update the logic around the output.squeeze(0)
reuse in the chunk/forward flow to first check that a caller-provided output
matches [T, num_o_heads, head_size] and is contiguous/compatible before passing
it as output=; otherwise fall back to allocating a fresh buffer or raise a clear
error from the same code path.

In `@tensorrt_llm/_torch/modules/fla/fused_recurrent.py`:
- Around line 130-141: The `output` tensor is passed into Triton kernels via
`fused_recurrent` and related call sites, but it is not validated before
`output.unsqueeze(0)` is used as the destination. Add a helper in this module to
check `output`’s shape, dtype, device, and contiguity against `v`, and invoke it
before each `output.unsqueeze(0)` branch so only compatible dense tensors reach
the kernel.

---

Nitpick comments:
In `@tensorrt_llm/_torch/compilation/piecewise_optimizer.py`:
- Around line 22-32: The helper _piecewise_boundary_ops currently lacks the
required return type annotation. Update the function signature for
_piecewise_boundary_ops to explicitly declare its return type based on the list
of ops it builds, keeping the implementation unchanged and ensuring it follows
the project’s function annotation guidelines.

In `@tensorrt_llm/_torch/modules/fla/chunk.py`:
- Around line 135-180: The public docstring for chunk_gated_delta_rule is
missing the new output argument contract. Update the docstring near the existing
parameter docs in chunk.py to describe output as an optional preallocated tensor
that must match the returned o layout, shape, and dtype (including
head_first-dependent dimensions). Keep the description aligned with the other
tensor arguments so callers can safely pass a correctly sized buffer.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 81ac5010-965d-4e82-8f6d-76fd40bf0aaf

📥 Commits

Reviewing files that changed from the base of the PR and between 71613f9 and b9281aa.

📒 Files selected for processing (11)

tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/modules/fla/chunk.py
tensorrt_llm/_torch/modules/fla/chunk_o.py
tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py
tensorrt_llm/_torch/modules/fla/fused_recurrent.py
tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/test_lists/qa/llm_function_core.txt
tests/integration/test_lists/test-db/l0_h100.yml

tensorrt-cicd · 2026-06-24T20:20:58Z

PR_Github #55499 [ run ] completed with state FAILURE. Commit: b9281aa
/LLM/main/L0_MergeRequest_PR pipeline #44424 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv · 2026-06-25T06:44:31Z

/bot run --disable-fail-fast

liji-nv · 2026-06-25T07:26:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T07:33:09Z

PR_Github #55740 [ run ] triggered by Bot. Commit: e94bdc8 Link to invocation

tensorrt-cicd · 2026-06-25T14:38:16Z

PR_Github #55740 [ run ] completed with state SUCCESS. Commit: e94bdc8
/LLM/main/L0_MergeRequest_PR pipeline #44642 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv · 2026-06-26T03:07:19Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T03:15:40Z

PR_Github #56284 [ run ] triggered by Bot. Commit: 02addfd Link to invocation

tensorrt-cicd · 2026-06-29T07:33:44Z

PR_Github #56284 [ run ] completed with state FAILURE. Commit: 02addfd
/LLM/main/L0_MergeRequest_PR pipeline #45137 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv · 2026-06-29T07:39:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T07:45:07Z

PR_Github #56331 [ run ] triggered by Bot. Commit: 02addfd Link to invocation

tensorrt-cicd · 2026-06-29T13:34:46Z

PR_Github #56331 [ run ] completed with state SUCCESS. Commit: 02addfd
/LLM/main/L0_MergeRequest_PR pipeline #45180 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv · 2026-06-30T05:02:19Z

/bot run --disable-fail-fast

liji-nv · 2026-06-30T06:28:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-30T06:35:00Z

PR_Github #56522 [ run ] triggered by Bot. Commit: 0892a72 Link to invocation

tensorrt-cicd · 2026-06-30T14:27:02Z

PR_Github #56522 [ run ] completed with state SUCCESS. Commit: 0892a72
/LLM/main/L0_MergeRequest_PR pipeline #45358 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv · 2026-07-01T02:44:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-01T02:50:13Z

PR_Github #56797 [ run ] triggered by Bot. Commit: 0892a72 Link to invocation

tensorrt-cicd · 2026-07-01T06:37:02Z

PR_Github #56797 [ run ] completed with state SUCCESS. Commit: 0892a72
/LLM/main/L0_MergeRequest_PR pipeline #45614 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv · 2026-07-02T02:54:24Z

/bot run

tensorrt-cicd · 2026-07-02T03:03:27Z

PR_Github #57079 [ run ] triggered by Bot. Commit: 0892a72 Link to invocation

tensorrt-cicd · 2026-07-02T06:26:51Z

PR_Github #57079 [ run ] completed with state SUCCESS. Commit: 0892a72
/LLM/main/L0_MergeRequest_PR pipeline #45870 completed with status: 'SUCCESS'

CI Report

Link to invocation

…graph Keep eager and torch-compile GDN execution on the same forward_core path by passing the original mixed QKV and gating projection tensors into the custom op. The custom op only provides a compile boundary and an inplace output buffer. Restore the standard decode path to fused_sigmoid_gating_delta_rule_update so FlashInfer GDN decode receives the original a/b tensors and preserves the eager accuracy behavior. Thread the optional output buffer through the FlashInfer and Triton decode paths to avoid an extra copy. Tests: - python -m py_compile tensorrt_llm/_torch/modules/mamba/gdn_mixer.py tensorrt_llm/_torch/modules/fla/fused_sigmoid_gating_recurrent.py - git diff --check - PDX sqsh build job 112288: COMPLETED - PDX accuracy job 112301: TestQwen3_5_4B test_fp8 and test_fp8_piecewise_cuda_graph passed - PDX accuracy job 112326: TestQwen3_5_35B_A3B test_bf16[tp2-TRTLLM] passed Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv · 2026-07-03T11:29:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T11:38:42Z

PR_Github #57461 [ run ] triggered by Bot. Commit: 4492490 Link to invocation

Wrap the MiniMax M3 metadata- and cache-dependent attention core in an inplace custom op so torch.compile can split it out of piecewise CUDA graphs. Keep QKV/index projections, QK normalization, RoPE, and the output projection visible to the compiled graph. Write dense and sparse attention results into the custom-op output buffer. Preserve FP32 sparse GQA accumulation until the final copy/cast, and expose the output buffer through MiniMaxM3SparseRuntimeBackend.forward. Register attention boundaries and mutation metadata through optional TRT-LLM op lookup, matching the latest GDN registration pattern from PR NVIDIA#15594. This avoids depending on model-specific custom ops being imported when compilation utilities initialize. Track piecewise runners owned by the compile backend and reset their CUDA graphs, captured addresses, outputs, and warmup state when phase-1 KV-cache estimation is released. Phase 2 then recaptures against the final allocations instead of replaying stale graph pointers. Add an 8-GPU MiniMax-M3-MXFP8 torch.compile E2E variant covering TP8/EP8, attention DP, TRTLLM MoE, padding CUDA graphs, multi-stream piecewise capture, and phase-2 recapture. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

tensorrt-cicd · 2026-07-03T19:08:01Z

PR_Github #57461 [ run ] completed with state SUCCESS. Commit: 4492490
/LLM/main/L0_MergeRequest_PR pipeline #46198 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv requested review from a team as code owners June 24, 2026 12:58

liji-nv requested review from QiJune, schetlur-nv, shaharmor98, symphonylyh and yizhang-nv June 24, 2026 12:58

github-actions Bot assigned liji-nv Jun 24, 2026

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/modules/mamba/gdn_mixer.py Outdated

liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch 2 times, most recently from 3ca4c97 to 30793be Compare June 25, 2026 06:00

liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch 3 times, most recently from 890832c to e94bdc8 Compare June 25, 2026 07:26

xinhe-nv approved these changes Jun 25, 2026

View reviewed changes

liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch from e94bdc8 to 6f1d4c7 Compare June 26, 2026 03:07

liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch from 02addfd to 0892a72 Compare June 30, 2026 05:02

nv-guomingz approved these changes Jul 2, 2026

View reviewed changes

liji-nv enabled auto-merge (squash) July 3, 2026 03:18

yizhang-nv approved these changes Jul 3, 2026

View reviewed changes

yuxianq reviewed Jul 3, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated

yuxianq reviewed Jul 3, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/compilation/utils.py Outdated

liji-nv added 2 commits July 3, 2026 04:09

[https://nvbugs/6299530][fix] Guard optional piecewise custom ops

4492490

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch from 0892a72 to 4492490 Compare July 3, 2026 11:25

liji-nv mentioned this pull request Jul 3, 2026

[None][fix] Enable MiniMax M3 piecewise CUDA graphs #15923

Open

1 task

Uh oh!

Conversation

liji-nv commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

liji-nv commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

liji-nv commented Jun 25, 2026

Uh oh!

liji-nv commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

liji-nv commented Jun 26, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

liji-nv commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

liji-nv commented Jun 30, 2026

Uh oh!

liji-nv commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

liji-nv commented Jul 1, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

liji-nv commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

Uh oh!

Uh oh!

liji-nv commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

liji-nv commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading