Skip to content

[https://nvbugs/6299530][fix] Capture Qwen3.5 GDN for piecewise CUDA …#15594

Open
liji-nv wants to merge 2 commits into
NVIDIA:mainfrom
liji-nv:liji/bug-6299530-qwen35-piecewise-cuda-graph
Open

[https://nvbugs/6299530][fix] Capture Qwen3.5 GDN for piecewise CUDA …#15594
liji-nv wants to merge 2 commits into
NVIDIA:mainfrom
liji-nv:liji/bug-6299530-qwen35-piecewise-cuda-graph

Conversation

@liji-nv

@liji-nv liji-nv commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

…graph

Add an inplace custom op boundary for Qwen3.5 GDN so torch.compile piecewise CUDA graph can keep tokenwise projections outside the custom op while hiding FLA state updates from FX capture.

Update the FLA GDN helpers to write into caller-provided output tensors, register inplace custom-op metadata, and exclude the GDN custom op from piecewise CUDA graph capture. Add a Qwen3.5 FP8 piecewise CUDA graph smoke test.

Bug: https://nvbugs/6299530

Tested:

  • python -m py_compile tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • git diff --check
  • PDX job 110733: pytest -q tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestQwen3_5_4B::test_fp8_piecewise_cuda_graph -s

Summary by CodeRabbit

  • New Features

    • Added support for optional preallocated output buffers in several gated delta rule and recurrent inference paths, improving flexibility and reducing extra allocations.
    • Expanded model execution to carry additional runtime metadata needed for newer optimized execution paths.
  • Bug Fixes

    • Improved piecewise partitioning behavior for compiled graphs, including better handling of boundary and stop operations.
    • Added support for an additional custom runtime operator when available.
  • Tests

    • Added regression coverage for FP8 piecewise CUDA graph generation on Qwen3.5-4B.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@liji-nv

liji-nv commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55499 [ run ] triggered by Bot. Commit: b9281aa Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR enables piecewise CUDA graph compilation for Qwen3Next GatedDeltaNet (GDN) layers. It propagates preallocated output buffers through FLA kernel APIs (chunk_fwd_o, fused_recurrent), introduces a gdn_custom_op_inplace custom op with weakref-based layer registry, refactors GDN forward kwargs from a/b to g/beta, registers the new op with the piecewise optimizer, and adds an FP8 integration test.

Changes

GDN Piecewise CUDA Graph Support

Layer / File(s) Summary
FLA kernel preallocated output buffer APIs
tensorrt_llm/_torch/modules/fla/chunk_o.py, tensorrt_llm/_torch/modules/fla/chunk.py, tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py, tensorrt_llm/_torch/modules/fla/fused_recurrent.py
Adds optional output: Optional[torch.Tensor] to chunk_fwd_o, chunk_gated_delta_rule_fwd, both ChunkGatedDeltaRuleFunction.forward and chunk_gated_delta_rule, and the full fused_recurrent family. Each function reuses the provided tensor instead of allocating a new one; @input_guard decorators are updated to exclude output.
gdn_custom_op_inplace registration and weakref layer registry
tensorrt_llm/_torch/modules/mamba/gdn_mixer.py, tensorrt_llm/_torch/pyexecutor/model_engine.py
Adds weakref import and removes fused_sigmoid_gating_delta_rule_update. Extends __init__ to register each GDN instance by unique layer_idx_str into model_config.extra_attrs["gdn_layers"]. Defines _extract_gdn_extra_attrs and the gdn_custom_op_inplace custom op that fetches runtime metadata and calls gdn_layer.forward_core to write results in-place. Propagates spec_metadata into extra_attrs in model_engine.model_forward.
GDN tokenwise input refactor: a/bg/beta
tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
Extracts _compute_tokenwise_inputs(hidden_states) that runs dual projection matmuls and computes g/beta via fused_gdn_gating_with_sigmoid. Updates _postprocess_gdn_output. Changes the kwargs payload from {mixed_qkv, a, b, z} to {mixed_qkv, g, beta} at call sites.
GDN forward_decode/forward_extend/forward_core output threading
tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
Updates forward_decode and forward_extend to accept output, consume g/beta from kwargs, construct output_d/output_p views for speculative paths, and thread output into kernel calls. Switches the primary decode kernel to fused_recurrent_gated_delta_rule_update with g/beta. Refactors forward_core to dispatch to gdn_custom_op_inplace when compiling or to the direct path with output.
Piecewise optimizer boundary op and inplace map
tensorrt_llm/_torch/compilation/piecewise_optimizer.py, tensorrt_llm/_torch/compilation/utils.py
Adds _piecewise_boundary_ops() helper that conditionally includes gdn_custom_op_inplace.default. Refactors node classification in piecewise_optimizer to use the helper. Registers gdn_custom_op_inplace.default → {1: "output"} in inplace_info()'s inplace_map via a guarded try/except block.
FP8 piecewise CUDA graph integration test
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/qa/llm_function_core.txt, tests/integration/test_lists/test-db/l0_h100.yml
Adds test_fp8_piecewise_cuda_graph to TestQwen3_5_4B with TorchCompileConfig for fullgraph + piecewise CUDA graphs, running LLM.generate on FP8 with chunked prefill. Registers the test in the QA list and H100 pre-merge database.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • xxi-nv
  • hyukn
  • QiJune
  • galagam
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the main change and includes the bug link and fix type.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly states the issue, solution, bug link, and test coverage, even though it doesn’t follow the template headings exactly.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py (1)

118-123: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Check output before reusing it as FlashInfer’s output buffer.

Line 122 assumes output.squeeze(0) has shape [T, num_o_heads, head_size] and a compatible contiguous layout. Validate this before passing it as output= so bad caller buffers fail deterministically.

Suggested guard
     total_seq_len = q3.shape[0]
     num_o_heads = max(q3.shape[1], v3.shape[1])
     head_size = q3.shape[2]
     need_state = inplace_indexed_state_update or output_final_state
-    output_buf = output.squeeze(0) if output is not None else q3.new_empty(
-        total_seq_len, num_o_heads, head_size)
+    if output is not None:
+        expected_shape = (1, total_seq_len, num_o_heads, head_size)
+        if output.shape != expected_shape or output.dtype != q3.dtype or output.device != q3.device:
+            raise ValueError(
+                "`output` must match FlashInfer output shape/dtype/device; "
+                f"got {tuple(output.shape)}/{output.dtype}/{output.device}, "
+                f"expected {expected_shape}/{q3.dtype}/{q3.device}"
+            )
+        if not output.is_contiguous():
+            raise ValueError("`output` must be contiguous")
+        output_buf = output.squeeze(0)
+    else:
+        output_buf = q3.new_empty(total_seq_len, num_o_heads, head_size)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py` around lines 118 - 123,
In flashinfer_chunk.py, the output buffer handling in the FlashInfer path
currently reuses output via FlashInfer’s output buffer without validating its
shape or layout. Update the logic around the output.squeeze(0) reuse in the
chunk/forward flow to first check that a caller-provided output matches [T,
num_o_heads, head_size] and is contiguous/compatible before passing it as
output=; otherwise fall back to allocating a fresh buffer or raise a clear error
from the same code path.
tensorrt_llm/_torch/modules/fla/fused_recurrent.py (1)

130-141: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate output before unsqueezing it into the Triton destination.

Both paths pass caller-owned output into kernels that assume the same dense layout as v. Add shape/dtype/device/contiguity checks before output.unsqueeze(0) to avoid bad writes.

Suggested helper
+def _validate_recurrent_output(output: torch.Tensor, v: torch.Tensor) -> None:
+    if output.shape != v.shape or output.dtype != v.dtype or output.device != v.device:
+        raise ValueError(
+            "`output` must match `v` in shape, dtype, and device; "
+            f"got output={tuple(output.shape)}/{output.dtype}/{output.device}, "
+            f"v={tuple(v.shape)}/{v.dtype}/{v.device}"
+        )
+    if not output.is_contiguous():
+        raise ValueError("`output` must be contiguous for fused recurrent kernels")

Then call it before each output.unsqueeze(0) branch.

Also applies to: 483-498

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/fused_recurrent.py` around lines 130 - 141,
The `output` tensor is passed into Triton kernels via `fused_recurrent` and
related call sites, but it is not validated before `output.unsqueeze(0)` is used
as the destination. Add a helper in this module to check `output`’s shape,
dtype, device, and contiguity against `v`, and invoke it before each
`output.unsqueeze(0)` branch so only compatible dense tensors reach the kernel.
tensorrt_llm/_torch/modules/fla/chunk_o.py (1)

133-144: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate the caller-provided output buffer before launching the Triton kernel.

Line 144 now lets callers supply o, but chunk_fwd_kernel_o writes with raw contiguous pointer arithmetic based on v.shape. A wrong dtype/device/shape/stride can silently write incorrect memory. Add a local contract check before assigning o.

Suggested guard
-    o = output if output is not None else torch.empty_like(v)
+    if output is not None:
+        if output.shape != v.shape or output.dtype != v.dtype or output.device != v.device:
+            raise ValueError(
+                "`output` must match `v` in shape, dtype, and device; "
+                f"got output={tuple(output.shape)}/{output.dtype}/{output.device}, "
+                f"v={tuple(v.shape)}/{v.dtype}/{v.device}"
+            )
+        if not output.is_contiguous():
+            raise ValueError("`output` must be contiguous for chunk_fwd_kernel_o")
+        o = output
+    else:
+        o = torch.empty_like(v)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/chunk_o.py` around lines 133 - 144, Validate
the caller-supplied output buffer in chunk_fwd_kernel_o before using it for the
Triton launch: ensure output/o matches the expected tensor dtype, device, shape,
and contiguity/stride layout derived from v.shape and q.shape. Add a local
assertion or explicit contract check right before assigning o so invalid buffers
fail fast instead of allowing raw pointer writes to corrupt memory.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (1)

22-32: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a return type annotation to _piecewise_boundary_ops.

This new function is missing a return annotation.
As per coding guidelines, "Always annotate functions with return types (use None if no return)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/compilation/piecewise_optimizer.py` around lines 22 - 32,
The helper _piecewise_boundary_ops currently lacks the required return type
annotation. Update the function signature for _piecewise_boundary_ops to
explicitly declare its return type based on the list of ops it builds, keeping
the implementation unchanged and ensuring it follows the project’s function
annotation guidelines.

Source: Coding guidelines

tensorrt_llm/_torch/modules/fla/chunk.py (1)

135-180: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Document the new output tensor contract.

The public chunk_gated_delta_rule docstring now omits output, but callers need to know it must match the returned o layout/shape/dtype. As per coding guidelines, public Tensor-like arguments should document expected dimensions and dtype options.

Suggested docstring addition
         cu_seqlens (torch.LongTensor):
             Cumulative sequence lengths of shape `[N+1]` used for variable-length training,
             consistent with the FlashAttention API.
+        output (Optional[torch.Tensor]):
+            Optional preallocated output buffer with the same shape, dtype, device, and contiguous
+            layout as the returned `o`.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fla/chunk.py` around lines 135 - 180, The public
docstring for chunk_gated_delta_rule is missing the new output argument
contract. Update the docstring near the existing parameter docs in chunk.py to
describe output as an optional preallocated tensor that must match the returned
o layout, shape, and dtype (including head_first-dependent dimensions). Keep the
description aligned with the other tensor arguments so callers can safely pass a
correctly sized buffer.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`:
- Around line 82-90: The custom-op signature for gdn_custom_op_inplace is out of
sync with the inplace metadata, causing mutation tracking to point at the wrong
argument. Update the gdn_custom_op_inplace parameter order so output matches the
position expected by inplace_info() (or change the inplace_info() mapping to the
current output position), and keep mutates_args consistent with the actual
mutable tensor name.

---

Outside diff comments:
In `@tensorrt_llm/_torch/modules/fla/chunk_o.py`:
- Around line 133-144: Validate the caller-supplied output buffer in
chunk_fwd_kernel_o before using it for the Triton launch: ensure output/o
matches the expected tensor dtype, device, shape, and contiguity/stride layout
derived from v.shape and q.shape. Add a local assertion or explicit contract
check right before assigning o so invalid buffers fail fast instead of allowing
raw pointer writes to corrupt memory.

In `@tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py`:
- Around line 118-123: In flashinfer_chunk.py, the output buffer handling in the
FlashInfer path currently reuses output via FlashInfer’s output buffer without
validating its shape or layout. Update the logic around the output.squeeze(0)
reuse in the chunk/forward flow to first check that a caller-provided output
matches [T, num_o_heads, head_size] and is contiguous/compatible before passing
it as output=; otherwise fall back to allocating a fresh buffer or raise a clear
error from the same code path.

In `@tensorrt_llm/_torch/modules/fla/fused_recurrent.py`:
- Around line 130-141: The `output` tensor is passed into Triton kernels via
`fused_recurrent` and related call sites, but it is not validated before
`output.unsqueeze(0)` is used as the destination. Add a helper in this module to
check `output`’s shape, dtype, device, and contiguity against `v`, and invoke it
before each `output.unsqueeze(0)` branch so only compatible dense tensors reach
the kernel.

---

Nitpick comments:
In `@tensorrt_llm/_torch/compilation/piecewise_optimizer.py`:
- Around line 22-32: The helper _piecewise_boundary_ops currently lacks the
required return type annotation. Update the function signature for
_piecewise_boundary_ops to explicitly declare its return type based on the list
of ops it builds, keeping the implementation unchanged and ensuring it follows
the project’s function annotation guidelines.

In `@tensorrt_llm/_torch/modules/fla/chunk.py`:
- Around line 135-180: The public docstring for chunk_gated_delta_rule is
missing the new output argument contract. Update the docstring near the existing
parameter docs in chunk.py to describe output as an optional preallocated tensor
that must match the returned o layout, shape, and dtype (including
head_first-dependent dimensions). Keep the description aligned with the other
tensor arguments so callers can safely pass a correctly sized buffer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 81ac5010-965d-4e82-8f6d-76fd40bf0aaf

📥 Commits

Reviewing files that changed from the base of the PR and between 71613f9 and b9281aa.

📒 Files selected for processing (11)
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/compilation/utils.py
  • tensorrt_llm/_torch/modules/fla/chunk.py
  • tensorrt_llm/_torch/modules/fla/chunk_o.py
  • tensorrt_llm/_torch/modules/fla/flashinfer_chunk.py
  • tensorrt_llm/_torch/modules/fla/fused_recurrent.py
  • tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/test-db/l0_h100.yml

Comment thread tensorrt_llm/_torch/modules/mamba/gdn_mixer.py Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55499 [ run ] completed with state FAILURE. Commit: b9281aa
/LLM/main/L0_MergeRequest_PR pipeline #44424 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@liji-nv liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch 2 times, most recently from 3ca4c97 to 30793be Compare June 25, 2026 06:00
@liji-nv

liji-nv commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@liji-nv liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch 3 times, most recently from 890832c to e94bdc8 Compare June 25, 2026 07:26
@liji-nv

liji-nv commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55740 [ run ] triggered by Bot. Commit: e94bdc8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55740 [ run ] completed with state SUCCESS. Commit: e94bdc8
/LLM/main/L0_MergeRequest_PR pipeline #44642 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@liji-nv liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch from e94bdc8 to 6f1d4c7 Compare June 26, 2026 03:07
@liji-nv

liji-nv commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56284 [ run ] triggered by Bot. Commit: 02addfd Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56284 [ run ] completed with state FAILURE. Commit: 02addfd
/LLM/main/L0_MergeRequest_PR pipeline #45137 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@liji-nv

liji-nv commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56331 [ run ] triggered by Bot. Commit: 02addfd Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56331 [ run ] completed with state SUCCESS. Commit: 02addfd
/LLM/main/L0_MergeRequest_PR pipeline #45180 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@liji-nv liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch from 02addfd to 0892a72 Compare June 30, 2026 05:02
@liji-nv

liji-nv commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@liji-nv

liji-nv commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56522 [ run ] triggered by Bot. Commit: 0892a72 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56522 [ run ] completed with state SUCCESS. Commit: 0892a72
/LLM/main/L0_MergeRequest_PR pipeline #45358 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@liji-nv

liji-nv commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56797 [ run ] triggered by Bot. Commit: 0892a72 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56797 [ run ] completed with state SUCCESS. Commit: 0892a72
/LLM/main/L0_MergeRequest_PR pipeline #45614 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@liji-nv

liji-nv commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57079 [ run ] triggered by Bot. Commit: 0892a72 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57079 [ run ] completed with state SUCCESS. Commit: 0892a72
/LLM/main/L0_MergeRequest_PR pipeline #45870 completed with status: 'SUCCESS'

CI Report

Link to invocation

@liji-nv liji-nv enabled auto-merge (squash) July 3, 2026 03:18
Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated
Comment thread tensorrt_llm/_torch/compilation/utils.py Outdated
liji-nv added 2 commits July 3, 2026 04:09
…graph

Keep eager and torch-compile GDN execution on the same forward_core path by
passing the original mixed QKV and gating projection tensors into the custom op.
The custom op only provides a compile boundary and an inplace output buffer.

Restore the standard decode path to fused_sigmoid_gating_delta_rule_update so
FlashInfer GDN decode receives the original a/b tensors and preserves the eager
accuracy behavior. Thread the optional output buffer through the FlashInfer and
Triton decode paths to avoid an extra copy.

Tests:
- python -m py_compile tensorrt_llm/_torch/modules/mamba/gdn_mixer.py tensorrt_llm/_torch/modules/fla/fused_sigmoid_gating_recurrent.py
- git diff --check
- PDX sqsh build job 112288: COMPLETED
- PDX accuracy job 112301: TestQwen3_5_4B test_fp8 and test_fp8_piecewise_cuda_graph passed
- PDX accuracy job 112326: TestQwen3_5_35B_A3B test_bf16[tp2-TRTLLM] passed

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv liji-nv force-pushed the liji/bug-6299530-qwen35-piecewise-cuda-graph branch from 0892a72 to 4492490 Compare July 3, 2026 11:25
@liji-nv

liji-nv commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57461 [ run ] triggered by Bot. Commit: 4492490 Link to invocation

liji-nv added a commit to liji-nv/TensorRT-LLM that referenced this pull request Jul 3, 2026
Wrap the MiniMax M3 metadata- and cache-dependent attention core in an inplace custom op so torch.compile can split it out of piecewise CUDA graphs. Keep QKV/index projections, QK normalization, RoPE, and the output projection visible to the compiled graph.

Write dense and sparse attention results into the custom-op output buffer. Preserve FP32 sparse GQA accumulation until the final copy/cast, and expose the output buffer through MiniMaxM3SparseRuntimeBackend.forward.

Register attention boundaries and mutation metadata through optional TRT-LLM op lookup, matching the latest GDN registration pattern from PR NVIDIA#15594. This avoids depending on model-specific custom ops being imported when compilation utilities initialize.

Track piecewise runners owned by the compile backend and reset their CUDA graphs, captured addresses, outputs, and warmup state when phase-1 KV-cache estimation is released. Phase 2 then recaptures against the final allocations instead of replaying stale graph pointers.

Add an 8-GPU MiniMax-M3-MXFP8 torch.compile E2E variant covering TP8/EP8, attention DP, TRTLLM MoE, padding CUDA graphs, multi-stream piecewise capture, and phase-2 recapture.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
liji-nv added a commit to liji-nv/TensorRT-LLM that referenced this pull request Jul 3, 2026
Wrap the MiniMax M3 metadata- and cache-dependent attention core in an inplace custom op so torch.compile can split it out of piecewise CUDA graphs. Keep QKV/index projections, QK normalization, RoPE, and the output projection visible to the compiled graph.

Write dense and sparse attention results into the custom-op output buffer. Preserve FP32 sparse GQA accumulation until the final copy/cast, and expose the output buffer through MiniMaxM3SparseRuntimeBackend.forward.

Register attention boundaries and mutation metadata through optional TRT-LLM op lookup, matching the latest GDN registration pattern from PR NVIDIA#15594. This avoids depending on model-specific custom ops being imported when compilation utilities initialize.

Track piecewise runners owned by the compile backend and reset their CUDA graphs, captured addresses, outputs, and warmup state when phase-1 KV-cache estimation is released. Phase 2 then recaptures against the final allocations instead of replaying stale graph pointers.

Add an 8-GPU MiniMax-M3-MXFP8 torch.compile E2E variant covering TP8/EP8, attention DP, TRTLLM MoE, padding CUDA graphs, multi-stream piecewise capture, and phase-2 recapture.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57461 [ run ] completed with state SUCCESS. Commit: 4492490
/LLM/main/L0_MergeRequest_PR pipeline #46198 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants