[https://nvbugs/6029220][fix] Disable multi-stream in maybe_execute_i… by liji-nv · Pull Request #12659 · NVIDIA/TensorRT-LLM

liji-nv · 2026-04-01T14:16:27Z

…n_parallel under torch.compile

PyTorch 2.11 has a bug (pytorch/pytorch#176486) where dynamo captures CUDA stream/event operations and converts them into torch.ops.streams.* nodes. At runtime these nodes create events with uninitialized device type (CPU), causing "Event device type CPU does not match recording stream's device type CUDA" errors.

Add a disable_on_compile flag to maybe_execute_in_parallel. Callers outside custom ops (model-level MoE, MTP norm, mamba mixer, etc.) set it to True so stream/event ops are not captured by dynamo. Callers inside custom ops (fused MoE, attention) leave it False since custom ops are opaque to the compiler.

There is no perf impact on torch compile workflow. Before the change, the multi stream ops are eliminated by DCE and we actually rely on auto-multi-stream in a latter pass to support multi-stream requirements.

Summary by CodeRabbit

Refactor
- Updated parallel execution handling across multiple model architectures (Deepseek V3, Exaone, GLM, Llama, Nemotron, Qwen3, Mamba2) to improve compilation behavior.
- Enhanced internal parallel execution utility to support finer-grained control over behavior during model compilation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-01T14:23:16Z

📝 Walkthrough

Walkthrough

This pull request adds a new disable_on_compile parameter to the maybe_execute_in_parallel() function and updates multiple model implementations to pass disable_on_compile=True when invoking this function. The parameter controls whether parallel execution via dual CUDA streams should be disabled during torch.compile tracing.

Changes

Cohort / File(s)	Summary
Core Infrastructure `tensorrt_llm/_torch/modules/multi_stream_utils.py`	Updated `maybe_execute_in_parallel()` function signature to accept `disable_on_compile: bool = False` parameter. Modified conditional logic to skip multi-stream parallel execution when `disable_on_compile=True` and `torch.compile` tracing is active, falling back to sequential execution.
MoE Model Implementations `tensorrt_llm/_torch/models/modeling_deepseekv3.py`, `tensorrt_llm/_torch/models/modeling_exaone_moe.py`, `tensorrt_llm/_torch/models/modeling_glm.py`, `tensorrt_llm/_torch/models/modeling_llama.py`, `tensorrt_llm/_torch/models/modeling_llama_min_latency.py`, `tensorrt_llm/_torch/models/modeling_nemotron_h.py`, `tensorrt_llm/_torch/models/modeling_qwen3_next.py`	Added `disable_on_compile=True` argument to `maybe_execute_in_parallel()` calls for parallel execution of MoE routed/shared outputs and embedding/hidden-state normalization routines.
Mamba2 Mixer Implementation `tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`	Added `disable_on_compile=True` argument to `maybe_execute_in_parallel()` call in the decode path for parallel processing of CUDA stream/event operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the fix (disable multi-stream in maybe_execute_in_parallel) and references the bug tracker (nvbugs/6029220) following the repository template format.
Description check	✅ Passed	The PR description adequately explains the issue, the solution, and the rationale. It references a specific PyTorch bug, describes the problem (dynamo capturing CUDA operations), explains the fix (adding disable_on_compile flag), and clarifies the usage pattern across different call sites.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
1-1: ⚠️ Potential issue | 🟠 Major

Add required NVIDIA SPDX header to this modified Python file.

The file currently has no NVIDIA copyright header.
Proposed fix
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
  from collections.abc import Callable
As per coding guidelines "`**/*.{cpp,h,hpp,cu,cuh,py}`: All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` at line 1, This
file lacks the required NVIDIA SPDX/header; add an NVIDIA copyright header and
SPDX identifier at the very top of
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (above the first
import), including the copyright owner "NVIDIA CORPORATION & AFFILIATES", the
year of the latest meaningful modification, and the SPDX-License-Identifier
(e.g., Apache-2.0) per project guidelines so the file complies with the
repository's header policy.
tensorrt_llm/_torch/models/modeling_glm.py (1)
1-1: ⚠️ Potential issue | 🟠 Major

Add required NVIDIA SPDX header to this modified Python file.

The file currently lacks the required NVIDIA copyright header.
Proposed fix
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
  import inspect
As per coding guidelines "`**/*.{cpp,h,hpp,cu,cuh,py}`: All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_glm.py` at line 1, The file
tensorrt_llm/_torch/models/modeling_glm.py is missing the required NVIDIA SPDX
copyright header; add the standard NVIDIA copyright/SPDX header (including the
year of latest meaningful modification) at the top of the file before any
imports (e.g., above the existing import inspect) so the file complies with the
repository coding guidelines.
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
1-1: ⚠️ Potential issue | 🟠 Major

Update SPDX copyright year for this modified file.

This file is modified in this PR, but the header still ends at 2024.
Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines "`**/*.{cpp,h,hpp,cu,cuh,py}`: All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py` at line 1, Update the SPDX
copyright header in mamba2_mixer.py to reflect the latest modification year by
changing the trailing year range from "2022-2024" to "2022-2026" (or to include
2026 as the newest year), keeping the rest of the SPDX header text intact so the
file header matches the repository policy.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/multi_stream_utils.py (1)
56-60: Clarify the docstring example for custom-op scope.

At Line 59-Line 60, “attention, MoE” can be read as model-level Python modules, while this PR sets disable_on_compile=True for several such wrappers. Consider wording this as “inside fused custom-op implementations” to avoid misuse.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/multi_stream_utils.py` around lines 56 - 60, The
docstring for disable_on_compile is ambiguous about "attention, MoE": update the
wording to clarify you mean fused/custom-op implementations rather than
high-level Python modules; e.g., change the parenthetical to say something like
"inside fused custom-op implementations (e.g., fused attention kernels,
Mixture-of-Experts custom ops)" and keep the guidance that callers inside those
fused custom ops should leave disable_on_compile=False; edit the docstring in
multi_stream_utils.py where disable_on_compile is documented to reflect this
phrasing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_glm.py`:
- Line 1: The file tensorrt_llm/_torch/models/modeling_glm.py is missing the
required NVIDIA SPDX copyright header; add the standard NVIDIA copyright/SPDX
header (including the year of latest meaningful modification) at the top of the
file before any imports (e.g., above the existing import inspect) so the file
complies with the repository coding guidelines.

In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py`:
- Line 1: This file lacks the required NVIDIA SPDX/header; add an NVIDIA
copyright header and SPDX identifier at the very top of
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (above the first
import), including the copyright owner "NVIDIA CORPORATION & AFFILIATES", the
year of the latest meaningful modification, and the SPDX-License-Identifier
(e.g., Apache-2.0) per project guidelines so the file complies with the
repository's header policy.

In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`:
- Line 1: Update the SPDX copyright header in mamba2_mixer.py to reflect the
latest modification year by changing the trailing year range from "2022-2024" to
"2022-2026" (or to include 2026 as the newest year), keeping the rest of the
SPDX header text intact so the file header matches the repository policy.

---

Nitpick comments:
In `@tensorrt_llm/_torch/modules/multi_stream_utils.py`:
- Around line 56-60: The docstring for disable_on_compile is ambiguous about
"attention, MoE": update the wording to clarify you mean fused/custom-op
implementations rather than high-level Python modules; e.g., change the
parenthetical to say something like "inside fused custom-op implementations
(e.g., fused attention kernels, Mixture-of-Experts custom ops)" and keep the
guidance that callers inside those fused custom ops should leave
disable_on_compile=False; edit the docstring in multi_stream_utils.py where
disable_on_compile is documented to reflect this phrasing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7c5b7209-2d84-4f74-99a2-b9a288bfa540

📥 Commits

Reviewing files that changed from the base of the PR and between bb60eb0 and 12513bb.

📒 Files selected for processing (9)

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_exaone_moe.py
tensorrt_llm/_torch/models/modeling_glm.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tensorrt_llm/_torch/models/modeling_nemotron_h.py
tensorrt_llm/_torch/models/modeling_qwen3_next.py
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
tensorrt_llm/_torch/modules/multi_stream_utils.py

liji-nv · 2026-04-02T02:18:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T02:25:25Z

PR_Github #41308 [ run ] triggered by Bot. Commit: 12513bb Link to invocation

tensorrt-cicd · 2026-04-02T09:41:50Z

PR_Github #41308 [ run ] completed with state SUCCESS. Commit: 12513bb
/LLM/main/L0_MergeRequest_PR pipeline #32262 completed with status: 'SUCCESS'

CI Report

Link to invocation

2ez4bz · 2026-04-02T18:19:15Z

Do existing test cases cover the new functionality? If not, could we add some? 🙏

liji-nv · 2026-04-03T13:40:22Z

Do existing test cases cover the new functionality? If not, could we add some? 🙏

The issue should be captured by accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_nvfp4_multi_gpus_piecewise_cuda_graph[mtp3_fp8kv_chunked] But that test is recently waived due to https://nvbugspro.nvidia.com/bug/5989920. After this PR merge, we will enable that in #12533.

…n_parallel under torch.compile PyTorch 2.11 has a bug (pytorch/pytorch#176486) where dynamo captures CUDA stream/event operations and converts them into torch.ops.streams.* nodes. At runtime these nodes create events with uninitialized device type (CPU), causing "Event device type CPU does not match recording stream's device type CUDA" errors. Add a `disable_on_compile` flag to `maybe_execute_in_parallel`. Callers outside custom ops (model-level MoE, MTP norm, mamba mixer, etc.) set it to True so stream/event ops are not captured by dynamo. Callers inside custom ops (fused MoE, attention) leave it False since custom ops are opaque to the compiler. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv · 2026-04-08T03:12:32Z

/bot run --disable-fail-fast

QiJune

LGTM

tensorrt-cicd · 2026-04-08T03:18:53Z

PR_Github #42254 [ run ] triggered by Bot. Commit: 036c8e5 Link to invocation

tensorrt-cicd · 2026-04-08T13:00:23Z

PR_Github #42254 [ run ] completed with state SUCCESS. Commit: 036c8e5
/LLM/main/L0_MergeRequest_PR pipeline #33063 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

liji-nv · 2026-04-08T14:30:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-08T14:36:16Z

PR_Github #42349 [ run ] triggered by Bot. Commit: 036c8e5 Link to invocation

tensorrt-cicd · 2026-04-08T23:26:37Z

PR_Github #42349 [ run ] completed with state SUCCESS. Commit: 036c8e5
/LLM/main/L0_MergeRequest_PR pipeline #33134 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

liji-nv · 2026-04-09T01:57:29Z

/bot run

liji-nv · 2026-04-09T02:18:39Z

/bot run

tensorrt-cicd · 2026-04-09T02:24:13Z

PR_Github #42431 [ run ] triggered by Bot. Commit: 036c8e5 Link to invocation

tensorrt-cicd · 2026-04-09T11:00:08Z

PR_Github #42431 [ run ] completed with state SUCCESS. Commit: 036c8e5
/LLM/main/L0_MergeRequest_PR pipeline #33199 completed with status: 'SUCCESS'

CI Report

Link to invocation

liji-nv requested review from a team as code owners April 1, 2026 14:16

liji-nv requested review from HuiGao-NV, byshiue, hlu1, symphonylyh and tomeras91 April 1, 2026 14:16

github-actions bot assigned liji-nv Apr 1, 2026

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

liji-nv mentioned this pull request Apr 3, 2026

[https://nvbugs/5989920][test] Unwaive DeepSeekV3 nvfp4 mtp3_fp8kv_chunked test #12533

Open

1 task

liji-nv force-pushed the fix-multistream-compile-event-mismatch branch from 12513bb to 036c8e5 Compare April 8, 2026 03:11

QiJune approved these changes Apr 8, 2026

View reviewed changes

2ez4bz approved these changes Apr 8, 2026

View reviewed changes

longlee0622 enabled auto-merge (squash) April 8, 2026 03:46

byshiue approved these changes Apr 8, 2026

View reviewed changes

longlee0622 merged commit 3e942cc into NVIDIA:main Apr 9, 2026
5 checks passed

amukkara mentioned this pull request Apr 9, 2026

[None][fix] skip inference_mode() when torch.compile=True for gemma3 fp8 #12367

Merged

1 task

Conversation

liji-nv commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

liji-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

2ez4bz commented Apr 2, 2026

Uh oh!

liji-nv commented Apr 3, 2026

Uh oh!

liji-nv commented Apr 8, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

liji-nv commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

liji-nv commented Apr 9, 2026

Uh oh!

liji-nv commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

liji-nv commented Apr 1, 2026 •

edited

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading