[https://nvbugs/6029220][fix] Disable multi-stream in maybe_execute_i…#12659
Conversation
📝 WalkthroughWalkthroughThis pull request adds a new Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd required NVIDIA SPDX header to this modified Python file.
The file currently has no NVIDIA copyright header.
As per coding guidelines "`**/*.{cpp,h,hpp,cu,cuh,py}`: All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification."Proposed fix
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + from collections.abc import Callable🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` at line 1, This file lacks the required NVIDIA SPDX/header; add an NVIDIA copyright header and SPDX identifier at the very top of tensorrt_llm/_torch/models/modeling_llama_min_latency.py (above the first import), including the copyright owner "NVIDIA CORPORATION & AFFILIATES", the year of the latest meaningful modification, and the SPDX-License-Identifier (e.g., Apache-2.0) per project guidelines so the file complies with the repository's header policy.tensorrt_llm/_torch/models/modeling_glm.py (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd required NVIDIA SPDX header to this modified Python file.
The file currently lacks the required NVIDIA copyright header.
As per coding guidelines "`**/*.{cpp,h,hpp,cu,cuh,py}`: All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification."Proposed fix
+ # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + import inspect🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_glm.py` at line 1, The file tensorrt_llm/_torch/models/modeling_glm.py is missing the required NVIDIA SPDX copyright header; add the standard NVIDIA copyright/SPDX header (including the year of latest meaningful modification) at the top of the file before any imports (e.g., above the existing import inspect) so the file complies with the repository coding guidelines.tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
1-1:⚠️ Potential issue | 🟠 MajorUpdate SPDX copyright year for this modified file.
This file is modified in this PR, but the header still ends at 2024.
As per coding guidelines "`**/*.{cpp,h,hpp,cu,cuh,py}`: All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification."Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py` at line 1, Update the SPDX copyright header in mamba2_mixer.py to reflect the latest modification year by changing the trailing year range from "2022-2024" to "2022-2026" (or to include 2026 as the newest year), keeping the rest of the SPDX header text intact so the file header matches the repository policy.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/multi_stream_utils.py (1)
56-60: Clarify the docstring example for custom-op scope.At Line 59-Line 60, “attention, MoE” can be read as model-level Python modules, while this PR sets
disable_on_compile=Truefor several such wrappers. Consider wording this as “inside fused custom-op implementations” to avoid misuse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/multi_stream_utils.py` around lines 56 - 60, The docstring for disable_on_compile is ambiguous about "attention, MoE": update the wording to clarify you mean fused/custom-op implementations rather than high-level Python modules; e.g., change the parenthetical to say something like "inside fused custom-op implementations (e.g., fused attention kernels, Mixture-of-Experts custom ops)" and keep the guidance that callers inside those fused custom ops should leave disable_on_compile=False; edit the docstring in multi_stream_utils.py where disable_on_compile is documented to reflect this phrasing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_glm.py`:
- Line 1: The file tensorrt_llm/_torch/models/modeling_glm.py is missing the
required NVIDIA SPDX copyright header; add the standard NVIDIA copyright/SPDX
header (including the year of latest meaningful modification) at the top of the
file before any imports (e.g., above the existing import inspect) so the file
complies with the repository coding guidelines.
In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py`:
- Line 1: This file lacks the required NVIDIA SPDX/header; add an NVIDIA
copyright header and SPDX identifier at the very top of
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (above the first
import), including the copyright owner "NVIDIA CORPORATION & AFFILIATES", the
year of the latest meaningful modification, and the SPDX-License-Identifier
(e.g., Apache-2.0) per project guidelines so the file complies with the
repository's header policy.
In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`:
- Line 1: Update the SPDX copyright header in mamba2_mixer.py to reflect the
latest modification year by changing the trailing year range from "2022-2024" to
"2022-2026" (or to include 2026 as the newest year), keeping the rest of the
SPDX header text intact so the file header matches the repository policy.
---
Nitpick comments:
In `@tensorrt_llm/_torch/modules/multi_stream_utils.py`:
- Around line 56-60: The docstring for disable_on_compile is ambiguous about
"attention, MoE": update the wording to clarify you mean fused/custom-op
implementations rather than high-level Python modules; e.g., change the
parenthetical to say something like "inside fused custom-op implementations
(e.g., fused attention kernels, Mixture-of-Experts custom ops)" and keep the
guidance that callers inside those fused custom ops should leave
disable_on_compile=False; edit the docstring in multi_stream_utils.py where
disable_on_compile is documented to reflect this phrasing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7c5b7209-2d84-4f74-99a2-b9a288bfa540
📒 Files selected for processing (9)
tensorrt_llm/_torch/models/modeling_deepseekv3.pytensorrt_llm/_torch/models/modeling_exaone_moe.pytensorrt_llm/_torch/models/modeling_glm.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/models/modeling_llama_min_latency.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/models/modeling_qwen3_next.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/modules/multi_stream_utils.py
|
/bot run --disable-fail-fast |
|
PR_Github #41308 [ run ] triggered by Bot. Commit: |
|
PR_Github #41308 [ run ] completed with state |
|
Do existing test cases cover the new functionality? If not, could we add some? 🙏 |
The issue should be captured by accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_nvfp4_multi_gpus_piecewise_cuda_graph[mtp3_fp8kv_chunked] But that test is recently waived due to https://nvbugspro.nvidia.com/bug/5989920. After this PR merge, we will enable that in #12533. |
…n_parallel under torch.compile PyTorch 2.11 has a bug (pytorch/pytorch#176486) where dynamo captures CUDA stream/event operations and converts them into torch.ops.streams.* nodes. At runtime these nodes create events with uninitialized device type (CPU), causing "Event device type CPU does not match recording stream's device type CUDA" errors. Add a `disable_on_compile` flag to `maybe_execute_in_parallel`. Callers outside custom ops (model-level MoE, MTP norm, mamba mixer, etc.) set it to True so stream/event ops are not captured by dynamo. Callers inside custom ops (fused MoE, attention) leave it False since custom ops are opaque to the compiler. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
12513bb to
036c8e5
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #42254 [ run ] triggered by Bot. Commit: |
|
PR_Github #42254 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42349 [ run ] triggered by Bot. Commit: |
|
PR_Github #42349 [ run ] completed with state
|
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #42431 [ run ] triggered by Bot. Commit: |
|
PR_Github #42431 [ run ] completed with state |
…n_parallel under torch.compile
PyTorch 2.11 has a bug (pytorch/pytorch#176486) where dynamo captures CUDA stream/event operations and converts them into torch.ops.streams.* nodes. At runtime these nodes create events with uninitialized device type (CPU), causing "Event device type CPU does not match recording stream's device type CUDA" errors.
Add a
disable_on_compileflag tomaybe_execute_in_parallel. Callers outside custom ops (model-level MoE, MTP norm, mamba mixer, etc.) set it to True so stream/event ops are not captured by dynamo. Callers inside custom ops (fused MoE, attention) leave it False since custom ops are opaque to the compiler.There is no perf impact on torch compile workflow. Before the change, the multi stream ops are eliminated by DCE and we actually rely on auto-multi-stream in a latter pass to support multi-stream requirements.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.