[None][feat] Qwen3.5 perf optimizations#11581
Conversation
1ba67a7 to
4c1879d
Compare
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (42)
📝 WalkthroughWalkthroughThis pull request introduces significant architectural improvements to auto-deploy infrastructure for large language models, primarily focusing on refactoring gated attention mechanisms, adding fused GDN gating operations, implementing multi-stream GEMM parallelization, and enhancing MoE routing with fused kernels. Configuration files are updated to reflect new backend strategies and performance tuning parameters. Custom operation signatures are modified to internalize preprocessing steps, and new graph transforms optimize runtime execution paths. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant SeqIface as SequenceInfo<br/>Interface
participant FlaOp as FLA Gated Delta<br/>Custom Op
participant PreProc as Internal<br/>Preprocessing
participant Kernel as Attention<br/>Kernel
App->>SeqIface: Prepare batch with use_initial_states,<br/>batch_info, and new any_prefill_use_initial_states_host
SeqIface->>SeqIface: Precompute any_prefill_use_initial_states_host<br/>(host-side torch.any on CPU)
SeqIface->>FlaOp: Pass raw inputs (q, k, v, a, b,<br/>A_log, dt_bias) + metadata
FlaOp->>PreProc: Forward raw a, b, A_log, dt_bias
PreProc->>PreProc: Compute g = -exp(A_log) * softplus(a + dt_bias)
PreProc->>PreProc: Compute beta = sigmoid(b)
PreProc->>PreProc: L2 normalize q, k
PreProc->>PreProc: Expand q, k for GQA if num_v_heads > num_k_heads
FlaOp->>Kernel: Execute attention with normalized q/k,<br/>gating params g/beta,<br/>initial_states decision from host flag
Kernel-->>FlaOp: Output attention result
FlaOp-->>App: Return fused output HV shape
sequenceDiagram
participant Graph as FX GraphModule
participant Transform as MultiStreamGemm<br/>Transform
participant Inspector as Fork Point<br/>Detector
participant Builder as Aux Op<br/>Builder
participant Rewriter as Graph<br/>Rewriter
Graph->>Transform: Accept graph for optimization
Transform->>Inspector: Scan for fork points<br/>(2+ supported ops on same input)
Inspector->>Inspector: Filter by: not already multi-stream,<br/>≥2 linear users, same input
Inspector-->>Transform: List fork points with weights
Transform->>Builder: For largest GEMM by weight,<br/>create _aux variant
Builder->>Builder: Derive custom op name with _aux suffix
Builder-->>Transform: Return aux op registration
Transform->>Rewriter: Insert record_event_passthrough<br/>before earliest remaining linear
Rewriter->>Rewriter: Wire largest GEMM to aux op,<br/>remaining to main stream
Rewriter->>Rewriter: Adjust downstream topology<br/>to preserve order
Rewriter-->>Transform: Updated graph with overlap
Transform-->>Graph: Return optimized FX graph
sequenceDiagram
participant Router as MoE Router<br/>Logits
participant MatchTx as MatchMoeRoutingPattern<br/>Transform
participant Detector as Pattern<br/>Detector
participant Fusion as Fused Routing<br/>Builder
participant Graph as Graph<br/>Rewriter
Router->>MatchTx: Forward routing logits through<br/>softmax → topk → renormalize
MatchTx->>Detector: Scan FX graph for topk nodes
Detector->>Detector: Locate softmax decomposition<br/>preceding topk (multiple variants)
Detector->>Detector: Verify renormalization on topk values
Detector->>Detector: Extract original logits tensor
Detector-->>MatchTx: Pattern matched, ready for fusion
MatchTx->>Fusion: Build single fused_topk_softmax call
Fusion->>Fusion: Triton kernel: in-kernel argmax loop,<br/>numerically stable softmax on top-k
Fusion-->>MatchTx: Fused node (weights, indices)
MatchTx->>Graph: Rewire downstream uses to fused result
Graph->>Graph: Remove old softmax, topk, renormalize nodes
Graph-->>MatchTx: Updated graph with fused routing
MatchTx-->>Router: Return optimized routing graph
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
1-1:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header
The file starts directly with the module docstring; no NVIDIA copyright header is present. As per coding guidelines, modified files must have an updated copyright year.
📄 Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. """Common utils for torch fx graph transformation."""As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` at line 1, Add the required NVIDIA copyright header at the top of tensorrt_llm/_torch/auto_deploy/utils/node_utils.py immediately above the existing module docstring ("""Common utils for torch fx graph transformation."""), updating the copyright year to the latest meaningful modification year; ensure the header follows the project's standard header format and includes the correct year and ownership text.tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the NVIDIA copyright header.
This test file is missing the required NVIDIA Apache 2.0 header with the latest modification year.
As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py` at line 1, This file's top-level docstring lacks the required NVIDIA Apache-2.0 copyright header; update the file header by replacing or prepending the existing module docstring (the first line/string in the test_ep_sharding.py module) with the standard NVIDIA Apache-2.0 copyright header including the correct current year of latest modification and license text, ensuring the header appears at the very top of test_ep_sharding.py before any code or docstrings.tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate the copyright year to include 2026.
The header still lists 2025 even though the file changed in 2026. Please bump the year (e.g., 2025-2026).
📄 Suggested header update
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py` at line 1, Update the copyright header line that currently reads "# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved." to include 2026 (e.g., "2025-2026" or "2026") so the header reflects the latest modification year and conforms to the Apache License 2.0 format; locate and edit this exact header string in trtllm_moe.py.tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the NVIDIA copyright header.
This file lacks the required NVIDIA Apache 2.0 header with the latest modification year.
As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` at line 1, Add the required NVIDIA Apache-2.0 copyright header at the top of the file immediately above the existing module docstring (the string starting with """Transformations to support graph sharding.). Use the standard NVIDIA Apache License 2.0 header format with the latest modification year, ensure the SPDX identifier (Apache-2.0) is present, and keep the header as the very first lines of the file.tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the NVIDIA copyright header.
This test file is missing the required NVIDIA Apache 2.0 header with the latest modification year.
As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py` at line 1, The file is missing the required NVIDIA Apache-2.0 copyright header; add the standard NVIDIA Apache 2.0 license header (with the latest modification year) at the top of the file before the module docstring (the current leading triple-quoted string in the test file), ensuring it matches the project's canonical NVIDIA header format and includes the correct year and license notice.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py (1)
1-5:⚠️ Potential issue | 🟠 MajorAdd the NVIDIA Apache 2.0 header (modified file).
This file is missing the required NVIDIA copyright/license header.As per coding guidelines: All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py` around lines 1 - 5, Add the NVIDIA Apache 2.0 header to the top of the file (test_moe_fusion.py) — insert the standard NVIDIA copyright/license block (Apache License, Version 2.0) with the correct year of latest meaningful modification and copyright owner before any imports (before the existing imports: pytest, torch, torch.fx, torch.nn, torch.nn.functional) so the file begins with the required header.tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py (1)
1-5:⚠️ Potential issue | 🟠 MajorAdd the NVIDIA Apache 2.0 header (modified file).
This file is missing the required NVIDIA copyright/license header.As per coding guidelines: All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py` around lines 1 - 5, This file (fused_moe.py) is missing the required NVIDIA Apache-2.0 license header; add the standard NVIDIA Apache License, Version 2.0 header block at the very top of the file (above the existing imports), include the correct copyright line with the year of latest meaningful modification, and ensure the header text matches the project's canonical Apache-2.0 template (including the SPDX identifier or full license notice) so the module (fused_moe.py) is properly licensed.tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
1-2:⚠️ Potential issue | 🟠 MajorUpdate the copyright year to 2026.
This file is modified but still shows a 2025 header.As per coding guidelines: Include NVIDIA copyright header on ALL new files and update year on modified files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py` around lines 1 - 2, Update the SPDX copyright header year from 2025 to 2026 in this file by changing the lines that begin with "# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved." and ensuring the "# SPDX-License-Identifier: Apache-2.0" line remains unchanged; make the year 2026 so the header reflects the modified file status.
🟡 Minor comments (19)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py-726-741 (1)
726-741:⚠️ Potential issue | 🟡 MinorSilence ARG001 warnings in the fake kernel.
Ruff flags unused arguments in
trtllm_quant_finegrained_fp8_moe_fused_fake. Add a_ = ...discard line or# noqa: ARG001to keep lint clean.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py` around lines 726 - 741, The fake kernel function trtllm_quant_finegrained_fp8_moe_fused_fake declares many parameters that are unused and trigger ARG001; silence the warnings by explicitly discarding unused parameters (e.g., add lines like "_ = selected_experts; _ = routing_weights; _ = fc1_expert_weights; ..." or a single tuple discard) or append "# noqa: ARG001" to the function signature, keeping the call to _validate_mlp_style_and_act_fn intact and referencing the same function name to locate the change.tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py-1163-1165 (1)
1163-1165:⚠️ Potential issue | 🟡 MinorSilence B007 by discarding the unused loop variable.
Ruff flags
nameas unused in this loop.🧹 Example fix
- for name, param in gm_transformed.named_parameters(): + for _, param in gm_transformed.named_parameters(): param.data.fill_(0.0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py` around lines 1163 - 1165, Ruff flags the unused loop variable `name` in the loop over `gm_transformed.named_parameters()`; change the loop to discard the name (e.g., use `_` or `_name`) so the variable is not flagged and keep the body that zeros each parameter before calling `gm_transformed.load_state_dict(original_state_dict, strict=False)`. Ensure you only modify the `for name, param in gm_transformed.named_parameters():` header to drop the unused name while leaving `param.data.fill_(0.0)` and the subsequent `load_state_dict` call unchanged.tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py-548-563 (1)
548-563:⚠️ Potential issue | 🟡 MinorSilence ARG002 warnings for unused parameters in
shard_load_hook.Ruff flags unused
prefix,*args, andweight_shapein the new override. Rename to_prefix,*_args,_weight_shape(or add# noqa: ARG002).🧹 Example fix
def shard_load_hook( self, state_dict, - prefix, - *args, + _prefix, + *_args, weight_name: str, - weight_shape: torch.Size, + _weight_shape: torch.Size, dim: int, rank: int, world_size: int, ) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines 548 - 563, In shard_load_hook, silence ARG002 by renaming unused parameters prefix, *args, and weight_shape to _prefix, *_args, and _weight_shape (or alternatively append "# noqa: ARG002" to their declarations); update the function signature in shard_load_hook accordingly so only weight_name, dim, rank, and world_size remain as used names while keeping behavior (still use scale_key logic and self._split_scale) intact.tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py-801-805 (1)
801-805:⚠️ Potential issue | 🟡 MinorSilence ARG002 warnings in
validate.
gmis unused; ruff flags it. Rename to_gm(or add# noqa: ARG002).🧹 Example fix
- def validate(self, gm: GraphModule = None, node: Node = None) -> bool: + def validate(self, _gm: GraphModule = None, node: Node = None) -> bool:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines 801 - 805, The validate method currently declares an unused parameter gm which triggers ARG002; update the signature of validate to rename gm to _gm (def validate(self, _gm: GraphModule = None, node: Node = None) -> bool) so the linter ignores it, or alternatively add a per-parameter noqa comment for ARG002—ensure you change only the parameter name and keep the body unchanged (preserve the is_op check against torch.ops.auto_deploy.torch_quant_finegrained_fp8_moe and the ad_logger.warning/return behavior).tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py-617-626 (1)
617-626:⚠️ Potential issue | 🟡 MinorSilence ARG001 warnings in
_trtllm_finegrained_fp8_linear_fake.
biasandweight_scaleare unused in the fake kernel; ruff will flag this.🧹 Example fix
def _trtllm_finegrained_fp8_linear_fake( input: torch.Tensor, weight: torch.Tensor, bias: Optional[torch.Tensor], weight_scale: torch.Tensor, ) -> torch.Tensor: + _ = bias, weight_scale """Fake implementation for torch.export tracing."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py` around lines 617 - 626, The fake kernel _trtllm_finegrained_fp8_linear_fake currently has unused parameters bias and weight_scale which trigger ARG001; to silence the warning, explicitly consume them (for example assign them to a throwaway variable or del them) at the top of the function body so they are referenced (e.g., unused = (bias, weight_scale) or del bias, weight_scale) while leaving the rest of _trtllm_finegrained_fp8_linear_fake unchanged.tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py-229-327 (1)
229-327:⚠️ Potential issue | 🟡 MinorRename classes to PascalCase (no underscores).
GDN_BlockandGDN_Block_Unfusedare not PascalCase. Please rename (e.g.,GdnBlock,GdnBlockUnfused) and update references.🔧 Example rename (apply similarly in references)
-class GDN_Block(nn.Module): +class GdnBlock(nn.Module): ... -class GDN_Block_Unfused(nn.Module): +class GdnBlockUnfused(nn.Module): ...- elif model_cls in (GDN_Block, GDN_Block_Unfused): + elif model_cls in (GdnBlock, GdnBlockUnfused):As per coding guidelines, "Use PascalCase for class names (e.g.,
class SomeClass)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py` around lines 229 - 327, Class names GDN_Block and GDN_Block_Unfused violate PascalCase; rename them to PascalCase (e.g., GdnBlock and GdnBlockUnfused) and update all local references. Change the class definitions for GDN_Block -> GdnBlock and GDN_Block_Unfused -> GdnBlockUnfused, update any instantiations, type hints, subclassing, and test references (search for the symbols GDN_Block and GDN_Block_Unfused) to use the new names, and run tests to ensure no remaining references or import/name errors.tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py-18-20 (1)
18-20:⚠️ Potential issue | 🟡 MinorUse module-level imports to preserve namespace.
The new
import triton/import triton.language as tllines violate the namespace rule. Consider importing the submodules/functions explicitly and prefixing usage to keep a clear namespace.🔧 Suggested refactor
-import triton -import triton.language as tl +from triton import cdiv as triton_cdiv, jit as triton_jit, language as tl-@triton.jit +@triton_jit def _act_quant_kernel(...): ...- grid = lambda meta: (triton.cdiv(x.numel(), meta["BLOCK_SIZE"]),) # noqa: E731 + grid = lambda meta: (triton_cdiv(x.numel(), meta["BLOCK_SIZE"]),) # noqa: E731As per coding guidelines, "Always maintain the namespace when importing. Use
from package.subpackage import fooinstead offrom package.subpackage.foo import SomeClassorimport package."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py` around lines 18 - 20, The file currently does a bare import of triton.language as tl which breaks the namespace rule; replace the alias import with a module-level import (keep only import triton) and update all usages of tl to fully-qualified references (replace tl.* with triton.language.*) so the triton namespace is preserved throughout (search for references to tl in this file and update them accordingly).tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py-13-13 (1)
13-13:⚠️ Potential issue | 🟡 MinorKeep module namespace for
_model_test_utilsimports.The new direct class import violates the namespace rule. Prefer importing the module and qualifying usages.
🔧 Suggested refactor
-from _model_test_utils import FakeFineGrainedFP8Linear, FakeFP8Linear +import _model_test_utils as model_test_utils- self.linear1 = FakeFP8Linear(in_features, 4 * in_features, bias=bias) - self.linear2 = FakeFP8Linear(4 * in_features, out_features, bias=bias) + self.linear1 = model_test_utils.FakeFP8Linear(in_features, 4 * in_features, bias=bias) + self.linear2 = model_test_utils.FakeFP8Linear(4 * in_features, out_features, bias=bias)- self.linear1 = FakeFineGrainedFP8Linear(in_features, hidden_features, bias=bias) - self.linear2 = FakeFineGrainedFP8Linear(hidden_features, out_features, bias=bias) + self.linear1 = model_test_utils.FakeFineGrainedFP8Linear(in_features, hidden_features, bias=bias) + self.linear2 = model_test_utils.FakeFineGrainedFP8Linear(hidden_features, out_features, bias=bias)As per coding guidelines, "Always maintain the namespace when importing. Use
from package.subpackage import fooinstead offrom package.subpackage.foo import SomeClassorimport package."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py` at line 13, The test imports concrete classes directly from _model_test_utils (FakeFineGrainedFP8Linear, FakeFP8Linear) which breaks the namespace guideline; change the import to import _model_test_utils as _model_test_utils (or import _model_test_utils) and update all references in this test (e.g., usages of FakeFineGrainedFP8Linear and FakeFP8Linear) to use the qualified names _model_test_utils.FakeFineGrainedFP8Linear and _model_test_utils.FakeFP8Linear so the module namespace is preserved.tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py-474-529 (1)
474-529:⚠️ Potential issue | 🟡 MinorSilence ARG001 warnings in FineGrained FP8 ops/fakes.
Ruff reports unused args (
input_scale,input_zp,weight_zp, plusbiasin the fake). Add a discard line or# noqa: ARG001to keep lint clean.🧹 Example fix (apply similarly to fakes)
def torch_fake_quant_finegrained_fp8_linear( input: torch.Tensor, # [..., K] weight_quantized: torch.Tensor, # [N, K] float8_e4m3fn bias: Optional[torch.Tensor], # [N] or None input_scale: List[torch.Tensor], # unused for FineGrained FP8 (input quantized on the fly) weight_scale: List[torch.Tensor], # [weight_scale_inv] input_zp: List[torch.Tensor], # unused weight_zp: List[torch.Tensor], # unused ) -> torch.Tensor: + _ = input_scale, input_zp, weight_zp🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py` around lines 474 - 529, Silence the ARG001 lint by marking the unused parameters as intentionally unused in both the real and fake FineGrained FP8 ops: in torch_fake_quant_finegrained_fp8_linear add a discard line referencing input_scale, input_zp, and weight_zp (e.g. assign them to _ or tuple-unpack to a throwaway variable) and in _torch_fake_quant_finegrained_fp8_linear_fake also discard input_scale, input_zp, weight_zp and bias; this keeps the signatures intact for tracing but prevents Ruff from flagging unused-argument warnings (alternatively add "# noqa: ARG001" to those parameter names if you prefer a comment-based suppression).tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py-25-27 (1)
25-27:⚠️ Potential issue | 🟡 MinorKeep module namespace in the new imports.
The new imports pull symbols directly; guidelines require keeping module namespaces. Consider importing the modules and qualifying usage.
🔧 Suggested refactor
-from tensorrt_llm._torch.modules.fused_moe.routing import RoutingMethodType +from tensorrt_llm._torch.modules.fused_moe import routing ... -from tensorrt_llm._utils import is_sm_100f +from tensorrt_llm import _utils as trtllm_utils- if is_sm_100f(): + if trtllm_utils.is_sm_100f(): ... - RoutingMethodType.Renormalize, + routing.RoutingMethodType.Renormalize,As per coding guidelines, "Always maintain the namespace when importing. Use
from package.subpackage import fooinstead offrom package.subpackage.foo import SomeClassorimport package."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py` around lines 25 - 27, The three direct imports (RoutingMethodType, ActivationType, is_sm_100f) violate the "keep module namespace" rule; change them to import their modules (e.g., import tensorrt_llm._torch.modules.fused_moe.routing as routing, import tensorrt_llm._torch.utils as torch_utils, import tensorrt_llm._utils as llm_utils) and update all references in this file from RoutingMethodType, ActivationType, and is_sm_100f to routing.RoutingMethodType, torch_utils.ActivationType, and llm_utils.is_sm_100f respectively so the module namespace is preserved across usages.tensorrt_llm/_torch/auto_deploy/models/patches/qwen3_next.py-1-12 (1)
1-12:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
This new file is missing the required NVIDIA copyright header with Apache License 2.0. As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
Add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """Patches for Qwen3Next to make it compatible with torch.export and reduce export time.As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/patches/qwen3_next.py` around lines 1 - 12, Add the required NVIDIA Apache-2.0 copyright header at the top of this module (qwen3_next patch file) with the year of latest meaningful modification; place the standard NVIDIA Apache License 2.0 boilerplate before the existing module docstring so the file begins with the license header, and ensure it mentions ownership by NVIDIA and the correct year. Keep the rest of the file intact (this patch touches definitions like Qwen3NextSparseMoeBlock and Qwen3NextGatedDeltaNet), only prepending the canonical NVIDIA Apache-2.0 header text.tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_torch_cached_gated_delta_rule.py-16-16 (1)
16-16:⚠️ Potential issue | 🟡 MinorStale
noqa: F401directiveRuff reports RUF100: the
# noqa: F401directive at line 16 references a rule (F401) that is not enabled in the project's Ruff configuration, making the suppression a no-op. The same pattern appears intest_quant.pyline 6, so this may be consistent with a codebase-wide convention. Consider switching to a plain comment (# side-effect import – registers custom ops) or enablingF401in the Ruff config to make the suppression meaningful.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_torch_cached_gated_delta_rule.py` at line 16, The import of tensorrt_llm._torch.auto_deploy.custom_ops currently uses a stale suppression "# noqa: F401"; update the import line in test_torch_cached_gated_delta_rule (and the matching line in test_quant.py) to remove the unused/no-op noqa and instead add an explanatory side-effect comment such as "# side-effect import — registers custom ops" (or alternatively enable F401 in the Ruff config if you prefer keeping a suppression); locate the statement importing tensorrt_llm._torch.auto_deploy.custom_ops and replace the directive accordingly.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py-116-118 (1)
116-118:⚠️ Potential issue | 🟡 Minor
biasparameter is declared and parameterized but never used — both runs are identicalRuff reports ARG001: unused function argument
bias. The test createsMLP(128, 256, 128)which hard-codesbias=Falsein both itsnn.Linearlayers (per theMLPdefinition in_model_test_utils.py), so neither theTruenorFalsevariant actually exercises bias logic.Either wire the parameter into the model construction, or remove the parametrize if bias coverage is not intended here:
🔧 Option: use an inline model that respects `bias`
- model = MLP(128, 256, 128).to(torch.float16).to("cuda") + import torch.nn as nn + model = nn.Sequential( + nn.Linear(128, 256, bias=bias), + nn.ReLU(), + nn.Linear(256, 128, bias=bias), + ).to(torch.float16).to("cuda")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py` around lines 116 - 118, The test test_finegrained_fp8_quantization declares a parameter bias but never uses it (ARG001) because MLP in _model_test_utils.py currently hard-codes bias=False; either remove the `@pytest.mark.parametrize`("bias", [True, False]) line or wire the parameter into the model construction by passing bias into MLP (e.g., construct MLP(128, 256, 128, bias=bias) or equivalent call site), ensuring the test actually exercises both bias=True and bias=False cases.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gated_delta_rule_cache.py-28-30 (1)
28-30:⚠️ Potential issue | 🟡 MinorUse namespace-style import and drop the unused
# noqa.
The# noqa: F401is unused, and the namespace import rule prefers importing the submodule from its package.Suggested fix
- import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401 + from tensorrt_llm._torch.auto_deploy import custom_ops # side-effect registrationAs per coding guidelines: Always maintain the namespace when importing. Use
from package.subpackage import fooinstead offrom package.subpackage.foo import SomeClassorimport package.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gated_delta_rule_cache.py` around lines 28 - 30, The import line uses a bare module import with an unnecessary "# noqa: F401" and should use a namespace-style import; replace "import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401" with a namespace import "from tensorrt_llm._torch.auto_deploy import custom_ops" and remove the "# noqa" comment so the submodule is imported with its package namespace preserved (referencing the module symbol tensorrt_llm._torch.auto_deploy.custom_ops).tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py-1244-1253 (1)
1244-1253:⚠️ Potential issue | 🟡 MinorRemove or use the unused
dtypeparameter.
dtypeis accepted but not used. Either drop it or apply it when creating weights to avoid lint warnings and confusion.Suggested fix (use dtype)
- weight_fp32 = torch.randn(out_features, in_features, device=device) * 0.01 + weight_fp32 = torch.randn(out_features, in_features, device=device, dtype=dtype) * 0.01🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py` around lines 1244 - 1253, The __init__ signature accepts a dtype parameter that is never used; either remove dtype from the signature or apply it when constructing tensors/weights in this class (e.g., pass dtype to torch.empty/torch.randn or .to(dtype)) so the parameter is meaningful; update the constructor in the class (the __init__ method that takes ffn_dim, hidden_dim, dtype=torch.bfloat16, device="cuda", block_size=None) and any weight initializations to use the dtype, or delete the dtype parameter and all references to it to avoid the unused-parameter lint warning.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py-20-22 (1)
20-22:⚠️ Potential issue | 🟡 MinorUse namespace-style import and drop the unused
# noqa.
The# noqa: F401is unused, and the namespace import rule prefers importing the submodule from its package.Suggested fix
- import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401 + from tensorrt_llm._torch.auto_deploy import custom_ops # side-effect registrationAs per coding guidelines: Always maintain the namespace when importing. Use
from package.subpackage import fooinstead offrom package.subpackage.foo import SomeClassorimport package.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py` around lines 20 - 22, Replace the bare module import line that uses "import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401" with a namespace-style import "from tensorrt_llm._torch.auto_deploy import custom_ops" and remove the unused "# noqa: F401" comment; ensure the imported symbol is still referenced as custom_ops so the module is registered/loaded as before.tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py-17-20 (1)
17-20:⚠️ Potential issue | 🟡 MinorUse namespace-style import and drop the unused
# noqa.
The# noqa: F401is flagged as unused, and the import style violates the namespace rule.Suggested fix
- import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401 + from tensorrt_llm._torch.auto_deploy import custom_ops # side-effect registrationAs per coding guidelines: Always maintain the namespace when importing. Use
from package.subpackage import fooinstead offrom package.subpackage.foo import SomeClassorimport package.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py` around lines 17 - 20, Remove the trailing "# noqa: F401" and switch to namespace-style imports: replace "import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401" with the same import but without the noqa, and change the two specific imports to import from the fla namespace (use "from tensorrt_llm._torch.modules.fla import chunk_gated_delta_rule" and "from tensorrt_llm._torch.modules.fla import fused_recurrent_gated_delta_rule_fwd") so the module namespace is preserved and no unused noqa annotation remains.tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py-171-179 (1)
171-179:⚠️ Potential issue | 🟡 MinorSilence unused-argument warnings in
torch_gated_delta_rule_fake.
Ruff flags the unused params; a smalldelblock (or_prefixes) avoids lint noise while keeping the signature intact.Suggested fix
def torch_gated_delta_rule_fake( q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, g: torch.Tensor, beta: torch.Tensor, scale: Optional[float] = None, ) -> torch.Tensor: + del q, k, g, beta, scale return torch.empty_like(v)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py` around lines 171 - 179, The function torch_gated_delta_rule_fake currently declares parameters (q, k, v, g, beta, scale) that go unused and trigger lint warnings; silence them by explicitly deleting or referencing them at the top of torch_gated_delta_rule_fake (e.g., add a small del q, k, v, g, beta, scale or assign them to _ to indicate intentional unused status) while keeping the public signature intact and not changing behavior.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py-36-54 (1)
36-54:⚠️ Potential issue | 🟡 MinorInitialize ModelFactory state in
DummyFactory.__init__.
DummyFactorybypassesModelFactory.__init__, leaving base attributes (e.g.,_prefetched_model_path,model_kwargs,skip_loading_weights) undefined if accessed by the optimizer pipeline. Either callsuper().__init__or explicitly initialize those members.#!/bin/bash # Check how ModelFactory fields are accessed in the auto_deploy pipeline. rg -n "factory\\.|ModelFactory" tensorrt_llm/_torch/auto_deploy -g '*.py' rg -n "_prefetched_model_path|_prefetched_tokenizer_path|model_kwargs|tokenizer_kwargs|skip_loading_weights|max_seq_len" tensorrt_llm/_torch/auto_deploy -g '*.py'As per coding guidelines: Initialize all externally visible members of a class in the constructor.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py` around lines 36 - 54, DummyFactory currently bypasses ModelFactory.__init__ causing missing base attributes; update DummyFactory.__init__ (or at top of it) to call super().__init__() or explicitly initialize the base members used by the auto-deploy pipeline (e.g., _prefetched_model_path, _prefetched_tokenizer_path, model_kwargs, tokenizer_kwargs, skip_loading_weights, max_seq_len) so accesses from methods like build_model, get_cache_config_updates, _build_model, or _load_checkpoint won't hit undefined attributes.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/torch_backend_gated_delta.py
Outdated
Show resolved
Hide resolved
dd6ad31 to
c4aa3af
Compare
+ review comments Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
+ review comments Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
13b183b to
a1e28f2
Compare
b3f2260 to
6e95538
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #38091 [ run ] triggered by Bot. Commit: |
|
PR_Github #38091 [ run ] completed with state
|
|
/bot run |
5bff4ba to
c759cff
Compare
|
/bot run --reuse-test |
|
PR_Github #38741 [ run ] triggered by Bot. Commit: |
|
PR_Github #38741 [ run ] completed with state
|
|
/bot run --reuse-test |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #38768 [ run ] triggered by Bot. Commit: |
|
PR_Github #38769 [ run ] triggered by Bot. Commit: |
|
PR_Github #38768 [ run ] completed with state |
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py
Show resolved
Hide resolved
tests/unittest/auto_deploy/singlegpu/smoke/test_ad_build_small_single.py
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/benchmark_routing.py
Show resolved
Hide resolved
|
The PR is too large to review, please consider splitting into small PRs for the future PR. |
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py
Show resolved
Hide resolved
|
PR_Github #38769 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #38799 [ run ] triggered by Bot. Commit: |
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #38799 [ run ] completed with state |
|
/bot run --reuse-test |
|
PR_Github #38830 [ run ] triggered by Bot. Commit: |
|
PR_Github #38830 [ run ] completed with state
|
|
/bot run |
|
PR_Github #38854 [ run ] triggered by Bot. Commit: |
|
PR_Github #38854 [ run ] completed with state
|
|
/bot skip --comment "Only AD changes, and all AD tests passed recent CI runs" |
|
PR_Github #38876 [ skip ] triggered by Bot. Commit: |
|
PR_Github #38876 [ skip ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
Performance Improvements
Qwen3.5 35B A3B:
BF16: ~82 MMLU, ~84 GSM8K
FP8: ~81 MMLU, ~81 GSM8K