Skip to content

[None][feat] Qwen3.5 perf optimizations#11581

Merged
suyoggupta merged 40 commits intoNVIDIA:mainfrom
nv-auto-deploy:sg/qwen3.5-fp8
Mar 13, 2026
Merged

[None][feat] Qwen3.5 perf optimizations#11581
suyoggupta merged 40 commits intoNVIDIA:mainfrom
nv-auto-deploy:sg/qwen3.5-fp8

Conversation

@suyoggupta
Copy link
Collaborator

@suyoggupta suyoggupta commented Feb 19, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added fused MoE routing optimization using Triton kernels for improved throughput.
    • Introduced FineGrained FP8 quantization support with optimized SwiGLU fusion.
    • Enabled multi-stream GEMM parallelization for concurrent GEMM operations.
    • Added fused gating operations for enhanced attention performance.
  • Performance Improvements

    • Optimized model configurations for better resource utilization across various model sizes.
    • Improved KV cache management and memory efficiency in attention layers.
    • Enhanced kernel fusion strategies for MoE and attention computations.

Qwen3.5 35B A3B:
BF16: ~82 MMLU, ~84 GSM8K
FP8: ~81 MMLU, ~81 GSM8K

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b022e82f-5414-4a35-b824-3835a6bf95df

📥 Commits

Reviewing files that changed from the base of the PR and between 02c8a94 and c00dc3f.

📒 Files selected for processing (42)
  • examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b.yaml
  • examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_delta.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/gdn_gating.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/torch_backend_gated_delta.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/benchmark_routing.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_routing.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/linear/swiglu.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/flashinfer_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/mamba_backend_common.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py
  • tensorrt_llm/_torch/auto_deploy/models/patches/qwen3_next.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_gdn_gating.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_swiglu.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/moe_routing.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_gemm.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_moe.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tests/integration/test_lists/waives.txt
  • tests/unittest/auto_deploy/_utils_test/_model_test_utils.py
  • tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fused_gdn_gating.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_torch_cached_gated_delta_rule.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_triton_mamba_cached_op.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/test_multi_stream_gemm.py
  • tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py
  • tests/unittest/auto_deploy/singlegpu/models/test_qwen3_next_gdn_patches.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_finegrained_fp8_swiglu.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_gdn_gating.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_gated_delta_rule_cache.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py

📝 Walkthrough

Walkthrough

This pull request introduces significant architectural improvements to auto-deploy infrastructure for large language models, primarily focusing on refactoring gated attention mechanisms, adding fused GDN gating operations, implementing multi-stream GEMM parallelization, and enhancing MoE routing with fused kernels. Configuration files are updated to reflect new backend strategies and performance tuning parameters. Custom operation signatures are modified to internalize preprocessing steps, and new graph transforms optimize runtime execution paths.

Changes

Cohort / File(s) Summary
Configuration Updates
examples/auto_deploy/model_registry/configs/qwen3.5_moe_*.yaml
Updated runtime configurations with new attention backend settings, increased sequence lengths, CUDA graph batch sizes, modified KV cache parameters (tokens_per_block: 64→32), and replaced manual TP sharding with automated SYMM_MEM allreduce and multi-stream MoE/GEMM strategies.
Default Transform Configuration
tensorrt_llm/_torch/auto_deploy/config/default.yaml
Added pattern matchers for MoE routing and fine-grained FP8 SwiGLU, enabled additional quantization and fusion transforms, and introduced new compilation-stage multi-stream GEMM support.
Attention Backend & Flashinfer
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py
Increased workspace buffer size from 320MB to 1GB for improved memory allocation flexibility.
FLA Gated Delta Core Operations
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py, fla_backend_gated_delta.py, torch_backend_gated_delta.py
Refactored gated delta attention to accept raw gating inputs (a, b, A_log, dt_bias) instead of pre-computed (g, beta); internalized L2 normalization, GQA expansion, and gating computation within operations. Updated public function signatures and cache shape semantics to HV-based (value-head) layouts.
FLA Delta Backend Updates
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_delta.py
Added host-side any_prefill_use_initial_states_host flag parameter to eliminate GPU synchronization during initial state decisions.
Fused GDN Gating Operations
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/gdn_gating.py
Introduced new fused GDN gating custom ops (torch_fused_gdn_gating, triton_fused_gdn_gating) combining exponential, softplus, and sigmoid computations into single kernels.
FP8 Quantization Enhancements
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py
Added fallback path for non-standard FP8 block sizes, using dequantization to BF16 when block dimensions don't match expected sizes.
Fused MoE Routing
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_routing.py, benchmark_routing.py
Implemented fused top-k softmax MoE routing using Triton kernel, combining expert logits processing and routing weight computation; added benchmarking utilities for validation.
Fine-Grained FP8 SwiGLU Operations
tensorrt_llm/_torch/auto_deploy/custom_ops/linear/swiglu.py
Added new fine-grained FP8 SwiGLU MLP custom ops (torch_finegrained_fp8_swiglu_mlp, fused_finegrained_fp8_swiglu_mlp) with dedicated kernels for quantized operations.
Mamba SSM Backend Updates
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/
Added any_prefill_use_initial_states_host parameter to flashinfer, Triton, and common mamba backends to optimize host-side initial state decisions.
Attention Interface & Metadata
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Introduced new any_prefill_use_initial_states_host field to SequenceInfo metadata with host-side precomputation support to avoid per-layer GPU synchronization.
Sharding Transforms
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
Added is_quantized_linear_scale_tensor helper to prevent double-sharding of quantized scales; updated all-reduce operations to use backend-determined distribution ops via _get_dist_ops; exposed is_fake_quantized_linear_op public symbol; removed enable_attention_dp field from config.
New Graph Transforms
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_gdn_gating.py, moe_routing.py, multi_stream_gemm.py, multi_stream_moe.py, fuse_swiglu.py
Introduced five new graph transformation passes: FuseGdnGating replaces torch ops with Triton variants; MatchMoeRoutingPattern fuses softmax-topk-renormalize sequences; MultiStreamGemm parallelizes shared-input FP8 GEMMs; MatchFineGrainedFP8SwiGLUPattern and FuseFineGrainedFP8SwiGLU optimize SwiGLU paths; updated MultiStreamMoE to support fused MoE ops.
Debugging & Tracing
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
Added debug logging of transformed GraphModule to file after fusion.
Model Patches
tensorrt_llm/_torch/auto_deploy/models/patches/qwen3_next.py, models/custom/modeling_qwen3_5_moe.py
Updated GDN forward path to use single torch_gated_delta_rule op with raw gating inputs; introduced shared-expert gating in MoE path; refactored internal parameter handling; simplified L2 normalization and GQA expansion to be op-internal.
Integration Tests
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
Added small-model test path for Qwen3.5 MoE with separate config; removed KimiK2_5 AutoDeploy accuracy tests.
Unit Tests: FLA Gated Delta
tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py, test_torch_cached_gated_delta_rule.py, test_gated_delta_rule_cache.py
Refactored tests to exercise raw input handling (a, b, A_log, dt_bias) with preprocessing helpers; added multi-head parameterization (num_k_heads vs num_v_heads) for GVA scenarios; updated reference implementations.
Unit Tests: GDN Gating
tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fused_gdn_gating.py, test_qwen3_next_gdn_patches.py
Added comprehensive tests for fused GDN gating (Torch and Triton variants); updated GDN patch tests to use raw inputs and internal preprocessing.
Unit Tests: Transform Validation
tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_gdn_gating.py, test_finegrained_fp8_swiglu.py, test_multi_stream_gemm.py
Introduced tests for new graph transforms: FuseGdnGating validation, fine-grained FP8 SwiGLU pattern matching and fusion, and comprehensive multi-stream GEMM fork-point detection and parallelization.
Unit Tests: Sharding & Mamba
tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py, tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_*.py
Updated TP sharding tests for new gated delta signatures; added any_prefill_use_initial_states_host parameter to Mamba SSM cached op tests.
Test Configuration
tests/unittest/auto_deploy/_utils_test/_model_test_utils.py, tests/integration/test_lists/waives.txt
Updated DeepSeek-V3 small model config (hidden_size: 32→128, intermediate_size: 64→128); removed DeepSeek-V3 build test skip marker.
Model Tests
tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py
Updated documentation to reflect torch_gated_delta_rule handling of L2 norm, GQA expansion, and gating internally.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant SeqIface as SequenceInfo<br/>Interface
    participant FlaOp as FLA Gated Delta<br/>Custom Op
    participant PreProc as Internal<br/>Preprocessing
    participant Kernel as Attention<br/>Kernel

    App->>SeqIface: Prepare batch with use_initial_states,<br/>batch_info, and new any_prefill_use_initial_states_host
    SeqIface->>SeqIface: Precompute any_prefill_use_initial_states_host<br/>(host-side torch.any on CPU)
    SeqIface->>FlaOp: Pass raw inputs (q, k, v, a, b,<br/>A_log, dt_bias) + metadata
    FlaOp->>PreProc: Forward raw a, b, A_log, dt_bias
    PreProc->>PreProc: Compute g = -exp(A_log) * softplus(a + dt_bias)
    PreProc->>PreProc: Compute beta = sigmoid(b)
    PreProc->>PreProc: L2 normalize q, k
    PreProc->>PreProc: Expand q, k for GQA if num_v_heads > num_k_heads
    FlaOp->>Kernel: Execute attention with normalized q/k,<br/>gating params g/beta,<br/>initial_states decision from host flag
    Kernel-->>FlaOp: Output attention result
    FlaOp-->>App: Return fused output HV shape
Loading
sequenceDiagram
    participant Graph as FX GraphModule
    participant Transform as MultiStreamGemm<br/>Transform
    participant Inspector as Fork Point<br/>Detector
    participant Builder as Aux Op<br/>Builder
    participant Rewriter as Graph<br/>Rewriter

    Graph->>Transform: Accept graph for optimization
    Transform->>Inspector: Scan for fork points<br/>(2+ supported ops on same input)
    Inspector->>Inspector: Filter by: not already multi-stream,<br/>≥2 linear users, same input
    Inspector-->>Transform: List fork points with weights
    Transform->>Builder: For largest GEMM by weight,<br/>create _aux variant
    Builder->>Builder: Derive custom op name with _aux suffix
    Builder-->>Transform: Return aux op registration
    Transform->>Rewriter: Insert record_event_passthrough<br/>before earliest remaining linear
    Rewriter->>Rewriter: Wire largest GEMM to aux op,<br/>remaining to main stream
    Rewriter->>Rewriter: Adjust downstream topology<br/>to preserve order
    Rewriter-->>Transform: Updated graph with overlap
    Transform-->>Graph: Return optimized FX graph
Loading
sequenceDiagram
    participant Router as MoE Router<br/>Logits
    participant MatchTx as MatchMoeRoutingPattern<br/>Transform
    participant Detector as Pattern<br/>Detector
    participant Fusion as Fused Routing<br/>Builder
    participant Graph as Graph<br/>Rewriter

    Router->>MatchTx: Forward routing logits through<br/>softmax → topk → renormalize
    MatchTx->>Detector: Scan FX graph for topk nodes
    Detector->>Detector: Locate softmax decomposition<br/>preceding topk (multiple variants)
    Detector->>Detector: Verify renormalization on topk values
    Detector->>Detector: Extract original logits tensor
    Detector-->>MatchTx: Pattern matched, ready for fusion
    MatchTx->>Fusion: Build single fused_topk_softmax call
    Fusion->>Fusion: Triton kernel: in-kernel argmax loop,<br/>numerically stable softmax on top-k
    Fusion-->>MatchTx: Fused node (weights, indices)
    MatchTx->>Graph: Rewire downstream uses to fused result
    Graph->>Graph: Remove old softmax, topk, renormalize nodes
    Graph-->>MatchTx: Updated graph with fused routing
    MatchTx-->>Router: Return optimized routing graph
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.66% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is minimal and lacks required information. It only contains benchmark results without explaining the feature, motivation, or implementation details required by the template. Expand the description following the provided template. Include: a clear summary of what block-scale FP8 quantization is and why it's needed, which models/use cases benefit, test coverage details, and checklist items. Reference the stacked PR and explain the relationship.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][feat] Qwen3.5 perf optimizations' is directly related to the main changes in the changeset, which introduce performance optimization features for Qwen3.5 models (FP8 quantization, gated delta rule caching, and MoE enhancements).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header

The file starts directly with the module docstring; no NVIDIA copyright header is present. As per coding guidelines, modified files must have an updated copyright year.

📄 Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Common utils for torch fx graph transformation."""

As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` at line 1, Add the
required NVIDIA copyright header at the top of
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py immediately above the
existing module docstring ("""Common utils for torch fx graph
transformation."""), updating the copyright year to the latest meaningful
modification year; ensure the header follows the project's standard header
format and includes the correct year and ownership text.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the NVIDIA copyright header.

This test file is missing the required NVIDIA Apache 2.0 header with the latest modification year.

As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py`
at line 1, This file's top-level docstring lacks the required NVIDIA Apache-2.0
copyright header; update the file header by replacing or prepending the existing
module docstring (the first line/string in the test_ep_sharding.py module) with
the standard NVIDIA Apache-2.0 copyright header including the correct current
year of latest modification and license text, ensuring the header appears at the
very top of test_ep_sharding.py before any code or docstrings.
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to include 2026.

The header still lists 2025 even though the file changed in 2026. Please bump the year (e.g., 2025-2026).

📄 Suggested header update
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py` at line
1, Update the copyright header line that currently reads "#
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
rights reserved." to include 2026 (e.g., "2025-2026" or "2026") so the header
reflects the latest modification year and conforms to the Apache License 2.0
format; locate and edit this exact header string in trtllm_moe.py.
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the NVIDIA copyright header.

This file lacks the required NVIDIA Apache 2.0 header with the latest modification year.

As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` at line 1, Add
the required NVIDIA Apache-2.0 copyright header at the top of the file
immediately above the existing module docstring (the string starting with
"""Transformations to support graph sharding.). Use the standard NVIDIA Apache
License 2.0 header format with the latest modification year, ensure the SPDX
identifier (Apache-2.0) is present, and keep the header as the very first lines
of the file.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the NVIDIA copyright header.

This test file is missing the required NVIDIA Apache 2.0 header with the latest modification year.

As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`
at line 1, The file is missing the required NVIDIA Apache-2.0 copyright header;
add the standard NVIDIA Apache 2.0 license header (with the latest modification
year) at the top of the file before the module docstring (the current leading
triple-quoted string in the test file), ensuring it matches the project's
canonical NVIDIA header format and includes the correct year and license notice.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py (1)

1-5: ⚠️ Potential issue | 🟠 Major

Add the NVIDIA Apache 2.0 header (modified file).
This file is missing the required NVIDIA copyright/license header.

As per coding guidelines: All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py`
around lines 1 - 5, Add the NVIDIA Apache 2.0 header to the top of the file
(test_moe_fusion.py) — insert the standard NVIDIA copyright/license block
(Apache License, Version 2.0) with the correct year of latest meaningful
modification and copyright owner before any imports (before the existing
imports: pytest, torch, torch.fx, torch.nn, torch.nn.functional) so the file
begins with the required header.
tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py (1)

1-5: ⚠️ Potential issue | 🟠 Major

Add the NVIDIA Apache 2.0 header (modified file).
This file is missing the required NVIDIA copyright/license header.

As per coding guidelines: All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py` around lines
1 - 5, This file (fused_moe.py) is missing the required NVIDIA Apache-2.0
license header; add the standard NVIDIA Apache License, Version 2.0 header block
at the very top of the file (above the existing imports), include the correct
copyright line with the year of latest meaningful modification, and ensure the
header text matches the project's canonical Apache-2.0 template (including the
SPDX identifier or full license notice) so the module (fused_moe.py) is properly
licensed.
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

1-2: ⚠️ Potential issue | 🟠 Major

Update the copyright year to 2026.
This file is modified but still shows a 2025 header.

As per coding guidelines: Include NVIDIA copyright header on ALL new files and update year on modified files.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py` around lines 1 -
2, Update the SPDX copyright header year from 2025 to 2026 in this file by
changing the lines that begin with "# SPDX-FileCopyrightText: Copyright (c) 2025
NVIDIA CORPORATION & AFFILIATES. All rights reserved." and ensuring the "#
SPDX-License-Identifier: Apache-2.0" line remains unchanged; make the year 2026
so the header reflects the modified file status.
🟡 Minor comments (19)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py-726-741 (1)

726-741: ⚠️ Potential issue | 🟡 Minor

Silence ARG001 warnings in the fake kernel.

Ruff flags unused arguments in trtllm_quant_finegrained_fp8_moe_fused_fake. Add a _ = ... discard line or # noqa: ARG001 to keep lint clean.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py` around
lines 726 - 741, The fake kernel function
trtllm_quant_finegrained_fp8_moe_fused_fake declares many parameters that are
unused and trigger ARG001; silence the warnings by explicitly discarding unused
parameters (e.g., add lines like "_ = selected_experts; _ = routing_weights; _ =
fc1_expert_weights; ..." or a single tuple discard) or append "# noqa: ARG001"
to the function signature, keeping the call to _validate_mlp_style_and_act_fn
intact and referencing the same function name to locate the change.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py-1163-1165 (1)

1163-1165: ⚠️ Potential issue | 🟡 Minor

Silence B007 by discarding the unused loop variable.

Ruff flags name as unused in this loop.

🧹 Example fix
-    for name, param in gm_transformed.named_parameters():
+    for _, param in gm_transformed.named_parameters():
         param.data.fill_(0.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`
around lines 1163 - 1165, Ruff flags the unused loop variable `name` in the loop
over `gm_transformed.named_parameters()`; change the loop to discard the name
(e.g., use `_` or `_name`) so the variable is not flagged and keep the body that
zeros each parameter before calling
`gm_transformed.load_state_dict(original_state_dict, strict=False)`. Ensure you
only modify the `for name, param in gm_transformed.named_parameters():` header
to drop the unused name while leaving `param.data.fill_(0.0)` and the subsequent
`load_state_dict` call unchanged.
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py-548-563 (1)

548-563: ⚠️ Potential issue | 🟡 Minor

Silence ARG002 warnings for unused parameters in shard_load_hook.

Ruff flags unused prefix, *args, and weight_shape in the new override. Rename to _prefix, *_args, _weight_shape (or add # noqa: ARG002).

🧹 Example fix
     def shard_load_hook(
         self,
         state_dict,
-        prefix,
-        *args,
+        _prefix,
+        *_args,
         weight_name: str,
-        weight_shape: torch.Size,
+        _weight_shape: torch.Size,
         dim: int,
         rank: int,
         world_size: int,
     ) -> None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines
548 - 563, In shard_load_hook, silence ARG002 by renaming unused parameters
prefix, *args, and weight_shape to _prefix, *_args, and _weight_shape (or
alternatively append "# noqa: ARG002" to their declarations); update the
function signature in shard_load_hook accordingly so only weight_name, dim,
rank, and world_size remain as used names while keeping behavior (still use
scale_key logic and self._split_scale) intact.
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py-801-805 (1)

801-805: ⚠️ Potential issue | 🟡 Minor

Silence ARG002 warnings in validate.

gm is unused; ruff flags it. Rename to _gm (or add # noqa: ARG002).

🧹 Example fix
-    def validate(self, gm: GraphModule = None, node: Node = None) -> bool:
+    def validate(self, _gm: GraphModule = None, node: Node = None) -> bool:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines
801 - 805, The validate method currently declares an unused parameter gm which
triggers ARG002; update the signature of validate to rename gm to _gm (def
validate(self, _gm: GraphModule = None, node: Node = None) -> bool) so the
linter ignores it, or alternatively add a per-parameter noqa comment for
ARG002—ensure you change only the parameter name and keep the body unchanged
(preserve the is_op check against
torch.ops.auto_deploy.torch_quant_finegrained_fp8_moe and the
ad_logger.warning/return behavior).
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py-617-626 (1)

617-626: ⚠️ Potential issue | 🟡 Minor

Silence ARG001 warnings in _trtllm_finegrained_fp8_linear_fake.

bias and weight_scale are unused in the fake kernel; ruff will flag this.

🧹 Example fix
 def _trtllm_finegrained_fp8_linear_fake(
     input: torch.Tensor,
     weight: torch.Tensor,
     bias: Optional[torch.Tensor],
     weight_scale: torch.Tensor,
 ) -> torch.Tensor:
+    _ = bias, weight_scale
     """Fake implementation for torch.export tracing."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py`
around lines 617 - 626, The fake kernel _trtllm_finegrained_fp8_linear_fake
currently has unused parameters bias and weight_scale which trigger ARG001; to
silence the warning, explicitly consume them (for example assign them to a
throwaway variable or del them) at the top of the function body so they are
referenced (e.g., unused = (bias, weight_scale) or del bias, weight_scale) while
leaving the rest of _trtllm_finegrained_fp8_linear_fake unchanged.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py-229-327 (1)

229-327: ⚠️ Potential issue | 🟡 Minor

Rename classes to PascalCase (no underscores).

GDN_Block and GDN_Block_Unfused are not PascalCase. Please rename (e.g., GdnBlock, GdnBlockUnfused) and update references.

🔧 Example rename (apply similarly in references)
-class GDN_Block(nn.Module):
+class GdnBlock(nn.Module):
     ...

-class GDN_Block_Unfused(nn.Module):
+class GdnBlockUnfused(nn.Module):
     ...
-    elif model_cls in (GDN_Block, GDN_Block_Unfused):
+    elif model_cls in (GdnBlock, GdnBlockUnfused):

As per coding guidelines, "Use PascalCase for class names (e.g., class SomeClass)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`
around lines 229 - 327, Class names GDN_Block and GDN_Block_Unfused violate
PascalCase; rename them to PascalCase (e.g., GdnBlock and GdnBlockUnfused) and
update all local references. Change the class definitions for GDN_Block ->
GdnBlock and GDN_Block_Unfused -> GdnBlockUnfused, update any instantiations,
type hints, subclassing, and test references (search for the symbols GDN_Block
and GDN_Block_Unfused) to use the new names, and run tests to ensure no
remaining references or import/name errors.
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py-18-20 (1)

18-20: ⚠️ Potential issue | 🟡 Minor

Use module-level imports to preserve namespace.

The new import triton/import triton.language as tl lines violate the namespace rule. Consider importing the submodules/functions explicitly and prefixing usage to keep a clear namespace.

🔧 Suggested refactor
-import triton
-import triton.language as tl
+from triton import cdiv as triton_cdiv, jit as triton_jit, language as tl
-@triton.jit
+@triton_jit
 def _act_quant_kernel(...):
     ...
-    grid = lambda meta: (triton.cdiv(x.numel(), meta["BLOCK_SIZE"]),)  # noqa: E731
+    grid = lambda meta: (triton_cdiv(x.numel(), meta["BLOCK_SIZE"]),)  # noqa: E731

As per coding guidelines, "Always maintain the namespace when importing. Use from package.subpackage import foo instead of from package.subpackage.foo import SomeClass or import package."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py`
around lines 18 - 20, The file currently does a bare import of triton.language
as tl which breaks the namespace rule; replace the alias import with a
module-level import (keep only import triton) and update all usages of tl to
fully-qualified references (replace tl.* with triton.language.*) so the triton
namespace is preserved throughout (search for references to tl in this file and
update them accordingly).
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py-13-13 (1)

13-13: ⚠️ Potential issue | 🟡 Minor

Keep module namespace for _model_test_utils imports.

The new direct class import violates the namespace rule. Prefer importing the module and qualifying usages.

🔧 Suggested refactor
-from _model_test_utils import FakeFineGrainedFP8Linear, FakeFP8Linear
+import _model_test_utils as model_test_utils
-        self.linear1 = FakeFP8Linear(in_features, 4 * in_features, bias=bias)
-        self.linear2 = FakeFP8Linear(4 * in_features, out_features, bias=bias)
+        self.linear1 = model_test_utils.FakeFP8Linear(in_features, 4 * in_features, bias=bias)
+        self.linear2 = model_test_utils.FakeFP8Linear(4 * in_features, out_features, bias=bias)
-        self.linear1 = FakeFineGrainedFP8Linear(in_features, hidden_features, bias=bias)
-        self.linear2 = FakeFineGrainedFP8Linear(hidden_features, out_features, bias=bias)
+        self.linear1 = model_test_utils.FakeFineGrainedFP8Linear(in_features, hidden_features, bias=bias)
+        self.linear2 = model_test_utils.FakeFineGrainedFP8Linear(hidden_features, out_features, bias=bias)

As per coding guidelines, "Always maintain the namespace when importing. Use from package.subpackage import foo instead of from package.subpackage.foo import SomeClass or import package."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`
at line 13, The test imports concrete classes directly from _model_test_utils
(FakeFineGrainedFP8Linear, FakeFP8Linear) which breaks the namespace guideline;
change the import to import _model_test_utils as _model_test_utils (or import
_model_test_utils) and update all references in this test (e.g., usages of
FakeFineGrainedFP8Linear and FakeFP8Linear) to use the qualified names
_model_test_utils.FakeFineGrainedFP8Linear and _model_test_utils.FakeFP8Linear
so the module namespace is preserved.
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py-474-529 (1)

474-529: ⚠️ Potential issue | 🟡 Minor

Silence ARG001 warnings in FineGrained FP8 ops/fakes.

Ruff reports unused args (input_scale, input_zp, weight_zp, plus bias in the fake). Add a discard line or # noqa: ARG001 to keep lint clean.

🧹 Example fix (apply similarly to fakes)
 def torch_fake_quant_finegrained_fp8_linear(
     input: torch.Tensor,  # [..., K]
     weight_quantized: torch.Tensor,  # [N, K] float8_e4m3fn
     bias: Optional[torch.Tensor],  # [N] or None
     input_scale: List[torch.Tensor],  # unused for FineGrained FP8 (input quantized on the fly)
     weight_scale: List[torch.Tensor],  # [weight_scale_inv]
     input_zp: List[torch.Tensor],  # unused
     weight_zp: List[torch.Tensor],  # unused
 ) -> torch.Tensor:
+    _ = input_scale, input_zp, weight_zp
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py`
around lines 474 - 529, Silence the ARG001 lint by marking the unused parameters
as intentionally unused in both the real and fake FineGrained FP8 ops: in
torch_fake_quant_finegrained_fp8_linear add a discard line referencing
input_scale, input_zp, and weight_zp (e.g. assign them to _ or tuple-unpack to a
throwaway variable) and in _torch_fake_quant_finegrained_fp8_linear_fake also
discard input_scale, input_zp, weight_zp and bias; this keeps the signatures
intact for tracing but prevents Ruff from flagging unused-argument warnings
(alternatively add "# noqa: ARG001" to those parameter names if you prefer a
comment-based suppression).
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py-25-27 (1)

25-27: ⚠️ Potential issue | 🟡 Minor

Keep module namespace in the new imports.

The new imports pull symbols directly; guidelines require keeping module namespaces. Consider importing the modules and qualifying usage.

🔧 Suggested refactor
-from tensorrt_llm._torch.modules.fused_moe.routing import RoutingMethodType
+from tensorrt_llm._torch.modules.fused_moe import routing
 ...
-from tensorrt_llm._utils import is_sm_100f
+from tensorrt_llm import _utils as trtllm_utils
-    if is_sm_100f():
+    if trtllm_utils.is_sm_100f():
         ...
-            RoutingMethodType.Renormalize,
+            routing.RoutingMethodType.Renormalize,

As per coding guidelines, "Always maintain the namespace when importing. Use from package.subpackage import foo instead of from package.subpackage.foo import SomeClass or import package."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py` around
lines 25 - 27, The three direct imports (RoutingMethodType, ActivationType,
is_sm_100f) violate the "keep module namespace" rule; change them to import
their modules (e.g., import tensorrt_llm._torch.modules.fused_moe.routing as
routing, import tensorrt_llm._torch.utils as torch_utils, import
tensorrt_llm._utils as llm_utils) and update all references in this file from
RoutingMethodType, ActivationType, and is_sm_100f to routing.RoutingMethodType,
torch_utils.ActivationType, and llm_utils.is_sm_100f respectively so the module
namespace is preserved across usages.
tensorrt_llm/_torch/auto_deploy/models/patches/qwen3_next.py-1-12 (1)

1-12: ⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header.

This new file is missing the required NVIDIA copyright header with Apache License 2.0. As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

Add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """Patches for Qwen3Next to make it compatible with torch.export and reduce export time.

As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/patches/qwen3_next.py` around lines 1
- 12, Add the required NVIDIA Apache-2.0 copyright header at the top of this
module (qwen3_next patch file) with the year of latest meaningful modification;
place the standard NVIDIA Apache License 2.0 boilerplate before the existing
module docstring so the file begins with the license header, and ensure it
mentions ownership by NVIDIA and the correct year. Keep the rest of the file
intact (this patch touches definitions like Qwen3NextSparseMoeBlock and
Qwen3NextGatedDeltaNet), only prepending the canonical NVIDIA Apache-2.0 header
text.
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_torch_cached_gated_delta_rule.py-16-16 (1)

16-16: ⚠️ Potential issue | 🟡 Minor

Stale noqa: F401 directive

Ruff reports RUF100: the # noqa: F401 directive at line 16 references a rule (F401) that is not enabled in the project's Ruff configuration, making the suppression a no-op. The same pattern appears in test_quant.py line 6, so this may be consistent with a codebase-wide convention. Consider switching to a plain comment (# side-effect import – registers custom ops) or enabling F401 in the Ruff config to make the suppression meaningful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_torch_cached_gated_delta_rule.py`
at line 16, The import of tensorrt_llm._torch.auto_deploy.custom_ops currently
uses a stale suppression "# noqa: F401"; update the import line in
test_torch_cached_gated_delta_rule (and the matching line in test_quant.py) to
remove the unused/no-op noqa and instead add an explanatory side-effect comment
such as "# side-effect import — registers custom ops" (or alternatively enable
F401 in the Ruff config if you prefer keeping a suppression); locate the
statement importing tensorrt_llm._torch.auto_deploy.custom_ops and replace the
directive accordingly.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py-116-118 (1)

116-118: ⚠️ Potential issue | 🟡 Minor

bias parameter is declared and parameterized but never used — both runs are identical

Ruff reports ARG001: unused function argument bias. The test creates MLP(128, 256, 128) which hard-codes bias=False in both its nn.Linear layers (per the MLP definition in _model_test_utils.py), so neither the True nor False variant actually exercises bias logic.

Either wire the parameter into the model construction, or remove the parametrize if bias coverage is not intended here:

🔧 Option: use an inline model that respects `bias`
-    model = MLP(128, 256, 128).to(torch.float16).to("cuda")
+    import torch.nn as nn
+    model = nn.Sequential(
+        nn.Linear(128, 256, bias=bias),
+        nn.ReLU(),
+        nn.Linear(256, 128, bias=bias),
+    ).to(torch.float16).to("cuda")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py`
around lines 116 - 118, The test test_finegrained_fp8_quantization declares a
parameter bias but never uses it (ARG001) because MLP in _model_test_utils.py
currently hard-codes bias=False; either remove the
`@pytest.mark.parametrize`("bias", [True, False]) line or wire the parameter into
the model construction by passing bias into MLP (e.g., construct MLP(128, 256,
128, bias=bias) or equivalent call site), ensuring the test actually exercises
both bias=True and bias=False cases.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gated_delta_rule_cache.py-28-30 (1)

28-30: ⚠️ Potential issue | 🟡 Minor

Use namespace-style import and drop the unused # noqa.
The # noqa: F401 is unused, and the namespace import rule prefers importing the submodule from its package.

Suggested fix
- import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401
+ from tensorrt_llm._torch.auto_deploy import custom_ops  # side-effect registration

As per coding guidelines: Always maintain the namespace when importing. Use from package.subpackage import foo instead of from package.subpackage.foo import SomeClass or import package.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gated_delta_rule_cache.py`
around lines 28 - 30, The import line uses a bare module import with an
unnecessary "# noqa: F401" and should use a namespace-style import; replace
"import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401" with a
namespace import "from tensorrt_llm._torch.auto_deploy import custom_ops" and
remove the "# noqa" comment so the submodule is imported with its package
namespace preserved (referencing the module symbol
tensorrt_llm._torch.auto_deploy.custom_ops).
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py-1244-1253 (1)

1244-1253: ⚠️ Potential issue | 🟡 Minor

Remove or use the unused dtype parameter.
dtype is accepted but not used. Either drop it or apply it when creating weights to avoid lint warnings and confusion.

Suggested fix (use dtype)
-        weight_fp32 = torch.randn(out_features, in_features, device=device) * 0.01
+        weight_fp32 = torch.randn(out_features, in_features, device=device, dtype=dtype) * 0.01
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py`
around lines 1244 - 1253, The __init__ signature accepts a dtype parameter that
is never used; either remove dtype from the signature or apply it when
constructing tensors/weights in this class (e.g., pass dtype to
torch.empty/torch.randn or .to(dtype)) so the parameter is meaningful; update
the constructor in the class (the __init__ method that takes ffn_dim,
hidden_dim, dtype=torch.bfloat16, device="cuda", block_size=None) and any weight
initializations to use the dtype, or delete the dtype parameter and all
references to it to avoid the unused-parameter lint warning.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py-20-22 (1)

20-22: ⚠️ Potential issue | 🟡 Minor

Use namespace-style import and drop the unused # noqa.
The # noqa: F401 is unused, and the namespace import rule prefers importing the submodule from its package.

Suggested fix
- import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401
+ from tensorrt_llm._torch.auto_deploy import custom_ops  # side-effect registration

As per coding guidelines: Always maintain the namespace when importing. Use from package.subpackage import foo instead of from package.subpackage.foo import SomeClass or import package.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py`
around lines 20 - 22, Replace the bare module import line that uses "import
tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401" with a namespace-style
import "from tensorrt_llm._torch.auto_deploy import custom_ops" and remove the
unused "# noqa: F401" comment; ensure the imported symbol is still referenced as
custom_ops so the module is registered/loaded as before.
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py-17-20 (1)

17-20: ⚠️ Potential issue | 🟡 Minor

Use namespace-style import and drop the unused # noqa.
The # noqa: F401 is flagged as unused, and the import style violates the namespace rule.

Suggested fix
- import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401
+ from tensorrt_llm._torch.auto_deploy import custom_ops  # side-effect registration

As per coding guidelines: Always maintain the namespace when importing. Use from package.subpackage import foo instead of from package.subpackage.foo import SomeClass or import package.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py`
around lines 17 - 20, Remove the trailing "# noqa: F401" and switch to
namespace-style imports: replace "import
tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401" with the same import
but without the noqa, and change the two specific imports to import from the fla
namespace (use "from tensorrt_llm._torch.modules.fla import
chunk_gated_delta_rule" and "from tensorrt_llm._torch.modules.fla import
fused_recurrent_gated_delta_rule_fwd") so the module namespace is preserved and
no unused noqa annotation remains.
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py-171-179 (1)

171-179: ⚠️ Potential issue | 🟡 Minor

Silence unused-argument warnings in torch_gated_delta_rule_fake.
Ruff flags the unused params; a small del block (or _ prefixes) avoids lint noise while keeping the signature intact.

Suggested fix
 def torch_gated_delta_rule_fake(
     q: torch.Tensor,
     k: torch.Tensor,
     v: torch.Tensor,
     g: torch.Tensor,
     beta: torch.Tensor,
     scale: Optional[float] = None,
 ) -> torch.Tensor:
+    del q, k, g, beta, scale
     return torch.empty_like(v)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py` around
lines 171 - 179, The function torch_gated_delta_rule_fake currently declares
parameters (q, k, v, g, beta, scale) that go unused and trigger lint warnings;
silence them by explicitly deleting or referencing them at the top of
torch_gated_delta_rule_fake (e.g., add a small del q, k, v, g, beta, scale or
assign them to _ to indicate intentional unused status) while keeping the public
signature intact and not changing behavior.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py-36-54 (1)

36-54: ⚠️ Potential issue | 🟡 Minor

Initialize ModelFactory state in DummyFactory.__init__.
DummyFactory bypasses ModelFactory.__init__, leaving base attributes (e.g., _prefetched_model_path, model_kwargs, skip_loading_weights) undefined if accessed by the optimizer pipeline. Either call super().__init__ or explicitly initialize those members.

#!/bin/bash
# Check how ModelFactory fields are accessed in the auto_deploy pipeline.
rg -n "factory\\.|ModelFactory" tensorrt_llm/_torch/auto_deploy -g '*.py'
rg -n "_prefetched_model_path|_prefetched_tokenizer_path|model_kwargs|tokenizer_kwargs|skip_loading_weights|max_seq_len" tensorrt_llm/_torch/auto_deploy -g '*.py'

As per coding guidelines: Initialize all externally visible members of a class in the constructor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py`
around lines 36 - 54, DummyFactory currently bypasses ModelFactory.__init__
causing missing base attributes; update DummyFactory.__init__ (or at top of it)
to call super().__init__() or explicitly initialize the base members used by the
auto-deploy pipeline (e.g., _prefetched_model_path, _prefetched_tokenizer_path,
model_kwargs, tokenizer_kwargs, skip_loading_weights, max_seq_len) so accesses
from methods like build_model, get_cache_config_updates, _build_model, or
_load_checkpoint won't hit undefined attributes.

bmarimuthu-nv added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Mar 5, 2026
 + review comments

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
bmarimuthu-nv added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Mar 5, 2026
 + review comments

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@suyoggupta suyoggupta force-pushed the sg/qwen3.5-fp8 branch 2 times, most recently from 13b183b to a1e28f2 Compare March 6, 2026 07:41
@suyoggupta suyoggupta changed the title [None][feat] Add support for block scale fp8 quantization. Enable Qwen3.5 fp8 [None][feat] Qwen3.5 perf optimizations Mar 6, 2026
@suyoggupta
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38091 [ run ] triggered by Bot. Commit: 6e95538 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38091 [ run ] completed with state SUCCESS. Commit: 6e95538
/LLM/main/L0_MergeRequest_PR pipeline #29515 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@suyoggupta
Copy link
Collaborator Author

/bot run

@suyoggupta suyoggupta enabled auto-merge (squash) March 12, 2026 13:32
@suyoggupta
Copy link
Collaborator Author

/bot run --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38741 [ run ] triggered by Bot. Commit: 35cdcd4 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38741 [ run ] completed with state SUCCESS. Commit: 35cdcd4
/LLM/main/L0_MergeRequest_PR pipeline #30060 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@suyoggupta
Copy link
Collaborator Author

/bot run --reuse-test

@suyoggupta
Copy link
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38768 [ run ] triggered by Bot. Commit: 35cdcd4 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38769 [ run ] triggered by Bot. Commit: 35cdcd4 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38768 [ run ] completed with state ABORTED. Commit: 35cdcd4

Link to invocation

@nvchenghaoz
Copy link
Collaborator

The PR is too large to review, please consider splitting into small PRs for the future PR.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38769 [ run ] completed with state SUCCESS. Commit: 35cdcd4
/LLM/main/L0_MergeRequest_PR pipeline #30084 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@suyoggupta
Copy link
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38799 [ run ] triggered by Bot. Commit: 35cdcd4 Link to invocation

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta
Copy link
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38799 [ run ] completed with state SUCCESS. Commit: 35cdcd4
/LLM/main/L0_MergeRequest_PR pipeline #30113 completed with status: 'SUCCESS'

CI Report

Link to invocation

@suyoggupta
Copy link
Collaborator Author

/bot run --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38830 [ run ] triggered by Bot. Commit: 660113c Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38830 [ run ] completed with state SUCCESS. Commit: 660113c
/LLM/main/L0_MergeRequest_PR pipeline #30141 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38854 [ run ] triggered by Bot. Commit: 660113c Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38854 [ run ] completed with state SUCCESS. Commit: 660113c
/LLM/main/L0_MergeRequest_PR pipeline #30164 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@suyoggupta
Copy link
Collaborator Author

/bot skip --comment "Only AD changes, and all AD tests passed recent CI runs"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38876 [ skip ] triggered by Bot. Commit: 660113c Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38876 [ skip ] completed with state SUCCESS. Commit: 660113c
Skipping testing for commit 660113c

Link to invocation

@suyoggupta suyoggupta merged commit 390a7fd into NVIDIA:main Mar 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: AutoDeploy: Improve AllReduce perf for Qwen3.5 model

6 participants