[None][feat] Add mix-precision checkpoint support in AutoDeploy by Fridah-nv · Pull Request #12175 · NVIDIA/TensorRT-LLM

Fridah-nv · 2026-03-13T00:36:45Z

Summary by CodeRabbit

Release Notes

New Features
- Added support for mixed-precision quantization configurations with per-layer quantization algorithm selection and validation
- Enhanced quantization validation with expanded logging for per-layer algorithm distribution and layer counts
Chores
- Updated backend configuration in auto-deployment settings

Description

For enabling https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Test Coverage

Accuracy test using tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestNemotronSuperV3::"test_accuracy[nvfp4-1-attn_dp_off-trtllm]" setting and nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 checkpoint:

MMLU:
Reference accuracy: 86.120
Threshold: 84.303
Evaluated accuracy: 85.453

GSM8K:
Reference accuracy: 82.121
Threshold: 78.918
Evaluated accuracy: 92.456

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

coderabbitai · 2026-03-13T00:42:55Z

📝 Walkthrough

Walkthrough

This pull request adds support for mixed-precision quantization configuration in the auto-deploy pipeline. Changes include branching logic in the config reader to distinguish between mixed-precision and single-algorithm paths, new helper utilities for mixed-precision validation, and updated transform logic to conditionally skip or process quantization based on configuration type.

Changes

Cohort / File(s)	Summary
Configuration `examples/auto_deploy/super_v3.yaml`	Changed backend for `fuse_nvfp4_moe` transform from `trtllm_gen` to `trtllm`.
Quantization Config Reader `tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py`	Added branching logic to `read_config` to dispatch between `_read_mixed_precision_config` and `_read_single_algo_config` based on `quant_algo` type. Introduced new helper methods for KV cache handling, mixed-precision validation, and config normalization.
Quantization Transforms `tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`	Added mixed-precision detection and skipping logic across quantization transforms. Updated control flow to validate algo support, extract `quantized_layers`, and apply per-node filtering for mixed-precision scenarios. Applied same gating logic consistently to FP8 and NVFP4 MOE quantization paths.
Quantization Utilities `tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py`	Added helper functions for mixed-precision configuration detection (`is_mixed_precision_config`, `mixed_precision_has_algo`) and per-node skipping logic (`should_skip_mixed_precision_quantization`, `_extract_modname`). Removed debug print statement.

Sequence Diagram(s)

sequenceDiagram
    actor Transform as Quantization Transform
    participant Reader as Config Reader
    participant Utils as Quantization Utils
    participant Graph as Model Graph

    Transform->>Reader: read_config(quant_config)
    
    alt MIXED_PRECISION Config
        Reader->>Reader: _read_mixed_precision_config()
        Reader->>Reader: validate quantized_layers
        Reader->>Reader: extract per-layer algo info
        Reader->>Utils: initialize exclude_modules
        Reader->>Reader: _handle_kv_cache()
        Reader->>Transform: return (config type: MIXED)
        
        Transform->>Utils: is_mixed_precision_config(qcfg)
        Utils-->>Transform: true
        
        Transform->>Graph: iterate linear nodes
        loop For each node
            Transform->>Utils: should_skip_mixed_precision_quantization(node, algo, layers)
            Utils-->>Transform: skip decision
            alt skip = false
                Transform->>Graph: apply quantization
            end
        end
    else Single Algorithm Config
        Reader->>Reader: _read_single_algo_config()
        Reader->>Reader: _handle_kv_cache()
        Reader->>Transform: return (config type: SINGLE)
        
        Transform->>Utils: is_mixed_precision_config(qcfg)
        Utils-->>Transform: false
        
        alt quant_algo matches transform
            Transform->>Graph: apply standard quantization
        else
            Transform-->>Transform: skip (algo mismatch)
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.82% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete and lacks critical details. Only the Description section mentions enabling a specific model checkpoint, but provides no explanation of the implementation changes, architecture, or why mixed-precision support was needed.	Expand the Description section to explain the technical approach, what mixed-precision support entails, which files were modified and why, and what problem this solves. Clarify the relationship between the code changes and the model deployment objective.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding mixed-precision checkpoint support in AutoDeploy, which aligns with the raw summary showing new mixed-precision quantization config parsing and related helpers.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (1)

195-230: Factor shared mixed-precision gating into a common helper.

Both _apply implementations repeat the same qcfg/mixed-precision gating and skip checks. Extracting the common block will reduce drift risk.

Also applies to: 261-296

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py` around
lines 195 - 230, Extract the repeated qcfg/mixed-precision gating and per-node
skip logic into a shared helper (e.g., a function like
_get_moe_quant_params(factory, algo_name) that returns (qcfg, is_mixed,
quantized_layers, excluded_patterns) or a boolean helper like
_should_skip_moe_node(node, qcfg, is_mixed, quantized_layers,
excluded_patterns)). Move the logic around factory.get_quant_config(),
is_mixed_precision_config(), mixed_precision_has_algo(), quant_algo check, and
building quantized_layers/excluded_patterns into that helper and have both
_apply implementations call it; preserve the existing early-return behavior
(return gm, TransformInfo(skipped=True,...)) when qcfg is absent or algo doesn’t
match, and keep the per-node checks using _extract_moe_weight_param_lists(),
is_op(... torch_moe), should_skip_quantization(), and
should_skip_mixed_precision_quantization() unchanged but invoked via the new
helper to avoid duplication.

tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (1)

27-38: Maintain namespace in quantization_utils import.

Import the quantization_utils module rather than individual symbols, then reference them through the module (e.g., quantization_utils.fp4_global_scale(...) instead of fp4_global_scale(...)). This follows the repository's import policy: "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py` around
lines 27 - 38, The import should maintain the namespace: replace the star-list
import of functions from ...utils.quantization_utils with a module import
(import quantization_utils from ...utils) and update every call site in this
file that currently calls fp4_global_scale, fp8_scale,
get_quantization_from_linear_node, is_mixed_precision_config,
is_quantized_graph, is_quantized_op, mixed_precision_has_algo,
remove_output_quantizers, should_skip_mixed_precision_quantization, and
should_skip_quantization to use the module prefix (e.g.,
quantization_utils.fp4_global_scale(...),
quantization_utils.get_quantization_from_linear_node(...), etc.) so all
references consistently use the quantization_utils namespace.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py`:
- Around line 155-159: The code checks kv_algo (from
quant_config["kv_cache_quant_algo"]) case-sensitively and will reject values
like "fp8"; normalize kv_algo before validation by lowercasing or uppercasing it
(e.g., set kv_algo_normalized = kv_algo.lower()) and compare against the
expected token (e.g., "fp8"); then set quant_config["kv_cache_dtype"] to the
normalized dtype string ("fp8") when valid. Update the block that reads kv_algo
and assigns kv_cache_dtype accordingly (referencing kv_algo, quant_config,
kv_cache_quant_algo, kv_cache_dtype).
- Line 143: The current assignment quant_config["exclude_modules"] =
list(self._ALWAYS_EXCLUDE) in quant_config_reader overwrites any caller-provided
exclude_modules; change it to merge the user-provided
quant_config.get("exclude_modules", []) with self._ALWAYS_EXCLUDE (e.g., union
or ordered unique concatenation) so explicit user exclusions are preserved.
Update the logic around quant_config and the key "exclude_modules" in the
class/method where self._ALWAYS_EXCLUDE is referenced so the final list contains
both sets without duplicates.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`:
- Around line 28-33: Replace the individual function imports from
quantization_utils with a module-namespace import and update all call sites to
use that namespace: import the module (e.g., from ...utils import
quantization_utils) and change usages like is_mixed_precision_config(...),
mixed_precision_has_algo(...), should_skip_mixed_precision_quantization(...),
and should_skip_quantization(...) to
quantization_utils.is_mixed_precision_config(...),
quantization_utils.mixed_precision_has_algo(...),
quantization_utils.should_skip_mixed_precision_quantization(...), and
quantization_utils.should_skip_quantization(...) respectively so the file
follows the repo policy of importing the module rather than individual symbols.

---

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py`:
- Around line 27-38: The import should maintain the namespace: replace the
star-list import of functions from ...utils.quantization_utils with a module
import (import quantization_utils from ...utils) and update every call site in
this file that currently calls fp4_global_scale, fp8_scale,
get_quantization_from_linear_node, is_mixed_precision_config,
is_quantized_graph, is_quantized_op, mixed_precision_has_algo,
remove_output_quantizers, should_skip_mixed_precision_quantization, and
should_skip_quantization to use the module prefix (e.g.,
quantization_utils.fp4_global_scale(...),
quantization_utils.get_quantization_from_linear_node(...), etc.) so all
references consistently use the quantization_utils namespace.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`:
- Around line 195-230: Extract the repeated qcfg/mixed-precision gating and
per-node skip logic into a shared helper (e.g., a function like
_get_moe_quant_params(factory, algo_name) that returns (qcfg, is_mixed,
quantized_layers, excluded_patterns) or a boolean helper like
_should_skip_moe_node(node, qcfg, is_mixed, quantized_layers,
excluded_patterns)). Move the logic around factory.get_quant_config(),
is_mixed_precision_config(), mixed_precision_has_algo(), quant_algo check, and
building quantized_layers/excluded_patterns into that helper and have both
_apply implementations call it; preserve the existing early-return behavior
(return gm, TransformInfo(skipped=True,...)) when qcfg is absent or algo doesn’t
match, and keep the per-node checks using _extract_moe_weight_param_lists(),
is_op(... torch_moe), should_skip_quantization(), and
should_skip_mixed_precision_quantization() unchanged but invoked via the new
helper to avoid duplication.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 88c9e45c-5be9-43bd-8bb4-30f5a3204e46

📥 Commits

Reviewing files that changed from the base of the PR and between 2eee701 and f56a258.

📒 Files selected for processing (5)

examples/auto_deploy/super_v3.yaml
tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py

tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py

tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py

lucaslie

did you run super with the mixed precision checkpoint and does the output look coherent? Otherwise, looks good to me and happy to approve

Fridah-nv · 2026-03-13T03:59:46Z

did you run super with the mixed precision checkpoint and does the output look coherent? Otherwise, looks good to me and happy to approve

Yes, the output looks fine, also update acc test results in PR description.

Fridah-nv · 2026-03-13T04:16:52Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-03-13T04:24:02Z

PR_Github #38824 [ run ] triggered by Bot. Commit: f56a258 Link to invocation

tensorrt-cicd · 2026-03-13T10:10:09Z

PR_Github #38824 [ run ] completed with state SUCCESS. Commit: f56a258
/LLM/main/L0_MergeRequest_PR pipeline #30136 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Fridah-nv · 2026-03-13T20:57:27Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-03-13T21:04:34Z

PR_Github #38902 [ run ] triggered by Bot. Commit: f56a258 Link to invocation

tensorrt-cicd · 2026-03-13T22:50:35Z

PR_Github #38902 [ run ] completed with state SUCCESS. Commit: f56a258
/LLM/main/L0_MergeRequest_PR pipeline #30210 completed with status: 'SUCCESS'

CI Report

Link to invocation

support mix-precision super v3 checkpoint

f56a258

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

Fridah-nv self-assigned this Mar 13, 2026

Fridah-nv requested a review from a team as a code owner March 13, 2026 00:36

Fridah-nv requested a review from bmarimuthu-nv March 13, 2026 00:36

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py Show resolved Hide resolved

lucaslie reviewed Mar 13, 2026

View reviewed changes

suyoggupta approved these changes Mar 13, 2026

View reviewed changes

suyoggupta merged commit 7754c66 into NVIDIA:main Mar 13, 2026
10 checks passed

Conversation

Fridah-nv commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

Fridah-nv commented Mar 13, 2026

Uh oh!

Fridah-nv commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

Fridah-nv commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fridah-nv commented Mar 13, 2026 •

edited

Loading

coderabbitai bot commented Mar 13, 2026 •

edited

Loading