Skip to content

[None][feat] Add mix-precision checkpoint support in AutoDeploy#12175

Merged
suyoggupta merged 1 commit intoNVIDIA:mainfrom
nv-auto-deploy:user/fridah/super-aq
Mar 13, 2026
Merged

[None][feat] Add mix-precision checkpoint support in AutoDeploy#12175
suyoggupta merged 1 commit intoNVIDIA:mainfrom
nv-auto-deploy:user/fridah/super-aq

Conversation

@Fridah-nv
Copy link
Collaborator

@Fridah-nv Fridah-nv commented Mar 13, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for mixed-precision quantization configurations with per-layer quantization algorithm selection and validation
    • Enhanced quantization validation with expanded logging for per-layer algorithm distribution and layer counts
  • Chores

    • Updated backend configuration in auto-deployment settings

Description

For enabling https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Test Coverage

Accuracy test using tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestNemotronSuperV3::"test_accuracy[nvfp4-1-attn_dp_off-trtllm]" setting and nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 checkpoint:

MMLU:
Reference accuracy: 86.120
Threshold: 84.303
Evaluated accuracy: 85.453

GSM8K:
Reference accuracy: 82.121
Threshold: 78.918
Evaluated accuracy: 92.456

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv Fridah-nv self-assigned this Mar 13, 2026
@Fridah-nv Fridah-nv requested a review from a team as a code owner March 13, 2026 00:36
@Fridah-nv Fridah-nv requested a review from bmarimuthu-nv March 13, 2026 00:36
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

This pull request adds support for mixed-precision quantization configuration in the auto-deploy pipeline. Changes include branching logic in the config reader to distinguish between mixed-precision and single-algorithm paths, new helper utilities for mixed-precision validation, and updated transform logic to conditionally skip or process quantization based on configuration type.

Changes

Cohort / File(s) Summary
Configuration
examples/auto_deploy/super_v3.yaml
Changed backend for fuse_nvfp4_moe transform from trtllm_gen to trtllm.
Quantization Config Reader
tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py
Added branching logic to read_config to dispatch between _read_mixed_precision_config and _read_single_algo_config based on quant_algo type. Introduced new helper methods for KV cache handling, mixed-precision validation, and config normalization.
Quantization Transforms
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py, tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
Added mixed-precision detection and skipping logic across quantization transforms. Updated control flow to validate algo support, extract quantized_layers, and apply per-node filtering for mixed-precision scenarios. Applied same gating logic consistently to FP8 and NVFP4 MOE quantization paths.
Quantization Utilities
tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py
Added helper functions for mixed-precision configuration detection (is_mixed_precision_config, mixed_precision_has_algo) and per-node skipping logic (should_skip_mixed_precision_quantization, _extract_modname). Removed debug print statement.

Sequence Diagram(s)

sequenceDiagram
    actor Transform as Quantization Transform
    participant Reader as Config Reader
    participant Utils as Quantization Utils
    participant Graph as Model Graph

    Transform->>Reader: read_config(quant_config)
    
    alt MIXED_PRECISION Config
        Reader->>Reader: _read_mixed_precision_config()
        Reader->>Reader: validate quantized_layers
        Reader->>Reader: extract per-layer algo info
        Reader->>Utils: initialize exclude_modules
        Reader->>Reader: _handle_kv_cache()
        Reader->>Transform: return (config type: MIXED)
        
        Transform->>Utils: is_mixed_precision_config(qcfg)
        Utils-->>Transform: true
        
        Transform->>Graph: iterate linear nodes
        loop For each node
            Transform->>Utils: should_skip_mixed_precision_quantization(node, algo, layers)
            Utils-->>Transform: skip decision
            alt skip = false
                Transform->>Graph: apply quantization
            end
        end
    else Single Algorithm Config
        Reader->>Reader: _read_single_algo_config()
        Reader->>Reader: _handle_kv_cache()
        Reader->>Transform: return (config type: SINGLE)
        
        Transform->>Utils: is_mixed_precision_config(qcfg)
        Utils-->>Transform: false
        
        alt quant_algo matches transform
            Transform->>Graph: apply standard quantization
        else
            Transform-->>Transform: skip (algo mismatch)
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete and lacks critical details. Only the Description section mentions enabling a specific model checkpoint, but provides no explanation of the implementation changes, architecture, or why mixed-precision support was needed. Expand the Description section to explain the technical approach, what mixed-precision support entails, which files were modified and why, and what problem this solves. Clarify the relationship between the code changes and the model deployment objective.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding mixed-precision checkpoint support in AutoDeploy, which aligns with the raw summary showing new mixed-precision quantization config parsing and related helpers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (1)

195-230: Factor shared mixed-precision gating into a common helper.

Both _apply implementations repeat the same qcfg/mixed-precision gating and skip checks. Extracting the common block will reduce drift risk.

Also applies to: 261-296

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py` around
lines 195 - 230, Extract the repeated qcfg/mixed-precision gating and per-node
skip logic into a shared helper (e.g., a function like
_get_moe_quant_params(factory, algo_name) that returns (qcfg, is_mixed,
quantized_layers, excluded_patterns) or a boolean helper like
_should_skip_moe_node(node, qcfg, is_mixed, quantized_layers,
excluded_patterns)). Move the logic around factory.get_quant_config(),
is_mixed_precision_config(), mixed_precision_has_algo(), quant_algo check, and
building quantized_layers/excluded_patterns into that helper and have both
_apply implementations call it; preserve the existing early-return behavior
(return gm, TransformInfo(skipped=True,...)) when qcfg is absent or algo doesn’t
match, and keep the per-node checks using _extract_moe_weight_param_lists(),
is_op(... torch_moe), should_skip_quantization(), and
should_skip_mixed_precision_quantization() unchanged but invoked via the new
helper to avoid duplication.
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (1)

27-38: Maintain namespace in quantization_utils import.

Import the quantization_utils module rather than individual symbols, then reference them through the module (e.g., quantization_utils.fp4_global_scale(...) instead of fp4_global_scale(...)). This follows the repository's import policy: "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py` around
lines 27 - 38, The import should maintain the namespace: replace the star-list
import of functions from ...utils.quantization_utils with a module import
(import quantization_utils from ...utils) and update every call site in this
file that currently calls fp4_global_scale, fp8_scale,
get_quantization_from_linear_node, is_mixed_precision_config,
is_quantized_graph, is_quantized_op, mixed_precision_has_algo,
remove_output_quantizers, should_skip_mixed_precision_quantization, and
should_skip_quantization to use the module prefix (e.g.,
quantization_utils.fp4_global_scale(...),
quantization_utils.get_quantization_from_linear_node(...), etc.) so all
references consistently use the quantization_utils namespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py`:
- Around line 155-159: The code checks kv_algo (from
quant_config["kv_cache_quant_algo"]) case-sensitively and will reject values
like "fp8"; normalize kv_algo before validation by lowercasing or uppercasing it
(e.g., set kv_algo_normalized = kv_algo.lower()) and compare against the
expected token (e.g., "fp8"); then set quant_config["kv_cache_dtype"] to the
normalized dtype string ("fp8") when valid. Update the block that reads kv_algo
and assigns kv_cache_dtype accordingly (referencing kv_algo, quant_config,
kv_cache_quant_algo, kv_cache_dtype).
- Line 143: The current assignment quant_config["exclude_modules"] =
list(self._ALWAYS_EXCLUDE) in quant_config_reader overwrites any caller-provided
exclude_modules; change it to merge the user-provided
quant_config.get("exclude_modules", []) with self._ALWAYS_EXCLUDE (e.g., union
or ordered unique concatenation) so explicit user exclusions are preserved.
Update the logic around quant_config and the key "exclude_modules" in the
class/method where self._ALWAYS_EXCLUDE is referenced so the final list contains
both sets without duplicates.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`:
- Around line 28-33: Replace the individual function imports from
quantization_utils with a module-namespace import and update all call sites to
use that namespace: import the module (e.g., from ...utils import
quantization_utils) and change usages like is_mixed_precision_config(...),
mixed_precision_has_algo(...), should_skip_mixed_precision_quantization(...),
and should_skip_quantization(...) to
quantization_utils.is_mixed_precision_config(...),
quantization_utils.mixed_precision_has_algo(...),
quantization_utils.should_skip_mixed_precision_quantization(...), and
quantization_utils.should_skip_quantization(...) respectively so the file
follows the repo policy of importing the module rather than individual symbols.

---

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py`:
- Around line 27-38: The import should maintain the namespace: replace the
star-list import of functions from ...utils.quantization_utils with a module
import (import quantization_utils from ...utils) and update every call site in
this file that currently calls fp4_global_scale, fp8_scale,
get_quantization_from_linear_node, is_mixed_precision_config,
is_quantized_graph, is_quantized_op, mixed_precision_has_algo,
remove_output_quantizers, should_skip_mixed_precision_quantization, and
should_skip_quantization to use the module prefix (e.g.,
quantization_utils.fp4_global_scale(...),
quantization_utils.get_quantization_from_linear_node(...), etc.) so all
references consistently use the quantization_utils namespace.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`:
- Around line 195-230: Extract the repeated qcfg/mixed-precision gating and
per-node skip logic into a shared helper (e.g., a function like
_get_moe_quant_params(factory, algo_name) that returns (qcfg, is_mixed,
quantized_layers, excluded_patterns) or a boolean helper like
_should_skip_moe_node(node, qcfg, is_mixed, quantized_layers,
excluded_patterns)). Move the logic around factory.get_quant_config(),
is_mixed_precision_config(), mixed_precision_has_algo(), quant_algo check, and
building quantized_layers/excluded_patterns into that helper and have both
_apply implementations call it; preserve the existing early-return behavior
(return gm, TransformInfo(skipped=True,...)) when qcfg is absent or algo doesn’t
match, and keep the per-node checks using _extract_moe_weight_param_lists(),
is_op(... torch_moe), should_skip_quantization(), and
should_skip_mixed_precision_quantization() unchanged but invoked via the new
helper to avoid duplication.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 88c9e45c-5be9-43bd-8bb4-30f5a3204e46

📥 Commits

Reviewing files that changed from the base of the PR and between 2eee701 and f56a258.

📒 Files selected for processing (5)
  • examples/auto_deploy/super_v3.yaml
  • tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
  • tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py

Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you run super with the mixed precision checkpoint and does the output look coherent? Otherwise, looks good to me and happy to approve

@Fridah-nv
Copy link
Collaborator Author

did you run super with the mixed precision checkpoint and does the output look coherent? Otherwise, looks good to me and happy to approve

Yes, the output looks fine, also update acc test results in PR description.

@Fridah-nv
Copy link
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38824 [ run ] triggered by Bot. Commit: f56a258 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38824 [ run ] completed with state SUCCESS. Commit: f56a258
/LLM/main/L0_MergeRequest_PR pipeline #30136 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Fridah-nv
Copy link
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38902 [ run ] triggered by Bot. Commit: f56a258 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38902 [ run ] completed with state SUCCESS. Commit: f56a258
/LLM/main/L0_MergeRequest_PR pipeline #30210 completed with status: 'SUCCESS'

CI Report

Link to invocation

@suyoggupta suyoggupta merged commit 7754c66 into NVIDIA:main Mar 13, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants