[None][feat] Add mix-precision checkpoint support in AutoDeploy#12175
[None][feat] Add mix-precision checkpoint support in AutoDeploy#12175suyoggupta merged 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughThis pull request adds support for mixed-precision quantization configuration in the auto-deploy pipeline. Changes include branching logic in the config reader to distinguish between mixed-precision and single-algorithm paths, new helper utilities for mixed-precision validation, and updated transform logic to conditionally skip or process quantization based on configuration type. Changes
Sequence Diagram(s)sequenceDiagram
actor Transform as Quantization Transform
participant Reader as Config Reader
participant Utils as Quantization Utils
participant Graph as Model Graph
Transform->>Reader: read_config(quant_config)
alt MIXED_PRECISION Config
Reader->>Reader: _read_mixed_precision_config()
Reader->>Reader: validate quantized_layers
Reader->>Reader: extract per-layer algo info
Reader->>Utils: initialize exclude_modules
Reader->>Reader: _handle_kv_cache()
Reader->>Transform: return (config type: MIXED)
Transform->>Utils: is_mixed_precision_config(qcfg)
Utils-->>Transform: true
Transform->>Graph: iterate linear nodes
loop For each node
Transform->>Utils: should_skip_mixed_precision_quantization(node, algo, layers)
Utils-->>Transform: skip decision
alt skip = false
Transform->>Graph: apply quantization
end
end
else Single Algorithm Config
Reader->>Reader: _read_single_algo_config()
Reader->>Reader: _handle_kv_cache()
Reader->>Transform: return (config type: SINGLE)
Transform->>Utils: is_mixed_precision_config(qcfg)
Utils-->>Transform: false
alt quant_algo matches transform
Transform->>Graph: apply standard quantization
else
Transform-->>Transform: skip (algo mismatch)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py (1)
195-230: Factor shared mixed-precision gating into a common helper.Both
_applyimplementations repeat the same qcfg/mixed-precision gating and skip checks. Extracting the common block will reduce drift risk.Also applies to: 261-296
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py` around lines 195 - 230, Extract the repeated qcfg/mixed-precision gating and per-node skip logic into a shared helper (e.g., a function like _get_moe_quant_params(factory, algo_name) that returns (qcfg, is_mixed, quantized_layers, excluded_patterns) or a boolean helper like _should_skip_moe_node(node, qcfg, is_mixed, quantized_layers, excluded_patterns)). Move the logic around factory.get_quant_config(), is_mixed_precision_config(), mixed_precision_has_algo(), quant_algo check, and building quantized_layers/excluded_patterns into that helper and have both _apply implementations call it; preserve the existing early-return behavior (return gm, TransformInfo(skipped=True,...)) when qcfg is absent or algo doesn’t match, and keep the per-node checks using _extract_moe_weight_param_lists(), is_op(... torch_moe), should_skip_quantization(), and should_skip_mixed_precision_quantization() unchanged but invoked via the new helper to avoid duplication.tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (1)
27-38: Maintain namespace in quantization_utils import.Import the
quantization_utilsmodule rather than individual symbols, then reference them through the module (e.g.,quantization_utils.fp4_global_scale(...)instead offp4_global_scale(...)). This follows the repository's import policy: "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py` around lines 27 - 38, The import should maintain the namespace: replace the star-list import of functions from ...utils.quantization_utils with a module import (import quantization_utils from ...utils) and update every call site in this file that currently calls fp4_global_scale, fp8_scale, get_quantization_from_linear_node, is_mixed_precision_config, is_quantized_graph, is_quantized_op, mixed_precision_has_algo, remove_output_quantizers, should_skip_mixed_precision_quantization, and should_skip_quantization to use the module prefix (e.g., quantization_utils.fp4_global_scale(...), quantization_utils.get_quantization_from_linear_node(...), etc.) so all references consistently use the quantization_utils namespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py`:
- Around line 155-159: The code checks kv_algo (from
quant_config["kv_cache_quant_algo"]) case-sensitively and will reject values
like "fp8"; normalize kv_algo before validation by lowercasing or uppercasing it
(e.g., set kv_algo_normalized = kv_algo.lower()) and compare against the
expected token (e.g., "fp8"); then set quant_config["kv_cache_dtype"] to the
normalized dtype string ("fp8") when valid. Update the block that reads kv_algo
and assigns kv_cache_dtype accordingly (referencing kv_algo, quant_config,
kv_cache_quant_algo, kv_cache_dtype).
- Line 143: The current assignment quant_config["exclude_modules"] =
list(self._ALWAYS_EXCLUDE) in quant_config_reader overwrites any caller-provided
exclude_modules; change it to merge the user-provided
quant_config.get("exclude_modules", []) with self._ALWAYS_EXCLUDE (e.g., union
or ordered unique concatenation) so explicit user exclusions are preserved.
Update the logic around quant_config and the key "exclude_modules" in the
class/method where self._ALWAYS_EXCLUDE is referenced so the final list contains
both sets without duplicates.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`:
- Around line 28-33: Replace the individual function imports from
quantization_utils with a module-namespace import and update all call sites to
use that namespace: import the module (e.g., from ...utils import
quantization_utils) and change usages like is_mixed_precision_config(...),
mixed_precision_has_algo(...), should_skip_mixed_precision_quantization(...),
and should_skip_quantization(...) to
quantization_utils.is_mixed_precision_config(...),
quantization_utils.mixed_precision_has_algo(...),
quantization_utils.should_skip_mixed_precision_quantization(...), and
quantization_utils.should_skip_quantization(...) respectively so the file
follows the repo policy of importing the module rather than individual symbols.
---
Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py`:
- Around line 27-38: The import should maintain the namespace: replace the
star-list import of functions from ...utils.quantization_utils with a module
import (import quantization_utils from ...utils) and update every call site in
this file that currently calls fp4_global_scale, fp8_scale,
get_quantization_from_linear_node, is_mixed_precision_config,
is_quantized_graph, is_quantized_op, mixed_precision_has_algo,
remove_output_quantizers, should_skip_mixed_precision_quantization, and
should_skip_quantization to use the module prefix (e.g.,
quantization_utils.fp4_global_scale(...),
quantization_utils.get_quantization_from_linear_node(...), etc.) so all
references consistently use the quantization_utils namespace.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`:
- Around line 195-230: Extract the repeated qcfg/mixed-precision gating and
per-node skip logic into a shared helper (e.g., a function like
_get_moe_quant_params(factory, algo_name) that returns (qcfg, is_mixed,
quantized_layers, excluded_patterns) or a boolean helper like
_should_skip_moe_node(node, qcfg, is_mixed, quantized_layers,
excluded_patterns)). Move the logic around factory.get_quant_config(),
is_mixed_precision_config(), mixed_precision_has_algo(), quant_algo check, and
building quantized_layers/excluded_patterns into that helper and have both
_apply implementations call it; preserve the existing early-return behavior
(return gm, TransformInfo(skipped=True,...)) when qcfg is absent or algo doesn’t
match, and keep the per-node checks using _extract_moe_weight_param_lists(),
is_op(... torch_moe), should_skip_quantization(), and
should_skip_mixed_precision_quantization() unchanged but invoked via the new
helper to avoid duplication.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 88c9e45c-5be9-43bd-8bb4-30f5a3204e46
📒 Files selected for processing (5)
examples/auto_deploy/super_v3.yamltensorrt_llm/_torch/auto_deploy/models/quant_config_reader.pytensorrt_llm/_torch/auto_deploy/transform/library/quantization.pytensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.pytensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py
lucaslie
left a comment
There was a problem hiding this comment.
did you run super with the mixed precision checkpoint and does the output look coherent? Otherwise, looks good to me and happy to approve
Yes, the output looks fine, also update acc test results in PR description. |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #38824 [ run ] triggered by Bot. Commit: |
|
PR_Github #38824 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #38902 [ run ] triggered by Bot. Commit: |
|
PR_Github #38902 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
Chores
Description
For enabling https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
Test Coverage
Accuracy test using
tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestNemotronSuperV3::"test_accuracy[nvfp4-1-attn_dp_off-trtllm]"setting andnvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4checkpoint:PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.