[Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318#1350
[Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318#1350kevalmorabia97 merged 10 commits intorelease/0.44.0from
Conversation
### What does this PR do? Exclude small-dimension MatMul nodes from INT8 quantization. MatMuls with N or K < 16 cannot efficiently use INT8, causing performance regressions. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved quantization exclusions so MatMul/Gemm ops with derived K<16 or N<16 are skipped, honoring Gemm transB, using inferred and runtime-determined shapes, and avoiding duplicate outputs. * **Tests** * Expanded unit tests to cover constant, inferred, and runtime-derived shapes, Gemm transB behavior, small-dimension edge cases, and output deduplication. * **Documentation** * Added changelog entry documenting the new small-dimension exclusion thresholds and transB handling. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: samcheng <samcheng@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: Bug fix Fixes two bugs in the vLLM + Megatron-Core MoE export path and cleans up the related weight-collection helper: 1. **`_QuantFusedMoEBase` (vllm.py)**: The weight-quantizer path in `_invoke_fused_moe_quantized_function` was temporarily mutating `self.w13_weight` / `self.w2_weight` to the quantized tensor, then restoring them via `finally`. This exposed a stale quantized tensor on `self` between the mutation and the kernel call. Fixed by computing the quantized weight directly into a local `B` without touching `self.*` attributes. 2. **`GPTModelExporter` / `VllmFqGPTModelExporter` (unified_export_megatron.py / vllm_fakequant_megatron.py)**: `expert_bias` (present in grouped MoE layers) was silently dropped during export because the bias collection ran after the early-return on missing `weight`. Extracted a `_get_weight_bias` helper that collects weight, bias, and expert_bias together, so bias/expert_bias are captured even when weight is absent or zero-element. ### Usage ```python # No API change; export pipelines pick this up automatically. # export_mcore_gpt_to_hf_vllm_fq / export_mcore_gpt_to_hf now correctly # export expert_bias for grouped-MoE checkpoints. ``` ### Testing Step 1 — Quantize (run from Megatron-LM examples/post_training/modelopt): ``` HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \ bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG ``` Step 2 — Export for vLLM fakequant: ``` MLM_EXTRA_ARGS=--export-vllm-fq \ HF_MODEL_CKPT=<path/to/hf/weights> \ MLM_MODEL_CKPT=<quant-ckpt-name> \ EXPORT_DIR=<export-dir> \ bash export.sh <hf-model-id> ``` Step 3 — Serve (run from examples/vllm_serve): ``` QUANT_CFG=NVFP4_DEFAULT_CFG \ QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \ python3 vllm_serve_fakequant.py <export-dir> \ -tp 1 --served-model-name <model-name> \ --host 0.0.0.0 --port 8000 \ --trust-remote-code --enforce-eager \ --disable-custom-all-reduce \ --gpu-memory-utilization 0.8 ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ❌ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Centralized weight/bias/expert-bias extraction and export to a single helper for consistent handling. * Standardized quantized-weight flow to temporarily swap and restore parameter tensors during computation. * **Bug Fixes** * Prevented missing or incorrect weight/bias exports by unifying extraction logic. * Broadened checkpoint key matching to preserve more quantizer state during reloads. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1322) ## Summary - Replaces the old pip license notice ("Please review the license terms of ModelOpt and any dependencies before use") with the Legal-approved wording: "Model Optimizer will download and install additional third-party open source software projects. Review the license terms of these open source projects before use." - Adds a generic container license review notice ("Before pulling and using the container images, please review their respective license terms.") to the Linux installation doc (Docker tab) and README. - Adds a `.. note::` with the pip notice to the Windows installation page (covers both standalone and Olive child pages). - Expands the README container section to explicitly list all four recommended NVIDIA container images (`pytorch`, `nemo`, `tensorrt-llm`, `tensorrt`). ## Test plan - [x] Verify rendered docs look correct (`nox -s docs`) - [x] Confirm legal notices appear in Linux, Windows, and README install sections 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated installation guides with explicit references to supported NVIDIA container images (PyTorch, NeMo, TensorRT-LLM and variants), clarified pre-installed Model Optimizer in some images, and added notes to review each container’s license terms; clarified conditional environment setup wording and local install license guidance. * **Chores** * Updated project license header year. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: ? documentation. This PR updates vLLM deployment instructions, taking into account heterogenous models created with AnyModel. ### Usage Does not apply. ### Testing Run the updated instructions in the documentation. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: N/A - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Replaced a benchmarking-focused section with a deployment guide for running compressed models on vLLM. * Added step-by-step setup for using an AnyModel-enabled vLLM fork, including checkout and install guidance and required model config edits (with optional architecture metadata). * Simplified runtime to a single vllm serve command, removing manual model rearrangement steps. * Restored inference benchmarking as a subsection, retaining vllm bench latency/throughput examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: bug fix <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> `lm_eval` does not have `__version__` attribute ### Additional Information <!-- E.g. related issue. --> NVBug 6102101 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Enhanced the package version detection system to improve overall reliability and stability of the application while reducing unnecessary external dependencies. All functionality, including version gating and system warnings, continues to operate exactly as expected with no impact on the user experience. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1289) Enables TensorRT attention-v2 fusion for vision transformers when exported to ONNX with FP8 Q/DQ. The core library changes are architecture-agnostic (drop-in for any FP8 ONNX export); coverage is exercised by the existing `examples/torch_onnx/torch_quant_to_onnx.py` pipeline. - **`modelopt/onnx/export/fp8_exporter.py`** — new post-processing passes: move attention-scaling `Mul` and K `Transpose` to the Q-side so DQ feeds MatMul directly, pre-transpose constant weights, and insert FP8 Q/DQ on Softmax outputs (fixed `1/448` scale, data-independent) for MHA-v2 fusion. Rewrites only fire when every downstream consumer is a MatMul so non-attention branches are never perturbed. - **`modelopt/onnx/utils.py`** — `fold_dq_fp32_to_fp16_casts` / `fold_q_fp16_to_fp32_casts` remove the Cast nodes `convert_float_to_float16` inserts around Q/DQ and rewrite scale initializers to FP16 so TRT fuses DQ into the downstream GEMM. Guarded behind opset >= 19 (FP16 Q/DQ scale requirement). Warns on FP16 overflow/underflow. - **`modelopt/torch/_deploy/utils/torch_onnx.py`** — calls the fold helpers for FP8-quantized models after `convert_float_to_float16`. - **`modelopt/torch/quantization/export_onnx.py`** — keeps FP8 Q/DQ scale in the native input dtype so no Cast is emitted between graph and Q/DQ. Removes the now-unused `trt_high_precision_dtype` parameter from `_fp8_quantize`/`_fp8_dequantize`. - **`modelopt/torch/quantization/nn/modules/quant_layernorm.py`** (new) — registers `nn.LayerNorm` in `QuantModuleRegistry` so LayerNorm output quantizers are honored. - **`modelopt/torch/quantization/plugins/huggingface.py`** — skips `*Attention` wrappers whose children are also `*Attention` per-instance (not per-class) to avoid double-patching `eager_attention_forward` (e.g. `ViTAttention` vs `ViTSelfAttention`). - **`examples/torch_onnx/torch_quant_to_onnx.py`** — adds a `_FP8_MHA_OVERRIDE` config block to FP8 mode that enables LayerNorm output quantizer + disables its input quantizer for TRT attention fusion. - **Unit tests** (12 CPU tests, ~1.2s total) — fp8_exporter rewrites + fanout safety, fold-cast helpers + opset guard, LayerNorm quant-wrapper identity, per-instance nested-attention detection. ViT-base-patch16-224, RTX 6000 Ada, strongly-typed FP8 via `trtexec`. Accuracy on 2 000 ImageNet-1k validation samples (streaming). **Batch = 1 (latency-bound)** | Model | Top-1 | Top-5 | TRT latency | Speedup | |---|---|---|---|---| | FP16 baseline | 80.96% | 95.80% | 0.722 ms | 1.00x | | Torch FP8 MHA | 80.66% | 95.75% | 0.657 ms | **1.10x** | | ONNX PTQ FP8 | — | — | 0.589 ms | **1.23x** | **Batch = 64 (throughput-bound, realistic inference)** | Model | TRT latency | Speedup | Images/s | |---|---|---|---| | FP16 baseline | 23.40 ms | 1.00x | 1152 | | Torch FP8 MHA | 15.89 ms | **1.47x** | 1152 | | ONNX PTQ FP8 | 15.89 ms | **1.47x** | 1216 | Top-1 accuracy stays within 0.30 pp of FP16; at batch=64 the Torch FP8 MHA path matches ONNX PTQ wall-time — attention is the bottleneck there and both paths achieve full FP8 attention fusion (36/36 attention MatMuls with QDQ in ViT-base). - [x] CPU unit tests (new): \`python -m pytest tests/unit/onnx/quantization/test_fp8_mha_exporter.py tests/unit/onnx/test_fold_casts.py tests/unit/torch/quantization/test_quant_layernorm.py tests/unit/torch/quantization/plugins/test_nested_attention_skip.py\` - [x] Existing ONNX / quantization unit suites unaffected: \`python -m pytest tests/unit/onnx tests/unit/torch/quantization\` - [x] End-to-end ViT FP8 export: \`python examples/torch_onnx/torch_quant_to_onnx.py --timm_model_name vit_base_patch16_224 --quantize_mode fp8 --onnx_save_path vit_base_fp8.onnx\` — expect log lines \`Folded 48 weight Transpose nodes\`, \`Inserted FP8 weight DequantizeLinear for 1 Conv nodes\`, and \`Attention QDQ rewrites: ... inserted QDQ on 12 Softmax outputs\` - [x] trtexec FP8 strongly-typed build: \`trtexec --onnx=vit_base_fp8.onnx --fp8 --stronglyTyped\` - [x] Accuracy within ~0.3 pp of FP16 baseline on ImageNet-1k subset --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> ### Usage - fixes a bug in example script. We were trying why our models were not that strong at long context. Seems like a recent refactor did not implement max seq length. so 512 is used by default. ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Improvements** * Calibration data loading now enforces a maximum sequence/sample length during dataset preparation, ensuring calibration inputs adhere to configured length limits. This yields more predictable calibration behavior, reduces peak memory usage during calibration runs, and improves consistency of quantization preprocessing. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
## Summary - Removes Mllama (Llama 3.2 Vision) model-type branches from the `llm_ptq` example (`hf_ptq.py`, `example_utils.py`) and drops the now-unused `MllamaImageProcessor` wrapper from `modelopt/torch/utils/`. - Drops the legacy `MllamaImageProcessor` path in `modelopt/torch/utils/vlm_dataset_utils.py`; the generic HF ProcessorMixin path handles the remaining cases. - Adds a CHANGELOG entry under 0.44 Backward Breaking Changes. ## Test plan - [x] CI lint / unit tests pass - [x] Smoke-run ``examples/llm_ptq/scripts/huggingface_example.sh --model <llm> --quant fp8`` (text-only path, non-mllama) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Removed Mllama (Llama 3.2 Vision) support from quantization examples. This includes removal of dedicated image processor implementation, specialized model handling, and related calibration logic. * Updated VLM image-text calibration guidance to use `--calib_with_images` flag with other supported VLMs instead of Mllama-specific processing paths. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### Type of change - [x] Bug fix (non-breaking change which fixes an issue) ### Description Fixes #1088 — `RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: IndexPutBackward0` when training with `eagle_mix_hidden_states=True`. **Root cause:** In `HFEagleModel._eagle_training_forward`, the indexed assignment at line 991–994 modifies `eagle_input_hiddens` in-place while it is still part of the autograd computation graph. **Fix:** Clone the tensor before the in-place assignment. This is the same pattern already used in the Megatron backend at `megatron_eagle.py:1201-1202`: ```python # Clone to avoid inplace modification of view created in no_grad mode eagle_module_input_hidden_states = eagle_module_input_hidden_states.clone() ``` The HF backend was missing this clone. ### Usage ```python config["eagle_mix_hidden_states"] = True config["eagle_ttt_steps"] = 2 mtsp.convert(model, mode=[("eagle", config)]) model.train() outputs = model(input_ids=input_ids, labels=labels) outputs.loss.backward() # no longer crashes ``` ### Testing Added `test_eagle_mix_hidden_states_backward` parametrized over `eagle_ttt_steps` [1, 2] that: - Converts a tiny LLaMA to EAGLE with `eagle_mix_hidden_states=True` - Runs forward + backward pass - Asserts loss is not None and gradients flow to `eagle_module` ``` pytest tests/unit/torch/speculative/plugins/test_hf_speculative.py::test_eagle_mix_hidden_states_backward -v ``` ### Checklist - [x] I have read the [contributor guidelines](CONTRIBUTING.md) and signed my commits - [x] I have followed the [security best practices](SECURITY.md) - [x] This change is backward compatible - [x] I have followed third-party code and dependency guidelines - [x] I have added tests that prove my fix is effective <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed gradient computation issue in speculative decoding during model training to ensure proper autograd behavior. * **Tests** * Added regression test to validate gradient computation in speculative decoding scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? This PR fixes PTQ with image claibration for VLMs. ### Usage ```python python3 examples/llm_ptq/hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-VL-8B-Instruct --qformat fp8 --export_path Qwen3-VL-8B-Instruct-fp8 --trust_remote_code --kv_cache_qformat none --calib_with_images --calib_size 512 ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Image-text calibration now extends support to additional model architectures when image calibration is enabled. * Improved tokenizer truncation handling in multimodal dataset processing to prevent configuration conflicts when image inputs are present. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Liana Mikaelyan <lmikaelyan@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
📝 WalkthroughWalkthroughThis pull request deprecates Mllama support in PTQ examples, introduces FP8 multi-head attention quantization optimizations in ONNX export, adds MatMul dimension-based exclusion logic to prevent small-GeMM quantization, refactors weight/bias extraction in quantization exports, updates documentation with third-party license notices, and includes comprehensive test coverage for new quantization features. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/vlm_dataset_utils.py (1)
441-448:⚠️ Potential issue | 🔴 CriticalFix logic bug in max_length condition that prevents truncation from ever being applied.
The condition
"images" not in kwargsat line 447 will always beFalsebecause"images"is unconditionally added tokwargsat line 443. This means thetruncationandmax_lengthparameters are never passed to the processor, making themax_lengthfunction parameter ineffective.If the intent is to skip truncation for multimodal cases (when images are present), the condition should be inverted to
"images" in kwargs. If truncation should always apply whenmax_lengthis provided, the condition should be removed entirely.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/vlm_dataset_utils.py` around lines 441 - 448, The current logic never applies truncation because "images" is always present in kwargs; update the max_length handling so truncation is applied when max_length is provided: remove the `"images" not in kwargs` check and simply, when max_length is not None, call kwargs.update({"truncation": True, "max_length": max_length}) (referencing the kwargs dict, max_length parameter, and the earlier creation that includes "text": list(prompts) and "images": list(images)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 273-274: The current guard on the quantizer uses
torch.any(torch.isnan(amax)) or torch.all(amax <= 0) so it only disables q when
every channel/block is non-positive; change it to disable the quantizer when any
entry is invalid (NaN) or any entry is <= 0. Locate the use of amax and
q.disable() in torch_quant_to_onnx.py and replace the torch.all(amax <= 0)
condition with torch.any(amax <= 0) (or equivalent) so per-channel/per-block
zero/negative values trigger q.disable() and avoid 448/amax infinities.
- Around line 595-600: The auto-mode gate is too broad: change the any(...)
check used to set uses_fp8_conv_input so it only triggers when the
auto_quantization_formats actually include an FP8-family format (not merely
anything other than "INT8_DEFAULT_CFG"); update the condition in the
uses_fp8_conv_input computation to detect FP8-style formats (e.g., check fmt
strings for FP8-family identifiers like "fp8", "mxfp8", "nvfp4" or a specific
FP8-format set) and only call
_disable_low_channel_conv_input_quantizers(quantized_model) when such an
FP8-family format is present, keeping the existing references to
args.quantize_mode, args.auto_quantization_formats, uses_fp8_conv_input and
_disable_low_channel_conv_input_quantizers.
In `@modelopt/onnx/export/fp8_exporter.py`:
- Around line 88-122: The transpose-folding code currently rewires Transpose to
consume dq_op.outputs[0] without ensuring that dq_op.outputs[0] has no other
live consumers; update the logic around dq_op, transpose_to_remove and
cast_to_remove to first verify that the DQ output node (dq_op.outputs[0]) has
exactly one live consumer (the candidate Transpose or Cast→Transpose) before
mutating torch_weights and setting transpose_to_remove/cast_to_remove; if there
are any other consumers, skip folding and leave
cast_to_remove/transpose_to_remove as None. Apply the same single-consumer check
to the analogous block later in the file (the second transpose-folding
occurrence).
In `@modelopt/onnx/utils.py`:
- Around line 1422-1437: The underflow/overflow warning in _scale_fp32_to_fp16
is too coarse: change the condition to check elementwise where a non-zero FP32
value becomes zero in FP16 by replacing the combined np.any(...) logic with an
elementwise mask (e.g., mask = (fp16_data == 0) & (scale_data != 0)) and warn
only if np.any(np.isinf(fp16_data)) or np.any(mask); keep the existing inf check
and the rest of the in-place conversion logic using scale_init, fp16_data and
scale_data unchanged.
- Around line 1440-1474: fold_q_fp16_to_fp32_casts currently treats any
Cast(..., to=FLOAT) as FP16→FP32 and rewrites Q scales; update it to first
verify the cast source is actually FP16 before proceeding: for each Cast node
(in function fold_q_fp16_to_fp32_casts) look up the input tensor's dtype (from
graph.value_info, graph.input, or initializers) and only continue when that
dtype == onnx.TensorProto.FLOAT16; keep the existing logic that calls
_scale_fp32_to_fp16(initializers[...]) and _bypass_cast_node(onnx_model, node)
but skip/break for casts from BF16 or other dtypes so you do not rewrite Q
scales for non-FP16→FP32 casts.
---
Outside diff comments:
In `@modelopt/torch/utils/vlm_dataset_utils.py`:
- Around line 441-448: The current logic never applies truncation because
"images" is always present in kwargs; update the max_length handling so
truncation is applied when max_length is provided: remove the `"images" not in
kwargs` check and simply, when max_length is not None, call
kwargs.update({"truncation": True, "max_length": max_length}) (referencing the
kwargs dict, max_length parameter, and the earlier creation that includes
"text": list(prompts) and "images": list(images)).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8d54ab83-f2cc-4766-b089-ee6c30d50c0e
📒 Files selected for processing (31)
CHANGELOG.rstLICENSE_HEADERREADME.mddocs/source/getting_started/_installation_for_Linux.rstdocs/source/getting_started/windows/_installation_for_Windows.rstexamples/llm_eval/lm_eval_hf.pyexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/puzzletron/README.mdexamples/torch_onnx/torch_quant_to_onnx.pyexamples/vllm_serve/vllm_reload_utils.pymodelopt/onnx/export/fp8_exporter.pymodelopt/onnx/quantization/graph_utils.pymodelopt/onnx/utils.pymodelopt/torch/_deploy/utils/torch_onnx.pymodelopt/torch/export/plugins/vllm_fakequant_megatron.pymodelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/export_onnx.pymodelopt/torch/quantization/nn/__init__.pymodelopt/torch/quantization/nn/modules/quant_layernorm.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/plugins/vllm.pymodelopt/torch/speculative/plugins/transformers.pymodelopt/torch/utils/image_processor.pymodelopt/torch/utils/vlm_dataset_utils.pytests/unit/onnx/quantization/test_fp8_mha_exporter.pytests/unit/onnx/quantization/test_graph_utils.pytests/unit/onnx/test_fold_casts.pytests/unit/torch/quantization/plugins/test_nested_attention_skip.pytests/unit/torch/quantization/test_quant_layernorm.pytests/unit/torch/speculative/plugins/test_hf_speculative.py
💤 Files with no reviewable changes (1)
- modelopt/torch/utils/image_processor.py
| if torch.any(torch.isnan(amax)) or torch.all(amax <= 0): | ||
| q.disable() |
There was a problem hiding this comment.
Handle partially-dead amax tensors too.
This only disables a quantizer when all amax entries are <= 0, but FP8 export will still blow up if a per-channel/per-block quantizer has just one zero/negative entry. That makes the current guard miss exactly the mixed-validity case that still produces 448 / amax infinities.
Proposed fix
- if torch.any(torch.isnan(amax)) or torch.all(amax <= 0):
+ if torch.any(torch.isnan(amax)) or torch.any(amax <= 0):
q.disable()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 273 - 274, The
current guard on the quantizer uses torch.any(torch.isnan(amax)) or
torch.all(amax <= 0) so it only disables q when every channel/block is
non-positive; change it to disable the quantizer when any entry is invalid (NaN)
or any entry is <= 0. Locate the use of amax and q.disable() in
torch_quant_to_onnx.py and replace the torch.all(amax <= 0) condition with
torch.any(amax <= 0) (or equivalent) so per-channel/per-block zero/negative
values trigger q.disable() and avoid 448/amax infinities.
| uses_fp8_conv_input = args.quantize_mode in ("fp8", "mxfp8", "nvfp4") or ( | ||
| args.quantize_mode == "auto" | ||
| and any(fmt != "INT8_DEFAULT_CFG" for fmt in args.auto_quantization_formats) | ||
| ) | ||
| if uses_fp8_conv_input: | ||
| _disable_low_channel_conv_input_quantizers(quantized_model) |
There was a problem hiding this comment.
Narrow the auto-mode gate to actual FP8-family formats.
any(fmt != "INT8_DEFAULT_CFG" ...) also fires for INT4_AWQ_CFG, so auto mode will disable low-channel Conv input quantizers even when no FP8-style format is in the search space. That changes the search budget/accuracy for a workaround that is only needed for TRT_FP8QuantizeLinear.
Proposed fix
+ fp8_family_formats = {
+ "FP8_DEFAULT_CFG",
+ "MXFP8_DEFAULT_CFG",
+ "NVFP4_AWQ_LITE_CFG",
+ "NVFP4_DEFAULT_CFG",
+ }
uses_fp8_conv_input = args.quantize_mode in ("fp8", "mxfp8", "nvfp4") or (
args.quantize_mode == "auto"
- and any(fmt != "INT8_DEFAULT_CFG" for fmt in args.auto_quantization_formats)
+ and any(fmt in fp8_family_formats for fmt in args.auto_quantization_formats)
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 595 - 600, The
auto-mode gate is too broad: change the any(...) check used to set
uses_fp8_conv_input so it only triggers when the auto_quantization_formats
actually include an FP8-family format (not merely anything other than
"INT8_DEFAULT_CFG"); update the condition in the uses_fp8_conv_input computation
to detect FP8-style formats (e.g., check fmt strings for FP8-family identifiers
like "fp8", "mxfp8", "nvfp4" or a specific FP8-format set) and only call
_disable_low_channel_conv_input_quantizers(quantized_model) when such an
FP8-family format is present, keeping the existing references to
args.quantize_mode, args.auto_quantization_formats, uses_fp8_conv_input and
_disable_low_channel_conv_input_quantizers.
| # Pre-transpose constant weights if DQ feeds ``Transpose → MatMul`` (or | ||
| # ``Cast → Transpose → MatMul`` after fp16 conversion) so TRT sees DQ→MatMul. | ||
| # Control flow: scan candidates; a Cast-wrapped candidate is accepted only if it | ||
| # leads to a Transpose; a bare Transpose whose all consumers are MatMul wins and | ||
| # breaks the loop. Any other shape defaults `cast_to_remove` back to None and | ||
| # continues scanning. | ||
| transpose_to_remove = None | ||
| cast_to_remove = None | ||
| for candidate in list(dq_op.outputs[0].outputs): | ||
| if candidate.op == "Cast": | ||
| cast_to_remove = candidate | ||
| candidate = next( | ||
| (c for c in candidate.outputs[0].outputs if c.op == "Transpose"), | ||
| None, | ||
| ) | ||
| if candidate is None: | ||
| cast_to_remove = None | ||
| continue | ||
| if candidate.op != "Transpose": | ||
| cast_to_remove = None | ||
| continue | ||
| t_consumers = list(candidate.outputs[0].outputs) | ||
| # Only fold the transpose when every downstream consumer is MatMul; otherwise | ||
| # non-MatMul consumers would observe the un-transposed weights. | ||
| if t_consumers and all(c.op == "MatMul" for c in t_consumers): | ||
| perm = candidate.attrs.get("perm", None) | ||
| torch_weights = ( | ||
| torch_weights.permute(*perm).contiguous() | ||
| if perm is not None | ||
| else torch_weights.T.contiguous() | ||
| ) | ||
| transpose_to_remove = candidate | ||
| else: | ||
| cast_to_remove = None | ||
| break |
There was a problem hiding this comment.
Only fold the transpose when the DQ output has no other live consumers.
This rewrite pre-transposes the stored weight and then rewires the Transpose branch to consume dq_op.outputs[0] directly, but it never checks whether that same DQ output also feeds some other branch. If it does, that branch now starts seeing the transposed weight too, which is a silent graph corruption.
Proposed fix
transpose_to_remove = None
cast_to_remove = None
- for candidate in list(dq_op.outputs[0].outputs):
+ dq_consumers = list(dq_op.outputs[0].outputs)
+ for candidate in dq_consumers:
if candidate.op == "Cast":
cast_to_remove = candidate
candidate = next(
(c for c in candidate.outputs[0].outputs if c.op == "Transpose"),
None,
@@
if t_consumers and all(c.op == "MatMul" for c in t_consumers):
+ if len(dq_consumers) != 1:
+ transpose_to_remove = None
+ cast_to_remove = None
+ break
perm = candidate.attrs.get("perm", None)
torch_weights = (
torch_weights.permute(*perm).contiguous()
if perm is not None
else torch_weights.T.contiguous()Also applies to: 142-151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/export/fp8_exporter.py` around lines 88 - 122, The
transpose-folding code currently rewires Transpose to consume dq_op.outputs[0]
without ensuring that dq_op.outputs[0] has no other live consumers; update the
logic around dq_op, transpose_to_remove and cast_to_remove to first verify that
the DQ output node (dq_op.outputs[0]) has exactly one live consumer (the
candidate Transpose or Cast→Transpose) before mutating torch_weights and setting
transpose_to_remove/cast_to_remove; if there are any other consumers, skip
folding and leave cast_to_remove/transpose_to_remove as None. Apply the same
single-consumer check to the analogous block later in the file (the second
transpose-folding occurrence).
| def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None: | ||
| """Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16. | ||
|
|
||
| Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range). | ||
| """ | ||
| if scale_init.data_type != onnx.TensorProto.FLOAT: | ||
| return | ||
| scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32) | ||
| if not scale_data.size: | ||
| scale_data = np.array(scale_init.float_data, dtype=np.float32) | ||
| fp16_data = scale_data.astype(np.float16) | ||
| if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)): | ||
| logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16") | ||
| scale_init.data_type = onnx.TensorProto.FLOAT16 | ||
| scale_init.raw_data = fp16_data.tobytes() | ||
| del scale_init.float_data[:] |
There was a problem hiding this comment.
Make the underflow warning elementwise.
np.any(fp16_data == 0) and np.any(scale_data != 0) fires whenever the tensor contains any real zero plus any non-zero value. That produces false overflow/underflow warnings for mixed scale tensors.
Suggested fix
- if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)):
+ if np.any(np.isinf(fp16_data)) or np.any((fp16_data == 0) & (scale_data != 0)):
logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None: | |
| """Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16. | |
| Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range). | |
| """ | |
| if scale_init.data_type != onnx.TensorProto.FLOAT: | |
| return | |
| scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32) | |
| if not scale_data.size: | |
| scale_data = np.array(scale_init.float_data, dtype=np.float32) | |
| fp16_data = scale_data.astype(np.float16) | |
| if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)): | |
| logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16") | |
| scale_init.data_type = onnx.TensorProto.FLOAT16 | |
| scale_init.raw_data = fp16_data.tobytes() | |
| del scale_init.float_data[:] | |
| def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None: | |
| """Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16. | |
| Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range). | |
| """ | |
| if scale_init.data_type != onnx.TensorProto.FLOAT: | |
| return | |
| scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32) | |
| if not scale_data.size: | |
| scale_data = np.array(scale_init.float_data, dtype=np.float32) | |
| fp16_data = scale_data.astype(np.float16) | |
| if np.any(np.isinf(fp16_data)) or np.any((fp16_data == 0) & (scale_data != 0)): | |
| logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16") | |
| scale_init.data_type = onnx.TensorProto.FLOAT16 | |
| scale_init.raw_data = fp16_data.tobytes() | |
| del scale_init.float_data[:] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/utils.py` around lines 1422 - 1437, The underflow/overflow
warning in _scale_fp32_to_fp16 is too coarse: change the condition to check
elementwise where a non-zero FP32 value becomes zero in FP16 by replacing the
combined np.any(...) logic with an elementwise mask (e.g., mask = (fp16_data ==
0) & (scale_data != 0)) and warn only if np.any(np.isinf(fp16_data)) or
np.any(mask); keep the existing inf check and the rest of the in-place
conversion logic using scale_init, fp16_data and scale_data unchanged.
| def fold_q_fp16_to_fp32_casts(onnx_model: onnx.ModelProto) -> onnx.ModelProto: | ||
| """Remove ``Cast(FP16→FP32) → Q`` patterns inserted by ``convert_float_to_float16``. | ||
|
|
||
| The Q scale is rewritten to FP16 so Q consumes the FP16 graph directly. Skipped for | ||
| opsets below ``BASE_MIN_OPSET`` since FP16 Q scales require opset >= 19. | ||
| """ | ||
| if get_opset_version(onnx_model) < BASE_MIN_OPSET: | ||
| logger.debug( | ||
| f"Skipping fold_q_fp16_to_fp32_casts: opset < {BASE_MIN_OPSET} (FP16 Q scale unsupported)" | ||
| ) | ||
| return onnx_model | ||
|
|
||
| consumer_map: dict[str, list[onnx.NodeProto]] = {} | ||
| for node in onnx_model.graph.node: | ||
| for inp in node.input: | ||
| consumer_map.setdefault(inp, []).append(node) | ||
| initializers = {init.name: init for init in onnx_model.graph.initializer} | ||
|
|
||
| to_remove = [] | ||
| for node in onnx_model.graph.node: | ||
| if node.op_type != "Cast": | ||
| continue | ||
| cast_to = next((a.i for a in node.attribute if a.name == "to"), None) | ||
| if cast_to != onnx.TensorProto.FLOAT: | ||
| continue | ||
| consumers = consumer_map.get(node.output[0], []) | ||
| if not consumers or not all(c.op_type in _Q_OPS for c in consumers): | ||
| continue | ||
|
|
||
| for q_node in consumers: | ||
| if len(q_node.input) >= 2 and q_node.input[1] in initializers: | ||
| _scale_fp32_to_fp16(initializers[q_node.input[1]]) | ||
|
|
||
| _bypass_cast_node(onnx_model, node) | ||
| to_remove.append(node) |
There was a problem hiding this comment.
Guard this fold to actual FP16→FP32 casts.
The implementation only checks Cast(..., to=FLOAT), so it will also bypass BF16→FP32 or any other *→FP32 cast and then rewrite the Q scale to FP16. That changes semantics and can leave the quantizer consuming a dtype this pass was not meant to legalize.
Suggested fix
def fold_q_fp16_to_fp32_casts(onnx_model: onnx.ModelProto) -> onnx.ModelProto:
@@
- consumer_map: dict[str, list[onnx.NodeProto]] = {}
+ consumer_map: dict[str, list[onnx.NodeProto]] = {}
for node in onnx_model.graph.node:
for inp in node.input:
consumer_map.setdefault(inp, []).append(node)
initializers = {init.name: init for init in onnx_model.graph.initializer}
+ type_map = _build_tensor_type_map(onnx_model)
@@
cast_to = next((a.i for a in node.attribute if a.name == "to"), None)
if cast_to != onnx.TensorProto.FLOAT:
continue
+ if type_map.get(node.input[0]) != onnx.TensorProto.FLOAT16:
+ continue
consumers = consumer_map.get(node.output[0], [])
if not consumers or not all(c.op_type in _Q_OPS for c in consumers):
continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/utils.py` around lines 1440 - 1474, fold_q_fp16_to_fp32_casts
currently treats any Cast(..., to=FLOAT) as FP16→FP32 and rewrites Q scales;
update it to first verify the cast source is actually FP16 before proceeding:
for each Cast node (in function fold_q_fp16_to_fp32_casts) look up the input
tensor's dtype (from graph.value_info, graph.input, or initializers) and only
continue when that dtype == onnx.TensorProto.FLOAT16; keep the existing logic
that calls _scale_fp32_to_fp16(initializers[...]) and
_bypass_cast_node(onnx_model, node) but skip/break for casts from BF16 or other
dtypes so you do not rewrite Q scales for non-FP16→FP32 casts.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## release/0.44.0 #1350 +/- ##
==================================================
+ Coverage 75.38% 75.87% +0.49%
==================================================
Files 462 462
Lines 49960 50173 +213
==================================================
+ Hits 37662 38069 +407
+ Misses 12298 12104 -194
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Adds `.claude/skills/release-cherry-pick/SKILL.md` — a Claude Code skill for cherry-picking labeled PRs to a release branch. Invoke with `/release-cherry-pick <version>`. See this PR created with the skill: #1350 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automated release cherry-pick workflow to streamline selecting and applying multiple PRs into release branches. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Cherry-picked PRs
Summary by CodeRabbit
Release Notes
Documentation
Deprecations
--calib_with_imageswith supported models.Bug Fixes
New Features