Skip to content

[Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318#1350

Merged
kevalmorabia97 merged 10 commits intorelease/0.44.0from
cherry-picks/release-0.44.0
Apr 27, 2026
Merged

[Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318#1350
kevalmorabia97 merged 10 commits intorelease/0.44.0from
cherry-picks/release-0.44.0

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented Apr 27, 2026

Cherry-picked PRs

Summary by CodeRabbit

Release Notes

  • Documentation

    • Updated installation guides with third-party software license disclaimers.
    • Added vLLM deployment instructions for model deployment.
    • Expanded NGC container image recommendations.
  • Deprecations

    • Mllama/vision model image processor support deprecated; users directed to use --calib_with_images with supported models.
  • Bug Fixes

    • Fixed ONNX quantization to exclude small MatMul/Gemm operations from INT8/FP8 quantization.
    • Improved FP8 export with enhanced cast-folding and attention fusion optimizations.
  • New Features

    • Added LayerNorm quantization support for improved FP8 attention quantization.

nv-samcheng and others added 10 commits April 27, 2026 07:07
### What does this PR do?
Exclude small-dimension MatMul nodes from INT8 quantization. MatMuls
with N or K < 16 cannot efficiently use INT8, causing performance
regressions.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved quantization exclusions so MatMul/Gemm ops with derived K<16
or N<16 are skipped, honoring Gemm transB, using inferred and
runtime-determined shapes, and avoiding duplicate outputs.

* **Tests**
* Expanded unit tests to cover constant, inferred, and runtime-derived
shapes, Gemm transB behavior, small-dimension edge cases, and output
deduplication.

* **Documentation**
* Added changelog entry documenting the new small-dimension exclusion
thresholds and transB handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: samcheng <samcheng@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix
Fixes two bugs in the vLLM + Megatron-Core MoE export path and cleans up
the related weight-collection helper:

1. **`_QuantFusedMoEBase` (vllm.py)**: The weight-quantizer path in
`_invoke_fused_moe_quantized_function` was temporarily mutating
`self.w13_weight` / `self.w2_weight` to the quantized tensor, then
restoring them via `finally`. This exposed a stale quantized tensor on
`self` between the mutation and the kernel call. Fixed by computing the
quantized weight directly into a local `B` without touching `self.*`
attributes.
2. **`GPTModelExporter` / `VllmFqGPTModelExporter`
(unified_export_megatron.py / vllm_fakequant_megatron.py)**:
`expert_bias` (present in grouped MoE layers) was silently dropped
during export because the bias collection ran after the early-return on
missing `weight`. Extracted a `_get_weight_bias` helper that collects
weight, bias, and expert_bias together, so bias/expert_bias are captured
even when weight is absent or zero-element.

### Usage

```python
# No API change; export pipelines pick this up automatically.
# export_mcore_gpt_to_hf_vllm_fq / export_mcore_gpt_to_hf now correctly
# export expert_bias for grouped-MoE checkpoints.
```

### Testing
Step 1 — Quantize (run from Megatron-LM
examples/post_training/modelopt):
```
HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \
bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG
```
Step 2 — Export for vLLM fakequant:
```
MLM_EXTRA_ARGS=--export-vllm-fq \
HF_MODEL_CKPT=<path/to/hf/weights> \
MLM_MODEL_CKPT=<quant-ckpt-name> \
EXPORT_DIR=<export-dir> \
bash export.sh <hf-model-id>
```
Step 3 — Serve (run from examples/vllm_serve):
```
 QUANT_CFG=NVFP4_DEFAULT_CFG \
 QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \
 python3 vllm_serve_fakequant.py <export-dir> \
 -tp 1 --served-model-name <model-name> \
 --host 0.0.0.0 --port 8000 \
--trust-remote-code --enforce-eager \
 --disable-custom-all-reduce \
--gpu-memory-utilization 0.8
```
### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ❌
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Centralized weight/bias/expert-bias extraction and export to a single
helper for consistent handling.
* Standardized quantized-weight flow to temporarily swap and restore
parameter tensors during computation.

* **Bug Fixes**
* Prevented missing or incorrect weight/bias exports by unifying
extraction logic.
* Broadened checkpoint key matching to preserve more quantizer state
during reloads.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1322)

## Summary

- Replaces the old pip license notice ("Please review the license terms
of ModelOpt and any dependencies before use") with the Legal-approved
wording: "Model Optimizer will download and install additional
third-party open source software projects. Review the license terms of
these open source projects before use."
- Adds a generic container license review notice ("Before pulling and
using the container images, please review their respective license
terms.") to the Linux installation doc (Docker tab) and README.
- Adds a `.. note::` with the pip notice to the Windows installation
page (covers both standalone and Olive child pages).
- Expands the README container section to explicitly list all four
recommended NVIDIA container images (`pytorch`, `nemo`, `tensorrt-llm`,
`tensorrt`).

## Test plan

- [x] Verify rendered docs look correct (`nox -s docs`)
- [x] Confirm legal notices appear in Linux, Windows, and README install
sections

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Updated installation guides with explicit references to supported
NVIDIA container images (PyTorch, NeMo, TensorRT-LLM and variants),
clarified pre-installed Model Optimizer in some images, and added notes
to review each container’s license terms; clarified conditional
environment setup wording and local install license guidance.
* **Chores**
  * Updated project license header year.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: ? documentation.

This PR updates vLLM deployment instructions, taking into account
heterogenous models created with AnyModel.

### Usage

Does not apply.

### Testing

Run the updated instructions in the documentation.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: N/A
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Replaced a benchmarking-focused section with a deployment guide for
running compressed models on vLLM.
* Added step-by-step setup for using an AnyModel-enabled vLLM fork,
including checkout and install guidance and required model config edits
(with optional architecture metadata).
* Simplified runtime to a single vllm serve command, removing manual
model rearrangement steps.
* Restored inference benchmarking as a subsection, retaining vllm bench
latency/throughput examples.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: bug fix <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

`lm_eval` does not have `__version__` attribute

### Additional Information
<!-- E.g. related issue. -->
NVBug 6102101

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Enhanced the package version detection system to improve overall
reliability and stability of the application while reducing unnecessary
external dependencies. All functionality, including version gating and
system warnings, continues to operate exactly as expected with no impact
on the user experience.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1289)

Enables TensorRT attention-v2 fusion for vision transformers when
exported to ONNX with FP8 Q/DQ. The core library changes are
architecture-agnostic (drop-in for any FP8 ONNX export); coverage is
exercised by the existing `examples/torch_onnx/torch_quant_to_onnx.py`
pipeline.

- **`modelopt/onnx/export/fp8_exporter.py`** — new post-processing
passes: move attention-scaling `Mul` and K `Transpose` to the Q-side so
DQ feeds MatMul directly, pre-transpose constant weights, and insert FP8
Q/DQ on Softmax outputs (fixed `1/448` scale, data-independent) for
MHA-v2 fusion. Rewrites only fire when every downstream consumer is a
MatMul so non-attention branches are never perturbed.
- **`modelopt/onnx/utils.py`** — `fold_dq_fp32_to_fp16_casts` /
`fold_q_fp16_to_fp32_casts` remove the Cast nodes
`convert_float_to_float16` inserts around Q/DQ and rewrite scale
initializers to FP16 so TRT fuses DQ into the downstream GEMM. Guarded
behind opset >= 19 (FP16 Q/DQ scale requirement). Warns on FP16
overflow/underflow.
- **`modelopt/torch/_deploy/utils/torch_onnx.py`** — calls the fold
helpers for FP8-quantized models after `convert_float_to_float16`.
- **`modelopt/torch/quantization/export_onnx.py`** — keeps FP8 Q/DQ
scale in the native input dtype so no Cast is emitted between graph and
Q/DQ. Removes the now-unused `trt_high_precision_dtype` parameter from
`_fp8_quantize`/`_fp8_dequantize`.
- **`modelopt/torch/quantization/nn/modules/quant_layernorm.py`** (new)
— registers `nn.LayerNorm` in `QuantModuleRegistry` so LayerNorm output
quantizers are honored.
- **`modelopt/torch/quantization/plugins/huggingface.py`** — skips
`*Attention` wrappers whose children are also `*Attention` per-instance
(not per-class) to avoid double-patching `eager_attention_forward` (e.g.
`ViTAttention` vs `ViTSelfAttention`).
- **`examples/torch_onnx/torch_quant_to_onnx.py`** — adds a
`_FP8_MHA_OVERRIDE` config block to FP8 mode that enables LayerNorm
output quantizer + disables its input quantizer for TRT attention
fusion.
- **Unit tests** (12 CPU tests, ~1.2s total) — fp8_exporter rewrites +
fanout safety, fold-cast helpers + opset guard, LayerNorm quant-wrapper
identity, per-instance nested-attention detection.
ViT-base-patch16-224, RTX 6000 Ada, strongly-typed FP8 via `trtexec`.
Accuracy on 2 000 ImageNet-1k validation samples (streaming).

**Batch = 1 (latency-bound)**
| Model | Top-1 | Top-5 | TRT latency | Speedup |
|---|---|---|---|---|
| FP16 baseline | 80.96% | 95.80% | 0.722 ms | 1.00x |
| Torch FP8 MHA | 80.66% | 95.75% | 0.657 ms | **1.10x** |
| ONNX PTQ FP8 | — | — | 0.589 ms | **1.23x** |

**Batch = 64 (throughput-bound, realistic inference)**
| Model | TRT latency | Speedup | Images/s |
|---|---|---|---|
| FP16 baseline | 23.40 ms | 1.00x | 1152 |
| Torch FP8 MHA | 15.89 ms | **1.47x** | 1152 |
| ONNX PTQ FP8 | 15.89 ms | **1.47x** | 1216 |

Top-1 accuracy stays within 0.30 pp of FP16; at batch=64 the Torch FP8
MHA path matches ONNX PTQ wall-time — attention is the bottleneck there
and both paths achieve full FP8 attention fusion (36/36 attention
MatMuls with QDQ in ViT-base).
- [x] CPU unit tests (new): \`python -m pytest
tests/unit/onnx/quantization/test_fp8_mha_exporter.py
tests/unit/onnx/test_fold_casts.py
tests/unit/torch/quantization/test_quant_layernorm.py
tests/unit/torch/quantization/plugins/test_nested_attention_skip.py\`
- [x] Existing ONNX / quantization unit suites unaffected: \`python -m
pytest tests/unit/onnx tests/unit/torch/quantization\`
- [x] End-to-end ViT FP8 export: \`python
examples/torch_onnx/torch_quant_to_onnx.py --timm_model_name
vit_base_patch16_224 --quantize_mode fp8 --onnx_save_path
vit_base_fp8.onnx\` — expect log lines \`Folded 48 weight Transpose
nodes\`, \`Inserted FP8 weight DequantizeLinear for 1 Conv nodes\`, and
\`Attention QDQ rewrites: ... inserted QDQ on 12 Softmax outputs\`
- [x] trtexec FP8 strongly-typed build: \`trtexec
--onnx=vit_base_fp8.onnx --fp8 --stronglyTyped\`
- [x] Accuracy within ~0.3 pp of FP16 baseline on ImageNet-1k subset

---------

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->

<!-- Details about the change. -->

### Usage
- fixes a bug in example script. We were trying why our models were not
that strong at long context. Seems like a recent refactor did not
implement max seq length. so 512 is used by default.

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

* **Improvements**
* Calibration data loading now enforces a maximum sequence/sample length
during dataset preparation, ensuring calibration inputs adhere to
configured length limits. This yields more predictable calibration
behavior, reduces peak memory usage during calibration runs, and
improves consistency of quantization preprocessing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
## Summary

- Removes Mllama (Llama 3.2 Vision) model-type branches from the
`llm_ptq` example (`hf_ptq.py`, `example_utils.py`) and drops the
now-unused `MllamaImageProcessor` wrapper from `modelopt/torch/utils/`.
- Drops the legacy `MllamaImageProcessor` path in
`modelopt/torch/utils/vlm_dataset_utils.py`; the generic HF
ProcessorMixin path handles the remaining cases.
- Adds a CHANGELOG entry under 0.44 Backward Breaking Changes.

## Test plan

- [x] CI lint / unit tests pass
- [x] Smoke-run ``examples/llm_ptq/scripts/huggingface_example.sh
--model <llm> --quant fp8`` (text-only path, non-mllama)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Removed Mllama (Llama 3.2 Vision) support from quantization examples.
This includes removal of dedicated image processor implementation,
specialized model handling, and related calibration logic.
* Updated VLM image-text calibration guidance to use
`--calib_with_images` flag with other supported VLMs instead of
Mllama-specific processing paths.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### Type of change

- [x] Bug fix (non-breaking change which fixes an issue)

### Description

Fixes #1088 — `RuntimeError: one of the variables needed for gradient
computation has been modified by an inplace operation:
IndexPutBackward0` when training with `eagle_mix_hidden_states=True`.

**Root cause:** In `HFEagleModel._eagle_training_forward`, the indexed
assignment at line 991–994 modifies `eagle_input_hiddens` in-place while
it is still part of the autograd computation graph.

**Fix:** Clone the tensor before the in-place assignment. This is the
same pattern already used in the Megatron backend at
`megatron_eagle.py:1201-1202`:

```python
# Clone to avoid inplace modification of view created in no_grad mode
eagle_module_input_hidden_states = eagle_module_input_hidden_states.clone()
```

The HF backend was missing this clone.

### Usage

```python
config["eagle_mix_hidden_states"] = True
config["eagle_ttt_steps"] = 2
mtsp.convert(model, mode=[("eagle", config)])
model.train()
outputs = model(input_ids=input_ids, labels=labels)
outputs.loss.backward()  # no longer crashes
```

### Testing

Added `test_eagle_mix_hidden_states_backward` parametrized over
`eagle_ttt_steps` [1, 2] that:
- Converts a tiny LLaMA to EAGLE with `eagle_mix_hidden_states=True`
- Runs forward + backward pass
- Asserts loss is not None and gradients flow to `eagle_module`

```
pytest tests/unit/torch/speculative/plugins/test_hf_speculative.py::test_eagle_mix_hidden_states_backward -v
```

### Checklist

- [x] I have read the [contributor guidelines](CONTRIBUTING.md) and
signed my commits
- [x] I have followed the [security best practices](SECURITY.md)
- [x] This change is backward compatible
- [x] I have followed third-party code and dependency guidelines
- [x] I have added tests that prove my fix is effective

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Fixed gradient computation issue in speculative decoding during model
training to ensure proper autograd behavior.

* **Tests**
* Added regression test to validate gradient computation in speculative
decoding scenarios.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

This PR fixes PTQ with image claibration for VLMs.

### Usage

```python
python3 examples/llm_ptq/hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-VL-8B-Instruct --qformat fp8 --export_path Qwen3-VL-8B-Instruct-fp8 --trust_remote_code --kv_cache_qformat none --calib_with_images --calib_size 512
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Image-text calibration now extends support to additional model
architectures when image calibration is enabled.
* Improved tokenizer truncation handling in multimodal dataset
processing to prevent configuration conflicts when image inputs are
present.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Liana Mikaelyan <lmikaelyan@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners April 27, 2026 14:15
@kevalmorabia97 kevalmorabia97 requested review from cjluo-nv, galagam, h-guo18 and realAsma and removed request for a team April 27, 2026 14:15
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

This pull request deprecates Mllama support in PTQ examples, introduces FP8 multi-head attention quantization optimizations in ONNX export, adds MatMul dimension-based exclusion logic to prevent small-GeMM quantization, refactors weight/bias extraction in quantization exports, updates documentation with third-party license notices, and includes comprehensive test coverage for new quantization features.

Changes

Cohort / File(s) Summary
Documentation & License
CHANGELOG.rst, LICENSE_HEADER, README.md, docs/source/getting_started/_installation_for_Linux.rst, docs/source/getting_started/windows/_installation_for_Windows.rst
License year update (2024→2026) and expanded installation/license disclaimers clarifying third-party software downloads; replaces single TensorRT-LLM image reference with multiple NVIDIA NGC container options; updates Linux container guidance with explicit pre-installation confirmation.
Mllama Deprecation
examples/llm_ptq/example_utils.py, examples/llm_ptq/hf_ptq.py, modelopt/torch/utils/image_processor.py, modelopt/torch/utils/vlm_dataset_utils.py
Removes MllamaImageProcessor and all mllama-specific branches from calibration, processor initialization, and post-quantization logic; narrows processor type annotations from BaseImageProcessor | ProcessorMixin to ProcessorMixin | None; unifies VLM image processing via args.calib_with_images and AutoProcessor.
ONNX FP8 MHA Quantization
modelopt/onnx/export/fp8_exporter.py, tests/unit/onnx/quantization/test_fp8_mha_exporter.py
Centralizes FP8 constants, adds transpose-folding for constant weights, and introduces three graph-rewrite passes: moving Mul/Transpose before Q/DQ and inserting fixed-scale Q/DQ after Softmax; new test validates rewrite patterns and non-MatMul consumer skipping.
MatMul Exclusion & Shape Inference
modelopt/onnx/quantization/graph_utils.py, tests/unit/onnx/quantization/test_graph_utils.py
Expands MatMul quantization exclusion from GEMV-only to include small-GeMM (K or N < 16); adds _get_inp_b_k_dim helper for K derivation with transB awareness; extends symbolic/runtime inference to infer K from constant initializers, graph inputs, and execution results; comprehensive test coverage including transB and deduplication logic.
FP16/FP32 Cast Folding
modelopt/onnx/utils.py, modelopt/torch/_deploy/utils/torch_onnx.py, tests/unit/onnx/test_fold_casts.py
Adds fold_q_fp16_to_fp32_casts for Q/DQ scale cast elimination; gates FP16 folding functions by minimum opset version; integrates FP8-specific cast folding into ONNX optimization pipeline; test validates cast removal and scale rewriting across opset versions.
PyTorch FP8 Export & Quantization
modelopt/torch/quantization/export_onnx.py, examples/torch_onnx/torch_quant_to_onnx.py
Removes intermediate Cast operations and trt_high_precision_dtype constraints from FP8 Q/DQ; adds FP8 MHA-specific LayerNorm quantizer config; introduces runtime helpers to disable quantizers for 4D+ inputs, low-channel Conv2d, and "dead" quantizers with NaN/non-positive amax.
Quantization Registry & Modules
modelopt/torch/quantization/nn/__init__.py, modelopt/torch/quantization/nn/modules/quant_layernorm.py
Exports new LayerNorm quantization module via registry; enables LayerNorm output quantizers for FP8 attention fusion scenarios.
Weight/Bias Extraction Refactoring
modelopt/torch/export/unified_export_megatron.py, modelopt/torch/export/plugins/vllm_fakequant_megatron.py
Centralizes weight/bias tensor extraction via new _get_weight_bias helper; replaces direct hasattr checks with dict-based detection; refactors vLLM quantized-weight swap to safely re-wrap as Parameter and restore in finally block.
Attention & Plugin Improvements
modelopt/torch/quantization/plugins/huggingface.py, modelopt/torch/quantization/plugins/vllm.py, modelopt/torch/speculative/plugins/transformers.py
Introduces nested-Attention detection to skip double-patching; adds parameter type-preservation logic for quantized weights; clones eagle_input_hiddens before in-place assignment to prevent autograd breakage.
Example & Utility Updates
examples/llm_eval/lm_eval_hf.py, examples/puzzletron/README.md, examples/vllm_serve/vllm_reload_utils.py
Replaces direct version import with importlib.metadata; updates vLLM deployment instructions with AnyModel fork and config.json requirements; broadens quantizer-key matching from "quantizer_" substring to any "quantizer" occurrence.
New Test Coverage
tests/unit/torch/quantization/plugins/test_nested_attention_skip.py, tests/unit/torch/quantization/test_quant_layernorm.py, tests/unit/torch/speculative/plugins/test_hf_speculative.py
Validates nested-attention skip logic, LayerNorm quantization exactness, and EAGLE speculative decoding backward-pass gradient flow.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.02% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Comprehensive security scan found no instances of torch.load with weights_only=False, numpy.load with allow_pickle=True, hardcoded trust_remote_code=True, eval/exec calls, or # nosec comments bypassing security checks.
Title check ✅ Passed The title directly summarizes the main change: cherry-picking multiple PRs into a release branch. It is specific, clear, and accurately reflects the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cherry-picks/release-0.44.0

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 removed the request for review from a team April 27, 2026 14:17
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-27 17:06 UTC

@kevalmorabia97 kevalmorabia97 changed the title Cherry-picks for release/0.44.0 Cherry-pick #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 Apr 27, 2026
@kevalmorabia97 kevalmorabia97 changed the title Cherry-pick #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 Cherry-pick PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 Apr 27, 2026
@kevalmorabia97 kevalmorabia97 changed the title Cherry-pick PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 [Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 Apr 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/utils/vlm_dataset_utils.py (1)

441-448: ⚠️ Potential issue | 🔴 Critical

Fix logic bug in max_length condition that prevents truncation from ever being applied.

The condition "images" not in kwargs at line 447 will always be False because "images" is unconditionally added to kwargs at line 443. This means the truncation and max_length parameters are never passed to the processor, making the max_length function parameter ineffective.

If the intent is to skip truncation for multimodal cases (when images are present), the condition should be inverted to "images" in kwargs. If truncation should always apply when max_length is provided, the condition should be removed entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/vlm_dataset_utils.py` around lines 441 - 448, The
current logic never applies truncation because "images" is always present in
kwargs; update the max_length handling so truncation is applied when max_length
is provided: remove the `"images" not in kwargs` check and simply, when
max_length is not None, call kwargs.update({"truncation": True, "max_length":
max_length}) (referencing the kwargs dict, max_length parameter, and the earlier
creation that includes "text": list(prompts) and "images": list(images)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 273-274: The current guard on the quantizer uses
torch.any(torch.isnan(amax)) or torch.all(amax <= 0) so it only disables q when
every channel/block is non-positive; change it to disable the quantizer when any
entry is invalid (NaN) or any entry is <= 0. Locate the use of amax and
q.disable() in torch_quant_to_onnx.py and replace the torch.all(amax <= 0)
condition with torch.any(amax <= 0) (or equivalent) so per-channel/per-block
zero/negative values trigger q.disable() and avoid 448/amax infinities.
- Around line 595-600: The auto-mode gate is too broad: change the any(...)
check used to set uses_fp8_conv_input so it only triggers when the
auto_quantization_formats actually include an FP8-family format (not merely
anything other than "INT8_DEFAULT_CFG"); update the condition in the
uses_fp8_conv_input computation to detect FP8-style formats (e.g., check fmt
strings for FP8-family identifiers like "fp8", "mxfp8", "nvfp4" or a specific
FP8-format set) and only call
_disable_low_channel_conv_input_quantizers(quantized_model) when such an
FP8-family format is present, keeping the existing references to
args.quantize_mode, args.auto_quantization_formats, uses_fp8_conv_input and
_disable_low_channel_conv_input_quantizers.

In `@modelopt/onnx/export/fp8_exporter.py`:
- Around line 88-122: The transpose-folding code currently rewires Transpose to
consume dq_op.outputs[0] without ensuring that dq_op.outputs[0] has no other
live consumers; update the logic around dq_op, transpose_to_remove and
cast_to_remove to first verify that the DQ output node (dq_op.outputs[0]) has
exactly one live consumer (the candidate Transpose or Cast→Transpose) before
mutating torch_weights and setting transpose_to_remove/cast_to_remove; if there
are any other consumers, skip folding and leave
cast_to_remove/transpose_to_remove as None. Apply the same single-consumer check
to the analogous block later in the file (the second transpose-folding
occurrence).

In `@modelopt/onnx/utils.py`:
- Around line 1422-1437: The underflow/overflow warning in _scale_fp32_to_fp16
is too coarse: change the condition to check elementwise where a non-zero FP32
value becomes zero in FP16 by replacing the combined np.any(...) logic with an
elementwise mask (e.g., mask = (fp16_data == 0) & (scale_data != 0)) and warn
only if np.any(np.isinf(fp16_data)) or np.any(mask); keep the existing inf check
and the rest of the in-place conversion logic using scale_init, fp16_data and
scale_data unchanged.
- Around line 1440-1474: fold_q_fp16_to_fp32_casts currently treats any
Cast(..., to=FLOAT) as FP16→FP32 and rewrites Q scales; update it to first
verify the cast source is actually FP16 before proceeding: for each Cast node
(in function fold_q_fp16_to_fp32_casts) look up the input tensor's dtype (from
graph.value_info, graph.input, or initializers) and only continue when that
dtype == onnx.TensorProto.FLOAT16; keep the existing logic that calls
_scale_fp32_to_fp16(initializers[...]) and _bypass_cast_node(onnx_model, node)
but skip/break for casts from BF16 or other dtypes so you do not rewrite Q
scales for non-FP16→FP32 casts.

---

Outside diff comments:
In `@modelopt/torch/utils/vlm_dataset_utils.py`:
- Around line 441-448: The current logic never applies truncation because
"images" is always present in kwargs; update the max_length handling so
truncation is applied when max_length is provided: remove the `"images" not in
kwargs` check and simply, when max_length is not None, call
kwargs.update({"truncation": True, "max_length": max_length}) (referencing the
kwargs dict, max_length parameter, and the earlier creation that includes
"text": list(prompts) and "images": list(images)).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8d54ab83-f2cc-4766-b089-ee6c30d50c0e

📥 Commits

Reviewing files that changed from the base of the PR and between f3da713 and 2892b95.

📒 Files selected for processing (31)
  • CHANGELOG.rst
  • LICENSE_HEADER
  • README.md
  • docs/source/getting_started/_installation_for_Linux.rst
  • docs/source/getting_started/windows/_installation_for_Windows.rst
  • examples/llm_eval/lm_eval_hf.py
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • examples/puzzletron/README.md
  • examples/torch_onnx/torch_quant_to_onnx.py
  • examples/vllm_serve/vllm_reload_utils.py
  • modelopt/onnx/export/fp8_exporter.py
  • modelopt/onnx/quantization/graph_utils.py
  • modelopt/onnx/utils.py
  • modelopt/torch/_deploy/utils/torch_onnx.py
  • modelopt/torch/export/plugins/vllm_fakequant_megatron.py
  • modelopt/torch/export/unified_export_megatron.py
  • modelopt/torch/quantization/export_onnx.py
  • modelopt/torch/quantization/nn/__init__.py
  • modelopt/torch/quantization/nn/modules/quant_layernorm.py
  • modelopt/torch/quantization/plugins/huggingface.py
  • modelopt/torch/quantization/plugins/vllm.py
  • modelopt/torch/speculative/plugins/transformers.py
  • modelopt/torch/utils/image_processor.py
  • modelopt/torch/utils/vlm_dataset_utils.py
  • tests/unit/onnx/quantization/test_fp8_mha_exporter.py
  • tests/unit/onnx/quantization/test_graph_utils.py
  • tests/unit/onnx/test_fold_casts.py
  • tests/unit/torch/quantization/plugins/test_nested_attention_skip.py
  • tests/unit/torch/quantization/test_quant_layernorm.py
  • tests/unit/torch/speculative/plugins/test_hf_speculative.py
💤 Files with no reviewable changes (1)
  • modelopt/torch/utils/image_processor.py

Comment on lines +273 to +274
if torch.any(torch.isnan(amax)) or torch.all(amax <= 0):
q.disable()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle partially-dead amax tensors too.

This only disables a quantizer when all amax entries are <= 0, but FP8 export will still blow up if a per-channel/per-block quantizer has just one zero/negative entry. That makes the current guard miss exactly the mixed-validity case that still produces 448 / amax infinities.

Proposed fix
-            if torch.any(torch.isnan(amax)) or torch.all(amax <= 0):
+            if torch.any(torch.isnan(amax)) or torch.any(amax <= 0):
                 q.disable()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 273 - 274, The
current guard on the quantizer uses torch.any(torch.isnan(amax)) or
torch.all(amax <= 0) so it only disables q when every channel/block is
non-positive; change it to disable the quantizer when any entry is invalid (NaN)
or any entry is <= 0. Locate the use of amax and q.disable() in
torch_quant_to_onnx.py and replace the torch.all(amax <= 0) condition with
torch.any(amax <= 0) (or equivalent) so per-channel/per-block zero/negative
values trigger q.disable() and avoid 448/amax infinities.

Comment on lines +595 to +600
uses_fp8_conv_input = args.quantize_mode in ("fp8", "mxfp8", "nvfp4") or (
args.quantize_mode == "auto"
and any(fmt != "INT8_DEFAULT_CFG" for fmt in args.auto_quantization_formats)
)
if uses_fp8_conv_input:
_disable_low_channel_conv_input_quantizers(quantized_model)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Narrow the auto-mode gate to actual FP8-family formats.

any(fmt != "INT8_DEFAULT_CFG" ...) also fires for INT4_AWQ_CFG, so auto mode will disable low-channel Conv input quantizers even when no FP8-style format is in the search space. That changes the search budget/accuracy for a workaround that is only needed for TRT_FP8QuantizeLinear.

Proposed fix
+    fp8_family_formats = {
+        "FP8_DEFAULT_CFG",
+        "MXFP8_DEFAULT_CFG",
+        "NVFP4_AWQ_LITE_CFG",
+        "NVFP4_DEFAULT_CFG",
+    }
     uses_fp8_conv_input = args.quantize_mode in ("fp8", "mxfp8", "nvfp4") or (
         args.quantize_mode == "auto"
-        and any(fmt != "INT8_DEFAULT_CFG" for fmt in args.auto_quantization_formats)
+        and any(fmt in fp8_family_formats for fmt in args.auto_quantization_formats)
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 595 - 600, The
auto-mode gate is too broad: change the any(...) check used to set
uses_fp8_conv_input so it only triggers when the auto_quantization_formats
actually include an FP8-family format (not merely anything other than
"INT8_DEFAULT_CFG"); update the condition in the uses_fp8_conv_input computation
to detect FP8-style formats (e.g., check fmt strings for FP8-family identifiers
like "fp8", "mxfp8", "nvfp4" or a specific FP8-format set) and only call
_disable_low_channel_conv_input_quantizers(quantized_model) when such an
FP8-family format is present, keeping the existing references to
args.quantize_mode, args.auto_quantization_formats, uses_fp8_conv_input and
_disable_low_channel_conv_input_quantizers.

Comment on lines +88 to +122
# Pre-transpose constant weights if DQ feeds ``Transpose → MatMul`` (or
# ``Cast → Transpose → MatMul`` after fp16 conversion) so TRT sees DQ→MatMul.
# Control flow: scan candidates; a Cast-wrapped candidate is accepted only if it
# leads to a Transpose; a bare Transpose whose all consumers are MatMul wins and
# breaks the loop. Any other shape defaults `cast_to_remove` back to None and
# continues scanning.
transpose_to_remove = None
cast_to_remove = None
for candidate in list(dq_op.outputs[0].outputs):
if candidate.op == "Cast":
cast_to_remove = candidate
candidate = next(
(c for c in candidate.outputs[0].outputs if c.op == "Transpose"),
None,
)
if candidate is None:
cast_to_remove = None
continue
if candidate.op != "Transpose":
cast_to_remove = None
continue
t_consumers = list(candidate.outputs[0].outputs)
# Only fold the transpose when every downstream consumer is MatMul; otherwise
# non-MatMul consumers would observe the un-transposed weights.
if t_consumers and all(c.op == "MatMul" for c in t_consumers):
perm = candidate.attrs.get("perm", None)
torch_weights = (
torch_weights.permute(*perm).contiguous()
if perm is not None
else torch_weights.T.contiguous()
)
transpose_to_remove = candidate
else:
cast_to_remove = None
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only fold the transpose when the DQ output has no other live consumers.

This rewrite pre-transposes the stored weight and then rewires the Transpose branch to consume dq_op.outputs[0] directly, but it never checks whether that same DQ output also feeds some other branch. If it does, that branch now starts seeing the transposed weight too, which is a silent graph corruption.

Proposed fix
                 transpose_to_remove = None
                 cast_to_remove = None
-                for candidate in list(dq_op.outputs[0].outputs):
+                dq_consumers = list(dq_op.outputs[0].outputs)
+                for candidate in dq_consumers:
                     if candidate.op == "Cast":
                         cast_to_remove = candidate
                         candidate = next(
                             (c for c in candidate.outputs[0].outputs if c.op == "Transpose"),
                             None,
@@
                     if t_consumers and all(c.op == "MatMul" for c in t_consumers):
+                        if len(dq_consumers) != 1:
+                            transpose_to_remove = None
+                            cast_to_remove = None
+                            break
                         perm = candidate.attrs.get("perm", None)
                         torch_weights = (
                             torch_weights.permute(*perm).contiguous()
                             if perm is not None
                             else torch_weights.T.contiguous()

Also applies to: 142-151

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/export/fp8_exporter.py` around lines 88 - 122, The
transpose-folding code currently rewires Transpose to consume dq_op.outputs[0]
without ensuring that dq_op.outputs[0] has no other live consumers; update the
logic around dq_op, transpose_to_remove and cast_to_remove to first verify that
the DQ output node (dq_op.outputs[0]) has exactly one live consumer (the
candidate Transpose or Cast→Transpose) before mutating torch_weights and setting
transpose_to_remove/cast_to_remove; if there are any other consumers, skip
folding and leave cast_to_remove/transpose_to_remove as None. Apply the same
single-consumer check to the analogous block later in the file (the second
transpose-folding occurrence).

Comment thread modelopt/onnx/utils.py
Comment on lines +1422 to +1437
def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None:
"""Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16.

Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range).
"""
if scale_init.data_type != onnx.TensorProto.FLOAT:
return
scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32)
if not scale_data.size:
scale_data = np.array(scale_init.float_data, dtype=np.float32)
fp16_data = scale_data.astype(np.float16)
if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)):
logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")
scale_init.data_type = onnx.TensorProto.FLOAT16
scale_init.raw_data = fp16_data.tobytes()
del scale_init.float_data[:]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the underflow warning elementwise.

np.any(fp16_data == 0) and np.any(scale_data != 0) fires whenever the tensor contains any real zero plus any non-zero value. That produces false overflow/underflow warnings for mixed scale tensors.

Suggested fix
-    if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)):
+    if np.any(np.isinf(fp16_data)) or np.any((fp16_data == 0) & (scale_data != 0)):
         logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None:
"""Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16.
Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range).
"""
if scale_init.data_type != onnx.TensorProto.FLOAT:
return
scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32)
if not scale_data.size:
scale_data = np.array(scale_init.float_data, dtype=np.float32)
fp16_data = scale_data.astype(np.float16)
if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)):
logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")
scale_init.data_type = onnx.TensorProto.FLOAT16
scale_init.raw_data = fp16_data.tobytes()
del scale_init.float_data[:]
def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None:
"""Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16.
Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range).
"""
if scale_init.data_type != onnx.TensorProto.FLOAT:
return
scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32)
if not scale_data.size:
scale_data = np.array(scale_init.float_data, dtype=np.float32)
fp16_data = scale_data.astype(np.float16)
if np.any(np.isinf(fp16_data)) or np.any((fp16_data == 0) & (scale_data != 0)):
logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")
scale_init.data_type = onnx.TensorProto.FLOAT16
scale_init.raw_data = fp16_data.tobytes()
del scale_init.float_data[:]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/utils.py` around lines 1422 - 1437, The underflow/overflow
warning in _scale_fp32_to_fp16 is too coarse: change the condition to check
elementwise where a non-zero FP32 value becomes zero in FP16 by replacing the
combined np.any(...) logic with an elementwise mask (e.g., mask = (fp16_data ==
0) & (scale_data != 0)) and warn only if np.any(np.isinf(fp16_data)) or
np.any(mask); keep the existing inf check and the rest of the in-place
conversion logic using scale_init, fp16_data and scale_data unchanged.

Comment thread modelopt/onnx/utils.py
Comment on lines +1440 to +1474
def fold_q_fp16_to_fp32_casts(onnx_model: onnx.ModelProto) -> onnx.ModelProto:
"""Remove ``Cast(FP16→FP32) → Q`` patterns inserted by ``convert_float_to_float16``.

The Q scale is rewritten to FP16 so Q consumes the FP16 graph directly. Skipped for
opsets below ``BASE_MIN_OPSET`` since FP16 Q scales require opset >= 19.
"""
if get_opset_version(onnx_model) < BASE_MIN_OPSET:
logger.debug(
f"Skipping fold_q_fp16_to_fp32_casts: opset < {BASE_MIN_OPSET} (FP16 Q scale unsupported)"
)
return onnx_model

consumer_map: dict[str, list[onnx.NodeProto]] = {}
for node in onnx_model.graph.node:
for inp in node.input:
consumer_map.setdefault(inp, []).append(node)
initializers = {init.name: init for init in onnx_model.graph.initializer}

to_remove = []
for node in onnx_model.graph.node:
if node.op_type != "Cast":
continue
cast_to = next((a.i for a in node.attribute if a.name == "to"), None)
if cast_to != onnx.TensorProto.FLOAT:
continue
consumers = consumer_map.get(node.output[0], [])
if not consumers or not all(c.op_type in _Q_OPS for c in consumers):
continue

for q_node in consumers:
if len(q_node.input) >= 2 and q_node.input[1] in initializers:
_scale_fp32_to_fp16(initializers[q_node.input[1]])

_bypass_cast_node(onnx_model, node)
to_remove.append(node)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard this fold to actual FP16→FP32 casts.

The implementation only checks Cast(..., to=FLOAT), so it will also bypass BF16→FP32 or any other *→FP32 cast and then rewrite the Q scale to FP16. That changes semantics and can leave the quantizer consuming a dtype this pass was not meant to legalize.

Suggested fix
 def fold_q_fp16_to_fp32_casts(onnx_model: onnx.ModelProto) -> onnx.ModelProto:
@@
-    consumer_map: dict[str, list[onnx.NodeProto]] = {}
+    consumer_map: dict[str, list[onnx.NodeProto]] = {}
     for node in onnx_model.graph.node:
         for inp in node.input:
             consumer_map.setdefault(inp, []).append(node)
     initializers = {init.name: init for init in onnx_model.graph.initializer}
+    type_map = _build_tensor_type_map(onnx_model)
@@
         cast_to = next((a.i for a in node.attribute if a.name == "to"), None)
         if cast_to != onnx.TensorProto.FLOAT:
             continue
+        if type_map.get(node.input[0]) != onnx.TensorProto.FLOAT16:
+            continue
         consumers = consumer_map.get(node.output[0], [])
         if not consumers or not all(c.op_type in _Q_OPS for c in consumers):
             continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/utils.py` around lines 1440 - 1474, fold_q_fp16_to_fp32_casts
currently treats any Cast(..., to=FLOAT) as FP16→FP32 and rewrites Q scales;
update it to first verify the cast source is actually FP16 before proceeding:
for each Cast node (in function fold_q_fp16_to_fp32_casts) look up the input
tensor's dtype (from graph.value_info, graph.input, or initializers) and only
continue when that dtype == onnx.TensorProto.FLOAT16; keep the existing logic
that calls _scale_fp32_to_fp16(initializers[...]) and
_bypass_cast_node(onnx_model, node) but skip/break for casts from BF16 or other
dtypes so you do not rewrite Q scales for non-FP16→FP32 casts.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 91.25874% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.87%. Comparing base (f3da713) to head (2892b95).

Files with missing lines Patch % Lines
modelopt/onnx/export/fp8_exporter.py 89.61% 16 Missing ⚠️
modelopt/onnx/utils.py 89.58% 5 Missing ⚠️
modelopt/torch/export/unified_export_megatron.py 78.57% 3 Missing ⚠️
modelopt/torch/utils/vlm_dataset_utils.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                 @@
##           release/0.44.0    #1350      +/-   ##
==================================================
+ Coverage           75.38%   75.87%   +0.49%     
==================================================
  Files                 462      462              
  Lines               49960    50173     +213     
==================================================
+ Hits                37662    38069     +407     
+ Misses              12298    12104     -194     
Flag Coverage Δ
examples 41.64% <56.29%> (+0.93%) ⬆️
gpu 58.39% <20.62%> (-0.74%) ⬇️
regression 14.78% <2.09%> (-0.01%) ⬇️
unit 52.87% <70.62%> (+0.48%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 merged commit b1ec471 into release/0.44.0 Apr 27, 2026
46 checks passed
@kevalmorabia97 kevalmorabia97 deleted the cherry-picks/release-0.44.0 branch April 27, 2026 17:05
kevalmorabia97 added a commit that referenced this pull request Apr 27, 2026
Adds `.claude/skills/release-cherry-pick/SKILL.md` — a Claude Code skill
for cherry-picking labeled PRs to a release branch.

Invoke with `/release-cherry-pick <version>`.

See this PR created with the skill:
#1350

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automated release cherry-pick workflow to streamline selecting
and applying multiple PRs into release branches.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants