[Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 by kevalmorabia97 · Pull Request #1350 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-04-27T14:15:28Z

Cherry-picked PRs

Summary by CodeRabbit

Release Notes

Documentation
- Updated installation guides with third-party software license disclaimers.
- Added vLLM deployment instructions for model deployment.
- Expanded NGC container image recommendations.
Deprecations
- Mllama/vision model image processor support deprecated; users directed to use --calib_with_images with supported models.
Bug Fixes
- Fixed ONNX quantization to exclude small MatMul/Gemm operations from INT8/FP8 quantization.
- Improved FP8 export with enhanced cast-folding and attention fusion optimizations.
New Features
- Added LayerNorm quantization support for improved FP8 attention quantization.

### What does this PR do? Exclude small-dimension MatMul nodes from INT8 quantization. MatMuls with N or K < 16 cannot efficiently use INT8, causing performance regressions. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Improved quantization exclusions so MatMul/Gemm ops with derived K<16 or N<16 are skipped, honoring Gemm transB, using inferred and runtime-determined shapes, and avoiding duplicate outputs. * **Tests** * Expanded unit tests to cover constant, inferred, and runtime-derived shapes, Gemm transB behavior, small-dimension edge cases, and output deduplication. * **Documentation** * Added changelog entry documenting the new small-dimension exclusion thresholds and transB handling.  --------- Signed-off-by: samcheng <samcheng@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Bug fix Fixes two bugs in the vLLM + Megatron-Core MoE export path and cleans up the related weight-collection helper: 1. **`_QuantFusedMoEBase` (vllm.py)**: The weight-quantizer path in `_invoke_fused_moe_quantized_function` was temporarily mutating `self.w13_weight` / `self.w2_weight` to the quantized tensor, then restoring them via `finally`. This exposed a stale quantized tensor on `self` between the mutation and the kernel call. Fixed by computing the quantized weight directly into a local `B` without touching `self.*` attributes. 2. **`GPTModelExporter` / `VllmFqGPTModelExporter` (unified_export_megatron.py / vllm_fakequant_megatron.py)**: `expert_bias` (present in grouped MoE layers) was silently dropped during export because the bias collection ran after the early-return on missing `weight`. Extracted a `_get_weight_bias` helper that collects weight, bias, and expert_bias together, so bias/expert_bias are captured even when weight is absent or zero-element. ### Usage ```python # No API change; export pipelines pick this up automatically. # export_mcore_gpt_to_hf_vllm_fq / export_mcore_gpt_to_hf now correctly # export expert_bias for grouped-MoE checkpoints. ``` ### Testing Step 1 — Quantize (run from Megatron-LM examples/post_training/modelopt): ``` HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \ bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG ``` Step 2 — Export for vLLM fakequant: ``` MLM_EXTRA_ARGS=--export-vllm-fq \ HF_MODEL_CKPT=<path/to/hf/weights> \ MLM_MODEL_CKPT=<quant-ckpt-name> \ EXPORT_DIR=<export-dir> \ bash export.sh <hf-model-id> ``` Step 3 — Serve (run from examples/vllm_serve): ``` QUANT_CFG=NVFP4_DEFAULT_CFG \ QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \ python3 vllm_serve_fakequant.py <export-dir> \ -tp 1 --served-model-name <model-name> \ --host 0.0.0.0 --port 8000 \ --trust-remote-code --enforce-eager \ --disable-custom-all-reduce \ --gpu-memory-utilization 0.8 ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ❌ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information   ## Summary by CodeRabbit * **Refactor** * Centralized weight/bias/expert-bias extraction and export to a single helper for consistent handling. * Standardized quantized-weight flow to temporarily swap and restore parameter tensors during computation. * **Bug Fixes** * Prevented missing or incorrect weight/bias exports by unifying extraction logic. * Broadened checkpoint key matching to preserve more quantizer state during reloads.  --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…1322) ## Summary - Replaces the old pip license notice ("Please review the license terms of ModelOpt and any dependencies before use") with the Legal-approved wording: "Model Optimizer will download and install additional third-party open source software projects. Review the license terms of these open source projects before use." - Adds a generic container license review notice ("Before pulling and using the container images, please review their respective license terms.") to the Linux installation doc (Docker tab) and README. - Adds a `.. note::` with the pip notice to the Windows installation page (covers both standalone and Olive child pages). - Expands the README container section to explicitly list all four recommended NVIDIA container images (`pytorch`, `nemo`, `tensorrt-llm`, `tensorrt`). ## Test plan - [x] Verify rendered docs look correct (`nox -s docs`) - [x] Confirm legal notices appear in Linux, Windows, and README install sections 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Documentation** * Updated installation guides with explicit references to supported NVIDIA container images (PyTorch, NeMo, TensorRT-LLM and variants), clarified pre-installed Model Optimizer in some images, and added notes to review each container’s license terms; clarified conditional environment setup wording and local install license guidance. * **Chores** * Updated project license header year.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: ? documentation. This PR updates vLLM deployment instructions, taking into account heterogenous models created with AnyModel. ### Usage Does not apply. ### Testing Run the updated instructions in the documentation. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: N/A - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information   ## Summary by CodeRabbit * **Documentation** * Replaced a benchmarking-focused section with a deployment guide for running compressed models on vLLM. * Added step-by-step setup for using an AnyModel-enabled vLLM fork, including checkout and install guidance and required model config edits (with optional architecture metadata). * Simplified runtime to a single vllm serve command, removing manual model rearrangement steps. * Restored inference benchmarking as a subsection, retaining vllm bench latency/throughput examples.  --------- Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: bug fix  `lm_eval` does not have `__version__` attribute ### Additional Information  NVBug 6102101  ## Summary by CodeRabbit * **Refactor** * Enhanced the package version detection system to improve overall reliability and stability of the application while reducing unnecessary external dependencies. All functionality, including version gating and system warnings, continues to operate exactly as expected with no impact on the user experience.  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…1289) Enables TensorRT attention-v2 fusion for vision transformers when exported to ONNX with FP8 Q/DQ. The core library changes are architecture-agnostic (drop-in for any FP8 ONNX export); coverage is exercised by the existing `examples/torch_onnx/torch_quant_to_onnx.py` pipeline. - **`modelopt/onnx/export/fp8_exporter.py`** — new post-processing passes: move attention-scaling `Mul` and K `Transpose` to the Q-side so DQ feeds MatMul directly, pre-transpose constant weights, and insert FP8 Q/DQ on Softmax outputs (fixed `1/448` scale, data-independent) for MHA-v2 fusion. Rewrites only fire when every downstream consumer is a MatMul so non-attention branches are never perturbed. - **`modelopt/onnx/utils.py`** — `fold_dq_fp32_to_fp16_casts` / `fold_q_fp16_to_fp32_casts` remove the Cast nodes `convert_float_to_float16` inserts around Q/DQ and rewrite scale initializers to FP16 so TRT fuses DQ into the downstream GEMM. Guarded behind opset >= 19 (FP16 Q/DQ scale requirement). Warns on FP16 overflow/underflow. - **`modelopt/torch/_deploy/utils/torch_onnx.py`** — calls the fold helpers for FP8-quantized models after `convert_float_to_float16`. - **`modelopt/torch/quantization/export_onnx.py`** — keeps FP8 Q/DQ scale in the native input dtype so no Cast is emitted between graph and Q/DQ. Removes the now-unused `trt_high_precision_dtype` parameter from `_fp8_quantize`/`_fp8_dequantize`. - **`modelopt/torch/quantization/nn/modules/quant_layernorm.py`** (new) — registers `nn.LayerNorm` in `QuantModuleRegistry` so LayerNorm output quantizers are honored. - **`modelopt/torch/quantization/plugins/huggingface.py`** — skips `*Attention` wrappers whose children are also `*Attention` per-instance (not per-class) to avoid double-patching `eager_attention_forward` (e.g. `ViTAttention` vs `ViTSelfAttention`). - **`examples/torch_onnx/torch_quant_to_onnx.py`** — adds a `_FP8_MHA_OVERRIDE` config block to FP8 mode that enables LayerNorm output quantizer + disables its input quantizer for TRT attention fusion. - **Unit tests** (12 CPU tests, ~1.2s total) — fp8_exporter rewrites + fanout safety, fold-cast helpers + opset guard, LayerNorm quant-wrapper identity, per-instance nested-attention detection. ViT-base-patch16-224, RTX 6000 Ada, strongly-typed FP8 via `trtexec`. Accuracy on 2 000 ImageNet-1k validation samples (streaming). **Batch = 1 (latency-bound)** | Model | Top-1 | Top-5 | TRT latency | Speedup | |---|---|---|---|---| | FP16 baseline | 80.96% | 95.80% | 0.722 ms | 1.00x | | Torch FP8 MHA | 80.66% | 95.75% | 0.657 ms | **1.10x** | | ONNX PTQ FP8 | — | — | 0.589 ms | **1.23x** | **Batch = 64 (throughput-bound, realistic inference)** | Model | TRT latency | Speedup | Images/s | |---|---|---|---| | FP16 baseline | 23.40 ms | 1.00x | 1152 | | Torch FP8 MHA | 15.89 ms | **1.47x** | 1152 | | ONNX PTQ FP8 | 15.89 ms | **1.47x** | 1216 | Top-1 accuracy stays within 0.30 pp of FP16; at batch=64 the Torch FP8 MHA path matches ONNX PTQ wall-time — attention is the bottleneck there and both paths achieve full FP8 attention fusion (36/36 attention MatMuls with QDQ in ViT-base). - [x] CPU unit tests (new): \`python -m pytest tests/unit/onnx/quantization/test_fp8_mha_exporter.py tests/unit/onnx/test_fold_casts.py tests/unit/torch/quantization/test_quant_layernorm.py tests/unit/torch/quantization/plugins/test_nested_attention_skip.py\` - [x] Existing ONNX / quantization unit suites unaffected: \`python -m pytest tests/unit/onnx tests/unit/torch/quantization\` - [x] End-to-end ViT FP8 export: \`python examples/torch_onnx/torch_quant_to_onnx.py --timm_model_name vit_base_patch16_224 --quantize_mode fp8 --onnx_save_path vit_base_fp8.onnx\` — expect log lines \`Folded 48 weight Transpose nodes\`, \`Inserted FP8 weight DequantizeLinear for 1 Conv nodes\`, and \`Attention QDQ rewrites: ... inserted QDQ on 12 Softmax outputs\` - [x] trtexec FP8 strongly-typed build: \`trtexec --onnx=vit_base_fp8.onnx --fp8 --stronglyTyped\` - [x] Accuracy within ~0.3 pp of FP16 baseline on ImageNet-1k subset --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: ?   ### Usage - fixes a bug in example script. We were trying why our models were not that strong at long context. Seems like a recent refactor did not implement max seq length. so 512 is used by default. ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit ## Release Notes * **Improvements** * Calibration data loading now enforces a maximum sequence/sample length during dataset preparation, ensuring calibration inputs adhere to configured length limits. This yields more predictable calibration behavior, reduces peak memory usage during calibration runs, and improves consistency of quantization preprocessing.  --------- Signed-off-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

## Summary - Removes Mllama (Llama 3.2 Vision) model-type branches from the `llm_ptq` example (`hf_ptq.py`, `example_utils.py`) and drops the now-unused `MllamaImageProcessor` wrapper from `modelopt/torch/utils/`. - Drops the legacy `MllamaImageProcessor` path in `modelopt/torch/utils/vlm_dataset_utils.py`; the generic HF ProcessorMixin path handles the remaining cases. - Adds a CHANGELOG entry under 0.44 Backward Breaking Changes. ## Test plan - [x] CI lint / unit tests pass - [x] Smoke-run ``examples/llm_ptq/scripts/huggingface_example.sh --model <llm> --quant fp8`` (text-only path, non-mllama)  ## Summary by CodeRabbit * **Chores** * Removed Mllama (Llama 3.2 Vision) support from quantization examples. This includes removal of dedicated image processor implementation, specialized model handling, and related calibration logic. * Updated VLM image-text calibration guidance to use `--calib_with_images` flag with other supported VLMs instead of Mllama-specific processing paths.  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### Type of change - [x] Bug fix (non-breaking change which fixes an issue) ### Description Fixes #1088 — `RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: IndexPutBackward0` when training with `eagle_mix_hidden_states=True`. **Root cause:** In `HFEagleModel._eagle_training_forward`, the indexed assignment at line 991–994 modifies `eagle_input_hiddens` in-place while it is still part of the autograd computation graph. **Fix:** Clone the tensor before the in-place assignment. This is the same pattern already used in the Megatron backend at `megatron_eagle.py:1201-1202`: ```python # Clone to avoid inplace modification of view created in no_grad mode eagle_module_input_hidden_states = eagle_module_input_hidden_states.clone() ``` The HF backend was missing this clone. ### Usage ```python config["eagle_mix_hidden_states"] = True config["eagle_ttt_steps"] = 2 mtsp.convert(model, mode=[("eagle", config)]) model.train() outputs = model(input_ids=input_ids, labels=labels) outputs.loss.backward() # no longer crashes ``` ### Testing Added `test_eagle_mix_hidden_states_backward` parametrized over `eagle_ttt_steps` [1, 2] that: - Converts a tiny LLaMA to EAGLE with `eagle_mix_hidden_states=True` - Runs forward + backward pass - Asserts loss is not None and gradients flow to `eagle_module` ``` pytest tests/unit/torch/speculative/plugins/test_hf_speculative.py::test_eagle_mix_hidden_states_backward -v ``` ### Checklist - [x] I have read the [contributor guidelines](CONTRIBUTING.md) and signed my commits - [x] I have followed the [security best practices](SECURITY.md) - [x] This change is backward compatible - [x] I have followed third-party code and dependency guidelines - [x] I have added tests that prove my fix is effective  ## Summary by CodeRabbit * **Bug Fixes** * Fixed gradient computation issue in speculative decoding during model training to ensure proper autograd behavior. * **Tests** * Added regression test to validate gradient computation in speculative decoding scenarios.  Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? This PR fixes PTQ with image claibration for VLMs. ### Usage ```python python3 examples/llm_ptq/hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-VL-8B-Instruct --qformat fp8 --export_path Qwen3-VL-8B-Instruct-fp8 --trust_remote_code --kv_cache_qformat none --calib_with_images --calib_size 512 ```  ## Summary by CodeRabbit * **Bug Fixes** * Image-text calibration now extends support to additional model architectures when image calibration is enabled. * Improved tokenizer truncation handling in multimodal dataset processing to prevent configuration conflicts when image inputs are present.  Signed-off-by: Liana Mikaelyan <lmikaelyan@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-04-27T14:15:43Z

📝 Walkthrough

Walkthrough

This pull request deprecates Mllama support in PTQ examples, introduces FP8 multi-head attention quantization optimizations in ONNX export, adds MatMul dimension-based exclusion logic to prevent small-GeMM quantization, refactors weight/bias extraction in quantization exports, updates documentation with third-party license notices, and includes comprehensive test coverage for new quantization features.

Changes

Cohort / File(s)	Summary
Documentation & License `CHANGELOG.rst`, `LICENSE_HEADER`, `README.md`, `docs/source/getting_started/_installation_for_Linux.rst`, `docs/source/getting_started/windows/_installation_for_Windows.rst`	License year update (2024→2026) and expanded installation/license disclaimers clarifying third-party software downloads; replaces single TensorRT-LLM image reference with multiple NVIDIA NGC container options; updates Linux container guidance with explicit pre-installation confirmation.
Mllama Deprecation `examples/llm_ptq/example_utils.py`, `examples/llm_ptq/hf_ptq.py`, `modelopt/torch/utils/image_processor.py`, `modelopt/torch/utils/vlm_dataset_utils.py`	Removes `MllamaImageProcessor` and all `mllama`-specific branches from calibration, processor initialization, and post-quantization logic; narrows processor type annotations from `BaseImageProcessor \| ProcessorMixin` to `ProcessorMixin \| None`; unifies VLM image processing via `args.calib_with_images` and `AutoProcessor`.
ONNX FP8 MHA Quantization `modelopt/onnx/export/fp8_exporter.py`, `tests/unit/onnx/quantization/test_fp8_mha_exporter.py`	Centralizes FP8 constants, adds transpose-folding for constant weights, and introduces three graph-rewrite passes: moving `Mul`/`Transpose` before Q/DQ and inserting fixed-scale Q/DQ after Softmax; new test validates rewrite patterns and non-MatMul consumer skipping.
MatMul Exclusion & Shape Inference `modelopt/onnx/quantization/graph_utils.py`, `tests/unit/onnx/quantization/test_graph_utils.py`	Expands MatMul quantization exclusion from GEMV-only to include small-GeMM (K or N < 16); adds `_get_inp_b_k_dim` helper for K derivation with `transB` awareness; extends symbolic/runtime inference to infer K from constant initializers, graph inputs, and execution results; comprehensive test coverage including `transB` and deduplication logic.
FP16/FP32 Cast Folding `modelopt/onnx/utils.py`, `modelopt/torch/_deploy/utils/torch_onnx.py`, `tests/unit/onnx/test_fold_casts.py`	Adds `fold_q_fp16_to_fp32_casts` for Q/DQ scale cast elimination; gates FP16 folding functions by minimum opset version; integrates FP8-specific cast folding into ONNX optimization pipeline; test validates cast removal and scale rewriting across opset versions.
PyTorch FP8 Export & Quantization `modelopt/torch/quantization/export_onnx.py`, `examples/torch_onnx/torch_quant_to_onnx.py`	Removes intermediate Cast operations and `trt_high_precision_dtype` constraints from FP8 Q/DQ; adds FP8 MHA-specific LayerNorm quantizer config; introduces runtime helpers to disable quantizers for 4D+ inputs, low-channel Conv2d, and "dead" quantizers with NaN/non-positive amax.
Quantization Registry & Modules `modelopt/torch/quantization/nn/__init__.py`, `modelopt/torch/quantization/nn/modules/quant_layernorm.py`	Exports new LayerNorm quantization module via registry; enables LayerNorm output quantizers for FP8 attention fusion scenarios.
Weight/Bias Extraction Refactoring `modelopt/torch/export/unified_export_megatron.py`, `modelopt/torch/export/plugins/vllm_fakequant_megatron.py`	Centralizes weight/bias tensor extraction via new `_get_weight_bias` helper; replaces direct `hasattr` checks with dict-based detection; refactors vLLM quantized-weight swap to safely re-wrap as `Parameter` and restore in `finally` block.
Attention & Plugin Improvements `modelopt/torch/quantization/plugins/huggingface.py`, `modelopt/torch/quantization/plugins/vllm.py`, `modelopt/torch/speculative/plugins/transformers.py`	Introduces nested-Attention detection to skip double-patching; adds parameter type-preservation logic for quantized weights; clones `eagle_input_hiddens` before in-place assignment to prevent autograd breakage.
Example & Utility Updates `examples/llm_eval/lm_eval_hf.py`, `examples/puzzletron/README.md`, `examples/vllm_serve/vllm_reload_utils.py`	Replaces direct version import with `importlib.metadata`; updates vLLM deployment instructions with AnyModel fork and `config.json` requirements; broadens quantizer-key matching from `"quantizer_"` substring to any `"quantizer"` occurrence.
New Test Coverage `tests/unit/torch/quantization/plugins/test_nested_attention_skip.py`, `tests/unit/torch/quantization/test_quant_layernorm.py`, `tests/unit/torch/speculative/plugins/test_hf_speculative.py`	Validates nested-attention skip logic, LayerNorm quantization exactness, and EAGLE speculative decoding backward-pass gradient flow.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 78.02% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Comprehensive security scan found no instances of torch.load with weights_only=False, numpy.load with allow_pickle=True, hardcoded trust_remote_code=True, eval/exec calls, or # nosec comments bypassing security checks.
Title check	✅ Passed	The title directly summarizes the main change: cherry-picking multiple PRs into a release branch. It is specific, clear, and accurately reflects the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cherry-picks/release-0.44.0

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-27T14:19:34Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-27 17:06 UTC

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/utils/vlm_dataset_utils.py (1)
441-448: ⚠️ Potential issue | 🔴 Critical

Fix logic bug in max_length condition that prevents truncation from ever being applied.

The condition "images" not in kwargs at line 447 will always be False because "images" is unconditionally added to kwargs at line 443. This means the truncation and max_length parameters are never passed to the processor, making the max_length function parameter ineffective.

If the intent is to skip truncation for multimodal cases (when images are present), the condition should be inverted to "images" in kwargs. If truncation should always apply when max_length is provided, the condition should be removed entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/vlm_dataset_utils.py` around lines 441 - 448, The
current logic never applies truncation because "images" is always present in
kwargs; update the max_length handling so truncation is applied when max_length
is provided: remove the `"images" not in kwargs` check and simply, when
max_length is not None, call kwargs.update({"truncation": True, "max_length":
max_length}) (referencing the kwargs dict, max_length parameter, and the earlier
creation that includes "text": list(prompts) and "images": list(images)).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 273-274: The current guard on the quantizer uses
torch.any(torch.isnan(amax)) or torch.all(amax <= 0) so it only disables q when
every channel/block is non-positive; change it to disable the quantizer when any
entry is invalid (NaN) or any entry is <= 0. Locate the use of amax and
q.disable() in torch_quant_to_onnx.py and replace the torch.all(amax <= 0)
condition with torch.any(amax <= 0) (or equivalent) so per-channel/per-block
zero/negative values trigger q.disable() and avoid 448/amax infinities.
- Around line 595-600: The auto-mode gate is too broad: change the any(...)
check used to set uses_fp8_conv_input so it only triggers when the
auto_quantization_formats actually include an FP8-family format (not merely
anything other than "INT8_DEFAULT_CFG"); update the condition in the
uses_fp8_conv_input computation to detect FP8-style formats (e.g., check fmt
strings for FP8-family identifiers like "fp8", "mxfp8", "nvfp4" or a specific
FP8-format set) and only call
_disable_low_channel_conv_input_quantizers(quantized_model) when such an
FP8-family format is present, keeping the existing references to
args.quantize_mode, args.auto_quantization_formats, uses_fp8_conv_input and
_disable_low_channel_conv_input_quantizers.

In `@modelopt/onnx/export/fp8_exporter.py`:
- Around line 88-122: The transpose-folding code currently rewires Transpose to
consume dq_op.outputs[0] without ensuring that dq_op.outputs[0] has no other
live consumers; update the logic around dq_op, transpose_to_remove and
cast_to_remove to first verify that the DQ output node (dq_op.outputs[0]) has
exactly one live consumer (the candidate Transpose or Cast→Transpose) before
mutating torch_weights and setting transpose_to_remove/cast_to_remove; if there
are any other consumers, skip folding and leave
cast_to_remove/transpose_to_remove as None. Apply the same single-consumer check
to the analogous block later in the file (the second transpose-folding
occurrence).

In `@modelopt/onnx/utils.py`:
- Around line 1422-1437: The underflow/overflow warning in _scale_fp32_to_fp16
is too coarse: change the condition to check elementwise where a non-zero FP32
value becomes zero in FP16 by replacing the combined np.any(...) logic with an
elementwise mask (e.g., mask = (fp16_data == 0) & (scale_data != 0)) and warn
only if np.any(np.isinf(fp16_data)) or np.any(mask); keep the existing inf check
and the rest of the in-place conversion logic using scale_init, fp16_data and
scale_data unchanged.
- Around line 1440-1474: fold_q_fp16_to_fp32_casts currently treats any
Cast(..., to=FLOAT) as FP16→FP32 and rewrites Q scales; update it to first
verify the cast source is actually FP16 before proceeding: for each Cast node
(in function fold_q_fp16_to_fp32_casts) look up the input tensor's dtype (from
graph.value_info, graph.input, or initializers) and only continue when that
dtype == onnx.TensorProto.FLOAT16; keep the existing logic that calls
_scale_fp32_to_fp16(initializers[...]) and _bypass_cast_node(onnx_model, node)
but skip/break for casts from BF16 or other dtypes so you do not rewrite Q
scales for non-FP16→FP32 casts.

---

Outside diff comments:
In `@modelopt/torch/utils/vlm_dataset_utils.py`:
- Around line 441-448: The current logic never applies truncation because
"images" is always present in kwargs; update the max_length handling so
truncation is applied when max_length is provided: remove the `"images" not in
kwargs` check and simply, when max_length is not None, call
kwargs.update({"truncation": True, "max_length": max_length}) (referencing the
kwargs dict, max_length parameter, and the earlier creation that includes
"text": list(prompts) and "images": list(images)).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8d54ab83-f2cc-4766-b089-ee6c30d50c0e

📥 Commits

Reviewing files that changed from the base of the PR and between f3da713 and 2892b95.

📒 Files selected for processing (31)

CHANGELOG.rst
LICENSE_HEADER
README.md
docs/source/getting_started/_installation_for_Linux.rst
docs/source/getting_started/windows/_installation_for_Windows.rst
examples/llm_eval/lm_eval_hf.py
examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
examples/puzzletron/README.md
examples/torch_onnx/torch_quant_to_onnx.py
examples/vllm_serve/vllm_reload_utils.py
modelopt/onnx/export/fp8_exporter.py
modelopt/onnx/quantization/graph_utils.py
modelopt/onnx/utils.py
modelopt/torch/_deploy/utils/torch_onnx.py
modelopt/torch/export/plugins/vllm_fakequant_megatron.py
modelopt/torch/export/unified_export_megatron.py
modelopt/torch/quantization/export_onnx.py
modelopt/torch/quantization/nn/__init__.py
modelopt/torch/quantization/nn/modules/quant_layernorm.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/quantization/plugins/vllm.py
modelopt/torch/speculative/plugins/transformers.py
modelopt/torch/utils/image_processor.py
modelopt/torch/utils/vlm_dataset_utils.py
tests/unit/onnx/quantization/test_fp8_mha_exporter.py
tests/unit/onnx/quantization/test_graph_utils.py
tests/unit/onnx/test_fold_casts.py
tests/unit/torch/quantization/plugins/test_nested_attention_skip.py
tests/unit/torch/quantization/test_quant_layernorm.py
tests/unit/torch/speculative/plugins/test_hf_speculative.py

💤 Files with no reviewable changes (1)

modelopt/torch/utils/image_processor.py

coderabbitai · 2026-04-27T14:26:12Z

+            if torch.any(torch.isnan(amax)) or torch.all(amax <= 0):
+                q.disable()


⚠️ Potential issue | 🟠 Major

Handle partially-dead amax tensors too.

This only disables a quantizer when all amax entries are <= 0, but FP8 export will still blow up if a per-channel/per-block quantizer has just one zero/negative entry. That makes the current guard miss exactly the mixed-validity case that still produces 448 / amax infinities.

Proposed fix

- if torch.any(torch.isnan(amax)) or torch.all(amax <= 0): + if torch.any(torch.isnan(amax)) or torch.any(amax <= 0): q.disable()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 273 - 274, The current guard on the quantizer uses torch.any(torch.isnan(amax)) or torch.all(amax <= 0) so it only disables q when every channel/block is non-positive; change it to disable the quantizer when any entry is invalid (NaN) or any entry is <= 0. Locate the use of amax and q.disable() in torch_quant_to_onnx.py and replace the torch.all(amax <= 0) condition with torch.any(amax <= 0) (or equivalent) so per-channel/per-block zero/negative values trigger q.disable() and avoid 448/amax infinities.

coderabbitai · 2026-04-27T14:26:12Z

+    uses_fp8_conv_input = args.quantize_mode in ("fp8", "mxfp8", "nvfp4") or (
+        args.quantize_mode == "auto"
+        and any(fmt != "INT8_DEFAULT_CFG" for fmt in args.auto_quantization_formats)
+    )
+    if uses_fp8_conv_input:
+        _disable_low_channel_conv_input_quantizers(quantized_model)


⚠️ Potential issue | 🟠 Major

Narrow the auto-mode gate to actual FP8-family formats.

any(fmt != "INT8_DEFAULT_CFG" ...) also fires for INT4_AWQ_CFG, so auto mode will disable low-channel Conv input quantizers even when no FP8-style format is in the search space. That changes the search budget/accuracy for a workaround that is only needed for TRT_FP8QuantizeLinear.

Proposed fix

+ fp8_family_formats = { + "FP8_DEFAULT_CFG", + "MXFP8_DEFAULT_CFG", + "NVFP4_AWQ_LITE_CFG", + "NVFP4_DEFAULT_CFG", + } uses_fp8_conv_input = args.quantize_mode in ("fp8", "mxfp8", "nvfp4") or ( args.quantize_mode == "auto" - and any(fmt != "INT8_DEFAULT_CFG" for fmt in args.auto_quantization_formats) + and any(fmt in fp8_family_formats for fmt in args.auto_quantization_formats) )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 595 - 600, The auto-mode gate is too broad: change the any(...) check used to set uses_fp8_conv_input so it only triggers when the auto_quantization_formats actually include an FP8-family format (not merely anything other than "INT8_DEFAULT_CFG"); update the condition in the uses_fp8_conv_input computation to detect FP8-style formats (e.g., check fmt strings for FP8-family identifiers like "fp8", "mxfp8", "nvfp4" or a specific FP8-format set) and only call _disable_low_channel_conv_input_quantizers(quantized_model) when such an FP8-family format is present, keeping the existing references to args.quantize_mode, args.auto_quantization_formats, uses_fp8_conv_input and _disable_low_channel_conv_input_quantizers.

coderabbitai · 2026-04-27T14:26:12Z

+                # Pre-transpose constant weights if DQ feeds ``Transpose → MatMul`` (or
+                # ``Cast → Transpose → MatMul`` after fp16 conversion) so TRT sees DQ→MatMul.
+                # Control flow: scan candidates; a Cast-wrapped candidate is accepted only if it
+                # leads to a Transpose; a bare Transpose whose all consumers are MatMul wins and
+                # breaks the loop. Any other shape defaults `cast_to_remove` back to None and
+                # continues scanning.
+                transpose_to_remove = None
+                cast_to_remove = None
+                for candidate in list(dq_op.outputs[0].outputs):
+                    if candidate.op == "Cast":
+                        cast_to_remove = candidate
+                        candidate = next(
+                            (c for c in candidate.outputs[0].outputs if c.op == "Transpose"),
+                            None,
+                        )
+                        if candidate is None:
+                            cast_to_remove = None
+                            continue
+                    if candidate.op != "Transpose":
+                        cast_to_remove = None
+                        continue
+                    t_consumers = list(candidate.outputs[0].outputs)
+                    # Only fold the transpose when every downstream consumer is MatMul; otherwise
+                    # non-MatMul consumers would observe the un-transposed weights.
+                    if t_consumers and all(c.op == "MatMul" for c in t_consumers):
+                        perm = candidate.attrs.get("perm", None)
+                        torch_weights = (
+                            torch_weights.permute(*perm).contiguous()
+                            if perm is not None
+                            else torch_weights.T.contiguous()
+                        )
+                        transpose_to_remove = candidate
+                    else:
+                        cast_to_remove = None
+                    break


⚠️ Potential issue | 🟠 Major

Only fold the transpose when the DQ output has no other live consumers.

This rewrite pre-transposes the stored weight and then rewires the Transpose branch to consume dq_op.outputs[0] directly, but it never checks whether that same DQ output also feeds some other branch. If it does, that branch now starts seeing the transposed weight too, which is a silent graph corruption.

Proposed fix

transpose_to_remove = None cast_to_remove = None - for candidate in list(dq_op.outputs[0].outputs): + dq_consumers = list(dq_op.outputs[0].outputs) + for candidate in dq_consumers: if candidate.op == "Cast": cast_to_remove = candidate candidate = next( (c for c in candidate.outputs[0].outputs if c.op == "Transpose"), None, @@ if t_consumers and all(c.op == "MatMul" for c in t_consumers): + if len(dq_consumers) != 1: + transpose_to_remove = None + cast_to_remove = None + break perm = candidate.attrs.get("perm", None) torch_weights = ( torch_weights.permute(*perm).contiguous() if perm is not None else torch_weights.T.contiguous()

Also applies to: 142-151

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/export/fp8_exporter.py` around lines 88 - 122, The transpose-folding code currently rewires Transpose to consume dq_op.outputs[0] without ensuring that dq_op.outputs[0] has no other live consumers; update the logic around dq_op, transpose_to_remove and cast_to_remove to first verify that the DQ output node (dq_op.outputs[0]) has exactly one live consumer (the candidate Transpose or Cast→Transpose) before mutating torch_weights and setting transpose_to_remove/cast_to_remove; if there are any other consumers, skip folding and leave cast_to_remove/transpose_to_remove as None. Apply the same single-consumer check to the analogous block later in the file (the second transpose-folding occurrence).

coderabbitai · 2026-04-27T14:26:12Z

+def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None:
+    """Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16.
+
+    Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range).
+    """
+    if scale_init.data_type != onnx.TensorProto.FLOAT:
+        return
+    scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32)
+    if not scale_data.size:
+        scale_data = np.array(scale_init.float_data, dtype=np.float32)
+    fp16_data = scale_data.astype(np.float16)
+    if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)):
+        logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")
+    scale_init.data_type = onnx.TensorProto.FLOAT16
+    scale_init.raw_data = fp16_data.tobytes()
+    del scale_init.float_data[:]


⚠️ Potential issue | 🟡 Minor

Make the underflow warning elementwise.

np.any(fp16_data == 0) and np.any(scale_data != 0) fires whenever the tensor contains any real zero plus any non-zero value. That produces false overflow/underflow warnings for mixed scale tensors.

Suggested fix

- if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)): + if np.any(np.isinf(fp16_data)) or np.any((fp16_data == 0) & (scale_data != 0)): logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None:

"""Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16.

Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range).

"""

if scale_init.data_type != onnx.TensorProto.FLOAT:

return

scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32)

if not scale_data.size:

scale_data = np.array(scale_init.float_data, dtype=np.float32)

fp16_data = scale_data.astype(np.float16)

if np.any(np.isinf(fp16_data)) or (np.any(fp16_data == 0) and np.any(scale_data != 0)):

logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")

scale_init.data_type = onnx.TensorProto.FLOAT16

scale_init.raw_data = fp16_data.tobytes()

del scale_init.float_data[:]

def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None:

"""Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16.

Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range).

"""

if scale_init.data_type != onnx.TensorProto.FLOAT:

return

scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32)

if not scale_data.size:

scale_data = np.array(scale_init.float_data, dtype=np.float32)

fp16_data = scale_data.astype(np.float16)

if np.any(np.isinf(fp16_data)) or np.any((fp16_data == 0) & (scale_data != 0)):

logger.warning(f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16")

scale_init.data_type = onnx.TensorProto.FLOAT16

scale_init.raw_data = fp16_data.tobytes()

del scale_init.float_data[:]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/utils.py` around lines 1422 - 1437, The underflow/overflow warning in _scale_fp32_to_fp16 is too coarse: change the condition to check elementwise where a non-zero FP32 value becomes zero in FP16 by replacing the combined np.any(...) logic with an elementwise mask (e.g., mask = (fp16_data == 0) & (scale_data != 0)) and warn only if np.any(np.isinf(fp16_data)) or np.any(mask); keep the existing inf check and the rest of the in-place conversion logic using scale_init, fp16_data and scale_data unchanged.

coderabbitai · 2026-04-27T14:26:12Z

+def fold_q_fp16_to_fp32_casts(onnx_model: onnx.ModelProto) -> onnx.ModelProto:
+    """Remove ``Cast(FP16→FP32) → Q`` patterns inserted by ``convert_float_to_float16``.
+
+    The Q scale is rewritten to FP16 so Q consumes the FP16 graph directly. Skipped for
+    opsets below ``BASE_MIN_OPSET`` since FP16 Q scales require opset >= 19.
+    """
+    if get_opset_version(onnx_model) < BASE_MIN_OPSET:
+        logger.debug(
+            f"Skipping fold_q_fp16_to_fp32_casts: opset < {BASE_MIN_OPSET} (FP16 Q scale unsupported)"
+        )
+        return onnx_model
+
+    consumer_map: dict[str, list[onnx.NodeProto]] = {}
+    for node in onnx_model.graph.node:
+        for inp in node.input:
+            consumer_map.setdefault(inp, []).append(node)
+    initializers = {init.name: init for init in onnx_model.graph.initializer}
+
+    to_remove = []
+    for node in onnx_model.graph.node:
+        if node.op_type != "Cast":
+            continue
+        cast_to = next((a.i for a in node.attribute if a.name == "to"), None)
+        if cast_to != onnx.TensorProto.FLOAT:
+            continue
+        consumers = consumer_map.get(node.output[0], [])
+        if not consumers or not all(c.op_type in _Q_OPS for c in consumers):
+            continue
+
+        for q_node in consumers:
+            if len(q_node.input) >= 2 and q_node.input[1] in initializers:
+                _scale_fp32_to_fp16(initializers[q_node.input[1]])
+
+        _bypass_cast_node(onnx_model, node)
+        to_remove.append(node)


⚠️ Potential issue | 🟠 Major

Guard this fold to actual FP16→FP32 casts.

The implementation only checks Cast(..., to=FLOAT), so it will also bypass BF16→FP32 or any other *→FP32 cast and then rewrite the Q scale to FP16. That changes semantics and can leave the quantizer consuming a dtype this pass was not meant to legalize.

Suggested fix

def fold_q_fp16_to_fp32_casts(onnx_model: onnx.ModelProto) -> onnx.ModelProto: @@ - consumer_map: dict[str, list[onnx.NodeProto]] = {} + consumer_map: dict[str, list[onnx.NodeProto]] = {} for node in onnx_model.graph.node: for inp in node.input: consumer_map.setdefault(inp, []).append(node) initializers = {init.name: init for init in onnx_model.graph.initializer} + type_map = _build_tensor_type_map(onnx_model) @@ cast_to = next((a.i for a in node.attribute if a.name == "to"), None) if cast_to != onnx.TensorProto.FLOAT: continue + if type_map.get(node.input[0]) != onnx.TensorProto.FLOAT16: + continue consumers = consumer_map.get(node.output[0], []) if not consumers or not all(c.op_type in _Q_OPS for c in consumers): continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/utils.py` around lines 1440 - 1474, fold_q_fp16_to_fp32_casts currently treats any Cast(..., to=FLOAT) as FP16→FP32 and rewrites Q scales; update it to first verify the cast source is actually FP16 before proceeding: for each Cast node (in function fold_q_fp16_to_fp32_casts) look up the input tensor's dtype (from graph.value_info, graph.input, or initializers) and only continue when that dtype == onnx.TensorProto.FLOAT16; keep the existing logic that calls _scale_fp32_to_fp16(initializers[...]) and _bypass_cast_node(onnx_model, node) but skip/break for casts from BF16 or other dtypes so you do not rewrite Q scales for non-FP16→FP32 casts.

codecov · 2026-04-27T14:27:30Z

Codecov Report

❌ Patch coverage is 91.25874% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.87%. Comparing base (f3da713) to head (2892b95).

Files with missing lines	Patch %	Lines
modelopt/onnx/export/fp8_exporter.py	89.61%	16 Missing ⚠️
modelopt/onnx/utils.py	89.58%	5 Missing ⚠️
modelopt/torch/export/unified_export_megatron.py	78.57%	3 Missing ⚠️
modelopt/torch/utils/vlm_dataset_utils.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@                Coverage Diff                 @@
##           release/0.44.0    #1350      +/-   ##
==================================================
+ Coverage           75.38%   75.87%   +0.49%     
==================================================
  Files                 462      462              
  Lines               49960    50173     +213     
==================================================
+ Hits                37662    38069     +407     
+ Misses              12298    12104     -194

Flag	Coverage Δ
examples	`41.64% <56.29%> (+0.93%)`	⬆️
gpu	`58.39% <20.62%> (-0.74%)`	⬇️
regression	`14.78% <2.09%> (-0.01%)`	⬇️
unit	`52.87% <70.62%> (+0.48%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adds `.claude/skills/release-cherry-pick/SKILL.md` — a Claude Code skill for cherry-picking labeled PRs to a release branch. Invoke with `/release-cherry-pick <version>`. See this PR created with the skill: #1350  ## Summary by CodeRabbit * **New Features** * Added automated release cherry-pick workflow to streamline selecting and applying multiple PRs into release branches.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

nv-samcheng and others added 10 commits April 27, 2026 07:07

kevalmorabia97 requested review from a team as code owners April 27, 2026 14:15

kevalmorabia97 requested review from cjluo-nv, galagam, h-guo18 and realAsma and removed request for a team April 27, 2026 14:15

kevalmorabia97 removed the request for review from a team April 27, 2026 14:17

kevalmorabia97 removed request for a team, cjluo-nv, galagam, h-guo18 and realAsma April 27, 2026 14:17

kevalmorabia97 changed the title ~~Cherry-picks for release/0.44.0~~ Cherry-pick #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 Apr 27, 2026

kevalmorabia97 changed the title ~~Cherry-pick #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318~~ Cherry-pick PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 Apr 27, 2026

kevalmorabia97 changed the title ~~Cherry-pick PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318~~ [Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318 Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

kevalmorabia97 mentioned this pull request Apr 27, 2026

Add release-cherry-pick Claude Code skill #1352

Merged

kevalmorabia97 merged commit b1ec471 into release/0.44.0 Apr 27, 2026
46 checks passed

kevalmorabia97 deleted the cherry-picks/release-0.44.0 branch April 27, 2026 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318#1350

[Cherry-pick] PRs #1256 #1305 #1322 #1317 #1321 #1289 #1311 #1332 #1104 #1318#1350
kevalmorabia97 merged 10 commits intorelease/0.44.0from
cherry-picks/release-0.44.0

kevalmorabia97 commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

codecov Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

		if torch.any(torch.isnan(amax)) or torch.all(amax <= 0):
		q.disable()

Conversation

kevalmorabia97 commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cherry-picked PRs

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

kevalmorabia97 commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

github-actions Bot commented Apr 27, 2026 •

edited

Loading

codecov Bot commented Apr 27, 2026 •

edited

Loading