[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export#1353
[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export#1353kevalmorabia97 merged 1 commit intomainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughWalkthrough
ChangesCore QDQ Scale Patching
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/onnx/quantization/test_qdq_utils.py (1)
1058-1059: Consider expanding parametrization to include QuantizeLinear variants.The production collector now supports
QuantizeLinearandTRT_INT4QuantizeLineartoo; adding them here would fully lock in op-type coverage for initializer-backed scales.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1058 - 1059, Update the parametrization on the test named test_zero_scale_initializer_fed_to_dq_is_patched to also cover QuantizeLinear variants: include "QuantizeLinear" and "TRT_INT4QuantizeLinear" in the `@pytest.mark.parametrize` list (alongside "DequantizeLinear" and "TRT_INT4DequantizeLinear") so the test exercises initializer-backed scales for both quantize and dequantize op types; locate the decorator above the test function to modify the list of dq_op_type values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1058-1059: Update the parametrization on the test named
test_zero_scale_initializer_fed_to_dq_is_patched to also cover QuantizeLinear
variants: include "QuantizeLinear" and "TRT_INT4QuantizeLinear" in the
`@pytest.mark.parametrize` list (alongside "DequantizeLinear" and
"TRT_INT4DequantizeLinear") so the test exercises initializer-backed scales for
both quantize and dequantize op types; locate the decorator above the test
function to modify the list of dq_op_type values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 97fc3c23-f6d7-4116-a9a8-de199e6220e4
📒 Files selected for processing (3)
CHANGELOG.rstmodelopt/onnx/quantization/qdq_utils.pytests/unit/onnx/quantization/test_qdq_utils.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1353 +/- ##
==========================================
- Coverage 76.90% 72.96% -3.95%
==========================================
Files 471 471
Lines 50562 50570 +8
==========================================
- Hits 38886 36898 -1988
- Misses 11676 13672 +1996
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| - Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training. | ||
|
|
||
| **Bug Fixes** |
There was a problem hiding this comment.
Doesnt this need to go in 0.44 section below?
There was a problem hiding this comment.
I think we should not include this in the changelog. Removed.
a3385aa to
9fc8306
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/onnx/quantization/test_qdq_utils.py (1)
1076-1110: LGTM! Legacy Constant-node path is properly regression-tested.This ensures the existing code path continues to work after the refactoring.
Consider adding a float32 scale test case to strengthen dtype preservation coverage. The current tests only use float16 scales.
🧪 Optional: Add float32 scale test
def test_float32_scale_dtype_preserved(self): """Verify float32 scales remain float32 after patching.""" from modelopt.onnx.quantization.qdq_utils import replace_zero_scale_with_smallest_nonzero # Build model with float32 scale containing zeros weight_data = np.random.randint(-8, 8, size=(3, 4), dtype=np.int8) weight_tensor = numpy_helper.from_array(weight_data, "weight") scale_data = np.array([0.0, 1e-3, 0.0], dtype=np.float32).reshape(3, 1) scale_tensor = numpy_helper.from_array(scale_data, "scale") input_tensor = helper.make_tensor_value_info("input", TensorProto.FLOAT, [None, 3]) dq_node = helper.make_node( "DequantizeLinear", inputs=["weight", "scale"], outputs=["dq_output"], name="dq" ) graph = helper.make_graph( nodes=[dq_node], name="test_graph", inputs=[input_tensor], outputs=[helper.make_tensor_value_info("dq_output", TensorProto.FLOAT, [3, 4])], initializer=[weight_tensor, scale_tensor], ) model = helper.make_model(graph) patched = replace_zero_scale_with_smallest_nonzero(model) scale_init = next(i for i in patched.graph.initializer if i.name == "scale") assert scale_init.data_type == TensorProto.FLOAT # float32 preserved scale_arr = numpy_helper.to_array(scale_init) assert not (scale_arr == 0).any()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1076 - 1110, Add a new unit test function (e.g. test_float32_scale_dtype_preserved) in tests/unit/onnx/quantization/test_qdq_utils.py that constructs a model with a float32 scale initializer (use numpy_helper.from_array with dtype float32) consumed by a DequantizeLinear node, calls replace_zero_scale_with_smallest_nonzero(model), then locate the patched initializer by name (e.g. "scale") and assert its data_type == TensorProto.FLOAT and that the scale array contains no zeros; reference helpers used in the file such as replace_zero_scale_with_smallest_nonzero, numpy_helper.from_array, DequantizeLinear, and TensorProto.FLOAT to match existing patterns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1076-1110: Add a new unit test function (e.g.
test_float32_scale_dtype_preserved) in
tests/unit/onnx/quantization/test_qdq_utils.py that constructs a model with a
float32 scale initializer (use numpy_helper.from_array with dtype float32)
consumed by a DequantizeLinear node, calls
replace_zero_scale_with_smallest_nonzero(model), then locate the patched
initializer by name (e.g. "scale") and assert its data_type == TensorProto.FLOAT
and that the scale array contains no zeros; reference helpers used in the file
such as replace_zero_scale_with_smallest_nonzero, numpy_helper.from_array,
DequantizeLinear, and TensorProto.FLOAT to match existing patterns.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 08f6d273-b45c-4536-9b8a-416dea5d5508
📒 Files selected for processing (2)
modelopt/onnx/quantization/qdq_utils.pytests/unit/onnx/quantization/test_qdq_utils.py
replace_zero_scale_with_smallest_nonzero() in qdq_utils.py only inspected QuantizeLinear consumers and Constant-node producers, which made it a no-op for INT4_AWQ exports — those use DequantizeLinear (default and trt:: domain) consumers and store scales as graph initializers. Zero scales produced when the FP32→FP16 cast underflows therefore reached TensorRT, causing trtexec --stronglyTyped to fail with "Scale coefficients must all be positive". Extend the sanitizer to also walk DequantizeLinear / TRT_INT4DequantizeLinear nodes and to patch initializer-backed scales, while preserving dtype. Add regression tests for both the initializer + DQ path (default and trt:: domain) and the legacy Constant + QuantizeLinear path. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
9fc8306 to
18d46e5
Compare
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Clean, well-scoped bug fix. The two bugs are clearly explained and the fix is correct:
- Scale collection now covers all 4 QDQ op types (QuantizeLinear, DequantizeLinear, TRT_INT4QuantizeLinear, TRT_INT4DequantizeLinear) instead of only QuantizeLinear.
- Scale patching now handles both graph initializers (INT4_AWQ path) and Constant nodes (legacy path), with proper dtype preservation via
.astype(tensor.dtype). - The
dtype.kind == "f"guard is a good addition to avoid corrupting non-float scale tensors. - Tests cover both the initializer path (parameterized over DequantizeLinear and TRT_INT4DequantizeLinear) and the legacy Constant-node path — all with assertions on zero elimination, positivity, and dtype preservation.
The change is small (~30 lines of logic + ~90 lines of tests), backward compatible, and well tested.
### What does this PR do?
Type of change: Bug fix
Fix `replace_zero_scale_with_smallest_nonzero()` in
`modelopt/onnx/quantization/qdq_utils.py` so the FP16-scale sanitizer
actually runs for INT4_AWQ ONNX exports.
The function is supposed to ensure all FP16 scales are strictly positive
before the model reaches TensorRT, since `trtexec --stronglyTyped`
asserts `scaleAllPositive`. It had two latent bugs that made it a
complete no-op for INT4_AWQ:
1. It only collected scales from `QuantizeLinear` consumers — but
INT4_AWQ exports use `DequantizeLinear` (default domain) and
`TRT_INT4DequantizeLinear` (trt:: domain). There are zero
`QuantizeLinear` nodes in such graphs, so the collected set was empty.
2. It only patched scales emitted by `Constant` nodes — but INT4_AWQ
stores scales as graph initializers.
When the FP32→FP16 cast in `_convert_fp32_init_to_fp16()` underflowed
small amax values to `0.0` (FP16 min subnormal is 5.96e-8), those zeros
sailed through into the exported ONNX. TRT then rejected the model with:
```
Assertion failed: (scaleAllPositive || allowNegativeScale): Scale coefficients must all be positive
```
This PR extends the sanitizer to:
- Walk `QuantizeLinear` / `DequantizeLinear` / `TRT_INT4QuantizeLinear`
/ `TRT_INT4DequantizeLinear` nodes when collecting scale tensor names.
- Patch zero entries in float-typed graph initializers in addition to
`Constant`-node values, preserving the original dtype.
Files modified:
- `modelopt/onnx/quantization/qdq_utils.py` — fix.
- `tests/unit/onnx/quantization/test_qdq_utils.py` —
`TestReplaceZeroScaleWithSmallestNonzero` regression tests.
- `CHANGELOG.rst` — bug-fix entry under 0.45.
### Usage
```bash
# Repro that previously failed and now succeeds:
python torch_quant_to_onnx.py \
--quantize_mode=int4_awq \
--timm_model_name=vit_base_patch16_224 \
--onnx_save_path=/tmp/vit_base_patch16_224.int4_awq.onnx \
--calibration_data_size=32
trtexec --onnx=/tmp/vit_base_patch16_224.int4_awq.onnx --stronglyTyped --skipInference
```
### Testing
- New tests pass: `pytest
tests/unit/onnx/quantization/test_qdq_utils.py::TestReplaceZeroScaleWithSmallestNonzero
-v` (3 passed).
- Full file: `pytest tests/unit/onnx/quantization/test_qdq_utils.py` (25
passed).
- Broader sanity: `pytest tests/unit/onnx/quantization/` (288 passed).
- Smoke test on a synthetic model with explicit zero scales: 3 zeros → 0
zeros, all positive, FP16 dtype preserved.
- `pre-commit run --files <changed>` clean.
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅
### Additional Information
- NVBug: 6110209
- Reproduces with `nvidia-modelopt==0.44.0rc1`, TensorRT 10.15.1.29, on
B200 / H20.
- Root cause introduced when the `TRT_INT4DequantizeLinear` export path
was added (PR #575, commit `0a4f0a8b`); that PR didn't update the
sanitizer to handle the new node type or scale storage.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Zero scale values in quantization/dequantization ops (including
additional operator variants) are now replaced with the smallest nonzero
fp16 scale for matching tensors; replacements preserve the original
tensor data type and handle scales provided via initializers or constant
nodes.
* **Tests**
* Added regression tests covering initializer- and constant-backed scale
tensors across multiple operator configurations to ensure zeros are
eliminated and dtype is preserved.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
#1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 #1375 #1386 #1353 #1356 #1390 (#1385) ## Cherry-picked PRs - #1352 - #1351 - #1330 - #1354 - #1355 - #1360 - #1342 - #1324 - #1340 - #1368 - #1373 - #1359 - #1361 - #1325 - #1369 - #1370 - #1371 - #1375 - #1386 - #1353 - #1356 - #1390 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Python 3.14 support (basic unit tests verified; production defaults on Python 3.12) * Added Windows CUDA 13.x installation guidance * Introduced LLM ONNX export utilities with quantization support * Extended Medusa mode support in speculative decoding pipeline * **Bug Fixes** * Fixed FP8 quantization for vision transformer multi-head attention * Improved MoE expert handling in quantization calibration and inference * Enhanced ONNX graph utilities for FP8 weight transformation * **Documentation** * Comprehensive Minitron pruning + distillation + quantization + vLLM tutorials with ablation studies * Megatron data preparation guide for tokenization workflows * Puzzletron distillation results and cross-reference updates <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Signed-off-by: Jennifer Chen <jennifchen@nvidia.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: ynankani <ynankani@nvidia.com> Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: vipandya <vipandya@nvidia.com> Signed-off-by: dmoodie <dmoodie@nvidia.com> Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com> Signed-off-by: Ye Yu <yeyu@nvidia.com> Signed-off-by: Kai Xu <kaix@nvidia.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Co-authored-by: CodeRabbit <noreply@coderabbit.ai> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Asha Anoosheh <aanoosheh@nvidia.com> Co-authored-by: Jenny Chen <jennifchen@nvidia.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: ynankani <ynankani@nvidia.com> Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Co-authored-by: vishalpandya1990 <vishalpandya1990@gmail.com> Co-authored-by: dthienan-nv <dmoodie@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Hrishith Thadicherla <99313418+hthadicherla@users.noreply.github.com> Co-authored-by: yeyu-nvidia <yeyu@nvidia.com> Co-authored-by: kaix-nv <kaix@nvidia.com> Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
What does this PR do?
Type of change: Bug fix
Fix
replace_zero_scale_with_smallest_nonzero()inmodelopt/onnx/quantization/qdq_utils.pyso the FP16-scale sanitizer actually runs for INT4_AWQ ONNX exports.The function is supposed to ensure all FP16 scales are strictly positive before the model reaches TensorRT, since
trtexec --stronglyTypedassertsscaleAllPositive. It had two latent bugs that made it a complete no-op for INT4_AWQ:QuantizeLinearconsumers — but INT4_AWQ exports useDequantizeLinear(default domain) andTRT_INT4DequantizeLinear(trt:: domain). There are zeroQuantizeLinearnodes in such graphs, so the collected set was empty.Constantnodes — but INT4_AWQ stores scales as graph initializers.When the FP32→FP16 cast in
_convert_fp32_init_to_fp16()underflowed small amax values to0.0(FP16 min subnormal is 5.96e-8), those zeros sailed through into the exported ONNX. TRT then rejected the model with:This PR extends the sanitizer to:
QuantizeLinear/DequantizeLinear/TRT_INT4QuantizeLinear/TRT_INT4DequantizeLinearnodes when collecting scale tensor names.Constant-node values, preserving the original dtype.Files modified:
modelopt/onnx/quantization/qdq_utils.py— fix.tests/unit/onnx/quantization/test_qdq_utils.py—TestReplaceZeroScaleWithSmallestNonzeroregression tests.CHANGELOG.rst— bug-fix entry under 0.45.Usage
# Repro that previously failed and now succeeds: python torch_quant_to_onnx.py \ --quantize_mode=int4_awq \ --timm_model_name=vit_base_patch16_224 \ --onnx_save_path=/tmp/vit_base_patch16_224.int4_awq.onnx \ --calibration_data_size=32 trtexec --onnx=/tmp/vit_base_patch16_224.int4_awq.onnx --stronglyTyped --skipInferenceTesting
pytest tests/unit/onnx/quantization/test_qdq_utils.py::TestReplaceZeroScaleWithSmallestNonzero -v(3 passed).pytest tests/unit/onnx/quantization/test_qdq_utils.py(25 passed).pytest tests/unit/onnx/quantization/(288 passed).pre-commit run --files <changed>clean.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
nvidia-modelopt==0.44.0rc1, TensorRT 10.15.1.29, on B200 / H20.TRT_INT4DequantizeLinearexport path was added (PR [OMNIML-2244] Implement the ONNX quantization exporter for INT4 #575, commit0a4f0a8b); that PR didn't update the sanitizer to handle the new node type or scale storage.Summary by CodeRabbit
Bug Fixes
Tests