[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export by ajrasane · Pull Request #1353 · NVIDIA/Model-Optimizer

ajrasane · 2026-04-27T15:31:26Z

What does this PR do?

Type of change: Bug fix

Fix replace_zero_scale_with_smallest_nonzero() in modelopt/onnx/quantization/qdq_utils.py so the FP16-scale sanitizer actually runs for INT4_AWQ ONNX exports.

The function is supposed to ensure all FP16 scales are strictly positive before the model reaches TensorRT, since trtexec --stronglyTyped asserts scaleAllPositive. It had two latent bugs that made it a complete no-op for INT4_AWQ:

It only collected scales from QuantizeLinear consumers — but INT4_AWQ exports use DequantizeLinear (default domain) and TRT_INT4DequantizeLinear (trt:: domain). There are zero QuantizeLinear nodes in such graphs, so the collected set was empty.
It only patched scales emitted by Constant nodes — but INT4_AWQ stores scales as graph initializers.

When the FP32→FP16 cast in _convert_fp32_init_to_fp16() underflowed small amax values to 0.0 (FP16 min subnormal is 5.96e-8), those zeros sailed through into the exported ONNX. TRT then rejected the model with:

Assertion failed: (scaleAllPositive || allowNegativeScale): Scale coefficients must all be positive

This PR extends the sanitizer to:

Walk QuantizeLinear / DequantizeLinear / TRT_INT4QuantizeLinear / TRT_INT4DequantizeLinear nodes when collecting scale tensor names.
Patch zero entries in float-typed graph initializers in addition to Constant-node values, preserving the original dtype.

Files modified:

modelopt/onnx/quantization/qdq_utils.py — fix.
tests/unit/onnx/quantization/test_qdq_utils.py — TestReplaceZeroScaleWithSmallestNonzero regression tests.
CHANGELOG.rst — bug-fix entry under 0.45.

Usage

# Repro that previously failed and now succeeds:
python torch_quant_to_onnx.py \
    --quantize_mode=int4_awq \
    --timm_model_name=vit_base_patch16_224 \
    --onnx_save_path=/tmp/vit_base_patch16_224.int4_awq.onnx \
    --calibration_data_size=32

trtexec --onnx=/tmp/vit_base_patch16_224.int4_awq.onnx --stronglyTyped --skipInference

Testing

New tests pass: pytest tests/unit/onnx/quantization/test_qdq_utils.py::TestReplaceZeroScaleWithSmallestNonzero -v (3 passed).
Full file: pytest tests/unit/onnx/quantization/test_qdq_utils.py (25 passed).
Broader sanity: pytest tests/unit/onnx/quantization/ (288 passed).
Smoke test on a synthetic model with explicit zero scales: 3 zeros → 0 zeros, all positive, FP16 dtype preserved.
pre-commit run --files <changed> clean.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: ✅

Additional Information

NVBug: 6110209
Reproduces with nvidia-modelopt==0.44.0rc1, TensorRT 10.15.1.29, on B200 / H20.
Root cause introduced when the TRT_INT4DequantizeLinear export path was added (PR [OMNIML-2244] Implement the ONNX quantization exporter for INT4 #575, commit 0a4f0a8b); that PR didn't update the sanitizer to handle the new node type or scale storage.

Summary by CodeRabbit

Bug Fixes
- Zero scale values in quantization/dequantization ops (including additional operator variants) are now replaced with the smallest nonzero fp16 scale for matching tensors; replacements preserve the original tensor data type and handle scales provided via initializers or constant nodes.
Tests
- Added regression tests covering initializer- and constant-backed scale tensors across multiple operator configurations to ensure zeros are eliminated and dtype is preserved.

copy-pr-bot · 2026-04-27T15:31:30Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-27T15:31:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 64827b9a-9164-42cb-9de1-cc1561c1e498

📥 Commits

Reviewing files that changed from the base of the PR and between 9fc8306 and 18d46e5.

📒 Files selected for processing (2)

modelopt/onnx/quantization/qdq_utils.py
tests/unit/onnx/quantization/test_qdq_utils.py

✅ Files skipped from review due to trivial changes (2)

tests/unit/onnx/quantization/test_qdq_utils.py
modelopt/onnx/quantization/qdq_utils.py

📝 Walkthrough

Walkthrough

replace_zero_scale_with_smallest_nonzero now collects scale tensor names from multiple Q/DQ op types and patches zero-valued scales found both in graph initializers and in Constant nodes, replacing zeros with the smallest nonzero fp16 value while preserving each tensor's original dtype.

Changes

Core QDQ Scale Patching

Layer / File(s)	Summary
Scale-name discovery `modelopt/onnx/quantization/qdq_utils.py`	Collects scale tensor names from multiple Q/DQ operator types (`QuantizeLinear`, `DequantizeLinear`, TRT INT4 Q/DQ variants) via `node.input[1]` instead of only from `QuantizeLinear`.
Initializer patching `modelopt/onnx/quantization/qdq_utils.py`	Iterates `graph.initializer` to find initializers whose names match discovered scale tensor names; replaces zero entries in floating-point initializers with the smallest nonzero fp16 value, then casts back to the initializer's original dtype.
Constant-node (legacy) path `modelopt/onnx/quantization/qdq_utils.py`	Updates legacy `Constant` node handling to only modify floating-point tensors and to cast the modified tensor back to the original dtype before assigning to the `value` attribute.
Tests / Regression coverage `tests/unit/onnx/quantization/test_qdq_utils.py`	Adds tests building models where `scale` is an initializer consumed by `DequantizeLinear` and `TRT_INT4DequantizeLinear`, asserting patched initializer contains no zeros and remains `FLOAT16`; adds test for legacy `Constant`-node path ensuring zeros are removed and dtype preserved.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: patching zero FP16 scale values in INT4_AWQ ONNX export functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	The pull request does not contain security anti-patterns including unsafe deserialization, code execution, remote code loading, or bypass comments.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ajrasane/nvbug_6110209

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-27T15:36:03Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-05 04:20 UTC

coderabbitai

🧹 Nitpick comments (1)

tests/unit/onnx/quantization/test_qdq_utils.py (1)
1058-1059: Consider expanding parametrization to include QuantizeLinear variants.

The production collector now supports QuantizeLinear and TRT_INT4QuantizeLinear too; adding them here would fully lock in op-type coverage for initializer-backed scales.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1058 - 1059,
Update the parametrization on the test named
test_zero_scale_initializer_fed_to_dq_is_patched to also cover QuantizeLinear
variants: include "QuantizeLinear" and "TRT_INT4QuantizeLinear" in the
`@pytest.mark.parametrize` list (alongside "DequantizeLinear" and
"TRT_INT4DequantizeLinear") so the test exercises initializer-backed scales for
both quantize and dequantize op types; locate the decorator above the test
function to modify the list of dq_op_type values.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1058-1059: Update the parametrization on the test named
test_zero_scale_initializer_fed_to_dq_is_patched to also cover QuantizeLinear
variants: include "QuantizeLinear" and "TRT_INT4QuantizeLinear" in the
`@pytest.mark.parametrize` list (alongside "DequantizeLinear" and
"TRT_INT4DequantizeLinear") so the test exercises initializer-backed scales for
both quantize and dequantize op types; locate the decorator above the test
function to modify the list of dq_op_type values.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 97fc3c23-f6d7-4116-a9a8-de199e6220e4

📥 Commits

Reviewing files that changed from the base of the PR and between 5b41ba4 and a3385aa.

📒 Files selected for processing (3)

CHANGELOG.rst
modelopt/onnx/quantization/qdq_utils.py
tests/unit/onnx/quantization/test_qdq_utils.py

codecov · 2026-04-27T15:46:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.96%. Comparing base (acfab41) to head (18d46e5).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1353      +/-   ##
==========================================
- Coverage   76.90%   72.96%   -3.95%     
==========================================
  Files         471      471              
  Lines       50562    50570       +8     
==========================================
- Hits        38886    36898    -1988     
- Misses      11676    13672    +1996

Flag	Coverage Δ
examples	`41.56% <100.00%> (+0.90%)`	⬆️
gpu	`51.45% <33.33%> (-8.87%)`	⬇️
unit	`52.82% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97 · 2026-04-27T16:18:44Z


 - Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.

+**Bug Fixes**


Doesnt this need to go in 0.44 section below?

I think we should not include this in the changelog. Removed.

coderabbitai

🧹 Nitpick comments (1)

tests/unit/onnx/quantization/test_qdq_utils.py (1)

1076-1110: LGTM! Legacy Constant-node path is properly regression-tested.

This ensures the existing code path continues to work after the refactoring.

Consider adding a float32 scale test case to strengthen dtype preservation coverage. The current tests only use float16 scales.

🧪 Optional: Add float32 scale test

def test_float32_scale_dtype_preserved(self):
    """Verify float32 scales remain float32 after patching."""
    from modelopt.onnx.quantization.qdq_utils import replace_zero_scale_with_smallest_nonzero

    # Build model with float32 scale containing zeros
    weight_data = np.random.randint(-8, 8, size=(3, 4), dtype=np.int8)
    weight_tensor = numpy_helper.from_array(weight_data, "weight")
    scale_data = np.array([0.0, 1e-3, 0.0], dtype=np.float32).reshape(3, 1)
    scale_tensor = numpy_helper.from_array(scale_data, "scale")

    input_tensor = helper.make_tensor_value_info("input", TensorProto.FLOAT, [None, 3])
    dq_node = helper.make_node(
        "DequantizeLinear", inputs=["weight", "scale"], outputs=["dq_output"], name="dq"
    )
    graph = helper.make_graph(
        nodes=[dq_node],
        name="test_graph",
        inputs=[input_tensor],
        outputs=[helper.make_tensor_value_info("dq_output", TensorProto.FLOAT, [3, 4])],
        initializer=[weight_tensor, scale_tensor],
    )
    model = helper.make_model(graph)

    patched = replace_zero_scale_with_smallest_nonzero(model)

    scale_init = next(i for i in patched.graph.initializer if i.name == "scale")
    assert scale_init.data_type == TensorProto.FLOAT  # float32 preserved
    scale_arr = numpy_helper.to_array(scale_init)
    assert not (scale_arr == 0).any()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1076 - 1110, Add
a new unit test function (e.g. test_float32_scale_dtype_preserved) in
tests/unit/onnx/quantization/test_qdq_utils.py that constructs a model with a
float32 scale initializer (use numpy_helper.from_array with dtype float32)
consumed by a DequantizeLinear node, calls
replace_zero_scale_with_smallest_nonzero(model), then locate the patched
initializer by name (e.g. "scale") and assert its data_type == TensorProto.FLOAT
and that the scale array contains no zeros; reference helpers used in the file
such as replace_zero_scale_with_smallest_nonzero, numpy_helper.from_array,
DequantizeLinear, and TensorProto.FLOAT to match existing patterns.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1076-1110: Add a new unit test function (e.g.
test_float32_scale_dtype_preserved) in
tests/unit/onnx/quantization/test_qdq_utils.py that constructs a model with a
float32 scale initializer (use numpy_helper.from_array with dtype float32)
consumed by a DequantizeLinear node, calls
replace_zero_scale_with_smallest_nonzero(model), then locate the patched
initializer by name (e.g. "scale") and assert its data_type == TensorProto.FLOAT
and that the scale array contains no zeros; reference helpers used in the file
such as replace_zero_scale_with_smallest_nonzero, numpy_helper.from_array,
DequantizeLinear, and TensorProto.FLOAT to match existing patterns.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08f6d273-b45c-4536-9b8a-416dea5d5508

📥 Commits

Reviewing files that changed from the base of the PR and between a3385aa and 9fc8306.

📒 Files selected for processing (2)

modelopt/onnx/quantization/qdq_utils.py
tests/unit/onnx/quantization/test_qdq_utils.py

replace_zero_scale_with_smallest_nonzero() in qdq_utils.py only inspected QuantizeLinear consumers and Constant-node producers, which made it a no-op for INT4_AWQ exports — those use DequantizeLinear (default and trt:: domain) consumers and store scales as graph initializers. Zero scales produced when the FP32→FP16 cast underflows therefore reached TensorRT, causing trtexec --stronglyTyped to fail with "Scale coefficients must all be positive". Extend the sanitizer to also walk DequantizeLinear / TRT_INT4DequantizeLinear nodes and to patch initializer-backed scales, while preserving dtype. Add regression tests for both the initializer + DQ path (default and trt:: domain) and the legacy Constant + QuantizeLinear path. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

cjluo-nv

Bot review — DM the bot to share feedback.

Clean, well-scoped bug fix. The two bugs are clearly explained and the fix is correct:

Scale collection now covers all 4 QDQ op types (QuantizeLinear, DequantizeLinear, TRT_INT4QuantizeLinear, TRT_INT4DequantizeLinear) instead of only QuantizeLinear.
Scale patching now handles both graph initializers (INT4_AWQ path) and Constant nodes (legacy path), with proper dtype preservation via .astype(tensor.dtype).
The dtype.kind == "f" guard is a good addition to avoid corrupting non-float scale tensors.
Tests cover both the initializer path (parameterized over DequantizeLinear and TRT_INT4DequantizeLinear) and the legacy Constant-node path — all with assertions on zero elimination, positivity, and dtype preservation.

The change is small (~30 lines of logic + ~90 lines of tests), backward compatible, and well tested.

### What does this PR do? Type of change: Bug fix Fix `replace_zero_scale_with_smallest_nonzero()` in `modelopt/onnx/quantization/qdq_utils.py` so the FP16-scale sanitizer actually runs for INT4_AWQ ONNX exports. The function is supposed to ensure all FP16 scales are strictly positive before the model reaches TensorRT, since `trtexec --stronglyTyped` asserts `scaleAllPositive`. It had two latent bugs that made it a complete no-op for INT4_AWQ: 1. It only collected scales from `QuantizeLinear` consumers — but INT4_AWQ exports use `DequantizeLinear` (default domain) and `TRT_INT4DequantizeLinear` (trt:: domain). There are zero `QuantizeLinear` nodes in such graphs, so the collected set was empty. 2. It only patched scales emitted by `Constant` nodes — but INT4_AWQ stores scales as graph initializers. When the FP32→FP16 cast in `_convert_fp32_init_to_fp16()` underflowed small amax values to `0.0` (FP16 min subnormal is 5.96e-8), those zeros sailed through into the exported ONNX. TRT then rejected the model with: ``` Assertion failed: (scaleAllPositive || allowNegativeScale): Scale coefficients must all be positive ``` This PR extends the sanitizer to: - Walk `QuantizeLinear` / `DequantizeLinear` / `TRT_INT4QuantizeLinear` / `TRT_INT4DequantizeLinear` nodes when collecting scale tensor names. - Patch zero entries in float-typed graph initializers in addition to `Constant`-node values, preserving the original dtype. Files modified: - `modelopt/onnx/quantization/qdq_utils.py` — fix. - `tests/unit/onnx/quantization/test_qdq_utils.py` — `TestReplaceZeroScaleWithSmallestNonzero` regression tests. - `CHANGELOG.rst` — bug-fix entry under 0.45. ### Usage ```bash # Repro that previously failed and now succeeds: python torch_quant_to_onnx.py \ --quantize_mode=int4_awq \ --timm_model_name=vit_base_patch16_224 \ --onnx_save_path=/tmp/vit_base_patch16_224.int4_awq.onnx \ --calibration_data_size=32 trtexec --onnx=/tmp/vit_base_patch16_224.int4_awq.onnx --stronglyTyped --skipInference ``` ### Testing - New tests pass: `pytest tests/unit/onnx/quantization/test_qdq_utils.py::TestReplaceZeroScaleWithSmallestNonzero -v` (3 passed). - Full file: `pytest tests/unit/onnx/quantization/test_qdq_utils.py` (25 passed). - Broader sanity: `pytest tests/unit/onnx/quantization/` (288 passed). - Smoke test on a synthetic model with explicit zero scales: 3 zeros → 0 zeros, all positive, FP16 dtype preserved. - `pre-commit run --files <changed>` clean. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ ### Additional Information - NVBug: 6110209 - Reproduces with `nvidia-modelopt==0.44.0rc1`, TensorRT 10.15.1.29, on B200 / H20. - Root cause introduced when the `TRT_INT4DequantizeLinear` export path was added (PR #575, commit `0a4f0a8b`); that PR didn't update the sanitizer to handle the new node type or scale storage.  ## Summary by CodeRabbit * **Bug Fixes** * Zero scale values in quantization/dequantization ops (including additional operator variants) are now replaced with the smallest nonzero fp16 scale for matching tensors; replacements preserve the original tensor data type and handle scales provided via initializers or constant nodes. * **Tests** * Added regression tests covering initializer- and constant-backed scale tensors across multiple operator configurations to ensure zeros are eliminated and dtype is preserved.  Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

#1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 #1375 #1386 #1353 #1356 #1390 (#1385) ## Cherry-picked PRs - #1352 - #1351 - #1330 - #1354 - #1355 - #1360 - #1342 - #1324 - #1340 - #1368 - #1373 - #1359 - #1361 - #1325 - #1369 - #1370 - #1371 - #1375 - #1386 - #1353 - #1356 - #1390  ## Summary by CodeRabbit * **New Features** * Added Python 3.14 support (basic unit tests verified; production defaults on Python 3.12) * Added Windows CUDA 13.x installation guidance * Introduced LLM ONNX export utilities with quantization support * Extended Medusa mode support in speculative decoding pipeline * **Bug Fixes** * Fixed FP8 quantization for vision transformer multi-head attention * Improved MoE expert handling in quantization calibration and inference * Enhanced ONNX graph utilities for FP8 weight transformation * **Documentation** * Comprehensive Minitron pruning + distillation + quantization + vLLM tutorials with ablation studies * Megatron data preparation guide for tokenization workflows * Puzzletron distillation results and cross-reference updates  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Signed-off-by: Jennifer Chen <jennifchen@nvidia.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: ynankani <ynankani@nvidia.com> Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: vipandya <vipandya@nvidia.com> Signed-off-by: dmoodie <dmoodie@nvidia.com> Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com> Signed-off-by: Ye Yu <yeyu@nvidia.com> Signed-off-by: Kai Xu <kaix@nvidia.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Co-authored-by: CodeRabbit <noreply@coderabbit.ai> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Asha Anoosheh <aanoosheh@nvidia.com> Co-authored-by: Jenny Chen <jennifchen@nvidia.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: ynankani <ynankani@nvidia.com> Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Co-authored-by: vishalpandya1990 <vishalpandya1990@gmail.com> Co-authored-by: dthienan-nv <dmoodie@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Hrishith Thadicherla <99313418+hthadicherla@users.noreply.github.com> Co-authored-by: yeyu-nvidia <yeyu@nvidia.com> Co-authored-by: kaix-nv <kaix@nvidia.com> Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>

ajrasane added the cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Apr 27, 2026

ajrasane self-assigned this Apr 27, 2026

ajrasane marked this pull request as ready for review April 27, 2026 15:34

ajrasane requested a review from a team as a code owner April 27, 2026 15:34

ajrasane requested a review from cjluo-nv April 27, 2026 15:34

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

kevalmorabia97 reviewed Apr 27, 2026

View reviewed changes

ajrasane force-pushed the ajrasane/nvbug_6110209 branch from a3385aa to 9fc8306 Compare April 27, 2026 17:39

ajrasane changed the title ~~[Fix]: Patch zero FP16 scales in INT4_AWQ ONNX export (NVBug 6110209)~~ [6110209] Patch zero FP16 scales in INT4_AWQ ONNX export Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

ajrasane force-pushed the ajrasane/nvbug_6110209 branch from 9fc8306 to 18d46e5 Compare May 4, 2026 20:42

cjluo-nv approved these changes May 4, 2026

View reviewed changes

kevalmorabia97 merged commit 540cd2a into main May 5, 2026
44 of 47 checks passed

kevalmorabia97 deleted the ajrasane/nvbug_6110209 branch May 5, 2026 04:20

kevalmorabia97 mentioned this pull request May 5, 2026

[Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 #1375 #1386 #1353 #1356 #1390 #1385

Merged

kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export#1353

[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export#1353
kevalmorabia97 merged 1 commit intomainfrom
ajrasane/nvbug_6110209

ajrasane commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

kevalmorabia97 Apr 27, 2026

Uh oh!

ajrasane Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

cjluo-nv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		- Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle\|dflash\|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.

		Bug Fixes

Conversation

ajrasane commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kevalmorabia97 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

ajrasane Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajrasane commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

github-actions Bot commented Apr 27, 2026 •

edited

Loading

codecov Bot commented Apr 27, 2026 •

edited

Loading