[NVBug 6143871] Fix awq_lite uncalibrated branch leaving input_quantizer disabled by cjluo-nv · Pull Request #1410 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-07T20:37:22Z

Summary

awq_lite.setup() disables module.input_quantizer at the start of search. The calibrated branch re-enables it inside postprocess(), but the uncalibrated branch (no cache-pass tokens, e.g. an MoE expert that never gets routed) never did. Worse, for experts that had cache hits but missed the search pass, the per-channel _amax left over from max_calibrate during cache mode tripped preprocess_linear_fusion's numel == 1 assertion and prevented _export_quantized_weight from emitting a per-tensor input_scale.

Result: per-expert input_scale was missing in the exported HF checkpoint, and TRT-LLM CutlassFusedMoE crashed on load with KeyError: '<idx>.w1.input_scale' for any expert that did not see enough calibration tokens (e.g. Qwen3-30B-A3B + nvfp4_awq from the bug report).

Fix

Mirror the calibrated postprocess() path in modelopt/torch/quantization/model_calib.py: collapse any per-channel _amax to scalar (axis=None) and re-enable the input_quantizer.

Test plan

Added regression test test_awq_lite_uncalibrated_linear_keeps_input_quantizer_enabled using NVFP4_AWQ_LITE_CFG with a two-branch model where only one branch is exercised; verifies the uncalibrated linear's input_quantizer remains enabled after mtq.quantize.
End-to-end pipeline test on a tiny synthetic Qwen3-MoE (8 experts, top-1 routing) confirms all 48 expert input_scale keys are present in the exported state_dict (vs. multiple missing pre-fix).
Manual repro of the bug command (Qwen3-30B-A3B, --qformat nvfp4_awq) confirmed the missing input_scale keys before the fix.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed AWQ-Lite quantization for uncalibrated modules so export/preprocessing invariants are preserved even when calibration/parameter updates are skipped.
Tests
- Added regression tests: one verifies uncalibrated linear modules keep their input quantizer enabled after quantization; another verifies weight-only AWQ-Lite cases keep the input quantizer disabled as expected.

…zer disabled awq_lite.setup() disables module.input_quantizer at the start of search. The calibrated branch re-enables it inside postprocess(), but the uncalibrated branch (no cache-pass tokens, e.g. an MoE expert that never gets routed) never did. Worse, for experts that had cache hits but missed the search pass, the per-channel _amax left over from max_calibrate during cache mode tripped preprocess_linear_fusion's numel==1 assertion and prevented _export_quantized_weight from emitting a per-tensor input_scale. Result: per-expert input_scale was missing in the exported HF checkpoint, and TRT-LLM CutlassFusedMoE crashed on load with KeyError: '<idx>.w1.input_scale' for any expert that did not see enough calibration tokens (e.g. Qwen3-30B-A3B + nvfp4_awq). Fix: mirror the calibrated postprocess() path — collapse any per-channel _amax to scalar (axis=None) and re-enable the input_quantizer. Tests: added regression test using NVFP4_AWQ_LITE_CFG with a two-branch model where only one branch is exercised; verifies the uncalibrated linear's input_quantizer remains enabled. Verified end-to-end on a tiny synthetic Qwen3-MoE that all expert input_scale keys are present in the exported state_dict. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-05-07T20:37:36Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6c2cce63-1a81-4149-9d5b-ed9d8587fac1

📥 Commits

Reviewing files that changed from the base of the PR and between af1bfd6 and deb0e3a.

📒 Files selected for processing (2)

modelopt/torch/quantization/model_calib.py
tests/unit/torch/quantization/test_calib.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/quantization/model_calib.py

📝 Walkthrough

Walkthrough

This PR fixes AWQ-Lite postprocessing to properly normalize input quantizer amax state for expert-disabled (uncalibrated) modules. The fallback path now collapses per-channel amax to scalar and re-enables quantizers to match export invariants. A regression test validates the fix by asserting uncalibrated linears retain enabled quantizers.

Changes

AWQ-Lite Input Quantizer Fix

Layer / File(s)	Summary
Core Implementation `modelopt/torch/quantization/model_calib.py`	AWQ-Lite fallback branch normalizes leftover per-channel `input_quantizer.amax` to scalar via `_amax_for_smoothing` → `amax()`, resets buffers, sets `axis = None`, and re-enables quantizer.
Regression Test `tests/unit/torch/quantization/test_calib.py`	New fixture and CUDA-only regression test verify uncalibrated linears in a two-branch model retain enabled input quantizers after AWQ Lite quantization; additional weight-only test asserts expected disabled state for weight-only AWQ.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main fix: addressing the issue where awq_lite leaves input_quantizer disabled in uncalibrated branches, which is the core problem solved in this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No CRITICAL security anti-patterns found. PR modifies quantization calibration logic only. No unsafe deserialization, eval/exec, hardcoded remote code execution flags, or new dependencies introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-awq-lite-uncalibrated-input-quantizer

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/unit/torch/quantization/test_calib.py (1)

349-353: ⚡ Quick win

Strengthen this regression to also assert the export-critical scalar amax invariant.

Right now it checks only is_enabled. Add assertions for uncalibrated branch axis is None and scalar amax so regressions in the numel == 1 export path are caught too.

Suggested test additions

     assert model.calibrated.input_quantizer.is_enabled
     assert model.uncalibrated.input_quantizer.is_enabled, (
         "Uncalibrated linear's input_quantizer must remain enabled after "
         "awq_lite postprocess so export emits input_scale (NVBug 6143871)."
     )
+    assert model.uncalibrated.input_quantizer.axis is None
+    assert model.uncalibrated.input_quantizer.amax is not None
+    assert model.uncalibrated.input_quantizer.amax.numel() == 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_calib.py` around lines 349 - 353, The test
currently only asserts that model.calibrated.input_quantizer.is_enabled and
model.uncalibrated.input_quantizer.is_enabled; add assertions to the
uncalibrated branch that its input_quantizer.axis is None and that its
input_quantizer.amax is a scalar (e.g., a tensor/array with numel == 1 or using
.ndim == 0) to ensure the export-critical scalar amax invariant is preserved for
the numel==1 path; update the assertions referencing
model.uncalibrated.input_quantizer.axis and
model.uncalibrated.input_quantizer.amax accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1319-1331: The uncalibrated fallback unconditionally calls
module.input_quantizer.enable(), which wrongly re-enables input quantization for
modules that were originally disabled; instead, check the original flag set by
awq_lite (module.awq_lite.is_input_quantized) and only call
module.input_quantizer.enable() when that flag is True (or preserve the original
state earlier and restore it here). Locate the block that manipulates
module.input_quantizer (fields: amax, _amax_for_smoothing, reset_amax, axis,
amax = act_amax.amax()) and gate the final module.input_quantizer.enable()
behind module.awq_lite.is_input_quantized (or restore the preserved boolean) so
modules that started disabled remain disabled.

---

Nitpick comments:
In `@tests/unit/torch/quantization/test_calib.py`:
- Around line 349-353: The test currently only asserts that
model.calibrated.input_quantizer.is_enabled and
model.uncalibrated.input_quantizer.is_enabled; add assertions to the
uncalibrated branch that its input_quantizer.axis is None and that its
input_quantizer.amax is a scalar (e.g., a tensor/array with numel == 1 or using
.ndim == 0) to ensure the export-critical scalar amax invariant is preserved for
the numel==1 path; update the assertions referencing
model.uncalibrated.input_quantizer.axis and
model.uncalibrated.input_quantizer.amax accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ec2ea2e2-3ca3-40e4-96ce-6a786936a5da

📥 Commits

Reviewing files that changed from the base of the PR and between 1f9c0bf and af1bfd6.

📒 Files selected for processing (2)

modelopt/torch/quantization/model_calib.py
tests/unit/torch/quantization/test_calib.py

github-actions · 2026-05-07T20:41:39Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-08 04:44 UTC

meenchen

Bot review — DM the bot to share feedback.

Fix is the right shape and mirrors the calibrated postprocess path, but I think it's missing the if module.awq_lite.is_input_quantized: guard that the calibrated branch has. For weight-only AWQ configs (e.g. INT4_AWQ_CFG, where the config sets *input_quantizer enable: False), is_input_quantized is False and setup() never disabled the input_quantizer. With this PR, an uncalibrated expert under INT4_AWQ will now unconditionally call module.input_quantizer.enable(), turning on a quantizer that the user's config had explicitly disabled. Combined with the pre-existing _enable_pre_quant_scale = True / pre_quant_scale = ones(...) lines a few lines up (which were harmless when the quantizer stayed disabled), this will start applying pre_quant_scale in forward and cause an amax-less input_quantizer to try to quantize.

Also, the test only covers NVFP4_AWQ_LITE_CFG (input_quantizer enabled). Please add a regression test for the weight-only AWQ path (INT4_AWQ_CFG or similar) with an uncalibrated expert, asserting input_quantizer.is_enabled == False after mtq.quantize — that would have caught this.

meenchen · 2026-05-07T22:53:40Z

+                    module.input_quantizer.reset_amax()
+                    module.input_quantizer.axis = None
+                    module.input_quantizer.amax = act_amax.amax()
+                module.input_quantizer.enable()


Bot comment.

Missing guard on module.awq_lite.is_input_quantized. The calibrated branch of postprocess() (right above, around line ~1270) only runs this block under if module.awq_lite.is_input_quantized:. Without that guard here, weight-only AWQ configs (e.g. INT4_AWQ_CFG, which sets *input_quantizer enable: False) will have module.input_quantizer.enable() called on an input_quantizer that the user's config explicitly disabled and that setup() never touched (setup only disables input_quantizer if is_enabled). That's a behavior regression for INT4_AWQ / weight-only AWQ on uncalibrated experts — the input_quantizer would start fake-quantizing in forward without an amax.

Suggest wrapping the new block in if module.awq_lite.is_input_quantized: to match the calibrated path, and adding a regression test with INT4_AWQ_CFG (or any weight-only AWQ config) + an uncalibrated branch that asserts uncalibrated.input_quantizer.is_enabled is False after mtq.quantize.

@meenchen

@meenchen: the unconditional input_quantizer.enable() in the uncalibrated branch wrongly turned on input quantization for weight-only AWQ configs (e.g. INT4_AWQ_CFG, where the user's config sets *input_quantizer enable=False and setup() therefore never disabled it). Gate the entire postprocess block — per-channel-amax collapse and enable() — behind module.awq_lite.is_input_quantized so weight-only configs are untouched. @coderabbitai: strengthen the existing regression test to also assert the export-critical scalar amax invariant (axis=None, numel==1) when amax exists, and add a companion test on INT4_AWQ_CFG asserting the uncalibrated linear's input_quantizer stays disabled. The NVFP4 test now requires CUDA (dynamic block quantization is CUDA-only), guarded with pytest.mark.skipif. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv · 2026-05-07T23:16:18Z

@meenchen good catch — fixed. The whole postprocess block (per-channel-amax collapse + enable()) is now gated behind module.awq_lite.is_input_quantized, so weight-only AWQ configs (e.g. INT4_AWQ_CFG) are untouched. Also added a regression test on INT4_AWQ_CFG asserting the uncalibrated linear's input_quantizer stays disabled.

@coderabbitai strengthened the existing test with the scalar-amax invariant (axis is None, amax.numel() == 1) — gated on amax is not None because the truly-uncalibrated case A (no cache hits) leaves amax for set_expert_quantizer_amax to populate during export.

coderabbitai · 2026-05-07T23:17:01Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the relevant sections of model_calib.py around awq_lite uncalibrated path
rg -n "is_input_quantized|uncalibrated|awq_lite|postprocess|_amax_for_smoothing|enable()" modelopt/torch/quantization/model_calib.py -C 5

echo "---TEST FILE---"
# Read the test for the uncalibrated linear
rg -n "uncalibrated|is_input_quantized|INT4_AWQ|axis is None|amax.numel|amax is not None" tests/unit/torch/quantization/test_calib.py -C 5

Length of output: 41131

@cjluo-nv Thanks for the detailed update — all three changes look correct from the code:

is_input_quantized gate (Lines 1326–1333 in model_calib.py): The per-channel-amax collapse and enable() are cleanly gated, mirroring the calibrated postprocess() at Line 1261. Weight-only configs will see no behavioral change.
INT4_AWQ_CFG companion test (test_awq_lite_uncalibrated_weight_only_keeps_input_quantizer_disabled, Lines 368–390): Well-structured — asserts both calibrated and uncalibrated linears' input_quantizer.is_enabled remain False, which guards against accidentally activating activation quantization for weight-only configs.
Scalar-amax invariant (Lines 358–365): The if uncal_q.amax is not None guard is the right call — it covers the cache-hit/search-miss path while correctly skipping the truly-uncalibrated (no cache hits, amax=None) path where set_expert_quantizer_amax populates the amax during export.

The implementation looks solid. The symmetry between the calibrated and uncalibrated postprocess paths is clear and well-commented.

meenchen

Bot review — DM the bot to share feedback.

Re-review: both critical comments from the previous review are addressed.

is_input_quantized guard: The new block in model_calib.py (lines 1325-1331) wraps the _amax collapse and input_quantizer.enable() in if module.awq_lite.is_input_quantized:, correctly mirroring the calibrated postprocess() path. Weight-only AWQ configs (e.g. INT4_AWQ_CFG) where setup() never disabled the input_quantizer are now left untouched.
Regression test for weight-only AWQ: test_awq_lite_uncalibrated_weight_only_keeps_input_quantizer_disabled has been added, using INT4_AWQ_CFG with the two-branch model and asserting uncalibrated.input_quantizer.is_enabled is False after mtq.quantize. This would have caught the missing-guard bug.
CodeRabbit nitpick on scalar amax invariant: also addressed (conditionally, with a reasonable comment about the None-amax export path).

Fix is small, targeted, and well-tested.

meenchen

LGTM

codecov · 2026-05-07T23:29:19Z

Codecov Report

❌ Patch coverage is 37.50000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.80%. Comparing base (555be6c) to head (deb0e3a).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	37.50%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1410      +/-   ##
==========================================
+ Coverage   76.74%   76.80%   +0.05%     
==========================================
  Files         476      476              
  Lines       51307    51363      +56     
==========================================
+ Hits        39377    39448      +71     
+ Misses      11930    11915      -15

Flag	Coverage Δ
examples	`41.80% <37.50%> (+2.62%)`	⬆️
gpu	`59.84% <0.00%> (-0.60%)`	⬇️
regression	`15.19% <0.00%> (+0.07%)`	⬆️
unit	`52.56% <12.50%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv requested a review from a team as a code owner May 7, 2026 20:37

cjluo-nv requested a review from kaix-nv May 7, 2026 20:37

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/model_calib.py Outdated

cjluo-nv requested review from Fridah-nv, meenchen and realAsma May 7, 2026 22:50

meenchen reviewed May 7, 2026

View reviewed changes

meenchen approved these changes May 7, 2026

View reviewed changes

cjluo-nv added the cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label May 7, 2026

cjluo-nv enabled auto-merge (squash) May 8, 2026 04:04

cjluo-nv merged commit e2d29c8 into main May 8, 2026
62 of 64 checks passed

cjluo-nv deleted the fix-awq-lite-uncalibrated-input-quantizer branch May 8, 2026 04:44

kevalmorabia97 mentioned this pull request May 11, 2026

[Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 #1416 #1426

Open

kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label May 11, 2026

Conversation

cjluo-nv commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading

codecov Bot commented May 7, 2026 •

edited

Loading