Skip to content

[NVBug 6143871] Fix awq_lite uncalibrated branch leaving input_quantizer disabled#1410

Merged
cjluo-nv merged 2 commits intomainfrom
fix-awq-lite-uncalibrated-input-quantizer
May 8, 2026
Merged

[NVBug 6143871] Fix awq_lite uncalibrated branch leaving input_quantizer disabled#1410
cjluo-nv merged 2 commits intomainfrom
fix-awq-lite-uncalibrated-input-quantizer

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented May 7, 2026

Summary

awq_lite.setup() disables module.input_quantizer at the start of search. The calibrated branch re-enables it inside postprocess(), but the uncalibrated branch (no cache-pass tokens, e.g. an MoE expert that never gets routed) never did. Worse, for experts that had cache hits but missed the search pass, the per-channel _amax left over from max_calibrate during cache mode tripped preprocess_linear_fusion's numel == 1 assertion and prevented _export_quantized_weight from emitting a per-tensor input_scale.

Result: per-expert input_scale was missing in the exported HF checkpoint, and TRT-LLM CutlassFusedMoE crashed on load with KeyError: '<idx>.w1.input_scale' for any expert that did not see enough calibration tokens (e.g. Qwen3-30B-A3B + nvfp4_awq from the bug report).

Fix

Mirror the calibrated postprocess() path in modelopt/torch/quantization/model_calib.py: collapse any per-channel _amax to scalar (axis=None) and re-enable the input_quantizer.

Test plan

  • Added regression test test_awq_lite_uncalibrated_linear_keeps_input_quantizer_enabled using NVFP4_AWQ_LITE_CFG with a two-branch model where only one branch is exercised; verifies the uncalibrated linear's input_quantizer remains enabled after mtq.quantize.
  • End-to-end pipeline test on a tiny synthetic Qwen3-MoE (8 experts, top-1 routing) confirms all 48 expert input_scale keys are present in the exported state_dict (vs. multiple missing pre-fix).
  • Manual repro of the bug command (Qwen3-30B-A3B, --qformat nvfp4_awq) confirmed the missing input_scale keys before the fix.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed AWQ-Lite quantization for uncalibrated modules so export/preprocessing invariants are preserved even when calibration/parameter updates are skipped.
  • Tests

    • Added regression tests: one verifies uncalibrated linear modules keep their input quantizer enabled after quantization; another verifies weight-only AWQ-Lite cases keep the input quantizer disabled as expected.

…zer disabled

awq_lite.setup() disables module.input_quantizer at the start of search.
The calibrated branch re-enables it inside postprocess(), but the
uncalibrated branch (no cache-pass tokens, e.g. an MoE expert that never
gets routed) never did. Worse, for experts that had cache hits but
missed the search pass, the per-channel _amax left over from
max_calibrate during cache mode tripped preprocess_linear_fusion's
numel==1 assertion and prevented _export_quantized_weight from emitting
a per-tensor input_scale.

Result: per-expert input_scale was missing in the exported HF
checkpoint, and TRT-LLM CutlassFusedMoE crashed on load with
KeyError: '<idx>.w1.input_scale' for any expert that did not see enough
calibration tokens (e.g. Qwen3-30B-A3B + nvfp4_awq).

Fix: mirror the calibrated postprocess() path — collapse any per-channel
_amax to scalar (axis=None) and re-enable the input_quantizer.

Tests: added regression test using NVFP4_AWQ_LITE_CFG with a two-branch
model where only one branch is exercised; verifies the uncalibrated
linear's input_quantizer remains enabled. Verified end-to-end on a tiny
synthetic Qwen3-MoE that all expert input_scale keys are present in the
exported state_dict.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested a review from a team as a code owner May 7, 2026 20:37
@cjluo-nv cjluo-nv requested a review from kaix-nv May 7, 2026 20:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6c2cce63-1a81-4149-9d5b-ed9d8587fac1

📥 Commits

Reviewing files that changed from the base of the PR and between af1bfd6 and deb0e3a.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/model_calib.py
  • tests/unit/torch/quantization/test_calib.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/quantization/model_calib.py

📝 Walkthrough

Walkthrough

This PR fixes AWQ-Lite postprocessing to properly normalize input quantizer amax state for expert-disabled (uncalibrated) modules. The fallback path now collapses per-channel amax to scalar and re-enables quantizers to match export invariants. A regression test validates the fix by asserting uncalibrated linears retain enabled quantizers.

Changes

AWQ-Lite Input Quantizer Fix

Layer / File(s) Summary
Core Implementation
modelopt/torch/quantization/model_calib.py
AWQ-Lite fallback branch normalizes leftover per-channel input_quantizer.amax to scalar via _amax_for_smoothingamax(), resets buffers, sets axis = None, and re-enables quantizer.
Regression Test
tests/unit/torch/quantization/test_calib.py
New fixture and CUDA-only regression test verify uncalibrated linears in a two-branch model retain enabled input quantizers after AWQ Lite quantization; additional weight-only test asserts expected disabled state for weight-only AWQ.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main fix: addressing the issue where awq_lite leaves input_quantizer disabled in uncalibrated branches, which is the core problem solved in this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No CRITICAL security anti-patterns found. PR modifies quantization calibration logic only. No unsafe deserialization, eval/exec, hardcoded remote code execution flags, or new dependencies introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-awq-lite-uncalibrated-input-quantizer

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unit/torch/quantization/test_calib.py (1)

349-353: ⚡ Quick win

Strengthen this regression to also assert the export-critical scalar amax invariant.

Right now it checks only is_enabled. Add assertions for uncalibrated branch axis is None and scalar amax so regressions in the numel == 1 export path are caught too.

Suggested test additions
     assert model.calibrated.input_quantizer.is_enabled
     assert model.uncalibrated.input_quantizer.is_enabled, (
         "Uncalibrated linear's input_quantizer must remain enabled after "
         "awq_lite postprocess so export emits input_scale (NVBug 6143871)."
     )
+    assert model.uncalibrated.input_quantizer.axis is None
+    assert model.uncalibrated.input_quantizer.amax is not None
+    assert model.uncalibrated.input_quantizer.amax.numel() == 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_calib.py` around lines 349 - 353, The test
currently only asserts that model.calibrated.input_quantizer.is_enabled and
model.uncalibrated.input_quantizer.is_enabled; add assertions to the
uncalibrated branch that its input_quantizer.axis is None and that its
input_quantizer.amax is a scalar (e.g., a tensor/array with numel == 1 or using
.ndim == 0) to ensure the export-critical scalar amax invariant is preserved for
the numel==1 path; update the assertions referencing
model.uncalibrated.input_quantizer.axis and
model.uncalibrated.input_quantizer.amax accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1319-1331: The uncalibrated fallback unconditionally calls
module.input_quantizer.enable(), which wrongly re-enables input quantization for
modules that were originally disabled; instead, check the original flag set by
awq_lite (module.awq_lite.is_input_quantized) and only call
module.input_quantizer.enable() when that flag is True (or preserve the original
state earlier and restore it here). Locate the block that manipulates
module.input_quantizer (fields: amax, _amax_for_smoothing, reset_amax, axis,
amax = act_amax.amax()) and gate the final module.input_quantizer.enable()
behind module.awq_lite.is_input_quantized (or restore the preserved boolean) so
modules that started disabled remain disabled.

---

Nitpick comments:
In `@tests/unit/torch/quantization/test_calib.py`:
- Around line 349-353: The test currently only asserts that
model.calibrated.input_quantizer.is_enabled and
model.uncalibrated.input_quantizer.is_enabled; add assertions to the
uncalibrated branch that its input_quantizer.axis is None and that its
input_quantizer.amax is a scalar (e.g., a tensor/array with numel == 1 or using
.ndim == 0) to ensure the export-critical scalar amax invariant is preserved for
the numel==1 path; update the assertions referencing
model.uncalibrated.input_quantizer.axis and
model.uncalibrated.input_quantizer.amax accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ec2ea2e2-3ca3-40e4-96ce-6a786936a5da

📥 Commits

Reviewing files that changed from the base of the PR and between 1f9c0bf and af1bfd6.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/model_calib.py
  • tests/unit/torch/quantization/test_calib.py

Comment thread modelopt/torch/quantization/model_calib.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-08 04:44 UTC

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Fix is the right shape and mirrors the calibrated postprocess path, but I think it's missing the if module.awq_lite.is_input_quantized: guard that the calibrated branch has. For weight-only AWQ configs (e.g. INT4_AWQ_CFG, where the config sets *input_quantizer enable: False), is_input_quantized is False and setup() never disabled the input_quantizer. With this PR, an uncalibrated expert under INT4_AWQ will now unconditionally call module.input_quantizer.enable(), turning on a quantizer that the user's config had explicitly disabled. Combined with the pre-existing _enable_pre_quant_scale = True / pre_quant_scale = ones(...) lines a few lines up (which were harmless when the quantizer stayed disabled), this will start applying pre_quant_scale in forward and cause an amax-less input_quantizer to try to quantize.

Also, the test only covers NVFP4_AWQ_LITE_CFG (input_quantizer enabled). Please add a regression test for the weight-only AWQ path (INT4_AWQ_CFG or similar) with an uncalibrated expert, asserting input_quantizer.is_enabled == False after mtq.quantize — that would have caught this.

module.input_quantizer.reset_amax()
module.input_quantizer.axis = None
module.input_quantizer.amax = act_amax.amax()
module.input_quantizer.enable()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Missing guard on module.awq_lite.is_input_quantized. The calibrated branch of postprocess() (right above, around line ~1270) only runs this block under if module.awq_lite.is_input_quantized:. Without that guard here, weight-only AWQ configs (e.g. INT4_AWQ_CFG, which sets *input_quantizer enable: False) will have module.input_quantizer.enable() called on an input_quantizer that the user's config explicitly disabled and that setup() never touched (setup only disables input_quantizer if is_enabled). That's a behavior regression for INT4_AWQ / weight-only AWQ on uncalibrated experts — the input_quantizer would start fake-quantizing in forward without an amax.

Suggest wrapping the new block in if module.awq_lite.is_input_quantized: to match the calibrated path, and adding a regression test with INT4_AWQ_CFG (or any weight-only AWQ config) + an uncalibrated branch that asserts uncalibrated.input_quantizer.is_enabled is False after mtq.quantize.

@meenchen: the unconditional input_quantizer.enable() in the
uncalibrated branch wrongly turned on input quantization for weight-only
AWQ configs (e.g. INT4_AWQ_CFG, where the user's config sets
*input_quantizer enable=False and setup() therefore never disabled it).
Gate the entire postprocess block — per-channel-amax collapse and
enable() — behind module.awq_lite.is_input_quantized so weight-only
configs are untouched.

@coderabbitai: strengthen the existing regression test to also assert
the export-critical scalar amax invariant (axis=None, numel==1) when
amax exists, and add a companion test on INT4_AWQ_CFG asserting the
uncalibrated linear's input_quantizer stays disabled.

The NVFP4 test now requires CUDA (dynamic block quantization is
CUDA-only), guarded with pytest.mark.skipif.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv
Copy link
Copy Markdown
Collaborator Author

cjluo-nv commented May 7, 2026

@meenchen good catch — fixed. The whole postprocess block (per-channel-amax collapse + enable()) is now gated behind module.awq_lite.is_input_quantized, so weight-only AWQ configs (e.g. INT4_AWQ_CFG) are untouched. Also added a regression test on INT4_AWQ_CFG asserting the uncalibrated linear's input_quantizer stays disabled.

@coderabbitai strengthened the existing test with the scalar-amax invariant (axis is None, amax.numel() == 1) — gated on amax is not None because the truly-uncalibrated case A (no cache hits) leaves amax for set_expert_quantizer_amax to populate during export.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the relevant sections of model_calib.py around awq_lite uncalibrated path
rg -n "is_input_quantized|uncalibrated|awq_lite|postprocess|_amax_for_smoothing|enable()" modelopt/torch/quantization/model_calib.py -C 5

echo "---TEST FILE---"
# Read the test for the uncalibrated linear
rg -n "uncalibrated|is_input_quantized|INT4_AWQ|axis is None|amax.numel|amax is not None" tests/unit/torch/quantization/test_calib.py -C 5

Length of output: 41131


@cjluo-nv Thanks for the detailed update — all three changes look correct from the code:

  1. is_input_quantized gate (Lines 1326–1333 in model_calib.py): The per-channel-amax collapse and enable() are cleanly gated, mirroring the calibrated postprocess() at Line 1261. Weight-only configs will see no behavioral change.

  2. INT4_AWQ_CFG companion test (test_awq_lite_uncalibrated_weight_only_keeps_input_quantizer_disabled, Lines 368–390): Well-structured — asserts both calibrated and uncalibrated linears' input_quantizer.is_enabled remain False, which guards against accidentally activating activation quantization for weight-only configs.

  3. Scalar-amax invariant (Lines 358–365): The if uncal_q.amax is not None guard is the right call — it covers the cache-hit/search-miss path while correctly skipping the truly-uncalibrated (no cache hits, amax=None) path where set_expert_quantizer_amax populates the amax during export.

The implementation looks solid. The symmetry between the calibrated and uncalibrated postprocess paths is clear and well-commented.

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review: both critical comments from the previous review are addressed.

  1. is_input_quantized guard: The new block in model_calib.py (lines 1325-1331) wraps the _amax collapse and input_quantizer.enable() in if module.awq_lite.is_input_quantized:, correctly mirroring the calibrated postprocess() path. Weight-only AWQ configs (e.g. INT4_AWQ_CFG) where setup() never disabled the input_quantizer are now left untouched.

  2. Regression test for weight-only AWQ: test_awq_lite_uncalibrated_weight_only_keeps_input_quantizer_disabled has been added, using INT4_AWQ_CFG with the two-branch model and asserting uncalibrated.input_quantizer.is_enabled is False after mtq.quantize. This would have caught the missing-guard bug.

  3. CodeRabbit nitpick on scalar amax invariant: also addressed (conditionally, with a reasonable comment about the None-amax export path).

Fix is small, targeted, and well-tested.

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cjluo-nv cjluo-nv added the cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 37.50000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.80%. Comparing base (555be6c) to head (deb0e3a).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 37.50% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1410      +/-   ##
==========================================
+ Coverage   76.74%   76.80%   +0.05%     
==========================================
  Files         476      476              
  Lines       51307    51363      +56     
==========================================
+ Hits        39377    39448      +71     
+ Misses      11930    11915      -15     
Flag Coverage Δ
examples 41.80% <37.50%> (+2.62%) ⬆️
gpu 59.84% <0.00%> (-0.60%) ⬇️
regression 15.19% <0.00%> (+0.07%) ⬆️
unit 52.56% <12.50%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv cjluo-nv enabled auto-merge (squash) May 8, 2026 04:04
@cjluo-nv cjluo-nv merged commit e2d29c8 into main May 8, 2026
62 of 64 checks passed
@cjluo-nv cjluo-nv deleted the fix-awq-lite-uncalibrated-input-quantizer branch May 8, 2026 04:44
@kevalmorabia97 kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc cherry-pick-done Added by bot once PR is cherry-picked to the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants