Conversation
…zer disabled awq_lite.setup() disables module.input_quantizer at the start of search. The calibrated branch re-enables it inside postprocess(), but the uncalibrated branch (no cache-pass tokens, e.g. an MoE expert that never gets routed) never did. Worse, for experts that had cache hits but missed the search pass, the per-channel _amax left over from max_calibrate during cache mode tripped preprocess_linear_fusion's numel==1 assertion and prevented _export_quantized_weight from emitting a per-tensor input_scale. Result: per-expert input_scale was missing in the exported HF checkpoint, and TRT-LLM CutlassFusedMoE crashed on load with KeyError: '<idx>.w1.input_scale' for any expert that did not see enough calibration tokens (e.g. Qwen3-30B-A3B + nvfp4_awq). Fix: mirror the calibrated postprocess() path — collapse any per-channel _amax to scalar (axis=None) and re-enable the input_quantizer. Tests: added regression test using NVFP4_AWQ_LITE_CFG with a two-branch model where only one branch is exercised; verifies the uncalibrated linear's input_quantizer remains enabled. Verified end-to-end on a tiny synthetic Qwen3-MoE that all expert input_scale keys are present in the exported state_dict. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unit/torch/quantization/test_calib.py (1)
349-353: ⚡ Quick winStrengthen this regression to also assert the export-critical scalar amax invariant.
Right now it checks only
is_enabled. Add assertions for uncalibrated branchaxis is Noneand scalaramaxso regressions in thenumel == 1export path are caught too.Suggested test additions
assert model.calibrated.input_quantizer.is_enabled assert model.uncalibrated.input_quantizer.is_enabled, ( "Uncalibrated linear's input_quantizer must remain enabled after " "awq_lite postprocess so export emits input_scale (NVBug 6143871)." ) + assert model.uncalibrated.input_quantizer.axis is None + assert model.uncalibrated.input_quantizer.amax is not None + assert model.uncalibrated.input_quantizer.amax.numel() == 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_calib.py` around lines 349 - 353, The test currently only asserts that model.calibrated.input_quantizer.is_enabled and model.uncalibrated.input_quantizer.is_enabled; add assertions to the uncalibrated branch that its input_quantizer.axis is None and that its input_quantizer.amax is a scalar (e.g., a tensor/array with numel == 1 or using .ndim == 0) to ensure the export-critical scalar amax invariant is preserved for the numel==1 path; update the assertions referencing model.uncalibrated.input_quantizer.axis and model.uncalibrated.input_quantizer.amax accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1319-1331: The uncalibrated fallback unconditionally calls
module.input_quantizer.enable(), which wrongly re-enables input quantization for
modules that were originally disabled; instead, check the original flag set by
awq_lite (module.awq_lite.is_input_quantized) and only call
module.input_quantizer.enable() when that flag is True (or preserve the original
state earlier and restore it here). Locate the block that manipulates
module.input_quantizer (fields: amax, _amax_for_smoothing, reset_amax, axis,
amax = act_amax.amax()) and gate the final module.input_quantizer.enable()
behind module.awq_lite.is_input_quantized (or restore the preserved boolean) so
modules that started disabled remain disabled.
---
Nitpick comments:
In `@tests/unit/torch/quantization/test_calib.py`:
- Around line 349-353: The test currently only asserts that
model.calibrated.input_quantizer.is_enabled and
model.uncalibrated.input_quantizer.is_enabled; add assertions to the
uncalibrated branch that its input_quantizer.axis is None and that its
input_quantizer.amax is a scalar (e.g., a tensor/array with numel == 1 or using
.ndim == 0) to ensure the export-critical scalar amax invariant is preserved for
the numel==1 path; update the assertions referencing
model.uncalibrated.input_quantizer.axis and
model.uncalibrated.input_quantizer.amax accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ec2ea2e2-3ca3-40e4-96ce-6a786936a5da
📒 Files selected for processing (2)
modelopt/torch/quantization/model_calib.pytests/unit/torch/quantization/test_calib.py
|
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Fix is the right shape and mirrors the calibrated postprocess path, but I think it's missing the if module.awq_lite.is_input_quantized: guard that the calibrated branch has. For weight-only AWQ configs (e.g. INT4_AWQ_CFG, where the config sets *input_quantizer enable: False), is_input_quantized is False and setup() never disabled the input_quantizer. With this PR, an uncalibrated expert under INT4_AWQ will now unconditionally call module.input_quantizer.enable(), turning on a quantizer that the user's config had explicitly disabled. Combined with the pre-existing _enable_pre_quant_scale = True / pre_quant_scale = ones(...) lines a few lines up (which were harmless when the quantizer stayed disabled), this will start applying pre_quant_scale in forward and cause an amax-less input_quantizer to try to quantize.
Also, the test only covers NVFP4_AWQ_LITE_CFG (input_quantizer enabled). Please add a regression test for the weight-only AWQ path (INT4_AWQ_CFG or similar) with an uncalibrated expert, asserting input_quantizer.is_enabled == False after mtq.quantize — that would have caught this.
| module.input_quantizer.reset_amax() | ||
| module.input_quantizer.axis = None | ||
| module.input_quantizer.amax = act_amax.amax() | ||
| module.input_quantizer.enable() |
There was a problem hiding this comment.
Bot comment.
Missing guard on module.awq_lite.is_input_quantized. The calibrated branch of postprocess() (right above, around line ~1270) only runs this block under if module.awq_lite.is_input_quantized:. Without that guard here, weight-only AWQ configs (e.g. INT4_AWQ_CFG, which sets *input_quantizer enable: False) will have module.input_quantizer.enable() called on an input_quantizer that the user's config explicitly disabled and that setup() never touched (setup only disables input_quantizer if is_enabled). That's a behavior regression for INT4_AWQ / weight-only AWQ on uncalibrated experts — the input_quantizer would start fake-quantizing in forward without an amax.
Suggest wrapping the new block in if module.awq_lite.is_input_quantized: to match the calibrated path, and adding a regression test with INT4_AWQ_CFG (or any weight-only AWQ config) + an uncalibrated branch that asserts uncalibrated.input_quantizer.is_enabled is False after mtq.quantize.
@meenchen: the unconditional input_quantizer.enable() in the uncalibrated branch wrongly turned on input quantization for weight-only AWQ configs (e.g. INT4_AWQ_CFG, where the user's config sets *input_quantizer enable=False and setup() therefore never disabled it). Gate the entire postprocess block — per-channel-amax collapse and enable() — behind module.awq_lite.is_input_quantized so weight-only configs are untouched. @coderabbitai: strengthen the existing regression test to also assert the export-critical scalar amax invariant (axis=None, numel==1) when amax exists, and add a companion test on INT4_AWQ_CFG asserting the uncalibrated linear's input_quantizer stays disabled. The NVFP4 test now requires CUDA (dynamic block quantization is CUDA-only), guarded with pytest.mark.skipif. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
@meenchen good catch — fixed. The whole postprocess block (per-channel-amax collapse + @coderabbitai strengthened the existing test with the scalar-amax invariant ( |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 41131
The implementation looks solid. The symmetry between the calibrated and uncalibrated postprocess paths is clear and well-commented. |
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review: both critical comments from the previous review are addressed.
-
is_input_quantizedguard: The new block inmodel_calib.py(lines 1325-1331) wraps the_amaxcollapse andinput_quantizer.enable()inif module.awq_lite.is_input_quantized:, correctly mirroring the calibratedpostprocess()path. Weight-only AWQ configs (e.g.INT4_AWQ_CFG) wheresetup()never disabled the input_quantizer are now left untouched. -
Regression test for weight-only AWQ:
test_awq_lite_uncalibrated_weight_only_keeps_input_quantizer_disabledhas been added, usingINT4_AWQ_CFGwith the two-branch model and assertinguncalibrated.input_quantizer.is_enabled is Falseaftermtq.quantize. This would have caught the missing-guard bug. -
CodeRabbit nitpick on scalar amax invariant: also addressed (conditionally, with a reasonable comment about the None-amax export path).
Fix is small, targeted, and well-tested.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1410 +/- ##
==========================================
+ Coverage 76.74% 76.80% +0.05%
==========================================
Files 476 476
Lines 51307 51363 +56
==========================================
+ Hits 39377 39448 +71
+ Misses 11930 11915 -15
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
awq_lite.setup()disablesmodule.input_quantizerat the start of search. The calibrated branch re-enables it insidepostprocess(), but the uncalibrated branch (no cache-pass tokens, e.g. an MoE expert that never gets routed) never did. Worse, for experts that had cache hits but missed the search pass, the per-channel_amaxleft over frommax_calibrateduring cache mode trippedpreprocess_linear_fusion'snumel == 1assertion and prevented_export_quantized_weightfrom emitting a per-tensorinput_scale.Result: per-expert
input_scalewas missing in the exported HF checkpoint, and TRT-LLMCutlassFusedMoEcrashed on load withKeyError: '<idx>.w1.input_scale'for any expert that did not see enough calibration tokens (e.g. Qwen3-30B-A3B +nvfp4_awqfrom the bug report).Fix
Mirror the calibrated
postprocess()path inmodelopt/torch/quantization/model_calib.py: collapse any per-channel_amaxto scalar (axis=None) and re-enable theinput_quantizer.Test plan
test_awq_lite_uncalibrated_linear_keeps_input_quantizer_enabledusingNVFP4_AWQ_LITE_CFGwith a two-branch model where only one branch is exercised; verifies the uncalibrated linear'sinput_quantizerremains enabled aftermtq.quantize.input_scalekeys are present in the exported state_dict (vs. multiple missing pre-fix).--qformat nvfp4_awq) confirmed the missinginput_scalekeys before the fix.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests