Fix weight-only quantization for TEGroupedMLP (MoE models)#971
Fix weight-only quantization for TEGroupedMLP (MoE models)#971jenchen13 merged 7 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request refactors weight-only quantization calibration to use an iterator-based API. The QuantModule class and its specialized subclasses now expose weight calibration pairs via an Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 74-84: The QuantModule branch captures weights by calling
module.iter_weights_for_calibration() outside the
enable_weight_access_and_writeback context, so remapped/sharded/offloaded
weights may be stale; move the call into the context so iteration happens while
enable_weight_access_and_writeback(module, model) is active (i.e., enter the
with block first, then call module.iter_weights_for_calibration() and call
weight_quantizer(weight) inside that context). Keep the else branch behavior for
weight_attr_names/quantizer_attr_names unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a8c54433-356b-4651-9f26-5e26affd01eb
📥 Commits
Reviewing files that changed from the base of the PR and between a34d613 and 77cf3d2fb4dbb8e0c08c33b8ed7d60051fc4ba73.
📒 Files selected for processing (5)
modelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/quant_module.pymodelopt/torch/quantization/plugins/transformer_engine.pymodelopt/torch/quantization/utils.py
286eef1 to
57099fb
Compare
Signed-off-by: larkzhang-nv <larkz@nvidia.com>
57099fb to
45ab8ca
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/quantization/model_calib.py (1)
67-85: Cachename_to_moduleonce in this hot path.
enable_weight_access_and_writeback()falls back to rebuildingdict(model.named_modules())whenname_to_moduleis omitted. Doing that for every module here makes weight-only calibration unnecessarily expensive on large MoE models. It would be better to precompute the mapping once and keep a single writeback context per module.♻️ Proposed refactor
def weight_only_quantize(model: nn.Module): """Just quantize the weights of the model.""" seen_modules = set() - for name, module in model.named_modules(): + name_to_module = dict(model.named_modules()) + for name, module in name_to_module.items(): if module in seen_modules: continue - if isinstance(module, QuantModule): - with enable_weight_access_and_writeback(module, model): + with enable_weight_access_and_writeback(module, model, name_to_module): + if isinstance(module, QuantModule): for weight, weight_quantizer in module.iter_weights_for_calibration(): weight_quantizer(weight) - else: - for weight_name in weight_attr_names(module): - with enable_weight_access_and_writeback(module, model): + else: + for weight_name in weight_attr_names(module): weight_quantizer = getattr( module, quantizer_attr_names(weight_name).weight_quantizer ) weight_quantizer(getattr(module, weight_name)) seen_modules.add(module)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 67 - 85, The hot loop in weight_only_quantize rebuilds model.named_modules() inside enable_weight_access_and_writeback for every module; precompute name_to_module once and reuse a single context per module to avoid repeated dict construction. Modify weight_only_quantize to call dict(model.named_modules()) once (e.g., name_to_module = dict(model.named_modules())) and pass that mapping into each enable_weight_access_and_writeback(...) invocation for both the QuantModule branch and the fallback branch, and also reuse the same context per module rather than creating a new temporary mapping on each weight access.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 67-85: The hot loop in weight_only_quantize rebuilds
model.named_modules() inside enable_weight_access_and_writeback for every
module; precompute name_to_module once and reuse a single context per module to
avoid repeated dict construction. Modify weight_only_quantize to call
dict(model.named_modules()) once (e.g., name_to_module =
dict(model.named_modules())) and pass that mapping into each
enable_weight_access_and_writeback(...) invocation for both the QuantModule
branch and the fallback branch, and also reuse the same context per module
rather than creating a new temporary mapping on each weight access.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a8d33326-47ca-480b-baff-e56c4eb6fc3c
📥 Commits
Reviewing files that changed from the base of the PR and between 57099fbae96bc6c02e8aef1c2267906b8dc8304b and 45ab8ca.
📒 Files selected for processing (4)
modelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/quant_module.pymodelopt/torch/quantization/plugins/transformer_engine.pymodelopt/torch/quantization/utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/quantization/plugins/transformer_engine.py
|
The PR description says: But no export files are changed. If this was intentionally deferred, the description should be updated. |
|
Overall LGTM, you can rebase and merge once it passes all the tests. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/quantization/model_calib.py (1)
76-81: Consider a single writeback context for the fallback loop.For modules with multiple weight attributes, entering/exiting writeback context per weight adds avoidable overhead. Wrapping the whole loop once should keep behavior and reduce context churn.
♻️ Proposed refactor
- else: - for weight_name in weight_attr_names(module): - with enable_weight_access_and_writeback(module, model, name_to_module): - weight_quantizer = getattr( - module, quantizer_attr_names(weight_name).weight_quantizer - ) - weight_quantizer(getattr(module, weight_name)) + else: + with enable_weight_access_and_writeback(module, model, name_to_module): + for weight_name in weight_attr_names(module): + weight_quantizer = getattr( + module, quantizer_attr_names(weight_name).weight_quantizer + ) + weight_quantizer(getattr(module, weight_name))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 76 - 81, The loop currently enters enable_weight_access_and_writeback(module, model, name_to_module) for every weight_name; instead, wrap the entire for weight_name in weight_attr_names(module) loop with a single with enable_weight_access_and_writeback(module, model, name_to_module) context so you open the writeback once, then inside call weight_quantizer = getattr(module, quantizer_attr_names(weight_name).weight_quantizer) and weight_quantizer(getattr(module, weight_name)) for each weight_name; this preserves behavior while avoiding repeated context entry/exit overhead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 76-81: The loop currently enters
enable_weight_access_and_writeback(module, model, name_to_module) for every
weight_name; instead, wrap the entire for weight_name in
weight_attr_names(module) loop with a single with
enable_weight_access_and_writeback(module, model, name_to_module) context so you
open the writeback once, then inside call weight_quantizer = getattr(module,
quantizer_attr_names(weight_name).weight_quantizer) and
weight_quantizer(getattr(module, weight_name)) for each weight_name; this
preserves behavior while avoiding repeated context entry/exit overhead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6f77587b-0518-4df9-a9a6-a734ce3144e6
📒 Files selected for processing (2)
modelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/utils/core_utils.py
@yueshen2016 Thanks! this PR is just for quantization. I've update the description regarding the export part. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #971 +/- ##
==========================================
+ Coverage 74.68% 75.83% +1.14%
==========================================
Files 349 349
Lines 39846 39856 +10
==========================================
+ Hits 29759 30224 +465
+ Misses 10087 9632 -455
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
81c7a9b to
052e360
Compare
ChenhanYu
left a comment
There was a problem hiding this comment.
Review Summary
Clean, well-scoped bug fix for MoE weight-only quantization. In TEGroupedMLP, weights are stored per-expert as weight0...weightN (not a single self.weight). The existing calibration code assumed a single weight attribute and missed these, causing _amax to never be computed — crash during export.
The fix introduces iter_weights_for_calibration() on QuantModule (base class) and overrides it in _QuantTEGroupedLinear to yield all per-expert weights with their shared quantizer. Good use of polymorphism — small, focused changes.
Attention
Shared quantizer across all experts — In transformer_engine.py, all expert weights are yielded with the same self.weight_quantizer, so _amax is computed across all experts jointly. This is likely intentional for NVFP4 (one scale for the whole layer), but worth confirming vs. per-expert calibration.
Minor Notes
-
The
isinstance(module, QuantModule)check inmodel_calib.pychanges behavior from the old code which iteratedweight_attr_names(module)for any module. If there are non-QuantModulesubclasses with quantizable weights, they'd be silently skipped. Fine if all quantizable modules areQuantModulesubclasses. -
core_utils.py— Addingweight is not Noneguard is the right root fix to prevent yielding"weight"whenself.weighthas been deleted. -
Minor nit: no upper bound validation on
self.num_gemmsin the TE override — a defensiveassert self.num_gemms > 0in_setupwould be nice but not blocking.
LGTM overall.
|
Need one of the owner approval from @kinjalpatel27 or @realAsma Also, fail DCO. Take a look at this which is required by OSS contribution |
21e2b6c to
83b7319
Compare
|
@kinjalpatel27 Hi, I've fixed the DCO issue. Could you please approve the workflows, Thank you! |
| # Remove self.weight after post_restore. | ||
| delattr(self, "weight") | ||
|
|
||
| def iter_weights_for_calibration(self): |
There was a problem hiding this comment.
this is unnecessary because _ParallelLinear inherits from QuantModule, so the iter_weights_for_calibration defined in QuantModule will be inherited
@jQizhang
There was a problem hiding this comment.
Hi @jenchen13 thanks for the review! The reason for overriding iter_weights_for_calibration is that the base implementation is not compatible with _QuantTEGroupedLinear.
_QuantTEGroupedLinear doesn't have a self.weight attribute. The actual weights are stored as weight0, weight1, ..... The base QuantModule.iter_weights_for_calibration relies on weight_attr_names(), which checks for self.weight. So without this override, weight calibration would be silently skipped for grouped linear layers.
|
Hi @kinjalpatel27 @jenchen13 @ChenhanYu , I noticed that some checks have been stuck for a while. Could you please help take a look? Thanks! |
|
I approved workflows to run, please don't make any more changes to this branch. thank you! |
|
Hi @jenchen13 , It looks like example-pr-required-check and gpu-pr-required-check are still stuck in a pending state. Could you please take a look when you have a moment? Thanks! |
|
Hi @jenchen13 , sorry to bother you again! I'm not very familiar with the ModelOpt PR workflow. I noticed the CI failed with a 403 Permission Denied error, which seems unrelated to my code changes. Could you please take a look when you have a moment? Thanks! |
|
/ok to test 75e94cf |
|
/ok to test 75e94cf |
|
/ok to test c10aa27 |
### What does this PR do? This PR fixes a critical issue where weight-only quantization fails for MoE models utilizing `TEGroupedMLP` (e.g., Qwen3-30B-A3B). #### The Problem: In `TEGroupedMLP`, weights are stored per-expert as `weight0`, `weight1`, ..., `weightN`. During `_QuantTEGroupedLinear._setup`, the standard `self.weight` attribute is deleted. The existing `weight_only_quantize` logic expects to find a `self.weight` associated with the quantizer. Because it couldn't find these "hidden" expert weights, the `weight_quantizer` failed to calibrate, resulting in a missing `_amax` attribute. This leads to the following crash during export/inference: <img width="2792" height="1034" alt="image" src="https://github.com/user-attachments/assets/9e2b1abd-80f4-4b8b-bb95-f8ee7a8f686a" /> ```python File ".../modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 59, in get_weights_scaling_factor_2_from_quantizer assert hasattr(weight_quantizer, "_amax"), "Weight quantizer does not have attribute amax" ``` #### The Solution: 1. **Calibration Interface:** Introduced `iter_weights_for_calibration` in the `QuantModule` base class. 2. **MoE Support:** Overrode this method in `_QuantTEGroupedLinear` to yield all per-expert weights (`weight0`...`weightN`) that share the same quantizer. This ensures the calibrator "sees" all expert weights and calculates a valid `_amax`. --- ### 2. Type of change * [x] Bug fix --- ### 3. Usage / Reproduction This issue is reproducible when running weight-only quantization on MoE models like Qwen3-30B-A3B: ```bash # Step 1: Quantization torchrun --nproc_per_node 8 examples/quantization/quantize.py \ --hf-model-id Qwen/Qwen3-30B-A3B \ --export-quant-cfg nvfp4 \ --tp 2 \ --ep 8 \ --weight-only \ --megatron-save-path ./qwen3_30b_nvfp4 ``` --- ### 4. Testing & Verification * **Models Tested:** Qwen3-8B (Dense), Qwen3-30B-A3B (MoE). * **Quantization:** NVFP4/FP8 weight-only quantization. * **Verification:** - Confirmed that `QuantTEGroupedMLP` now correctly shows calculated `_amax` values in the quantization statistics table instead of remaining `dynamic`. * Validated that the change does not regress dense model (Qwen3-8B) quantization flow. * After fix, the amax of experts can be calculated correctly. ``` Quantization Statistics ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓ ┃ Parameter Name ┃ Shape ┃ Max Value ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩ │ decoder.layers.0.self_attention.linear_proj.weight_quantizer._amax │ () │ 7.5781e-01 │ │ decoder.layers.0.self_attention.linear_qkv.weight_quantizer._amax │ () │ 2.8711e-01 │ │ decoder.layers.0.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.1094e-01 │ │ decoder.layers.0.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 8.6719e-01 │ │ decoder.layers.1.self_attention.linear_proj.weight_quantizer._amax │ () │ 5.8594e-01 │ │ decoder.layers.1.self_attention.linear_qkv.weight_quantizer._amax │ () │ 7.4219e-01 │ │ decoder.layers.1.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.2266e-01 │ │ decoder.layers.1.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 1.9922e+00 │ │ decoder.layers.2.self_attention.linear_proj.weight_quantizer._amax │ () │ 1.0859e+00 │ │ decoder.layers.2.self_attention.linear_qkv.weight_quantizer._amax │ () │ 1.7812e+00 │ │ decoder.layers.2.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.3047e-01 │ │ decoder.layers.2.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 1.9219e+00 │ ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Enhanced weight-only quantization calibration with improved support for specialized quantization modules and grouped-linear quantization paths. * **Bug Fixes** * Fixed handling of missing weight attributes during quantization calibration to prevent incorrect processing. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: larkzhang-nv <larkz@nvidia.com> Signed-off-by: larkz <larkz@nvidia.com>
What does this PR do?
This PR fixes a critical issue where weight-only quantization fails for MoE models utilizing
TEGroupedMLP(e.g., Qwen3-30B-A3B).The Problem:
In
TEGroupedMLP, weights are stored per-expert asweight0,weight1, ...,weightN. During_QuantTEGroupedLinear._setup, the standardself.weightattribute is deleted.The existing
weight_only_quantizelogic expects to find aself.weightassociated with the quantizer. Because it couldn't find these "hidden" expert weights, theweight_quantizerfailed to calibrate, resulting in a missing_amaxattribute. This leads to the following crash during export/inference:The Solution:
iter_weights_for_calibrationin theQuantModulebase class._QuantTEGroupedLinearto yield all per-expert weights (weight0...weightN) that share the same quantizer. This ensures the calibrator "sees" all expert weights and calculates a valid_amax.2. Type of change
3. Usage / Reproduction
This issue is reproducible when running weight-only quantization on MoE models like Qwen3-30B-A3B:
# Step 1: Quantization torchrun --nproc_per_node 8 examples/quantization/quantize.py \ --hf-model-id Qwen/Qwen3-30B-A3B \ --export-quant-cfg nvfp4 \ --tp 2 \ --ep 8 \ --weight-only \ --megatron-save-path ./qwen3_30b_nvfp44. Testing & Verification
QuantTEGroupedMLPnow correctly shows calculated_amaxvalues in the quantization statistics table instead of remainingdynamic.Summary by CodeRabbit
New Features
Bug Fixes