Skip to content

Fix weight-only quantization for TEGroupedMLP (MoE models)#971

Merged
jenchen13 merged 7 commits intoNVIDIA:mainfrom
jQizhang:weight_only_te_fix
Apr 3, 2026
Merged

Fix weight-only quantization for TEGroupedMLP (MoE models)#971
jenchen13 merged 7 commits intoNVIDIA:mainfrom
jQizhang:weight_only_te_fix

Conversation

@jQizhang
Copy link
Copy Markdown
Contributor

@jQizhang jQizhang commented Mar 4, 2026

What does this PR do?

This PR fixes a critical issue where weight-only quantization fails for MoE models utilizing TEGroupedMLP (e.g., Qwen3-30B-A3B).

The Problem:

In TEGroupedMLP, weights are stored per-expert as weight0, weight1, ..., weightN. During _QuantTEGroupedLinear._setup, the standard self.weight attribute is deleted.
The existing weight_only_quantize logic expects to find a self.weight associated with the quantizer. Because it couldn't find these "hidden" expert weights, the weight_quantizer failed to calibrate, resulting in a missing _amax attribute. This leads to the following crash during export/inference:

image
File ".../modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 59, in get_weights_scaling_factor_2_from_quantizer
assert hasattr(weight_quantizer, "_amax"), "Weight quantizer does not have attribute amax"

The Solution:

  1. Calibration Interface: Introduced iter_weights_for_calibration in the QuantModule base class.
  2. MoE Support: Overrode this method in _QuantTEGroupedLinear to yield all per-expert weights (weight0...weightN) that share the same quantizer. This ensures the calibrator "sees" all expert weights and calculates a valid _amax.

2. Type of change

  • Bug fix

3. Usage / Reproduction

This issue is reproducible when running weight-only quantization on MoE models like Qwen3-30B-A3B:

# Step 1: Quantization
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
    --hf-model-id Qwen/Qwen3-30B-A3B \
    --export-quant-cfg nvfp4 \
    --tp 2 \
    --ep 8 \
    --weight-only \
    --megatron-save-path ./qwen3_30b_nvfp4

4. Testing & Verification

  • Models Tested: Qwen3-8B (Dense), Qwen3-30B-A3B (MoE).
  • Quantization: NVFP4/FP8 weight-only quantization.
  • Verification: - Confirmed that QuantTEGroupedMLP now correctly shows calculated _amax values in the quantization statistics table instead of remaining dynamic.
  • Validated that the change does not regress dense model (Qwen3-8B) quantization flow.
  • After fix, the amax of experts can be calculated correctly.
                                  Quantization Statistics                                   
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ Parameter Name                                                      ┃ Shape ┃  Max Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ decoder.layers.0.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 7.5781e-01 │
│ decoder.layers.0.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 2.8711e-01 │
│ decoder.layers.0.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.1094e-01 │
│ decoder.layers.0.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 8.6719e-01 │
│ decoder.layers.1.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 5.8594e-01 │
│ decoder.layers.1.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 7.4219e-01 │
│ decoder.layers.1.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.2266e-01 │
│ decoder.layers.1.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 1.9922e+00 │
│ decoder.layers.2.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 1.0859e+00 │
│ decoder.layers.2.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 1.7812e+00 │
│ decoder.layers.2.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.3047e-01 │
│ decoder.layers.2.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 1.9219e+00 │

Summary by CodeRabbit

  • New Features

    • Enhanced weight-only quantization calibration with improved support for specialized quantization modules and grouped-linear quantization paths.
  • Bug Fixes

    • Fixed handling of missing weight attributes during quantization calibration to prevent incorrect processing.

@jQizhang jQizhang requested review from a team as code owners March 4, 2026 15:47
@jQizhang jQizhang requested review from jingyu-ml and sychen52 March 4, 2026 15:47
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request refactors weight-only quantization calibration to use an iterator-based API. The QuantModule class and its specialized subclasses now expose weight calibration pairs via an iter_weights_for_calibration() method, replacing direct weight attribute access patterns in the calibration flow.

Changes

Cohort / File(s) Summary
Calibration Logic
modelopt/torch/quantization/model_calib.py
Modified weight_only_quantize to add QuantModule-specific calibration path using the new iter_weights_for_calibration() iterator, replacing direct weighted attribute access while preserving non-QuantModule behavior.
QuantModule Base & Plugin
modelopt/torch/quantization/nn/modules/quant_module.py, modelopt/torch/quantization/plugins/transformer_engine.py
Added iter_weights_for_calibration() method to QuantModule base class and specialized _QuantTEGroupedLinear subclass to yield (weight, weight_quantizer) pairs for calibration.
Utility Validation
modelopt/torch/quantization/utils/core_utils.py
Added guard in weight_attr_names() to verify weight attribute existence before yielding, preventing invalid entries for modules without weight attributes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns ✅ Passed PR introduces weight calibration methods with only algorithmic changes and no security anti-patterns like unsafe torch.load(), eval(), or exec().
Title check ✅ Passed The title 'Fix weight-only quantization for TEGroupedMLP (MoE models)' directly and clearly summarizes the main change: fixing a bug in weight-only quantization for MoE models using TEGroupedMLP.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 74-84: The QuantModule branch captures weights by calling
module.iter_weights_for_calibration() outside the
enable_weight_access_and_writeback context, so remapped/sharded/offloaded
weights may be stale; move the call into the context so iteration happens while
enable_weight_access_and_writeback(module, model) is active (i.e., enter the
with block first, then call module.iter_weights_for_calibration() and call
weight_quantizer(weight) inside that context). Keep the else branch behavior for
weight_attr_names/quantizer_attr_names unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8c54433-356b-4651-9f26-5e26affd01eb

📥 Commits

Reviewing files that changed from the base of the PR and between a34d613 and 77cf3d2fb4dbb8e0c08c33b8ed7d60051fc4ba73.

📒 Files selected for processing (5)
  • modelopt/torch/export/unified_export_megatron.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/nn/modules/quant_module.py
  • modelopt/torch/quantization/plugins/transformer_engine.py
  • modelopt/torch/quantization/utils.py

Comment thread modelopt/torch/quantization/model_calib.py Outdated
@jQizhang jQizhang force-pushed the weight_only_te_fix branch from 286eef1 to 57099fb Compare March 10, 2026 15:19
Signed-off-by: larkzhang-nv <larkz@nvidia.com>
@jQizhang jQizhang force-pushed the weight_only_te_fix branch from 57099fb to 45ab8ca Compare March 12, 2026 02:35
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/quantization/model_calib.py (1)

67-85: Cache name_to_module once in this hot path.

enable_weight_access_and_writeback() falls back to rebuilding dict(model.named_modules()) when name_to_module is omitted. Doing that for every module here makes weight-only calibration unnecessarily expensive on large MoE models. It would be better to precompute the mapping once and keep a single writeback context per module.

♻️ Proposed refactor
 def weight_only_quantize(model: nn.Module):
     """Just quantize the weights of the model."""
     seen_modules = set()
-    for name, module in model.named_modules():
+    name_to_module = dict(model.named_modules())
+    for name, module in name_to_module.items():
         if module in seen_modules:
             continue

-        if isinstance(module, QuantModule):
-            with enable_weight_access_and_writeback(module, model):
+        with enable_weight_access_and_writeback(module, model, name_to_module):
+            if isinstance(module, QuantModule):
                 for weight, weight_quantizer in module.iter_weights_for_calibration():
                     weight_quantizer(weight)
-        else:
-            for weight_name in weight_attr_names(module):
-                with enable_weight_access_and_writeback(module, model):
+            else:
+                for weight_name in weight_attr_names(module):
                     weight_quantizer = getattr(
                         module, quantizer_attr_names(weight_name).weight_quantizer
                     )
                     weight_quantizer(getattr(module, weight_name))
         seen_modules.add(module)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 67 - 85, The hot
loop in weight_only_quantize rebuilds model.named_modules() inside
enable_weight_access_and_writeback for every module; precompute name_to_module
once and reuse a single context per module to avoid repeated dict construction.
Modify weight_only_quantize to call dict(model.named_modules()) once (e.g.,
name_to_module = dict(model.named_modules())) and pass that mapping into each
enable_weight_access_and_writeback(...) invocation for both the QuantModule
branch and the fallback branch, and also reuse the same context per module
rather than creating a new temporary mapping on each weight access.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 67-85: The hot loop in weight_only_quantize rebuilds
model.named_modules() inside enable_weight_access_and_writeback for every
module; precompute name_to_module once and reuse a single context per module to
avoid repeated dict construction. Modify weight_only_quantize to call
dict(model.named_modules()) once (e.g., name_to_module =
dict(model.named_modules())) and pass that mapping into each
enable_weight_access_and_writeback(...) invocation for both the QuantModule
branch and the fallback branch, and also reuse the same context per module
rather than creating a new temporary mapping on each weight access.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8d33326-47ca-480b-baff-e56c4eb6fc3c

📥 Commits

Reviewing files that changed from the base of the PR and between 57099fbae96bc6c02e8aef1c2267906b8dc8304b and 45ab8ca.

📒 Files selected for processing (4)
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/nn/modules/quant_module.py
  • modelopt/torch/quantization/plugins/transformer_engine.py
  • modelopt/torch/quantization/utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/quantization/plugins/transformer_engine.py

@yueshen2016
Copy link
Copy Markdown
Contributor

The PR description says:
▎ Updated GPTModelExporter to correctly handle the structure of TEGroupedMLP during HuggingFace format conversion

But no export files are changed. If this was intentionally deferred, the description should be updated.

@yueshen2016
Copy link
Copy Markdown
Contributor

Overall LGTM, you can rebase and merge once it passes all the tests.
Is this fix just for quantization or also includes export?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/quantization/model_calib.py (1)

76-81: Consider a single writeback context for the fallback loop.

For modules with multiple weight attributes, entering/exiting writeback context per weight adds avoidable overhead. Wrapping the whole loop once should keep behavior and reduce context churn.

♻️ Proposed refactor
-        else:
-            for weight_name in weight_attr_names(module):
-                with enable_weight_access_and_writeback(module, model, name_to_module):
-                    weight_quantizer = getattr(
-                        module, quantizer_attr_names(weight_name).weight_quantizer
-                    )
-                    weight_quantizer(getattr(module, weight_name))
+        else:
+            with enable_weight_access_and_writeback(module, model, name_to_module):
+                for weight_name in weight_attr_names(module):
+                    weight_quantizer = getattr(
+                        module, quantizer_attr_names(weight_name).weight_quantizer
+                    )
+                    weight_quantizer(getattr(module, weight_name))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 76 - 81, The loop
currently enters enable_weight_access_and_writeback(module, model,
name_to_module) for every weight_name; instead, wrap the entire for weight_name
in weight_attr_names(module) loop with a single with
enable_weight_access_and_writeback(module, model, name_to_module) context so you
open the writeback once, then inside call weight_quantizer = getattr(module,
quantizer_attr_names(weight_name).weight_quantizer) and
weight_quantizer(getattr(module, weight_name)) for each weight_name; this
preserves behavior while avoiding repeated context entry/exit overhead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 76-81: The loop currently enters
enable_weight_access_and_writeback(module, model, name_to_module) for every
weight_name; instead, wrap the entire for weight_name in
weight_attr_names(module) loop with a single with
enable_weight_access_and_writeback(module, model, name_to_module) context so you
open the writeback once, then inside call weight_quantizer = getattr(module,
quantizer_attr_names(weight_name).weight_quantizer) and
weight_quantizer(getattr(module, weight_name)) for each weight_name; this
preserves behavior while avoiding repeated context entry/exit overhead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f77587b-0518-4df9-a9a6-a734ce3144e6

📥 Commits

Reviewing files that changed from the base of the PR and between 45ab8ca and 052e360.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/utils/core_utils.py

@jQizhang jQizhang changed the title Fix weight-only quantization and export for TEGroupedMLP (MoE models) Fix weight-only quantization for TEGroupedMLP (MoE models) Mar 18, 2026
@jQizhang
Copy link
Copy Markdown
Contributor Author

Overall LGTM, you can rebase and merge once it passes all the tests. Is this fix just for quantization or also includes export?

@yueshen2016 Thanks! this PR is just for quantization. I've update the description regarding the export part.
I also finished a rebase to clear some conflicts. Could you please take one final look?

Copy link
Copy Markdown
Contributor

@yueshen2016 yueshen2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.83%. Comparing base (18ddcb7) to head (c10aa27).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...t/torch/quantization/plugins/transformer_engine.py 20.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #971      +/-   ##
==========================================
+ Coverage   74.68%   75.83%   +1.14%     
==========================================
  Files         349      349              
  Lines       39846    39856      +10     
==========================================
+ Hits        29759    30224     +465     
+ Misses      10087     9632     -455     
Flag Coverage Δ
examples 43.92% <71.42%> (+3.61%) ⬆️
gpu 57.09% <71.42%> (-0.16%) ⬇️
unit 54.53% <64.28%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jQizhang jQizhang force-pushed the weight_only_te_fix branch from 81c7a9b to 052e360 Compare March 20, 2026 07:52
@kevalmorabia97 kevalmorabia97 requested review from jenchen13 and realAsma and removed request for jingyu-ml March 20, 2026 09:51
@ChenhanYu ChenhanYu removed request for a team and sychen52 March 20, 2026 21:50
Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Clean, well-scoped bug fix for MoE weight-only quantization. In TEGroupedMLP, weights are stored per-expert as weight0...weightN (not a single self.weight). The existing calibration code assumed a single weight attribute and missed these, causing _amax to never be computed — crash during export.

The fix introduces iter_weights_for_calibration() on QuantModule (base class) and overrides it in _QuantTEGroupedLinear to yield all per-expert weights with their shared quantizer. Good use of polymorphism — small, focused changes.

Attention

Shared quantizer across all experts — In transformer_engine.py, all expert weights are yielded with the same self.weight_quantizer, so _amax is computed across all experts jointly. This is likely intentional for NVFP4 (one scale for the whole layer), but worth confirming vs. per-expert calibration.

Minor Notes

  1. The isinstance(module, QuantModule) check in model_calib.py changes behavior from the old code which iterated weight_attr_names(module) for any module. If there are non-QuantModule subclasses with quantizable weights, they'd be silently skipped. Fine if all quantizable modules are QuantModule subclasses.

  2. core_utils.py — Adding weight is not None guard is the right root fix to prevent yielding "weight" when self.weight has been deleted.

  3. Minor nit: no upper bound validation on self.num_gemms in the TE override — a defensive assert self.num_gemms > 0 in _setup would be nice but not blocking.

LGTM overall.

@ChenhanYu
Copy link
Copy Markdown
Collaborator

ChenhanYu commented Mar 20, 2026

Need one of the owner approval from @kinjalpatel27 or @realAsma

Also, fail DCO. Take a look at this which is required by OSS contribution
https://github.com/NVIDIA/Model-Optimizer/pull/971/checks?check_run_id=67882147421

Signed-off-by: larkz <larkz@nvidia.com>
@jQizhang jQizhang force-pushed the weight_only_te_fix branch from 21e2b6c to 83b7319 Compare March 21, 2026 06:02
@jQizhang
Copy link
Copy Markdown
Contributor Author

@kinjalpatel27 Hi, I've fixed the DCO issue. Could you please approve the workflows, Thank you!

# Remove self.weight after post_restore.
delattr(self, "weight")

def iter_weights_for_calibration(self):
Copy link
Copy Markdown
Contributor

@jenchen13 jenchen13 Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unnecessary because _ParallelLinear inherits from QuantModule, so the iter_weights_for_calibration defined in QuantModule will be inherited
@jQizhang

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jenchen13 thanks for the review! The reason for overriding iter_weights_for_calibration is that the base implementation is not compatible with _QuantTEGroupedLinear.
_QuantTEGroupedLinear doesn't have a self.weight attribute. The actual weights are stored as weight0, weight1, ..... The base QuantModule.iter_weights_for_calibration relies on weight_attr_names(), which checks for self.weight. So without this override, weight calibration would be silently skipped for grouped linear layers.

@jenchen13 jenchen13 enabled auto-merge (squash) March 24, 2026 04:03
@jQizhang
Copy link
Copy Markdown
Contributor Author

Hi @kinjalpatel27 @jenchen13 @ChenhanYu , I noticed that some checks have been stuck for a while. Could you please help take a look? Thanks!

@jenchen13
Copy link
Copy Markdown
Contributor

I approved workflows to run, please don't make any more changes to this branch. thank you!

@jQizhang
Copy link
Copy Markdown
Contributor Author

Hi @jenchen13 , It looks like example-pr-required-check and gpu-pr-required-check are still stuck in a pending state. Could you please take a look when you have a moment? Thanks!

@jQizhang
Copy link
Copy Markdown
Contributor Author

jQizhang commented Apr 1, 2026

Hi @jenchen13 , sorry to bother you again! I'm not very familiar with the ModelOpt PR workflow. I noticed the CI failed with a 403 Permission Denied error, which seems unrelated to my code changes. Could you please take a look when you have a moment? Thanks!

@jenchen13
Copy link
Copy Markdown
Contributor

/ok to test 75e94cf

@jenchen13
Copy link
Copy Markdown
Contributor

/ok to test 75e94cf

@jenchen13
Copy link
Copy Markdown
Contributor

/ok to test c10aa27

@jenchen13 jenchen13 merged commit 9aab38c into NVIDIA:main Apr 3, 2026
61 of 68 checks passed
kevalmorabia97 pushed a commit that referenced this pull request Apr 6, 2026
### What does this PR do?

This PR fixes a critical issue where weight-only quantization fails for
MoE models utilizing `TEGroupedMLP` (e.g., Qwen3-30B-A3B).

#### The Problem:

In `TEGroupedMLP`, weights are stored per-expert as `weight0`,
`weight1`, ..., `weightN`. During `_QuantTEGroupedLinear._setup`, the
standard `self.weight` attribute is deleted.
The existing `weight_only_quantize` logic expects to find a
`self.weight` associated with the quantizer. Because it couldn't find
these "hidden" expert weights, the `weight_quantizer` failed to
calibrate, resulting in a missing `_amax` attribute. This leads to the
following crash during export/inference:

<img width="2792" height="1034" alt="image"
src="https://github.com/user-attachments/assets/9e2b1abd-80f4-4b8b-bb95-f8ee7a8f686a"
/>


```python
File ".../modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 59, in get_weights_scaling_factor_2_from_quantizer
assert hasattr(weight_quantizer, "_amax"), "Weight quantizer does not have attribute amax"

```

#### The Solution:

1. **Calibration Interface:** Introduced `iter_weights_for_calibration`
in the `QuantModule` base class.
2. **MoE Support:** Overrode this method in `_QuantTEGroupedLinear` to
yield all per-expert weights (`weight0`...`weightN`) that share the same
quantizer. This ensures the calibrator "sees" all expert weights and
calculates a valid `_amax`.

---

### 2. Type of change

* [x] Bug fix 

---

### 3. Usage / Reproduction

This issue is reproducible when running weight-only quantization on MoE
models like Qwen3-30B-A3B:

```bash
# Step 1: Quantization
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
    --hf-model-id Qwen/Qwen3-30B-A3B \
    --export-quant-cfg nvfp4 \
    --tp 2 \
    --ep 8 \
    --weight-only \
    --megatron-save-path ./qwen3_30b_nvfp4

```

---

### 4. Testing & Verification

* **Models Tested:** Qwen3-8B (Dense), Qwen3-30B-A3B (MoE).
* **Quantization:** NVFP4/FP8 weight-only quantization.
* **Verification:** - Confirmed that `QuantTEGroupedMLP` now correctly
shows calculated `_amax` values in the quantization statistics table
instead of remaining `dynamic`.
* Validated that the change does not regress dense model (Qwen3-8B)
quantization flow.
* After fix, the amax of experts can be calculated correctly.
```
                                  Quantization Statistics                                   
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ Parameter Name                                                      ┃ Shape ┃  Max Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ decoder.layers.0.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 7.5781e-01 │
│ decoder.layers.0.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 2.8711e-01 │
│ decoder.layers.0.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.1094e-01 │
│ decoder.layers.0.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 8.6719e-01 │
│ decoder.layers.1.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 5.8594e-01 │
│ decoder.layers.1.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 7.4219e-01 │
│ decoder.layers.1.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.2266e-01 │
│ decoder.layers.1.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 1.9922e+00 │
│ decoder.layers.2.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 1.0859e+00 │
│ decoder.layers.2.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 1.7812e+00 │
│ decoder.layers.2.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.3047e-01 │
│ decoder.layers.2.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 1.9219e+00 │
```




<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Enhanced weight-only quantization calibration with improved support
for specialized quantization modules and grouped-linear quantization
paths.

* **Bug Fixes**
* Fixed handling of missing weight attributes during quantization
calibration to prevent incorrect processing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: larkzhang-nv <larkz@nvidia.com>
Signed-off-by: larkz <larkz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants