Fix weight-only quantization for TEGroupedMLP (MoE models) by jQizhang · Pull Request #971 · NVIDIA/Model-Optimizer

jQizhang · 2026-03-04T15:47:36Z

What does this PR do?

This PR fixes a critical issue where weight-only quantization fails for MoE models utilizing TEGroupedMLP (e.g., Qwen3-30B-A3B).

The Problem:

In TEGroupedMLP, weights are stored per-expert as weight0, weight1, ..., weightN. During _QuantTEGroupedLinear._setup, the standard self.weight attribute is deleted.
The existing weight_only_quantize logic expects to find a self.weight associated with the quantizer. Because it couldn't find these "hidden" expert weights, the weight_quantizer failed to calibrate, resulting in a missing _amax attribute. This leads to the following crash during export/inference:

File ".../modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 59, in get_weights_scaling_factor_2_from_quantizer
assert hasattr(weight_quantizer, "_amax"), "Weight quantizer does not have attribute amax"

The Solution:

Calibration Interface: Introduced iter_weights_for_calibration in the QuantModule base class.
MoE Support: Overrode this method in _QuantTEGroupedLinear to yield all per-expert weights (weight0...weightN) that share the same quantizer. This ensures the calibrator "sees" all expert weights and calculates a valid _amax.

2. Type of change

Bug fix

3. Usage / Reproduction

This issue is reproducible when running weight-only quantization on MoE models like Qwen3-30B-A3B:

# Step 1: Quantization
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
    --hf-model-id Qwen/Qwen3-30B-A3B \
    --export-quant-cfg nvfp4 \
    --tp 2 \
    --ep 8 \
    --weight-only \
    --megatron-save-path ./qwen3_30b_nvfp4

4. Testing & Verification

Models Tested: Qwen3-8B (Dense), Qwen3-30B-A3B (MoE).
Quantization: NVFP4/FP8 weight-only quantization.
Verification: - Confirmed that QuantTEGroupedMLP now correctly shows calculated _amax values in the quantization statistics table instead of remaining dynamic.
Validated that the change does not regress dense model (Qwen3-8B) quantization flow.
After fix, the amax of experts can be calculated correctly.

                                  Quantization Statistics                                   
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ Parameter Name                                                      ┃ Shape ┃  Max Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ decoder.layers.0.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 7.5781e-01 │
│ decoder.layers.0.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 2.8711e-01 │
│ decoder.layers.0.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.1094e-01 │
│ decoder.layers.0.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 8.6719e-01 │
│ decoder.layers.1.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 5.8594e-01 │
│ decoder.layers.1.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 7.4219e-01 │
│ decoder.layers.1.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.2266e-01 │
│ decoder.layers.1.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 1.9922e+00 │
│ decoder.layers.2.self_attention.linear_proj.weight_quantizer._amax  │ ()    │ 1.0859e+00 │
│ decoder.layers.2.self_attention.linear_qkv.weight_quantizer._amax   │ ()    │ 1.7812e+00 │
│ decoder.layers.2.mlp.experts.linear_fc1.weight_quantizer._amax      │ ()    │ 7.3047e-01 │
│ decoder.layers.2.mlp.experts.linear_fc2.weight_quantizer._amax      │ ()    │ 1.9219e+00 │

Summary by CodeRabbit

New Features
- Enhanced weight-only quantization calibration with improved support for specialized quantization modules and grouped-linear quantization paths.
Bug Fixes
- Fixed handling of missing weight attributes during quantization calibration to prevent incorrect processing.

copy-pr-bot · 2026-03-04T15:47:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-04T15:47:57Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request refactors weight-only quantization calibration to use an iterator-based API. The QuantModule class and its specialized subclasses now expose weight calibration pairs via an iter_weights_for_calibration() method, replacing direct weight attribute access patterns in the calibration flow.

Changes

Cohort / File(s)	Summary
Calibration Logic `modelopt/torch/quantization/model_calib.py`	Modified `weight_only_quantize` to add QuantModule-specific calibration path using the new `iter_weights_for_calibration()` iterator, replacing direct weighted attribute access while preserving non-QuantModule behavior.
QuantModule Base & Plugin `modelopt/torch/quantization/nn/modules/quant_module.py`, `modelopt/torch/quantization/plugins/transformer_engine.py`	Added `iter_weights_for_calibration()` method to QuantModule base class and specialized `_QuantTEGroupedLinear` subclass to yield (weight, weight_quantizer) pairs for calibration.
Utility Validation `modelopt/torch/quantization/utils/core_utils.py`	Added guard in `weight_attr_names()` to verify weight attribute existence before yielding, preventing invalid entries for modules without weight attributes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns	✅ Passed	PR introduces weight calibration methods with only algorithmic changes and no security anti-patterns like unsafe torch.load(), eval(), or exec().
Title check	✅ Passed	The title 'Fix weight-only quantization for TEGroupedMLP (MoE models)' directly and clearly summarizes the main change: fixing a bug in weight-only quantization for MoE models using TEGroupedMLP.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 74-84: The QuantModule branch captures weights by calling
module.iter_weights_for_calibration() outside the
enable_weight_access_and_writeback context, so remapped/sharded/offloaded
weights may be stale; move the call into the context so iteration happens while
enable_weight_access_and_writeback(module, model) is active (i.e., enter the
with block first, then call module.iter_weights_for_calibration() and call
weight_quantizer(weight) inside that context). Keep the else branch behavior for
weight_attr_names/quantizer_attr_names unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8c54433-356b-4651-9f26-5e26affd01eb

📥 Commits

Reviewing files that changed from the base of the PR and between a34d613 and 77cf3d2fb4dbb8e0c08c33b8ed7d60051fc4ba73.

📒 Files selected for processing (5)

modelopt/torch/export/unified_export_megatron.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/nn/modules/quant_module.py
modelopt/torch/quantization/plugins/transformer_engine.py
modelopt/torch/quantization/utils.py

Signed-off-by: larkzhang-nv <larkz@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/quantization/model_calib.py (1)

67-85: Cache name_to_module once in this hot path.

enable_weight_access_and_writeback() falls back to rebuilding dict(model.named_modules()) when name_to_module is omitted. Doing that for every module here makes weight-only calibration unnecessarily expensive on large MoE models. It would be better to precompute the mapping once and keep a single writeback context per module.

♻️ Proposed refactor

 def weight_only_quantize(model: nn.Module):
     """Just quantize the weights of the model."""
     seen_modules = set()
-    for name, module in model.named_modules():
+    name_to_module = dict(model.named_modules())
+    for name, module in name_to_module.items():
         if module in seen_modules:
             continue

-        if isinstance(module, QuantModule):
-            with enable_weight_access_and_writeback(module, model):
+        with enable_weight_access_and_writeback(module, model, name_to_module):
+            if isinstance(module, QuantModule):
                 for weight, weight_quantizer in module.iter_weights_for_calibration():
                     weight_quantizer(weight)
-        else:
-            for weight_name in weight_attr_names(module):
-                with enable_weight_access_and_writeback(module, model):
+            else:
+                for weight_name in weight_attr_names(module):
                     weight_quantizer = getattr(
                         module, quantizer_attr_names(weight_name).weight_quantizer
                     )
                     weight_quantizer(getattr(module, weight_name))
         seen_modules.add(module)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 67 - 85, The hot
loop in weight_only_quantize rebuilds model.named_modules() inside
enable_weight_access_and_writeback for every module; precompute name_to_module
once and reuse a single context per module to avoid repeated dict construction.
Modify weight_only_quantize to call dict(model.named_modules()) once (e.g.,
name_to_module = dict(model.named_modules())) and pass that mapping into each
enable_weight_access_and_writeback(...) invocation for both the QuantModule
branch and the fallback branch, and also reuse the same context per module
rather than creating a new temporary mapping on each weight access.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 67-85: The hot loop in weight_only_quantize rebuilds
model.named_modules() inside enable_weight_access_and_writeback for every
module; precompute name_to_module once and reuse a single context per module to
avoid repeated dict construction. Modify weight_only_quantize to call
dict(model.named_modules()) once (e.g., name_to_module =
dict(model.named_modules())) and pass that mapping into each
enable_weight_access_and_writeback(...) invocation for both the QuantModule
branch and the fallback branch, and also reuse the same context per module
rather than creating a new temporary mapping on each weight access.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8d33326-47ca-480b-baff-e56c4eb6fc3c

📥 Commits

Reviewing files that changed from the base of the PR and between 57099fbae96bc6c02e8aef1c2267906b8dc8304b and 45ab8ca.

📒 Files selected for processing (4)

modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/nn/modules/quant_module.py
modelopt/torch/quantization/plugins/transformer_engine.py
modelopt/torch/quantization/utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/quantization/plugins/transformer_engine.py

yueshen2016 · 2026-03-17T22:41:45Z

The PR description says:
▎ Updated GPTModelExporter to correctly handle the structure of TEGroupedMLP during HuggingFace format conversion

But no export files are changed. If this was intentionally deferred, the description should be updated.

yueshen2016 · 2026-03-17T22:43:10Z

Overall LGTM, you can rebase and merge once it passes all the tests.
Is this fix just for quantization or also includes export?

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/quantization/model_calib.py (1)

76-81: Consider a single writeback context for the fallback loop.

For modules with multiple weight attributes, entering/exiting writeback context per weight adds avoidable overhead. Wrapping the whole loop once should keep behavior and reduce context churn.

♻️ Proposed refactor

-        else:
-            for weight_name in weight_attr_names(module):
-                with enable_weight_access_and_writeback(module, model, name_to_module):
-                    weight_quantizer = getattr(
-                        module, quantizer_attr_names(weight_name).weight_quantizer
-                    )
-                    weight_quantizer(getattr(module, weight_name))
+        else:
+            with enable_weight_access_and_writeback(module, model, name_to_module):
+                for weight_name in weight_attr_names(module):
+                    weight_quantizer = getattr(
+                        module, quantizer_attr_names(weight_name).weight_quantizer
+                    )
+                    weight_quantizer(getattr(module, weight_name))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 76 - 81, The loop
currently enters enable_weight_access_and_writeback(module, model,
name_to_module) for every weight_name; instead, wrap the entire for weight_name
in weight_attr_names(module) loop with a single with
enable_weight_access_and_writeback(module, model, name_to_module) context so you
open the writeback once, then inside call weight_quantizer = getattr(module,
quantizer_attr_names(weight_name).weight_quantizer) and
weight_quantizer(getattr(module, weight_name)) for each weight_name; this
preserves behavior while avoiding repeated context entry/exit overhead.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 76-81: The loop currently enters
enable_weight_access_and_writeback(module, model, name_to_module) for every
weight_name; instead, wrap the entire for weight_name in
weight_attr_names(module) loop with a single with
enable_weight_access_and_writeback(module, model, name_to_module) context so you
open the writeback once, then inside call weight_quantizer = getattr(module,
quantizer_attr_names(weight_name).weight_quantizer) and
weight_quantizer(getattr(module, weight_name)) for each weight_name; this
preserves behavior while avoiding repeated context entry/exit overhead.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f77587b-0518-4df9-a9a6-a734ce3144e6

📥 Commits

Reviewing files that changed from the base of the PR and between 45ab8ca and 052e360.

📒 Files selected for processing (2)

modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/utils/core_utils.py

jQizhang · 2026-03-18T03:37:30Z

Overall LGTM, you can rebase and merge once it passes all the tests. Is this fix just for quantization or also includes export?

@yueshen2016 Thanks! this PR is just for quantization. I've update the description regarding the export part.
I also finished a rebase to clear some conflicts. Could you please take one final look?

yueshen2016

LGTM

codecov · 2026-03-18T07:45:22Z

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.83%. Comparing base (18ddcb7) to head (c10aa27).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...t/torch/quantization/plugins/transformer_engine.py	20.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #971      +/-   ##
==========================================
+ Coverage   74.68%   75.83%   +1.14%     
==========================================
  Files         349      349              
  Lines       39846    39856      +10     
==========================================
+ Hits        29759    30224     +465     
+ Misses      10087     9632     -455

Flag	Coverage Δ
examples	`43.92% <71.42%> (+3.61%)`	⬆️
gpu	`57.09% <71.42%> (-0.16%)`	⬇️
unit	`54.53% <64.28%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu

Review Summary

Clean, well-scoped bug fix for MoE weight-only quantization. In TEGroupedMLP, weights are stored per-expert as weight0...weightN (not a single self.weight). The existing calibration code assumed a single weight attribute and missed these, causing _amax to never be computed — crash during export.

The fix introduces iter_weights_for_calibration() on QuantModule (base class) and overrides it in _QuantTEGroupedLinear to yield all per-expert weights with their shared quantizer. Good use of polymorphism — small, focused changes.

Attention

Shared quantizer across all experts — In transformer_engine.py, all expert weights are yielded with the same self.weight_quantizer, so _amax is computed across all experts jointly. This is likely intentional for NVFP4 (one scale for the whole layer), but worth confirming vs. per-expert calibration.

Minor Notes

The isinstance(module, QuantModule) check in model_calib.py changes behavior from the old code which iterated weight_attr_names(module) for any module. If there are non-QuantModule subclasses with quantizable weights, they'd be silently skipped. Fine if all quantizable modules are QuantModule subclasses.
core_utils.py — Adding weight is not None guard is the right root fix to prevent yielding "weight" when self.weight has been deleted.
Minor nit: no upper bound validation on self.num_gemms in the TE override — a defensive assert self.num_gemms > 0 in _setup would be nice but not blocking.

LGTM overall.

ChenhanYu · 2026-03-20T21:54:06Z

Need one of the owner approval from @kinjalpatel27 or @realAsma

Also, fail DCO. Take a look at this which is required by OSS contribution
https://github.com/NVIDIA/Model-Optimizer/pull/971/checks?check_run_id=67882147421

Signed-off-by: larkz <larkz@nvidia.com>

jQizhang · 2026-03-23T02:29:41Z

@kinjalpatel27 Hi, I've fixed the DCO issue. Could you please approve the workflows, Thank you!

jenchen13 · 2026-03-23T18:47:40Z

        # Remove self.weight after post_restore.
        delattr(self, "weight")

+    def iter_weights_for_calibration(self):


this is unnecessary because _ParallelLinear inherits from QuantModule, so the iter_weights_for_calibration defined in QuantModule will be inherited
@jQizhang

Hi @jenchen13 thanks for the review! The reason for overriding iter_weights_for_calibration is that the base implementation is not compatible with _QuantTEGroupedLinear.
_QuantTEGroupedLinear doesn't have a self.weight attribute. The actual weights are stored as weight0, weight1, ..... The base QuantModule.iter_weights_for_calibration relies on weight_attr_names(), which checks for self.weight. So without this override, weight calibration would be silently skipped for grouped linear layers.

jQizhang · 2026-03-25T10:21:30Z

Hi @kinjalpatel27 @jenchen13 @ChenhanYu , I noticed that some checks have been stuck for a while. Could you please help take a look? Thanks!

jenchen13 · 2026-03-25T15:02:59Z

I approved workflows to run, please don't make any more changes to this branch. thank you!

jQizhang · 2026-03-30T16:13:31Z

Hi @jenchen13 , It looks like example-pr-required-check and gpu-pr-required-check are still stuck in a pending state. Could you please take a look when you have a moment? Thanks!

jQizhang · 2026-04-01T16:04:12Z

Hi @jenchen13 , sorry to bother you again! I'm not very familiar with the ModelOpt PR workflow. I noticed the CI failed with a 403 Permission Denied error, which seems unrelated to my code changes. Could you please take a look when you have a moment? Thanks!

jenchen13 · 2026-04-02T15:31:52Z

/ok to test 75e94cf

jenchen13 · 2026-04-02T15:57:51Z

/ok to test 75e94cf

jenchen13 · 2026-04-03T15:26:48Z

/ok to test c10aa27

### What does this PR do? This PR fixes a critical issue where weight-only quantization fails for MoE models utilizing `TEGroupedMLP` (e.g., Qwen3-30B-A3B). #### The Problem: In `TEGroupedMLP`, weights are stored per-expert as `weight0`, `weight1`, ..., `weightN`. During `_QuantTEGroupedLinear._setup`, the standard `self.weight` attribute is deleted. The existing `weight_only_quantize` logic expects to find a `self.weight` associated with the quantizer. Because it couldn't find these "hidden" expert weights, the `weight_quantizer` failed to calibrate, resulting in a missing `_amax` attribute. This leads to the following crash during export/inference: <img width="2792" height="1034" alt="image" src="https://github.com/user-attachments/assets/9e2b1abd-80f4-4b8b-bb95-f8ee7a8f686a" /> ```python File ".../modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 59, in get_weights_scaling_factor_2_from_quantizer assert hasattr(weight_quantizer, "_amax"), "Weight quantizer does not have attribute amax" ``` #### The Solution: 1. **Calibration Interface:** Introduced `iter_weights_for_calibration` in the `QuantModule` base class. 2. **MoE Support:** Overrode this method in `_QuantTEGroupedLinear` to yield all per-expert weights (`weight0`...`weightN`) that share the same quantizer. This ensures the calibrator "sees" all expert weights and calculates a valid `_amax`. --- ### 2. Type of change * [x] Bug fix --- ### 3. Usage / Reproduction This issue is reproducible when running weight-only quantization on MoE models like Qwen3-30B-A3B: ```bash # Step 1: Quantization torchrun --nproc_per_node 8 examples/quantization/quantize.py \ --hf-model-id Qwen/Qwen3-30B-A3B \ --export-quant-cfg nvfp4 \ --tp 2 \ --ep 8 \ --weight-only \ --megatron-save-path ./qwen3_30b_nvfp4 ``` --- ### 4. Testing & Verification * **Models Tested:** Qwen3-8B (Dense), Qwen3-30B-A3B (MoE). * **Quantization:** NVFP4/FP8 weight-only quantization. * **Verification:** - Confirmed that `QuantTEGroupedMLP` now correctly shows calculated `_amax` values in the quantization statistics table instead of remaining `dynamic`. * Validated that the change does not regress dense model (Qwen3-8B) quantization flow. * After fix, the amax of experts can be calculated correctly. ``` Quantization Statistics ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓ ┃ Parameter Name ┃ Shape ┃ Max Value ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩ │ decoder.layers.0.self_attention.linear_proj.weight_quantizer._amax │ () │ 7.5781e-01 │ │ decoder.layers.0.self_attention.linear_qkv.weight_quantizer._amax │ () │ 2.8711e-01 │ │ decoder.layers.0.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.1094e-01 │ │ decoder.layers.0.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 8.6719e-01 │ │ decoder.layers.1.self_attention.linear_proj.weight_quantizer._amax │ () │ 5.8594e-01 │ │ decoder.layers.1.self_attention.linear_qkv.weight_quantizer._amax │ () │ 7.4219e-01 │ │ decoder.layers.1.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.2266e-01 │ │ decoder.layers.1.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 1.9922e+00 │ │ decoder.layers.2.self_attention.linear_proj.weight_quantizer._amax │ () │ 1.0859e+00 │ │ decoder.layers.2.self_attention.linear_qkv.weight_quantizer._amax │ () │ 1.7812e+00 │ │ decoder.layers.2.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.3047e-01 │ │ decoder.layers.2.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 1.9219e+00 │ ```  ## Summary by CodeRabbit * **New Features** * Enhanced weight-only quantization calibration with improved support for specialized quantization modules and grouped-linear quantization paths. * **Bug Fixes** * Fixed handling of missing weight attributes during quantization calibration to prevent incorrect processing.  --------- Signed-off-by: larkzhang-nv <larkz@nvidia.com> Signed-off-by: larkz <larkz@nvidia.com>

jQizhang requested review from a team as code owners March 4, 2026 15:47

jQizhang requested review from jingyu-ml and sychen52 March 4, 2026 15:47

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/model_calib.py Outdated

jQizhang force-pushed the weight_only_te_fix branch from 286eef1 to 57099fb Compare March 10, 2026 15:19

Fix nvfp4 weight-only quantization for TEGroupedMLP (MoE models)

45ab8ca

Signed-off-by: larkzhang-nv <larkz@nvidia.com>

jQizhang force-pushed the weight_only_te_fix branch from 57099fb to 45ab8ca Compare March 12, 2026 02:35

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into weight_only_te_fix

052e360

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

jQizhang changed the title ~~Fix weight-only quantization and export for TEGroupedMLP (MoE models)~~ Fix weight-only quantization for TEGroupedMLP (MoE models) Mar 18, 2026

yueshen2016 approved these changes Mar 18, 2026

View reviewed changes

jQizhang force-pushed the weight_only_te_fix branch from 81c7a9b to 052e360 Compare March 20, 2026 07:52

kevalmorabia97 requested review from jenchen13 and realAsma and removed request for jingyu-ml March 20, 2026 09:51

ChenhanYu removed request for a team and sychen52 March 20, 2026 21:50

ChenhanYu approved these changes Mar 20, 2026

View reviewed changes

kinjalpatel27 approved these changes Mar 20, 2026

View reviewed changes

minor fix

83b7319

Signed-off-by: larkz <larkz@nvidia.com>

jQizhang force-pushed the weight_only_te_fix branch from 21e2b6c to 83b7319 Compare March 21, 2026 06:02

jQizhang added 2 commits March 21, 2026 14:02

Merge branch 'main' into weight_only_te_fix

06576f2

Merge branch 'main' into weight_only_te_fix

c31fbb4

jenchen13 reviewed Mar 23, 2026

View reviewed changes

jenchen13 approved these changes Mar 24, 2026

View reviewed changes

jenchen13 enabled auto-merge (squash) March 24, 2026 04:03

Merge branch 'main' into weight_only_te_fix

75e94cf

Merge branch 'main' into weight_only_te_fix

c10aa27

jenchen13 merged commit 9aab38c into NVIDIA:main Apr 3, 2026
61 of 68 checks passed

Conversation

jQizhang commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

The Problem:

The Solution:

2. Type of change

3. Usage / Reproduction

4. Testing & Verification

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yueshen2016 commented Mar 17, 2026

Uh oh!

yueshen2016 commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

jQizhang commented Mar 18, 2026

Uh oh!

yueshen2016 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Review Summary

Attention

Minor Notes

Uh oh!

ChenhanYu commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jQizhang commented Mar 23, 2026

Uh oh!

jenchen13 Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jQizhang Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

jQizhang commented Mar 25, 2026

Uh oh!

jenchen13 commented Mar 25, 2026

Uh oh!

jQizhang commented Mar 30, 2026

Uh oh!

jQizhang commented Apr 1, 2026

Uh oh!

jenchen13 commented Apr 2, 2026

Uh oh!

jenchen13 commented Apr 2, 2026

Uh oh!

jenchen13 commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

jQizhang commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

codecov bot commented Mar 18, 2026 •

edited

Loading

ChenhanYu commented Mar 20, 2026 •

edited

Loading

jenchen13 Mar 23, 2026 •

edited

Loading