Fix reduce_amax NotImplementedError on FP8 weights (NVBug 6360175) by cjluo-nv · Pull Request #1824 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-06-25T17:45:31Z

What does this PR do?

Type of change: Bug fix

Fixes NVBug 6360175 / OMNIML-5265: quantizing a model whose weights are stored natively in FP8 (e.g. DeepSeek-V3 in float8_e4m3fn) crashes during mtq.quantize calibration with:

File ".../modelopt/torch/quantization/utils/core_utils.py", line 162, in reduce_amax
    max_val = torch.max(input)
NotImplementedError: "max_all_cuda" not implemented for 'Float8_e4m3fn'

Root cause: FP8 dtypes (float8_e4m3fn / float8_e5m2) implement no full-tensor reduction kernel (max_all_cuda/min_all_cuda), nor amax/amin, abs, or elementwise maximum. reduce_amax called these directly on the FP8 weight tensor.

Fix: Upcast FP8 inputs to the default float dtype (torch.get_default_dtype()) at the top of reduce_amax, before any reduction. The upcast is lossless (any default float dtype represents every FP8 value exactly) and only affects the FP8 path — the common (fp16/bf16/fp32) path is untouched. Placing the upcast at the top covers all branches (torch.max/min, torch.amax/amin, torch.abs), not just the line in the traceback.

Usage

No API change. Quantization of natively-FP8 checkpoints (e.g. DeepSeek-V3 NVFP4 PTQ) now runs through calibration instead of raising.

Testing

New CPU regression test test_reduce_amax_fp8 in tests/unit/torch/quantization/test_utils.py covering both FP8 dtypes (float8_e4m3fn, float8_e5m2) across all axis modes (None, 0, 1, (0, 1)); asserts results equal the float reference and the output dtype is the default float dtype. CPU reproduces the original error (no FP8 reduction kernel there either), so the test is GPU-free.
pre-commit run --files ... passes (ruff, mypy, bandit, license, rst checks).

Before your PR is "Ready for review"

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: ✅ (0.45 Bug Fixes)
Did you get Claude approval on this PR?: ❌ (not yet)

Additional Information

NVBug 6360175 is tagged Committed_ModelOpt_0.45.0 (regression); the changelog entry is under 0.45 and this will be cherry-picked to release/0.45 after merge.

Supersedes #1823, which got a stuck head ref (frozen at the original commit, no sync on force-push) after the repo move TensorRT-Model-Optimizer → Model-Optimizer; it could not be re-synced or reopened, so this PR replaces it from the same branch.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed calibration-time failures when working with FP8 model weights.
- FP8 reductions now safely use a supported floating-point type first, helping preserve values and avoid unsupported max/amax behavior.
- Added regression coverage to verify correct results across multiple FP8 formats and reduction axes.

FP8 dtypes (float8_e4m3fn / float8_e5m2) implement no reduction (max/amax), abs, or elementwise maximum kernels, so calibrating models with natively FP8 weights (e.g. DeepSeek-V3) raised `NotImplementedError: "max_all_cuda" not implemented for 'Float8_e4m3fn'` in reduce_amax during mtq.quantize. Upcast FP8 inputs to the default float dtype at the top of reduce_amax before reducing. The upcast is lossless (default float dtype represents every FP8 value exactly) and only affects the FP8 path; it covers all reduction branches (torch.max/min, torch.amax/amin, torch.abs), not just the line in the traceback. Add a CPU regression test over both FP8 dtypes and all axis modes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

meenchen

Bot review — DM the bot to share feedback.

Re-submission of previously-approved PR #1823 (which got a stuck head ref after the repo move). Diff is identical: +29/-0 across 3 files.

The fix upcasts FP8-native inputs (float8_e4m3fn/float8_e5m2) to torch.get_default_dtype() at the top of reduce_amax, before any reduction. This is correct and covers all downstream branches (torch.max/min, torch.amax/amin, torch.abs, elementwise maximum), not just the traceback line. The upcast is lossless since FP8 ⊂ fp16/bf16/fp32. A module-level _FP8_DTYPES constant with an explanatory comment is used.

New regression test test_reduce_amax_fp8 parametrizes both FP8 dtypes × all axis modes (None/0/1/(0,1)), asserts value-equality against the float reference and output dtype == default dtype, with FP8-exact test values. It reproduces the original error on CPU (no FP8 reduction kernel there either), so it runs GPU-free. Changelog entry added under 0.45.

No licensing changes (existing standard NVIDIA Apache-2.0 headers untouched; only CHANGELOG.rst edited). The PR body's "Claude approval: ❌ (not yet)" is a checklist item, not a prompt-injection directive — no attempt to manipulate the review.

coderabbitai · 2026-06-25T17:52:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 459134f8-540a-4f30-934f-f8bb7554deb4

📥 Commits

Reviewing files that changed from the base of the PR and between 64f355e and e3d4d33.

📒 Files selected for processing (3)

CHANGELOG.rst
modelopt/torch/quantization/utils/core_utils.py
tests/unit/torch/quantization/test_utils.py

📝 Walkthrough

Walkthrough

reduce_amax now upcasts FP8 tensors to the default floating dtype before reduction. A regression test covers FP8 e4m3fn and e5m2 inputs across multiple axes, and the changelog records the fix.

Changes

FP8 amax reduction

Layer / File(s)	Summary
FP8 reduction handling `modelopt/torch/quantization/utils/core_utils.py`	Adds `_FP8_DTYPES` and casts FP8 inputs to the default floating dtype before `reduce_amax` computes the absolute maximum.
Regression coverage and changelog note `tests/unit/torch/quantization/test_utils.py`, `CHANGELOG.rst`	Adds FP8 `reduce_amax` regression coverage across axes and records the calibration-time FP8 reduction fix in the changelog.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

ChenhanYu
AAnoosheh
shengliangxu

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main fix: reduce_amax now handles FP8 weights to avoid a NotImplementedError.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	The only Python diffs are core_utils.py and its test, and they contain none of the banned patterns (no load/allow_pickle/trust_remote_code/eval/exec/# nosec).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/fix-reduce-amax-fp8

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-25T17:57:48Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-25 18:51 UTC

codecov · 2026-06-25T18:02:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.59%. Comparing base (37dbbda) to head (e3d4d33).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1824       +/-   ##
===========================================
+ Coverage   62.89%   75.59%   +12.69%     
===========================================
  Files         511      514        +3     
  Lines       56683    59341     +2658     
===========================================
+ Hits        35651    44858     +9207     
+ Misses      21032    14483     -6549

Flag	Coverage Δ
examples	`42.12% <66.66%> (+4.10%)`	⬆️
gpu	`57.88% <66.66%> (+37.31%)`	⬆️
regression	`14.72% <33.33%> (+0.05%)`	⬆️
unit	`54.64% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

#1858 #1839 #1857 #1869 (#1880) ## Cherry-picked PRs - #1801 - #1808 - #1629 - #1627 - #1824 - #1826 - #1830 - #1760 - #1831 - #1858 - #1839 - #1857 - #1869 #1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the file was renamed `llm_ptq` -> `hf_ptq` (#1759) and surrounding `get_model` code diverged on `main`, but the actual fix targets the `init_empty_weights` / `from_config` block that already exists on the release branch. Accompanying unit tests were ported (15 passed).  ## Summary by CodeRabbit * **New Features** * Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8 KV-cache calibration. * **Bug Fixes** * Improved ONNX mixed-precision/FP16 conversion reliability with stricter type handling and better stale output-shape reconciliation. * Fixed quantization/export edge cases: MoE router/gate handling, FP8 calibration/reduction failures, and additional FP8/INT8 robustness during export. * Standardized Puzzletron validation split naming to `validation`. * **Documentation** * Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including updated command names and examples.  --------- Signed-off-by: Meng Xin <mxin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: dimapihtar <dpykhtar@nvidia.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com> Co-authored-by: mxinO <164952785+mxinO@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Co-authored-by: Daniel Korzekwa <daniel.korzekwa@gmail.com>

meenchen approved these changes Jun 25, 2026

View reviewed changes

kevalmorabia97 added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 25, 2026

meenchen approved these changes Jun 25, 2026

View reviewed changes

cjluo-nv force-pushed the chenjiel/fix-reduce-amax-fp8 branch from c752e6a to e3d4d33 Compare June 25, 2026 17:48

cjluo-nv enabled auto-merge (squash) June 25, 2026 17:48

cjluo-nv force-pushed the chenjiel/fix-reduce-amax-fp8 branch from ec94811 to c752e6a Compare June 25, 2026 17:48

cjluo-nv requested review from a team as code owners June 25, 2026 17:50

cjluo-nv requested a review from realAsma June 25, 2026 17:50

cjluo-nv added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 25, 2026

coderabbitai Bot approved these changes Jun 25, 2026

View reviewed changes

cjluo-nv merged commit 1c6bdb3 into main Jun 25, 2026
55 checks passed

cjluo-nv deleted the chenjiel/fix-reduce-amax-fp8 branch June 25, 2026 18:51

kevalmorabia97 mentioned this pull request Jul 1, 2026

[Cherry-pick] PRs #1801 #1808 #1629 #1627 #1824 #1826 #1830 #1760 #1831 #1858 #1839 #1857 #1869 #1880

Merged

kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix reduce_amax NotImplementedError on FP8 weights (NVBug 6360175)#1824

Fix reduce_amax NotImplementedError on FP8 weights (NVBug 6360175)#1824
cjluo-nv merged 1 commit into
mainfrom
chenjiel/fix-reduce-amax-fp8

cjluo-nv commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

meenchen left a comment

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

cjluo-nv commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading

codecov Bot commented Jun 25, 2026 •

edited

Loading