Skip to content

Fix reduce_amax NotImplementedError on FP8 weights (NVBug 6360175)#1824

Merged
cjluo-nv merged 1 commit into
mainfrom
chenjiel/fix-reduce-amax-fp8
Jun 25, 2026
Merged

Fix reduce_amax NotImplementedError on FP8 weights (NVBug 6360175)#1824
cjluo-nv merged 1 commit into
mainfrom
chenjiel/fix-reduce-amax-fp8

Conversation

@cjluo-nv

@cjluo-nv cjluo-nv commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: Bug fix

Fixes NVBug 6360175 / OMNIML-5265: quantizing a model whose weights are stored natively in FP8 (e.g. DeepSeek-V3 in float8_e4m3fn) crashes during mtq.quantize calibration with:

File ".../modelopt/torch/quantization/utils/core_utils.py", line 162, in reduce_amax
    max_val = torch.max(input)
NotImplementedError: "max_all_cuda" not implemented for 'Float8_e4m3fn'

Root cause: FP8 dtypes (float8_e4m3fn / float8_e5m2) implement no full-tensor reduction kernel (max_all_cuda/min_all_cuda), nor amax/amin, abs, or elementwise maximum. reduce_amax called these directly on the FP8 weight tensor.

Fix: Upcast FP8 inputs to the default float dtype (torch.get_default_dtype()) at the top of reduce_amax, before any reduction. The upcast is lossless (any default float dtype represents every FP8 value exactly) and only affects the FP8 path — the common (fp16/bf16/fp32) path is untouched. Placing the upcast at the top covers all branches (torch.max/min, torch.amax/amin, torch.abs), not just the line in the traceback.

Usage

No API change. Quantization of natively-FP8 checkpoints (e.g. DeepSeek-V3 NVFP4 PTQ) now runs through calibration instead of raising.

Testing

  • New CPU regression test test_reduce_amax_fp8 in tests/unit/torch/quantization/test_utils.py covering both FP8 dtypes (float8_e4m3fn, float8_e5m2) across all axis modes (None, 0, 1, (0, 1)); asserts results equal the float reference and the output dtype is the default float dtype. CPU reproduces the original error (no FP8 reduction kernel there either), so the test is GPU-free.
  • pre-commit run --files ... passes (ruff, mypy, bandit, license, rst checks).

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ✅ (0.45 Bug Fixes)
  • Did you get Claude approval on this PR?: ❌ (not yet)

Additional Information

NVBug 6360175 is tagged Committed_ModelOpt_0.45.0 (regression); the changelog entry is under 0.45 and this will be cherry-picked to release/0.45 after merge.

Supersedes #1823, which got a stuck head ref (frozen at the original commit, no sync on force-push) after the repo move TensorRT-Model-OptimizerModel-Optimizer; it could not be re-synced or reopened, so this PR replaces it from the same branch.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Fixed calibration-time failures when working with FP8 model weights.
    • FP8 reductions now safely use a supported floating-point type first, helping preserve values and avoid unsupported max/amax behavior.
    • Added regression coverage to verify correct results across multiple FP8 formats and reduction axes.

FP8 dtypes (float8_e4m3fn / float8_e5m2) implement no reduction
(max/amax), abs, or elementwise maximum kernels, so calibrating models
with natively FP8 weights (e.g. DeepSeek-V3) raised
`NotImplementedError: "max_all_cuda" not implemented for 'Float8_e4m3fn'`
in reduce_amax during mtq.quantize.

Upcast FP8 inputs to the default float dtype at the top of reduce_amax
before reducing. The upcast is lossless (default float dtype represents
every FP8 value exactly) and only affects the FP8 path; it covers all
reduction branches (torch.max/min, torch.amax/amin, torch.abs), not just
the line in the traceback.

Add a CPU regression test over both FP8 dtypes and all axis modes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

@meenchen meenchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-submission of previously-approved PR #1823 (which got a stuck head ref after the repo move). Diff is identical: +29/-0 across 3 files.

The fix upcasts FP8-native inputs (float8_e4m3fn/float8_e5m2) to torch.get_default_dtype() at the top of reduce_amax, before any reduction. This is correct and covers all downstream branches (torch.max/min, torch.amax/amin, torch.abs, elementwise maximum), not just the traceback line. The upcast is lossless since FP8 ⊂ fp16/bf16/fp32. A module-level _FP8_DTYPES constant with an explanatory comment is used.

New regression test test_reduce_amax_fp8 parametrizes both FP8 dtypes × all axis modes (None/0/1/(0,1)), asserts value-equality against the float reference and output dtype == default dtype, with FP8-exact test values. It reproduces the original error on CPU (no FP8 reduction kernel there either), so it runs GPU-free. Changelog entry added under 0.45.

No licensing changes (existing standard NVIDIA Apache-2.0 headers untouched; only CHANGELOG.rst edited). The PR body's "Claude approval: ❌ (not yet)" is a checklist item, not a prompt-injection directive — no attempt to manipulate the review.

@kevalmorabia97 kevalmorabia97 added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 25, 2026
@cjluo-nv cjluo-nv force-pushed the chenjiel/fix-reduce-amax-fp8 branch from c752e6a to e3d4d33 Compare June 25, 2026 17:48
@cjluo-nv cjluo-nv enabled auto-merge (squash) June 25, 2026 17:48
@cjluo-nv cjluo-nv force-pushed the chenjiel/fix-reduce-amax-fp8 branch from ec94811 to c752e6a Compare June 25, 2026 17:48
@cjluo-nv cjluo-nv requested review from a team as code owners June 25, 2026 17:50
@cjluo-nv cjluo-nv requested a review from realAsma June 25, 2026 17:50
@cjluo-nv cjluo-nv added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 459134f8-540a-4f30-934f-f8bb7554deb4

📥 Commits

Reviewing files that changed from the base of the PR and between 64f355e and e3d4d33.

📒 Files selected for processing (3)
  • CHANGELOG.rst
  • modelopt/torch/quantization/utils/core_utils.py
  • tests/unit/torch/quantization/test_utils.py

📝 Walkthrough

Walkthrough

reduce_amax now upcasts FP8 tensors to the default floating dtype before reduction. A regression test covers FP8 e4m3fn and e5m2 inputs across multiple axes, and the changelog records the fix.

Changes

FP8 amax reduction

Layer / File(s) Summary
FP8 reduction handling
modelopt/torch/quantization/utils/core_utils.py
Adds _FP8_DTYPES and casts FP8 inputs to the default floating dtype before reduce_amax computes the absolute maximum.
Regression coverage and changelog note
tests/unit/torch/quantization/test_utils.py, CHANGELOG.rst
Adds FP8 reduce_amax regression coverage across axes and records the calibration-time FP8 reduction fix in the changelog.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • ChenhanYu
  • AAnoosheh
  • shengliangxu
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main fix: reduce_amax now handles FP8 weights to avoid a NotImplementedError.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed The only Python diffs are core_utils.py and its test, and they contain none of the banned patterns (no load/allow_pickle/trust_remote_code/eval/exec/# nosec).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/fix-reduce-amax-fp8

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-25 18:51 UTC

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.59%. Comparing base (37dbbda) to head (e3d4d33).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1824       +/-   ##
===========================================
+ Coverage   62.89%   75.59%   +12.69%     
===========================================
  Files         511      514        +3     
  Lines       56683    59341     +2658     
===========================================
+ Hits        35651    44858     +9207     
+ Misses      21032    14483     -6549     
Flag Coverage Δ
examples 42.12% <66.66%> (+4.10%) ⬆️
gpu 57.88% <66.66%> (+37.31%) ⬆️
regression 14.72% <33.33%> (+0.05%) ⬆️
unit 54.64% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv cjluo-nv merged commit 1c6bdb3 into main Jun 25, 2026
55 checks passed
@cjluo-nv cjluo-nv deleted the chenjiel/fix-reduce-amax-fp8 branch June 25, 2026 18:51
@kevalmorabia97 kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Jul 1, 2026
kevalmorabia97 added a commit that referenced this pull request Jul 2, 2026
#1858 #1839 #1857 #1869 (#1880)

## Cherry-picked PRs

- #1801
- #1808
- #1629
- #1627
- #1824
- #1826
- #1830
- #1760
- #1831
- #1858
- #1839
- #1857
- #1869

#1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the
file was
renamed `llm_ptq` -> `hf_ptq` (#1759) and surrounding `get_model` code
diverged on
`main`, but the actual fix targets the `init_empty_weights` /
`from_config` block that
already exists on the release branch. Accompanying unit tests were
ported (15 passed).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8
KV-cache calibration.
* **Bug Fixes**
* Improved ONNX mixed-precision/FP16 conversion reliability with
stricter type handling and better stale output-shape reconciliation.
* Fixed quantization/export edge cases: MoE router/gate handling, FP8
calibration/reduction failures, and additional FP8/INT8 robustness
during export.
  * Standardized Puzzletron validation split naming to `validation`.
* **Documentation**
* Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including
updated command names and examples.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Co-authored-by: mxinO <164952785+mxinO@users.noreply.github.com>
Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
Co-authored-by: Zhiyu <zhiyuc@nvidia.com>
Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com>
Co-authored-by: Daniel Korzekwa <daniel.korzekwa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc cherry-pick-done Added by bot once PR is cherry-picked to the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants