Fix reduce_amax NotImplementedError on FP8 weights (NVBug 6360175)#1824
Conversation
FP8 dtypes (float8_e4m3fn / float8_e5m2) implement no reduction (max/amax), abs, or elementwise maximum kernels, so calibrating models with natively FP8 weights (e.g. DeepSeek-V3) raised `NotImplementedError: "max_all_cuda" not implemented for 'Float8_e4m3fn'` in reduce_amax during mtq.quantize. Upcast FP8 inputs to the default float dtype at the top of reduce_amax before reducing. The upcast is lossless (default float dtype represents every FP8 value exactly) and only affects the FP8 path; it covers all reduction branches (torch.max/min, torch.amax/amin, torch.abs), not just the line in the traceback. Add a CPU regression test over both FP8 dtypes and all axis modes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-submission of previously-approved PR #1823 (which got a stuck head ref after the repo move). Diff is identical: +29/-0 across 3 files.
The fix upcasts FP8-native inputs (float8_e4m3fn/float8_e5m2) to torch.get_default_dtype() at the top of reduce_amax, before any reduction. This is correct and covers all downstream branches (torch.max/min, torch.amax/amin, torch.abs, elementwise maximum), not just the traceback line. The upcast is lossless since FP8 ⊂ fp16/bf16/fp32. A module-level _FP8_DTYPES constant with an explanatory comment is used.
New regression test test_reduce_amax_fp8 parametrizes both FP8 dtypes × all axis modes (None/0/1/(0,1)), asserts value-equality against the float reference and output dtype == default dtype, with FP8-exact test values. It reproduces the original error on CPU (no FP8 reduction kernel there either), so it runs GPU-free. Changelog entry added under 0.45.
No licensing changes (existing standard NVIDIA Apache-2.0 headers untouched; only CHANGELOG.rst edited). The PR body's "Claude approval: ❌ (not yet)" is a checklist item, not a prompt-injection directive — no attempt to manipulate the review.
c752e6a to
e3d4d33
Compare
ec94811 to
c752e6a
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthrough
ChangesFP8 amax reduction
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1824 +/- ##
===========================================
+ Coverage 62.89% 75.59% +12.69%
===========================================
Files 511 514 +3
Lines 56683 59341 +2658
===========================================
+ Hits 35651 44858 +9207
+ Misses 21032 14483 -6549
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
#1858 #1839 #1857 #1869 (#1880) ## Cherry-picked PRs - #1801 - #1808 - #1629 - #1627 - #1824 - #1826 - #1830 - #1760 - #1831 - #1858 - #1839 - #1857 - #1869 #1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the file was renamed `llm_ptq` -> `hf_ptq` (#1759) and surrounding `get_model` code diverged on `main`, but the actual fix targets the `init_empty_weights` / `from_config` block that already exists on the release branch. Accompanying unit tests were ported (15 passed). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8 KV-cache calibration. * **Bug Fixes** * Improved ONNX mixed-precision/FP16 conversion reliability with stricter type handling and better stale output-shape reconciliation. * Fixed quantization/export edge cases: MoE router/gate handling, FP8 calibration/reduction failures, and additional FP8/INT8 robustness during export. * Standardized Puzzletron validation split naming to `validation`. * **Documentation** * Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including updated command names and examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Meng Xin <mxin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: dimapihtar <dpykhtar@nvidia.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com> Co-authored-by: mxinO <164952785+mxinO@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Co-authored-by: Daniel Korzekwa <daniel.korzekwa@gmail.com>
What does this PR do?
Type of change: Bug fix
Fixes NVBug 6360175 / OMNIML-5265: quantizing a model whose weights are stored natively in FP8 (e.g. DeepSeek-V3 in
float8_e4m3fn) crashes duringmtq.quantizecalibration with:Root cause: FP8 dtypes (
float8_e4m3fn/float8_e5m2) implement no full-tensor reduction kernel (max_all_cuda/min_all_cuda), noramax/amin,abs, or elementwisemaximum.reduce_amaxcalled these directly on the FP8 weight tensor.Fix: Upcast FP8 inputs to the default float dtype (
torch.get_default_dtype()) at the top ofreduce_amax, before any reduction. The upcast is lossless (any default float dtype represents every FP8 value exactly) and only affects the FP8 path — the common (fp16/bf16/fp32) path is untouched. Placing the upcast at the top covers all branches (torch.max/min,torch.amax/amin,torch.abs), not just the line in the traceback.Usage
No API change. Quantization of natively-FP8 checkpoints (e.g. DeepSeek-V3 NVFP4 PTQ) now runs through calibration instead of raising.
Testing
test_reduce_amax_fp8intests/unit/torch/quantization/test_utils.pycovering both FP8 dtypes (float8_e4m3fn,float8_e5m2) across all axis modes (None,0,1,(0, 1)); asserts results equal the float reference and the output dtype is the default float dtype. CPU reproduces the original error (no FP8 reduction kernel there either), so the test is GPU-free.pre-commit run --files ...passes (ruff, mypy, bandit, license, rst checks).Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
NVBug 6360175 is tagged
Committed_ModelOpt_0.45.0(regression); the changelog entry is under 0.45 and this will be cherry-picked torelease/0.45after merge.Supersedes #1823, which got a stuck head ref (frozen at the original commit, no sync on force-push) after the repo move
TensorRT-Model-Optimizer→Model-Optimizer; it could not be re-synced or reopened, so this PR replaces it from the same branch.🤖 Generated with Claude Code
Summary by CodeRabbit