Track global_amax for weight FP4 MSE sweep; Refactor to NVFP4StaticQantizer, NVFP4MSECalibrator by realAsma · Pull Request #849 · NVIDIA/Model-Optimizer

realAsma · 2026-02-03T22:13:27Z

What does this PR do?

Type of change: ?

Overview: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added NVFP4StaticQuantizer for improved 4-bit quantization with enhanced precision control
- Introduced NVFP4MSECalibrator with flexible candidate generation for calibration optimization
Improvements
- Optimized GPU kernels for Hopper+ graphics cards with better performance
- Extended Triton support to broader GPU compatibility
- Enhanced backward compatibility for restoring previously quantized models
Tests
- Added comprehensive test coverage for new quantizers and calibration methods

coderabbitai · 2026-02-03T22:13:42Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

This PR introduces NVFP4 (NVIDIA FP4) static quantization support with MSE calibration. It refactors MseCalibrator into a generalized candidate-based framework, adds NVFP4StaticQuantizer for two-level scaling quantization, updates Triton FP4 kernel implementations with dequantization and Hopper-optimized paths, and integrates these components into the model calibration pipeline.

Changes

Cohort / File(s)	Summary
MSE Calibrator refactoring `modelopt/torch/quantization/calib/mse.py`	Refactored MseCalibrator to use generalized candidate generation (_generate_candidates) and amax computation (_compute_candidate_amax) instead of FP8-specific scale sweep. Added NVFP4MSECalibrator subclass for per-block quantization with global_amax multiplication. Updated all to export both classes.
NVFP4 Quantizer Integration `modelopt/torch/quantization/nn/modules/tensor_quantizer.py`, `modelopt/torch/quantization/tensor_quant.py`	Introduced NVFP4StaticQuantizer class with two-level scaling (per-block amax and global_amax), from_tensor_quantizer classmethod, and overridden _fake_quantize. Updated StaticBlockwiseFP4FakeQuantFunction.forward signature to use amax-based parameters (amax, global_amax, quantize_block_scales) and added Triton availability guard.
Model Calibration Pipeline `modelopt/torch/quantization/model_calib.py`, `modelopt/torch/quantization/conversion.py`, `modelopt/torch/quantization/mode.py`	Added NVFP4 static quantization support in model calibration with global_amax computation and conditional NVFP4MSECalibrator swapping. Introduced restore_mse_calibrate function for backward-compatible restoration of NVFP4StaticQuantizer instances. Updated QuantizeModeDescriptor.restore property to use restore_mse_calibrate.
Triton Kernel Updates `modelopt/torch/quantization/triton/__init__.py`, `modelopt/torch/quantization/triton/fp4_kernel.py`	Relaxed CUDA capability check to only require CUDA availability for Triton loading. Replaced fp4_fake_quant_block export with fp4_dequantize; updated static_blockwise_fp4_fake_quant signature to use amax-based parameters (amax, global_amax, quantize_block_scales) and removed scale-based parameters.
Hopper FP4 Optimization `modelopt/torch/quantization/triton/fp4_kernel_hopper.py`	Added new Triton-based FP4 fake-quantization implementation (fp4_fake_quant_kernel, fp4_fake_quant_block) optimized for Hopper+ GPUs with block-wise quantization, global scaling, and configurable tiling. Exports fp4_fake_quant_block for dynamic block quantization on capability >= 8.9.
Test Coverage `tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py`, `tests/gpu/torch/quantization/test_tensor_quant_cuda.py`	Added comprehensive test suite for NVFP4StaticQuantizer and NVFP4MSECalibrator covering initialization, candidate generation, amax computation, loss accumulation, and per-block optimization. Updated tensor_quant tests to check fp4_fake_quant_block presence via hasattr.

Sequence Diagram

sequenceDiagram
    participant Model
    participant Calibrator as NVFP4MSECalibrator
    participant Quantizer as NVFP4StaticQuantizer
    participant TritonKernel as Triton FP4

    Model->>Calibrator: collect(activations)
    activate Calibrator
    Calibrator->>Calibrator: _generate_candidates()
    Note over Calibrator: Generate FP8-based candidates
    loop For each candidate
        Calibrator->>TritonKernel: static_blockwise_fp4_fake_quant(x, amax, global_amax)
        activate TritonKernel
        TritonKernel->>TritonKernel: Two-level scaling<br/>(per-block + global)
        TritonKernel-->>Calibrator: quantized output
        deactivate TritonKernel
        Calibrator->>Calibrator: Accumulate loss
    end
    deactivate Calibrator

    Calibrator->>Calibrator: compute_amax()
    Note over Calibrator: Select best candidate<br/>based on minimal loss
    Calibrator-->>Quantizer: Update amax & global_amax

    Model->>Quantizer: forward(x)
    activate Quantizer
    Quantizer->>TritonKernel: _fake_quantize(inputs)
    activate TritonKernel
    TritonKernel->>TritonKernel: Block-wise FP4 quantization<br/>scaled by global_amax
    TritonKernel-->>Quantizer: Quantized tensor
    deactivate TritonKernel
    Quantizer-->>Model: Output
    deactivate Quantizer

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 74.42% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main changes: tracking global_amax for FP4 MSE sweep and refactoring to NVFP4StaticQuantizer/NVFP4MSECalibrator, which align with the core modifications across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch asma/refactor-scale-sweep

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/triton/fp4_kernel.py (1)
69-85: ⚠️ Potential issue | 🔴 Critical

Guard the final tile to prevent out-of-bounds access in fp4_dequantize_kernel.

packed_mask only checks packed_col_idx < (N // 2), which is always true since packed_col_idx = packed_offs % (N // 2). When TILE_SIZE doesn't evenly divide packed_tensor.numel(), the final kernel instance can read/write past the end of the tensor. Add a TOTAL_PACKED_ELEMS parameter and change the mask to packed_offs < TOTAL_PACKED_ELEMS.
🛠️ Proposed fix
 def fp4_dequantize_kernel(
     packed_ptr,
     scale_ptr,
     global_scale_ptr,
     output_ptr,
     N,
+    TOTAL_PACKED_ELEMS,
     BLOCK_SIZE: tl.constexpr,
     TILE_SIZE: tl.constexpr,
 ):
@@
-    packed_mask = packed_col_idx < (N // 2)
+    packed_mask = packed_offs < TOTAL_PACKED_ELEMS
@@
-    fp4_dequantize_kernel[grid](
+    fp4_dequantize_kernel[grid](
         packed_tensor,
         scale_tensor,
         global_scale,
         output,
         N,
+        TOTAL_PACKED_ELEMS=packed_tensor.numel(),
         BLOCK_SIZE=block_size,
         TILE_SIZE=tile_size,
     )

🤖 Fix all issues with AI agents

In `@modelopt/torch/quantization/calib/mse.py`:
- Around line 161-172: Update NVFP4MSECalibrator.__init__ to accept a
block_size:int parameter and assign it to self._block_size so tests passing
block_size=16 succeed; specifically, add block_size to the __init__ signature of
NVFP4MSECalibrator and set self._block_size = block_size (leave the existing
call to super() and other parameters unchanged).

In `@modelopt/torch/quantization/nn/modules/tensor_quantizer.py`:
- Around line 1278-1288: The call in NVFP4StaticQuantizer._fake_quantize passes
a removed parameter (_pass_through_bwd) to static_blockwise_fp4_fake_quant
causing a TypeError; update the invocation in _fake_quantize to call
static_blockwise_fp4_fake_quant with only the supported arguments (inputs,
self.amax, self.global_amax, True, inputs.dtype) and remove the trailing
_pass_through_bwd argument, ensuring argument order or keywords match the
current static_blockwise_fp4_fake_quant signature.

In `@modelopt/torch/quantization/triton/fp4_kernel_hopper.py`:
- Around line 141-144: The docstring for the FP4 kernel (the
block_size/tile_rows/tile_cols parameter description) is out of sync with the
function signature: update the documented defaults to match the signature's
tile_rows=16 and tile_cols=64 (instead of 64/128) so the docstring for the
function/class that documents block_size, tile_rows, and tile_cols reflects the
actual defaults used in the code.
- Around line 151-191: The kernel launch uses global_amax on the wrong device
and doesn't ensure the correct CUDA context; move and validate global_amax to
x.device before using it (e.g., ensure global_amax is a scalar tensor, call
global_amax = global_amax.to(x.device) and then .float()), compute global_scale
from that device-local tensor, and wrap the fp4_fake_quant_kernel[grid](...)
launch in the same CUDA context as x (use torch.cuda.device(x.device) as in
static_blockwise_fp4_fake_quant) so the kernel runs on the correct GPU; also
validate that global_amax is a scalar and raise/convert if not.

In `@modelopt/torch/quantization/triton/fp4_kernel.py`:
- Around line 241-275: The function static_blockwise_fp4_fake_quant must be
back-compatible with older callers that pass scale, skip_scale_quant, or
scale_fp8_quant_amax: update the signature to accept those parameters (e.g.,
scale=None, skip_scale_quant=None, scale_fp8_quant_amax=None or via **kwargs)
and map them to the new behavior—if scale is provided, use it instead of
computing scale = amax/6.0; if skip_scale_quant is True, set
quantize_block_scales=False; if scale_fp8_quant_amax is provided, use it as
global_amax (convert to float and compute scale_fp8_quant_amax =
scale_fp8_quant_amax/6.0) before calling scaled_e4m3_impl/reduce_amax; preserve
existing type conversions (amax.float(), global_amax.float()) and emit a
deprecation warning when any legacy arg is used.

modelopt/torch/quantization/calib/mse.py

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

modelopt/torch/quantization/triton/fp4_kernel_hopper.py

coderabbitai · 2026-02-03T22:21:51Z

modelopt/torch/quantization/triton/fp4_kernel_hopper.py

+    x_shape = x.shape
+    x_dtype = x.dtype
+    x = x.reshape(-1, x_shape[-1]).contiguous()
+
+    M, N = x.shape
+    y = torch.empty_like(x)
+
+    stride_xm, stride_xn = x.stride()
+    stride_ym, stride_yn = y.stride()
+
+    tile_cols = max(tile_cols, block_size)
+    tile_cols_aligned = ((tile_cols + block_size - 1) // block_size) * block_size
+    num_fp4_blocks = tile_cols_aligned // block_size
+
+    global_scale = global_amax.float() / (6.0 * 448.0)
+
+    grid = lambda *_: (triton.cdiv(M, tile_rows), triton.cdiv(N, tile_cols_aligned))
+
+    launch_kwargs = {
+        "BLOCK_SIZE": block_size,
+        "TILE_M": tile_rows,
+        "TILE_N": tile_cols_aligned,
+        "NUM_FP4_BLOCKS": num_fp4_blocks,
+        "OUT_DTYPE": _torch_dtype_to_tl(x_dtype),
+    }
+    if num_warps is not None:
+        launch_kwargs["num_warps"] = num_warps
+    if num_stages is not None:
+        launch_kwargs["num_stages"] = num_stages
+    fp4_fake_quant_kernel[grid](
+        x,
+        y,
+        M,
+        N,
+        global_scale,
+        stride_xm,
+        stride_xn,
+        stride_ym,
+        stride_yn,
+        **launch_kwargs,
+    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "fp4_kernel_hopper.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 123

🏁 Script executed:

cat -n ./modelopt/torch/quantization/triton/fp4_kernel_hopper.py | sed -n '140,200p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2255

🏁 Script executed:

cat -n ./modelopt/torch/quantization/triton/fp4_kernel_hopper.py | sed -n '100,155p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2398

🏁 Script executed:

cat -n ./modelopt/torch/quantization/triton/fp4_kernel_hopper.py | sed -n '1,70p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2542

🏁 Script executed:

cat -n ./modelopt/torch/quantization/triton/fp4_kernel_hopper.py | sed -n '70,125p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2254

🏁 Script executed:

# Check if there are any existing validations or device handling in the full function cat -n ./modelopt/torch/quantization/triton/fp4_kernel_hopper.py | sed -n '127,200p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2741

🏁 Script executed:

# Check if there are similar functions in the codebase that might show the correct pattern find . -name "*.py" -path "*/quantization/triton/*" | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 223

🏁 Script executed:

# Check the similar fp4_kernel.py to see how it handles device management rg -A 50 "def fp4_fake_quant" ./modelopt/torch/quantization/triton/fp4_kernel.py | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

cat -n ./modelopt/torch/quantization/triton/fp4_kernel.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 6236

🏁 Script executed:

cat -n ./modelopt/torch/quantization/triton/fp4_kernel.py | sed -n '141,250p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4000

🏁 Script executed:

cat -n ./modelopt/torch/quantization/triton/fp4_kernel.py | sed -n '241,330p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2306

🏁 Script executed:

# Let me verify if there are any calls to fp4_fake_quant_block to see if global_amax is always guaranteed to be on device rg -B 5 -A 2 "fp4_fake_quant_block" ./modelopt/torch/quantization/triton/ --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 1335

🏁 Script executed:

# Look for calls/usages of fp4_fake_quant_block in the entire codebase rg "fp4_fake_quant_block" --type py -B 3 -A 3 | grep -v "def fp4_fake_quant_block" | grep -v "__all__"

Repository: NVIDIA/Model-Optimizer

Length of output: 4328

🏁 Script executed:

cat -n ./modelopt/torch/quantization/tensor_quant.py | grep -B 20 "fp4_fake_quant_block" | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1287

🏁 Script executed:

cat -n ./tests/gpu/torch/quantization/test_tensor_quant_cuda.py | grep -B 5 -A 10 "fp4_fake_quant_block"

Repository: NVIDIA/Model-Optimizer

Length of output: 2857

Ensure global_amax is on x.device and launch kernel in correct CUDA context.

Lines 165-191: global_amax.float() does not move the tensor to x.device. If global_amax is on CPU or a different CUDA device, global_scale will be on the wrong device and the kernel will fail or use incorrect data. Additionally, the kernel launch lacks a torch.cuda.device context, unlike the similar static_blockwise_fp4_fake_quant function in fp4_kernel.py. Add a CUDA device check, move global_amax to x.device, validate it's scalar, and wrap the kernel launch in torch.cuda.device.

🔧 Proposed fix

x_shape = x.shape x_dtype = x.dtype + if not x.is_cuda: + raise RuntimeError("fp4_fake_quant_block requires a CUDA tensor.") + if global_amax.numel() != 1: + raise ValueError("global_amax must be a scalar tensor.") + global_amax = global_amax.to(device=x.device, dtype=torch.float32) x = x.reshape(-1, x_shape[-1]).contiguous() @@ - global_scale = global_amax.float() / (6.0 * 448.0) + global_scale = global_amax / (6.0 * 448.0) @@ - fp4_fake_quant_kernel[grid]( - x, - y, - M, - N, - global_scale, - stride_xm, - stride_xn, - stride_ym, - stride_yn, - **launch_kwargs, - ) + with torch.cuda.device(x.device): + fp4_fake_quant_kernel[grid]( + x, + y, + M, + N, + global_scale, + stride_xm, + stride_xn, + stride_ym, + stride_yn, + **launch_kwargs, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

x_shape = x.shape

x_dtype = x.dtype

x = x.reshape(-1, x_shape[-1]).contiguous()

M, N = x.shape

y = torch.empty_like(x)

stride_xm, stride_xn = x.stride()

stride_ym, stride_yn = y.stride()

tile_cols = max(tile_cols, block_size)

tile_cols_aligned = ((tile_cols + block_size - 1) // block_size) * block_size

num_fp4_blocks = tile_cols_aligned // block_size

global_scale = global_amax.float() / (6.0 * 448.0)

grid = lambda *_: (triton.cdiv(M, tile_rows), triton.cdiv(N, tile_cols_aligned))

launch_kwargs = {

"BLOCK_SIZE": block_size,

"TILE_M": tile_rows,

"TILE_N": tile_cols_aligned,

"NUM_FP4_BLOCKS": num_fp4_blocks,

"OUT_DTYPE": _torch_dtype_to_tl(x_dtype),

}

if num_warps is not None:

launch_kwargs["num_warps"] = num_warps

if num_stages is not None:

launch_kwargs["num_stages"] = num_stages

fp4_fake_quant_kernel[grid](

x,

y,

M,

N,

global_scale,

stride_xm,

stride_xn,

stride_ym,

stride_yn,

**launch_kwargs,

)

x_shape = x.shape

x_dtype = x.dtype

if not x.is_cuda:

raise RuntimeError("fp4_fake_quant_block requires a CUDA tensor.")

if global_amax.numel() != 1:

raise ValueError("global_amax must be a scalar tensor.")

global_amax = global_amax.to(device=x.device, dtype=torch.float32)

x = x.reshape(-1, x_shape[-1]).contiguous()

M, N = x.shape

y = torch.empty_like(x)

stride_xm, stride_xn = x.stride()

stride_ym, stride_yn = y.stride()

tile_cols = max(tile_cols, block_size)

tile_cols_aligned = ((tile_cols + block_size - 1) // block_size) * block_size

num_fp4_blocks = tile_cols_aligned // block_size

global_scale = global_amax / (6.0 * 448.0)

grid = lambda *_: (triton.cdiv(M, tile_rows), triton.cdiv(N, tile_cols_aligned))

launch_kwargs = {

"BLOCK_SIZE": block_size,

"TILE_M": tile_rows,

"TILE_N": tile_cols_aligned,

"NUM_FP4_BLOCKS": num_fp4_blocks,

"OUT_DTYPE": _torch_dtype_to_tl(x_dtype),

}

if num_warps is not None:

launch_kwargs["num_warps"] = num_warps

if num_stages is not None:

launch_kwargs["num_stages"] = num_stages

with torch.cuda.device(x.device):

fp4_fake_quant_kernel[grid](

x,

y,

M,

N,

global_scale,

stride_xm,

stride_xn,

stride_ym,

stride_yn,

**launch_kwargs,

)

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/triton/fp4_kernel_hopper.py` around lines 151 - 191, The kernel launch uses global_amax on the wrong device and doesn't ensure the correct CUDA context; move and validate global_amax to x.device before using it (e.g., ensure global_amax is a scalar tensor, call global_amax = global_amax.to(x.device) and then .float()), compute global_scale from that device-local tensor, and wrap the fp4_fake_quant_kernel[grid](...) launch in the same CUDA context as x (use torch.cuda.device(x.device) as in static_blockwise_fp4_fake_quant) so the kernel runs on the correct GPU; also validate that global_amax is a scalar and raise/convert if not.

modelopt/torch/quantization/triton/fp4_kernel.py

codecov · 2026-02-04T16:41:59Z

Codecov Report

❌ Patch coverage is 53.19149% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.45%. Comparing base (02c5f29) to head (d17bad3).
⚠️ Report is 15 commits behind head on main.

Files with missing lines	Patch %	Lines
.../torch/quantization/nn/modules/tensor_quantizer.py	25.00%	24 Missing ⚠️
modelopt/torch/quantization/calib/mse.py	71.73%	13 Missing ⚠️
modelopt/torch/quantization/tensor_quant.py	0.00%	3 Missing ⚠️
modelopt/torch/quantization/triton/__init__.py	33.33%	2 Missing ⚠️
modelopt/torch/quantization/conversion.py	66.66%	1 Missing ⚠️
modelopt/torch/quantization/model_calib.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #849      +/-   ##
==========================================
+ Coverage   73.38%   73.45%   +0.07%     
==========================================
  Files         193      197       +4     
  Lines       19893    20651     +758     
==========================================
+ Hits        14598    15169     +571     
- Misses       5295     5482     +187

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

modelopt/torch/quantization/conversion.py

Fridah-nv

NVFP4MSECalibrator and NVFP4StaticQuantizer refactor LGTM.

…antizer, NVFP4MSECalibrator Signed-off-by: realAsma <akuriparambi@nvidia.com> fp4 static kernel fix, test fixes, minor clean ups Signed-off-by: realAsma <akuriparambi@nvidia.com> minor Signed-off-by: realAsma <akuriparambi@nvidia.com> minor Signed-off-by: realAsma <akuriparambi@nvidia.com> minor Signed-off-by: realAsma <akuriparambi@nvidia.com> minor Signed-off-by: realAsma <akuriparambi@nvidia.com>

Signed-off-by: realAsma <akuriparambi@nvidia.com>

modelopt/torch/quantization/triton/fp4_kernel_hopper.py

modelopt/torch/quantization/calib/mse.py

sugunav14

I reviewed the calibrator and the static quantizer logic! Just had a general question about scale search.

mxinO

LGTM, left one comment.

modelopt/torch/quantization/triton/fp4_kernel_hopper.py

modelopt/torch/quantization/calib/mse.py

Signed-off-by: realAsma <akuriparambi@nvidia.com>

…ntizer, NVFP4MSECalibrator (#849) ## What does this PR do? **Type of change:** ?  **Overview:** ? ## Usage  ```python # Add a code snippet demonstrating how to use this ``` ## Testing  ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information   ## Summary by CodeRabbit ## Release Notes * **New Features** * Added NVFP4StaticQuantizer for improved 4-bit quantization with enhanced precision control * Introduced NVFP4MSECalibrator with flexible candidate generation for calibration optimization * **Improvements** * Optimized GPU kernels for Hopper+ graphics cards with better performance * Extended Triton support to broader GPU compatibility * Enhanced backward compatibility for restoring previously quantized models * **Tests** * Added comprehensive test coverage for new quantizers and calibration methods  --------- Signed-off-by: realAsma <akuriparambi@nvidia.com>

…ntizer, NVFP4MSECalibrator (#849) **Type of change:** ?  **Overview:** ?  ```python ```   - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No    * **New Features** * Added NVFP4StaticQuantizer for improved 4-bit quantization with enhanced precision control * Introduced NVFP4MSECalibrator with flexible candidate generation for calibration optimization * **Improvements** * Optimized GPU kernels for Hopper+ graphics cards with better performance * Extended Triton support to broader GPU compatibility * Enhanced backward compatibility for restoring previously quantized models * **Tests** * Added comprehensive test coverage for new quantizers and calibration methods  --------- Signed-off-by: realAsma <akuriparambi@nvidia.com>

…ntizer, NVFP4MSECalibrator (#849) ## What does this PR do? **Type of change:** ?  **Overview:** ? ## Usage  ```python # Add a code snippet demonstrating how to use this ``` ## Testing  ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information   ## Summary by CodeRabbit ## Release Notes * **New Features** * Added NVFP4StaticQuantizer for improved 4-bit quantization with enhanced precision control * Introduced NVFP4MSECalibrator with flexible candidate generation for calibration optimization * **Improvements** * Optimized GPU kernels for Hopper+ graphics cards with better performance * Extended Triton support to broader GPU compatibility * Enhanced backward compatibility for restoring previously quantized models * **Tests** * Added comprehensive test coverage for new quantizers and calibration methods  --------- Signed-off-by: realAsma <akuriparambi@nvidia.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

…ntizer, NVFP4MSECalibrator (#849) **Type of change:** ?  **Overview:** ?  ```python ```   - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No    * **New Features** * Added NVFP4StaticQuantizer for improved 4-bit quantization with enhanced precision control * Introduced NVFP4MSECalibrator with flexible candidate generation for calibration optimization * **Improvements** * Optimized GPU kernels for Hopper+ graphics cards with better performance * Extended Triton support to broader GPU compatibility * Enhanced backward compatibility for restoring previously quantized models * **Tests** * Added comprehensive test coverage for new quantizers and calibration methods  --------- Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma requested a review from a team as a code owner February 3, 2026 22:13

realAsma requested a review from kaix-nv February 3, 2026 22:13

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

realAsma force-pushed the asma/refactor-scale-sweep branch from 1a09c12 to fcf071f Compare February 4, 2026 00:44

realAsma requested review from Fridah-nv, cjluo-nv, meenchen, mxinO and sugunav14 February 4, 2026 18:10

realAsma changed the title ~~[Draft] Track global_amax for weight FP4 MSE sweep; Refactor to NVFP4StaticQantizer, NVFP4MSECalibrator~~ Track global_amax for weight FP4 MSE sweep; Refactor to NVFP4StaticQantizer, NVFP4MSECalibrator Feb 4, 2026

Fridah-nv reviewed Feb 4, 2026

View reviewed changes

modelopt/torch/quantization/conversion.py Outdated Show resolved Hide resolved

Fridah-nv approved these changes Feb 4, 2026

View reviewed changes

realAsma added 2 commits February 5, 2026 18:36

minor

d0dfae0

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma force-pushed the asma/refactor-scale-sweep branch from 4ca3180 to d0dfae0 Compare February 5, 2026 18:42

minor

c974090

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma commented Feb 5, 2026

View reviewed changes

modelopt/torch/quantization/triton/fp4_kernel_hopper.py Show resolved Hide resolved

sugunav14 reviewed Feb 5, 2026

View reviewed changes

modelopt/torch/quantization/calib/mse.py Show resolved Hide resolved

sugunav14 approved these changes Feb 5, 2026

View reviewed changes

mxinO approved these changes Feb 6, 2026

View reviewed changes

modelopt/torch/quantization/triton/fp4_kernel_hopper.py Show resolved Hide resolved

modelopt/torch/quantization/calib/mse.py Show resolved Hide resolved

fixes, clean ups for mse_calibrate

d17bad3

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma enabled auto-merge (squash) February 6, 2026 18:19

realAsma merged commit ac30686 into main Feb 6, 2026
37 checks passed

realAsma deleted the asma/refactor-scale-sweep branch February 6, 2026 19:47

Fridah-nv mentioned this pull request Feb 7, 2026

[Minor] fix: do not requantize the scales in FP8 scale sweep calibration #825

Closed

Conversation

realAsma commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Fridah-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sugunav14 left a comment

Choose a reason for hiding this comment

Uh oh!

mxinO left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

realAsma commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

codecov bot commented Feb 4, 2026 •

edited

Loading