Issue with running quantized fp8 model on onnxruntime.

Hello,

I'm trying to do PTQ on a simple resnet50 model using TensorRT-Model-Optimizer, and then running the model with onnxruntime, but I'm having a bit of trouble. 

I have :
tensorrt                  10.13.3.9
onnxruntime-gpu           1.23.2
nvidia-modelopt           0.37.0

When I run the fp8 quantization with
```
quantize(
        onnx_path=FP32_ONNX,
        quantize_mode="fp8",  # fp8, int8, int4 etc.
        calibration_data=calib_tensor,
        calibration_method="max",  # max, entropy, awq_clip, rtn_dq etc.
        output_path=FP8_ONNX,
    )

```

 here are my logs:

```
[modelopt][onnx] - INFO - Starting quantization process for model: resnet50_fp32.onnx
[modelopt][onnx] - INFO - Quantization mode: fp8
[modelopt][onnx] - INFO - Preprocessing the model resnet50_fp32.onnx
[modelopt][onnx] - INFO - Duplicating shared constants
[modelopt][onnx] - INFO - Model is cloned to resnet50_fp32_named.onnx after naming the nodes
[modelopt][onnx] - INFO - Setting up CalibrationDataProvider for calibration
[modelopt][onnx] - INFO - Analyzing MHA nodes for fp8 quantization
[modelopt][onnx] - INFO - No MHA partitions found in the model
[modelopt][onnx] - INFO - Starting FP8 quantization process
[modelopt][onnx] - INFO - Loading ONNX model from resnet50_fp32_named.onnx
[modelopt][onnx] - INFO - Detecting GEMV patterns for TRT optimization
[modelopt][onnx] - INFO - Scanning for unsupported Conv nodes for quantization
[modelopt][onnx] - INFO - Found 1 unsupported Conv nodes for quantization
[modelopt][onnx] - INFO - Configuring ORT for ModelOpt ONNX quantization
[modelopt][onnx] - INFO - Checking for cuDNN library
[modelopt][onnx] - INFO - *cudnn*.dll is accessible in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudnn64_8.dll! Please check that this is the correct version needed for your ORT version at https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements.
[modelopt][onnx] - INFO - Successfully imported the `tensorrt` python package with version 10.13.3.9
[modelopt][onnx] - INFO - Checking for cuDNN library
[modelopt][onnx] - INFO - *cudnn*.dll is accessible in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudnn64_8.dll! Please check that this is the correct version needed for your ORT version at https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements.
[modelopt][onnx] - INFO - Successfully enabled 3 EPs for ORT: ['CPUExecutionProvider', ('CUDAExecutionProvider', {'device_id': 0}), 'TensorrtExecutionProvider']
[modelopt][onnx] - INFO - Quantizable op types in the model: ['Gemm', 'Conv', 'Add']
[modelopt][onnx] - INFO - Finding nodes to quantize
[modelopt][onnx] - INFO - Building non-residual Add input map
[modelopt][onnx] - INFO - Searching for patterns like MHA, LayerNorm, etc
[modelopt][onnx] - INFO - Found 0 layer norm partitions
[modelopt][onnx] - INFO - Found 0 MHA (QK_AV) Patterns
[modelopt][onnx] - INFO - Found 1 non-quantizable partitions
[modelopt][onnx] - INFO - Building KGEN/CASK targeted partitions
[modelopt][onnx] - INFO - Classifying partition nodes
[modelopt][onnx] - INFO - Found 52 quantizable partition nodes and 0 quantizable KGEN heads
[modelopt][onnx] - INFO - Finding quantizable nodes. Initial nodes to quantize: 69
[modelopt][onnx] - INFO - Found 0 pooling/window ops
[modelopt][onnx] - INFO - Total number of quantizable nodes: 69
[modelopt][onnx] - INFO - Finding concat eliminated tensors
[modelopt][onnx] - INFO - Starting INT8 quantization with 'max' calibration
[modelopt][onnx] - INFO - Starting static quantization
[modelopt][onnx] - INFO - Starting post-processing of quantized model
[modelopt][onnx] - INFO - Deleting QDQ nodes from marked inputs to make certain operations fusible
[modelopt][onnx] - INFO - Converting float tensors to fp16
2025-10-27 14:52:07,043 - autocast - WARNING - precisionconverter.py - Initializer layer2.1.conv2.weight used by node layer2.1.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,069 - autocast - WARNING - precisionconverter.py - Initializer layer2.3.conv2.weight used by node layer2.3.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,085 - autocast - WARNING - precisionconverter.py - Initializer layer3.0.conv2.weight used by node layer3.0.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,101 - autocast - WARNING - precisionconverter.py - Initializer layer3.0.downsample.0.weight used by node layer3.0.downsample.0.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,116 - autocast - WARNING - precisionconverter.py - Initializer layer3.1.conv2.weight used by node layer3.1.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,143 - autocast - WARNING - precisionconverter.py - Initializer layer3.2.conv2.weight used by node layer3.2.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,163 - autocast - WARNING - precisionconverter.py - Initializer layer3.3.conv2.weight used by node layer3.3.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,195 - autocast - WARNING - precisionconverter.py - Initializer layer3.4.conv2.weight used by node layer3.4.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,210 - autocast - WARNING - precisionconverter.py - Initializer layer3.5.conv2.weight used by node layer3.5.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,227 - autocast - WARNING - precisionconverter.py - Initializer layer3.5.conv3.weight used by node layer3.5.conv3.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,259 - autocast - WARNING - precisionconverter.py - Initializer layer4.0.conv2.weight used by node layer4.0.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,288 - autocast - WARNING - precisionconverter.py - Initializer layer4.0.conv3.weight used by node layer4.0.conv3.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,319 - autocast - WARNING - precisionconverter.py - Initializer layer4.0.downsample.0.weight used by node layer4.0.downsample.0.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,354 - autocast - WARNING - precisionconverter.py - Initializer layer4.1.conv1.weight used by node layer4.1.conv1.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,381 - autocast - WARNING - precisionconverter.py - Initializer layer4.1.conv2.weight used by node layer4.1.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,427 - autocast - WARNING - precisionconverter.py - Initializer layer4.2.conv1.weight used by node layer4.2.conv1.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,443 - autocast - WARNING - precisionconverter.py - Initializer layer4.2.conv2.weight used by node layer4.2.conv2.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:07,474 - autocast - WARNING - precisionconverter.py - Initializer layer4.2.conv3.weight used by node layer4.2.conv3.weight_QuantizeLinear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
2025-10-27 14:52:09,529 - autocast - WARNING - precisionconverter.py - Initializer fc.weight used by node node_linear contains values smaller than smallest fp16 value, values will be replaced with 6.0e-08.
[modelopt][onnx] - INFO - Starting INT8 to FP8 conversion
[modelopt][onnx] - INFO - FP8 quantization completed in 79.18 seconds
[modelopt][onnx] - INFO - Converting model with QDQ nodes to DQ only model
[modelopt][onnx] - INFO - Removing 0 redundant Cast nodes
[modelopt][onnx] - INFO - Removed 52 Q nodes and redundant cast nodes
[W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored
[W] Could not convert: FLOAT8E4M3FN to a corresponding NumPy type. The original ONNX type will be preserved.
[modelopt][onnx] - INFO - Total number of nodes: 333
[modelopt][onnx] - INFO - Total number of quantized nodes: 68
[modelopt][onnx] - INFO - Quantized onnx model is saved as resnet50_quantized.onnx
[modelopt][onnx] - INFO - Cleaning up intermediate files
[modelopt][onnx] - INFO - Validating quantized model
[modelopt][onnx] - INFO - Quantization process completed
```


and here are the error logs and the python script for onnxruntime

```
sess_opts = ort.SessionOptions()
# Set graph optimization level
sess_opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

providers = (
    [
        "TensorrtExecutionProvider",
        ("CUDAExecutionProvider", {"device_id": 0}),
        "CPUExecutionProvider",
    ]
)

sess = ort.InferenceSession(path, sess_options=sess_opts, providers=providers)
```

```
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from resnet50_quantized.onnx failed:Type Error: Type parameter (T1) of Optype (DequantizeLinear) bound to different types (tensor(int8) and tensor(float8e4m3fn) in node (layer1.0.conv1.weight_DequantizeLinear).
```

Any help is appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with running quantized fp8 model on onnxruntime. #468

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue with running quantized fp8 model on onnxruntime. #468

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions