Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Mobilenet v2 Quantization with ONNX Runtime on CPU

In this tutorial, we will load a mobilenet v2 model pretrained with [PyTorch](https://pytorch.org/), export the model to ONNX, and quantize then run with ONNXRuntime.

## 0. Prerequisites ##

If you have Jupyter Notebook, you can run this notebook directly with it. You may need to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/), and other required packages.

Otherwise, you can setup a new environment. First, install [Anaconda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:

```console
conda create -n cpu_env python=3.8
conda activate cpu_env
conda install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

### 0.1 Install packages
Let's install the necessary packages to start the tutorial. We will install PyTorch 1.8, OnnxRuntime 1.8, latest ONNX and pillow.

In this step, we load a pretrained mobilenet v2 model, and export it to ONNX.

### 1.1 Load the pretrained model
Use torchvision provides API to load mobilenet_v2 model.

In [13]:
from torchvision import models, datasets, transforms as T
import timm
model = timm.create_model('resnet50', pretrained=True)

  from .autonotebook import tqdm as notebook_tqdm


### 1.2 Export the model to ONNX
Pytorch onnx export API to export the model.

In [14]:
import torch
image_height = 224
image_width = 224
# # image_height = 256
# # image_width = 256
# # image_height = 800
# # image_width = 1199
# image_height = 640
# image_width = 640
x = torch.randn(1, 3, image_height, image_width, requires_grad=True)
torch_out = model(x)

# Export the model
torch.onnx.export(model,              # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "resnet50_float.onnx", # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=12,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output']) # the model's output names


verbose: False, log level: Level.ERROR



### 1.3 Sample Execution with ONNXRuntime

Run an sample with the full precision ONNX model. Firstly, implement the preprocess.

In [1]:
from PIL import Image
import numpy as np
import onnxruntime
import torch

def preprocess_image(image_path, height, width, channels=3):
    image = Image.open(image_path)
    image = image.resize((width, height), Image.ANTIALIAS)
    image_data = np.asarray(image).astype(np.float32)
    image_data = image_data.transpose([2, 0, 1]) # transpose to CHW
    mean = np.array([0.079, 0.05, 0]) + 0.406
    std = np.array([0.005, 0, 0.001]) + 0.224
    # print(image_data.shape[0])
    for channel in range(image_data.shape[0]):
        image_data[channel, :, :] = (image_data[channel, :, :] / 255 - mean[channel]) / std[channel]
    image_data = np.expand_dims(image_data, 0)
    return image_data

#### Download the imagenet labels and load it

In [10]:
# Download ImageNet labels
!curl -o imagenet_classes.txt https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt

# Read the categories
with open("/home/xlx/code/onnxruntime-inference-examples/quantization/notebooks/imagenet_v2/imagenet_classes.txt", "r") as f:
    categories = [s.strip() for s in f.readlines()]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (7) Failed to connect to raw.githubusercontent.com port 443: Connection refused


#### Run the example with ONNXRuntime

In [17]:
session_fp32 = onnxruntime.InferenceSession("/home/xlx/code/neural-compressor/examples/pytorch/image_recognition/torchvision_models/quantization/qat/fx/saved_results/resnet18_float.onnx",providers=['CUDAExecutionProvider'])

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def run_sample(session, image_file, categories):
    output = session.run([], {'input':preprocess_image(image_file, image_height, image_width)})[0]
    print(output)
    output = output.flatten()
    output = softmax(output) # this is optional
    # print(output)
    top5_catid = np.argsort(-output)[:5]
    for catid in top5_catid:
        print(categories[catid], output[catid])

run_sample(session_fp32, '/home/xlx/code/onnxruntime-inference-examples/quantization/notebooks/imagenet_v2/cat.jpg', categories)

[[-1.00364494e+00 -2.68677783e+00 -1.24730301e+00 -1.02071035e+00
  -1.62582362e+00  2.60259628e-01 -2.42855930e+00 -2.56515920e-01
   5.46876013e-01 -1.44562960e+00  1.21661448e+00 -2.51657796e+00
   2.63004780e+00 -6.94954753e-01 -2.36902165e+00  1.43998456e+00
  -1.27346885e+00 -2.16607976e+00 -4.55389023e+00 -3.44421887e+00
  -4.66023125e-02  2.68369770e+00  1.03526270e+00 -1.15526676e+00
   2.54656649e+00 -3.57944584e+00 -5.62320828e-01 -2.81269717e+00
  -2.70262790e+00 -3.09027582e-01  1.36424506e+00 -3.05889487e+00
  -1.36863005e+00 -2.33973145e+00  9.87966120e-01  6.95207179e-01
   1.89699376e+00 -3.19865674e-01  2.82360649e+00  3.63455391e+00
  -1.81797051e+00  3.94215178e+00  6.08824313e-01  2.95863700e+00
   1.33709764e+00  1.54550886e-03 -1.57516509e-01  1.91143107e+00
   8.45279276e-01 -1.24554396e+00 -1.37848234e+00  2.57497573e+00
  -7.25240707e-01 -3.70089483e+00 -7.25843832e-02 -2.88598824e+00
  -3.29884434e+00 -3.49452198e-01 -1.58990216e+00  2.95361608e-01
  -5.18140

  image = image.resize((width, height), Image.ANTIALIAS)


In [11]:
session_int8 = onnxruntime.InferenceSession("/home/xlx/code/neural-compressor/examples/pytorch/image_recognition/torchvision_models/quantization/qat/fx/saved_results/resnet18int.onnx",providers=['CUDAExecutionProvider'])

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def run_sample(session, image_file, categories):
    output = session.run([], {'input':preprocess_image(image_file, image_height, image_width)})[0]
    output = output.flatten()
    output = softmax(output) # this is optional
    top5_catid = np.argsort(-output)[:5]
    for catid in top5_catid:
        print(categories[catid], output[catid])

run_sample(session_int8, '/home/xlx/code/onnxruntime-inference-examples/quantization/notebooks/imagenet_v2/cat.jpg', categories)

2023-06-29 17:40:46.988765459 [W:onnxruntime:, session_state.cc:1169 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-06-29 17:40:46.988801939 [W:onnxruntime:, session_state.cc:1171 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.


NameError: name 'image_height' is not defined

In [77]:
import torchvision.transforms as transforms
import torch
import torchvision
def prepare_data_loaders(data_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    dataset = torchvision.datasets.ImageNet(
        data_path, split="train", transform=transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))
    dataset_test = torchvision.datasets.ImageNet(
        data_path, split="val", transform=transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]))

    train_sampler = torch.utils.data.RandomSampler(dataset)
    test_sampler = torch.utils.data.SequentialSampler(dataset_test)

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=train_batch_size,
        sampler=train_sampler)

    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=eval_batch_size,
        sampler=test_sampler)

    return data_loader, data_loader_test
data_path = '/data1/data/imagenet2012/'
train_batch_size = 30
eval_batch_size = 1
data_loader, data_loader_test = prepare_data_loaders(data_path)

In [78]:
# example_inputs = (next(iter(data_loader_test))[0]) # get an example input
# print(example_inputs.shape)
# for step, (batch_x, batch_y) in enumerate(data_loader_test):
#             # training
#         print("steop:{}, batch_x:{}, batch_y:{}".format(step, batch_x.numpy().shape, batch_y.shape))
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType, QuantFormat, CalibrationMethod
class ImagenetDataReader(CalibrationDataReader):
    def __init__(self, data_loader):
        self.data = data_loader

    def get_next(self):
        
        self.enum_data_dicts = iter([{'input': batch_x.numpy()} for step, (batch_x, batch_y) in enumerate(self.data)])
        return next(self.enum_data_dicts, None)

# 2 Quantize the model with ONNXRuntime 
In this step, we load the full precison model, and quantize it with ONNXRuntime quantization tool. And show the model size comparison between full precision and quantized model. Finally, we run the same sample with the quantized model

## 2.1 Implement a CalibrationDataReader
CalibrationDataReader takes in calibration data and generates input for the model

In [2]:
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType, QuantFormat, CalibrationMethod
import os
image_height = 224
image_width = 224
def preprocess_func(images_folder, height, width, size_limit=6000):
    image_names = os.listdir(images_folder)
    if size_limit > 0 and len(image_names) >= size_limit:
        batch_filenames = [image_names[i] for i in range(size_limit)]
    else:
        batch_filenames = image_names
    unconcatenated_batch_data = []
    num=0
    for image_name in batch_filenames:
        image_filepath = images_folder + '/' + image_name
        # print(image_filepath)
        image = Image.open(image_filepath)
        if len(image.split())==3:
            # print(image_filepath)
            num=num+1
             
        # print(len(image.split()))
        if len(image.split())==1 or len(image.split())==2 or len(image.split())==4:
            continue
        image_data = preprocess_image(image_filepath, height, width)
        unconcatenated_batch_data.append(image_data)
    batch_data = np.concatenate(np.expand_dims(unconcatenated_batch_data, axis=0), axis=0)
    # print("claibration_data num:",num)
    return batch_data


class MobilenetDataReader(CalibrationDataReader):
    def __init__(self, calibration_image_folder):
        self.image_folder = calibration_image_folder
        self.preprocess_flag = True
        self.enum_data_dicts = []
        self.datasize = 0

    def get_next(self):
        if self.preprocess_flag:
            self.preprocess_flag = False
            nhwc_data_list = preprocess_func(self.image_folder, image_height, image_width, size_limit=5000)
            self.datasize = len(nhwc_data_list)
            # print(nhwc_data_list[0].shape)
            self.enum_data_dicts = iter([{'input': nhwc_data} for nhwc_data in nhwc_data_list])
        return next(self.enum_data_dicts, None)


In [3]:
calibration_data_folder = "/data1/data/imagenet2012/dataILSVRC2012_img_val"
dr = MobilenetDataReader(calibration_data_folder)
# dr_torch=ImagenetDataReader(data_loader_test)



In [4]:
# change it to your real calibration data set

quantize_static('/home/xlx/code/quantize/resnet50_float.onnx',
                '/home/xlx/code/quantize/resnet50_int8.onnx',
                dr,
                quant_format= QuantFormat.QOperator, # 量化格式 QDQ / QOperator
                activation_type=QuantType.QInt8, # 激活类型 Int8 / UInt8
                weight_type=QuantType.QInt8, # 参数类型 Int8 / UInt8
                calibrate_method=CalibrationMethod.Percentile,  # 数据校准方法 MinMax / Entropy / Percentile
                 
                extra_options={"ActivationSymmetric":True,"WeightSymmetric":True,"extra.Sigmoid.nnapi":True},
            


                )

print('ONNX full precision model size (MB):', os.path.getsize("/home/xlx/code/quantize/resnet50_float.onnx")/(1024*1024))
print('ONNX quantized model size (MB):', os.path.getsize("/home/xlx/code/quantize/resnet50_int8_percentile.onnx")/(1024*1024))


  image = image.resize((width, height), Image.ANTIALIAS)


: 

: 

In [9]:
session_fp32 = onnxruntime.InferenceSession("/home/xlx/code/quantize/resnet50_uint8.onnx",providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def run_sample(session, image_file, categories):
    output = session.run([], {'input':preprocess_image(image_file, image_height, image_width)})[0]
    output = output.flatten()
    output = softmax(output) # this is optional
    print(output)
    top5_catid = np.argsort(-output)[:5]
    for catid in top5_catid:
        print(categories[catid], output[catid])

run_sample(session_fp32, '/home/xlx/code/onnxruntime-inference-examples/quantization/notebooks/imagenet_v2/cat.jpg', categories)

NameError: name 'categories' is not defined

In [14]:
import os.path
import json
import pandas as pd
import glob
import onnxruntime as rt
import torch
from scipy import spatial
import numpy as np
import time

input_data = np.random.random([1,3,224,224])
print(input_data.shape)

onnx_bin="/home/xlx/code/quantize/resnet.bin"
input_data.astype(np.float32).tofile(onnx_bin)
def get_cosine_dist(x, y):
    cosine_dist = spatial.distance.cosine(x.reshape(-1), y.reshape(-1))
    return cosine_dist
def bench_performance(model_path):
    """
    用于测试速度
    :param model_path:
    :return:
    """
    session = onnxruntime.InferenceSession(model_path,providers=['CUDAExecutionProvider'])
    input_name = session.get_inputs()[0].name

    total = 0.0
    runs = 10
    input_data = np.zeros((1,3,224,224), np.float32)  # 随便输入一个假数据，注意shape要与模型一致，我这里是灰度图输入所以(1,1)，三通道图为(1,3)
    # warming up
    _ = session.run([], {input_name: input_data})
    for i in range(runs):
        start = time.perf_counter()
        _ = session.run([], {input_name: input_data})
        end = (time.perf_counter() - start) * 1000
        total += end
        print(f"{end:.2f}ms")
    total /= runs
    print(f"Avg: {total:.2f}ms")

def bench_accuracy(model_path,onnx_bin):
    """
    用于测试精度
    :param model_path:
    :return:
    """
    session = rt.InferenceSession(model_path,providers=['CUDAExecutionProvider'])
    input_name = session.get_inputs()[0].name

    
    onnx_input= np.fromfile(onnx_bin,dtype=np.float32).reshape(1, 3, 224, 224)
   
    # warming up
    _ = session.run([], {input_name: onnx_input})

    return _[0]


input_model_path = '/home/xlx/code/neural-compressor/examples/pytorch/image_recognition/torchvision_models/quantization/qat/fx/saved_results/resnet18_float.onnx'  # 输入onnx模型
output_model_path = '/home/xlx/code/neural-compressor/examples/pytorch/image_recognition/torchvision_models/quantization/qat/fx/saved_results/resnet18int.onnx'  # 输出模型名


# dist = get_cosine_dist(bench_accuracy(input_model_path,onnx_bin),bench_accuracy(output_model_path,onnx_bin))
# print(dist)
print('ONNX full precision model size (MB):', os.path.getsize(input_model_path)/(1024*1024))
print('ONNX quantized model size (MB):', os.path.getsize(output_model_path)/(1024*1024))
print("float32测试")
bench_performance(input_model_path)
print("int8测试")
bench_performance(output_model_path)


(1, 3, 224, 224)
ONNX full precision model size (MB): 44.582664489746094
ONNX quantized model size (MB): 11.220354080200195
float32测试
5.87ms
4.96ms
4.80ms
4.76ms
4.70ms
4.74ms
4.80ms
4.78ms
4.75ms
4.72ms
Avg: 4.89ms
int8测试
5.92ms
5.39ms
5.27ms
5.25ms
5.24ms
5.26ms
5.24ms
5.24ms
5.25ms
5.23ms
Avg: 5.33ms


image_height = 512
image_width = 512
# change it to your real calibration data set
calibration_data_folder = "calibration_imagenet"
dr = MobilenetDataReader(calibration_data_folder)
print(dr)
quantize_static('/data1/need_quant_models/deeplabv3plus_r101-d8_4xb2-80k_cityscapes-512x1024.onnx',
                '/data1/need_quant_models/deeplabv3plus_r101-d8_4xb2-80k_cityscapes-512x1024_uint8.onnx',
                dr,
                quant_format= QuantFormat.QOperator, # 量化格式 QDQ / QOperator
                per_channel=True,
                activation_type=QuantType.QUInt8, # 激活类型 Int8 / UInt8
                weight_type=QuantType.QInt8, # 参数类型 Int8 / UInt8
                calibrate_method=CalibrationMethod.MinMax, ) # 数据校准方法 MinMax / Entropy / Percentile

print('ONNX full precision model size (MB):', os.path.getsize("/data1/need_quant_models/deeplabv3plus_r101-d8_4xb2-80k_cityscapes-512x1024.onnx")/(1024*1024))
print('ONNX quantized model size (MB):', os.path.getsize("/data1/need_quant_models/deeplabv3plus_r101-d8_4xb2-80k_cityscapes-512x1024_uint8.onnx")/(1024*1024))

## 2.2 Quantize the model

In [None]:
# change it to your real calibration data set
calibration_data_folder = "calibration_imagenet"
dr = MobilenetDataReader(calibration_data_folder)

quantize_static('/home/xlx/code/quantization/swinv2_tiny_window8_256.ms_in1k.onnx',
                '/home/xlx/code/quantization/swinv2_tiny_window8_256.ms_in1k_uint8.onnx',
                dr)

print('ONNX full precision model size (MB):', os.path.getsize("/home/xlx/code/quantization/swinv2_tiny_window8_256.ms_in1k.onnx")/(1024*1024))
print('ONNX quantized model size (MB):', os.path.getsize("/home/xlx/code/quantization/swinv2_tiny_window8_256.ms_in1k_uint8.onnx")/(1024*1024))

As we can not upload full calibration data set for copy right issue, we only demonstrate with some example images. You need to use your own calibration data set in practice.

In [None]:
import onnx
model_in_file="/home/xlx/code/onnxruntime-inference-examples/quantization/notebooks/imagenet_v2/mobilenet_v2_float.onnx"
model=onnx.load(model_in_file)
nodes = model.graph.node
for node in nodes:
    print(node.name)

## 2.3 Run the model with OnnxRuntime