# Part 2: Quantization Inference

## Goals
* Learn how to quantize a onnx model using *'vai_q_onnx'* 
* Learn how to perform inference on an AIE using ONNX runtime.

Quantization is the process of converting a model's high-bit floating-point data format (usually 32-bit) into a lower-bit fixed-point data format (e.g., 8-bit integers) without significantly affecting the accuracy of the model's inference. This reduces computational complexity and storage space requirements, making the model more efficient.


## Step 1: Import Packages

First, we need to import the necessary packages to run the inference on the Ryzen AI NPU:

In [1]:
import torch
import onnx
import onnxruntime
import torchvision
from torchvision import datasets
from torch.utils.data import DataLoader
from onnxruntime.quantization.calibrate import CalibrationDataReader

import vai_q_onnx
from onnxruntime.quantization import CalibrationDataReader, QuantType, QuantFormat

import numpy as np
import random
import os

## Step 2: Quantize the Model



### Quantize Methods

Quantization can be categorized into two main types: Post-training Quantization (PTQ) and Quantization-aware Training (QAT). PTQ can be further divided into dynamic and static methods. Dynamic PTQ maps the model's weights directly from FP32 to INT8, which can lead to significant accuracy loss. Static PTQ, on the other hand, requires a small calibration dataset to compute how to map FP32 weights to INT8, resulting in better accuracy.

In this example, we will use the PTQ static method to quantize the model.

### Prepare Calibration Dataset

We will randomly select 1000 samples from the MNIST test dataset as the calibration dataset:

In [2]:
test_data = torchvision.datasets.FashionMNIST(root="./data/", train=True, download=False,transform=torchvision.transforms.ToTensor())
_, calibration_data = torch.utils.data.random_split(test_data, [len(test_data)-1000,1000])
calibration_loader = DataLoader(calibration_data, batch_size = 64, drop_last=True, shuffle=True)

### Calibration dataset loader

We need to package the dataset into a dataloader object to facilitate subsequent iterations, similar to the training process. We inherit the *'CalibrationDataReader'* class from *'onnxruntime'* and implement the required *'__init__'* and *'get_next'* methods:

In [3]:
class FashionMNISTCalibrationDatasetLoader(CalibrationDataReader):
    def __init__(self) -> None:
        super().__init__()
        self.iterator = iter(calibration_loader)  

    def get_next(self) -> dict:
        try:
            images, labels = next(self.iterator)
            images = torch.flatten(images, start_dim=1)
            return {"input": images.numpy()}
        except Exception :
            return None 

### Quantize the Model Using
We use *'vai_q_onnx.quantize_static'* API from Vitis AI. We need to set some parameters, such as input/output model, calibration data set, parameter type (INT8), quantization method, etc.

In addition, we can also set more fine-grained quantification conditions to apply to different hardware and improve model inference performance,such as *‘*enable_dpu’*, *'extra_options'*, etc.

In [4]:
onnx_model_path = "models/mlp_trained.onnx"
quantization_model = "models/mlp_qdq.U8S8.onnx"
MNIST_cdr = FashionMNISTCalibrationDatasetLoader()

vai_q_onnx.quantize_static(
    onnx_model_path,
    quantization_model,
    MNIST_cdr,
    quant_format=QuantFormat.QDQ,
    calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    enable_dpu=True,  # determines whether to generate a quantized model that is suitable for the DPU.
    extra_options={'ActivationSymmetric': True} # reduce computation
)
print('Calibrated and quantized model saved at:', quantization_model)

INFO:vai_q_onnx.quant_utils:The input ONNX model models/mlp_trained.onnx can create InferenceSession successfully
INFO:vai_q_onnx.quantize:Removed initializers from input
INFO:vai_q_onnx.quantize:Loading model...
INFO:vai_q_onnx.quantize:enable_dpu is True, optimize the model for better hardware compatibility.
INFO:vai_q_onnx.quantize:Start calibration...
INFO:vai_q_onnx.quantize:Start collecting data, runtime depends on your model size and the number of calibration dataset.
INFO:vai_q_onnx.calibrate:Finding optimal threshold for each tensor using PowerOfTwoMethod.MinMSE algorithm ...
INFO:vai_q_onnx.calibrate:Use all calibration data to calculate min mse


[VAI_Q_ONNX_INFO]: Time information:
2024-07-24 08:14:25.962812
[VAI_Q_ONNX_INFO]: OS and CPU information:
                                        system --- Windows
                                          node --- AUP-MINIPC-D6
                                       release --- 10
                                       version --- 10.0.22621
                                       machine --- AMD64
                                     processor --- AMD64 Family 25 Model 116 Stepping 1, AuthenticAMD
[VAI_Q_ONNX_INFO]: Tools version information:
                                        python --- 3.9.2
                                          onnx --- 1.15.0
                                   onnxruntime --- 1.15.1
                                    vai_q_onnx --- 1.16.0+60e82ab
[VAI_Q_ONNX_INFO]: Quantized Configuration information:
                                   model_input --- models/mlp_trained.onnx
                                  model_output --- models/mlp_qdq.U8S8.onnx
  

Computing range: 100%|███████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 61.82tensor/s]
INFO:vai_q_onnx.qdq_quantizer:Remove QuantizeLinear & DequantizeLinear on certain operations(such as conv-relu).


Calibrated and quantized model saved at: models/mlp_qdq.U8S8.onnx


## Step 3: Inference


### Load the Quantized Model
In the previous step, we have saved the quantized model in the given file path. 

Here we need to load the beamed model first.

In [5]:
quantized_model_path = './models/mlp_qdq.U8S8.onnx'
model = onnx.load(quantized_model_path)

### Select Device for Inference

Depending on the device you want to use for inference, set the appropriate execution provider:

If you want inference model on cpu, you can set as follows.

In [6]:
ep = 'cpu'
if(ep == 'cpu'):
    providers = ['CPUExecutionProvider']
    provider_options = [{}]
session = onnxruntime.InferenceSession(model.SerializeToString(), providers=providers,
    provider_options=provider_options)

If you want inference model on AIE, you can set it as follows.
You can leave the optional *"provider_options.cacheDir"* and *"provider_options.cacheKey"* unset. But 'provider_options.config_file' is **required**.

**NOTE: 'vaip_config.json' file is from the Execution Provider setup package. And place the file in the same directory as the above provider_options's 'config_file'**

We need copy 'vaip_config.json' file in this directory.

In [None]:
ep = 'ipu'
providers = ['VitisAIExecutionProvider']
cache_dir = os.path.join(os.getcwd(), "onnx")
provider_options = [{
        'config_file': 'vaip_config.json',
        'cacheDir': str(cache_dir),
        'cacheKey': 'mlpmodel'
    }]
session = onnxruntime.InferenceSession(model.SerializeToString(), providers=providers,
    provider_options=provider_options)

### Predict

Randomly select images from the MNIST test dataset for prediction. We use 1000 images in this case:

In [17]:
test_data  = torchvision.datasets.FashionMNIST(root="./data/",  train=False, download=False,transform=torchvision.transforms.ToTensor())#same root dirctionary
print(len(test_data))

inference_data_len = 1000
size = len(test_data)
indice = [random.randint(0,size-1) for _ in range(inference_data_len)]

10000


We input 1,000 images into the model, then obtain the output, compare whether the output is consistent with the label, and then judge the accuracy of the model.

In [18]:
acc = 0
for i in indice:
    img, label = test_data[i]
    img = torch.flatten(img, start_dim=1)  
    output =  session.run(None, {'input':img.numpy()})
    output_arr = output[0]
    res=np.argmax(output_arr)
    if(res == label):
        acc+=1

print(f"acc is {100*acc/inference_data_len}%") 

acc is 90.1%
