## Automatic quantization and optimized inference for YOLOv5 with ONNX Runtime (TensorrRT Execution Provider)

This notebook demonstrates simple procedure for Ultralytics YOLOv5 quantization.

Our quantization process consists of quantized model calibration, quantization thresholds adjustment and weight fine-tuning using distillation. Finally, we demonstrate inference of our quantized model using ONNX Runtime framework.

### Main chapters of this notebook:
1. Setup the environment
1. Prepare dataset and create dataloaders
1. Baseline YOLOv5 ONNX creation
1. Quantize YOLOv5
1. Measure speed of default YOLOv5 inferenced via default PyTorch and quantized YOLOv5 inferenced via ONNX Runtime (TensorRT)
1. Measure mAP for float and quantized versions

Before running this example make sure that TensorRT supports your GPU for INT8 inference  (``cuda compute capability`` > 6.1, as described [here](https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix)).

## Setup the environment

First, let's set up the environment and make some common imports.

In [None]:
!pip install -r requirements.txt
!pip install 'numpy<1.24'

In [None]:
# You may need to uncomment and change this variable to match free GPU index
# %env CUDA_VISIBLE_DEVICES=0

1. Install enot-autodl and ONNX Runtime libraries and create jupyter kernel with them.
2. Clone specific commit from YOLOv5 repository: https://github.com/ultralytics/yolov5/commit/f76a78e7078185ecdc67470d8658103cf2067c81
3. Replace the val.py script with our val.py
4. Replace path to COCO dataset folder in 'yolov5/data/coco.yaml' file. If you do not have pre-downloaded MS COCO dataset - you can leave it as is and the dataset will be automatically downloaded.

Steps 2 and 3 will be done with these commands:

In [None]:
! git clone https://github.com/ultralytics/yolov5
! cd yolov5/ && git checkout f76a78e7078185ecdc67470d8658103cf2067c81
! cp tutorial_utils/val.py yolov5/val.py

In [None]:
import sys

sys.path.append('yolov5/')

import itertools
import statistics
import numpy as np
from timeit import Timer

import torch
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.optim import RAdam
from tqdm.auto import tqdm
from pathlib import Path

from itertools import islice

# quantization procedure
from enot.quantization import TensorRTFakeQuantizedModel
from enot.quantization import calibrate
from enot.quantization import distill
from enot.quantization import RMSELoss

# optimized inference
from tutorial_utils.inference import create_onnxruntime_session
import onnxsim

# converters from onnx to pytorch
from onnx2torch import convert
from onnx2torch.utils.custom_export_to_onnx import OnnxToTorchModuleWithCustomExport

# dataset creation functions
from yolov5.utils.dataloaders import create_dataloader
from yolov5.utils.general import check_dataset

# function for loading yolo checkpoint
from yolov5.models.experimental import attempt_load

# onnx conversion function
from yolov5.export import export_onnx

### In the following cell we setup all necessary contents

* `HOME_DIR` - experiments home directory
* `PROJECT_DIR` - project directory to save training logs, checkpoints, ...
* `ONNX_MODEL_PATH` - onnx model path

In [None]:
HOME_DIR = Path.home() / '.optimization_experiments'
DATASETS_DIR = HOME_DIR / 'datasets/coco_for_yolo'
PROJECT_DIR = HOME_DIR / 'yolov5s_quantization'
QUANT_ONNX_PATH = './yolov5s_trt_int8.onnx'
ONNX_PATH = './yolov5s.onnx'

HOME_DIR.mkdir(exist_ok=True)
PROJECT_DIR.mkdir(exist_ok=True)

BATCH_SIZE = 8
IMG_SIZE = 640
IMG_SHAPE = (BATCH_SIZE, 3, IMG_SIZE, IMG_SIZE)

## Prepare dataset and create dataloaders

We will use MS COCO dataset in this example.


`create_dataloader` and `check_dataset` functions prepare datasets for you in this example; specifically, it:
1. downloads and unpacks dataset into folder pointed out in `yolov5/data/coco.yaml`;
1. creates and returns train and validation dataloaders.

**IMPORTANT NOTE**: since this is example notebook we will train and validate model in **THE SAME DATASET**. For better performance and generalization use separate dataset for train and val procedure. 


In [None]:
import yaml

In [None]:
with open('yolov5/data/coco.yaml', 'r') as f:
    coco_cfg = yaml.load(f, yaml.Loader)

coco_cfg['path'] = DATASETS_DIR.as_posix()

with open('yolov5/data/coco.yaml', 'w') as f:
    yaml.dump(coco_cfg, f)

data = check_dataset('yolov5/data/coco.yaml', autodownload=True)

valid_dataloader = create_dataloader(data["val"], IMG_SIZE, BATCH_SIZE, 32, False, pad=0.5, rect=False)[0]

## Baseline YOLO-v5 onnx creation

In [None]:
# Since the default YOLOv5 model contains conditional execution ('if' nodes), we have to save
# it to ONNX format and convert back to PyTorch to perform quantization.

In [None]:
%run yolov5/export.py --weights=yolov5s.pt --include=onnx --batch-size={BATCH_SIZE} --imgsz={IMG_SIZE}

In [None]:
regular_model = convert(ONNX_PATH).cuda()
regular_model.eval();

## Quantization YOLO-v5

In [None]:
# Let's define function for converting dataset samples to model inputs.


def sample_to_model_inputs(x):
    # x[0] is the first item from dataloader sample. Sample is a tuple where 0'th element is a tensor with images.
    x = x[0]

    # Model is on CUDA, so input images should also be on CUDA.
    x = x.cuda()

    # Converting tensor from int8 to float data type.
    x = x.float()

    # YOLOv5 image normalization (0-255 to 0-1 normalization)
    x /= 255
    return x

In [None]:
# Please consider to specify `quantization_scheme` for `TensorRTFakeQuantizedModel`,
# quantization scheme can affect the perfomance of the quantized model.
# See for details: https://enot-autodl.rtd.enot.ai/en/stable/reference_documentation/quantization.html#enot.quantization.TensorRTFakeQuantizedModel

fake_quantized_model = TensorRTFakeQuantizedModel(regular_model).cuda()

In [None]:
# Calibrate quantization thresholds using 10 batches.
# Note that we are using **valid_dataloader** for fast calculation.
# For real purpose you have to use your train data, at least some part of it.

with torch.no_grad(), calibrate(fake_quantized_model):
    for batch in itertools.islice(valid_dataloader, 10):
        batch = sample_to_model_inputs(batch)
        fake_quantized_model(batch)

In [None]:
# Distill model quantization thresholds and weights using RMSE loss.
# Note that we are using **valid_dataloader** for fast calculation.
# For real purpose you have to use your train data, at least some part of it.

n_epochs = 5
with distill(fq_model=fake_quantized_model, tune_weight_scale_factors=True) as (qdistill_model, params):
    optimizer = RAdam(params=params, lr=0.005, betas=(0.9, 0.95))
    scheduler = CosineAnnealingLR(optimizer=optimizer, T_max=len(valid_dataloader) * n_epochs)
    distillation_criterion = RMSELoss()

    for _ in range(n_epochs):
        for batch in (tqdm_it := tqdm(valid_dataloader)):
            batch = sample_to_model_inputs(batch)

            optimizer.zero_grad()
            loss: torch.Tensor = torch.tensor(0.0).cuda()
            for student_output, teacher_output in qdistill_model(batch):
                loss += distillation_criterion(student_output, teacher_output)

            loss.backward()
            optimizer.step()
            scheduler.step()

            tqdm_it.set_description(f'loss: {loss.item():.3f}')

In [None]:
fake_quantized_model.cuda()
fake_quantized_model.enable_quantization_mode(True)
fake_quantized_model.cpu()

torch.onnx.export(
    model=fake_quantized_model,
    args=torch.ones(*IMG_SHAPE),
    f=QUANT_ONNX_PATH,
    input_names=['input'],
    output_names=['output'],
    opset_version=13,
)

## Speed measurement

In [None]:
torch.cuda.empty_cache()

In [None]:
yolov5s = attempt_load('yolov5s.pt').cuda()

In [None]:
def measure_fps(infer):
    for _ in range(50):  # warmup
        infer()

    number = 50
    measurements = Timer(infer).repeat(repeat=50, number=number)
    norm = statistics.mean(measurements) / number / BATCH_SIZE
    fps = 1.0 / norm
    return fps

In [None]:
inputs = torch.ones((BATCH_SIZE, 3, IMG_SIZE, IMG_SIZE), dtype=torch.float32, device='cuda')


def infer_torch():
    yolov5s(inputs)

In [None]:
proto, _ = onnxsim.simplify(QUANT_ONNX_PATH)

onnxruntime_sess = create_onnxruntime_session(
    proto=proto,
    input_sample=inputs,
    output_shape=(BATCH_SIZE, 25200, 85),
)


def infer_onnxruntime():
    onnxruntime_sess(inputs)

In [None]:
measure_fps(infer_torch)  # PyTorch FPS

In [None]:
measure_fps(infer_onnxruntime)  # ONNX Runtime (TensorRT Execution Provider) FPS

In [None]:
torch.cuda.empty_cache()

## mAP evaluation

In [None]:
# common validation function for Ultralytics YOLO models
from yolov5.val import run

In [None]:
opt = {
    'data': 'yolov5/data/coco.yaml',
    'weights': 'yolov5s.pt',
    'half': True,
    'batch_size': 8,
}

In [None]:
run(**opt);

In [None]:
torch.cuda.empty_cache()

In [None]:
opt['onnxruntime_sess'] = onnxruntime_sess
opt['half'] = False

In [None]:
run(**opt);