## Automatic quantization for enot-lite

This notebook demonstrates simple end2end pipeline for MobileNetV2 quantization.

Our quantization process consists of quantized model calibration, quantization threshold adjustment and weight fine-tuning using distillation. Finally, we demonstrate inference of our quantized model using [enot-lite](https://enot-lite.rtd.enot.ai/en/latest/) framework.

### Main chapters of this notebook:
1. Setup the environment
1. Prepare dataset and create dataloaders
1. Evaluate pretrained MobileNetV2 from torchvision
1. End2end quantization with ENOT
1. Inference using enot-lite with TensorRT int8 backend

Before running this example make sure that TensorRT supports your GPU for int8 inference  (``cuda compute capability`` > 6.1, as described [here](https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix)).

## Setup the environment

First, let's set up the environment and make some common imports.

In [None]:
import os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
# You may need to change this variable to match free GPU index
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [None]:
# Common:
import numpy as np
import torch
from pathlib import Path
from torch import nn
from tqdm.auto import tqdm
from tutorial_utils.dataset import create_imagenet10k_dataloaders
from tutorial_utils.train import accuracy

# Quantization:
from enot.quantization import TrtFakeQuantizedModel
from enot.quantization import DefaultQuantizationDistiller

# TensorRT inference:
from enot_lite import backend
from enot_lite.calibration import CalibrationTableTensorrt

Define model evaluation function:

In [None]:
# This function can evaluate both nn.Modules and executable functions.
def eval_model(model_fn, dataloader):

    if isinstance(model_fn, nn.Module):
        model_fn.eval()

    total = 0
    total_accuracy = 0.0
    total_loss = 0.0

    criterion = nn.CrossEntropyLoss()

    with torch.no_grad():
        for inputs, labels in tqdm(dataloader):

            n = inputs.shape[0]

            pred_labels = model_fn(inputs)
            batch_loss = criterion(pred_labels, labels)
            batch_accuracy = accuracy(pred_labels, labels)

            total += n
            total_loss += batch_loss.item() * n
            total_accuracy += batch_accuracy.item() * n

    return total_loss / total, total_accuracy / total

### In the following cell we setup all necessary dirs

* `ENOT_HOME_DIR` - ENOT framework home directory
* `ENOT_DATASETS_DIR` - root directory for datasets (imagenette2, ...)
* `PROJECT_DIR` - project directory to save training logs, checkpoints, ...
* `ONNX_MODEL_PATH` - onnx model path

In [None]:
ENOT_HOME_DIR = Path.home() / '.enot'
ENOT_DATASETS_DIR = ENOT_HOME_DIR / 'datasets'
PROJECT_DIR = ENOT_HOME_DIR / 'enot-lite_quantization'
ONNX_MODEL_PATH = PROJECT_DIR / 'mobilenetv2.onnx'

ENOT_HOME_DIR.mkdir(exist_ok=True)
ENOT_DATASETS_DIR.mkdir(exist_ok=True)
PROJECT_DIR.mkdir(exist_ok=True)

## Prepare dataset and create dataloaders

We will use Imagenet-10k dataset in this example.

Imagenet-10k dataset is a subsample of [Imagenet](https://image-net.org/challenges/LSVRC/index.php) dataset. It contains 5000 training images and 5000 validation images. Training images are uniformly gathered from the original training set, and validation images are gathered from the original validation set, 5 per each class.

`create_imagenet10k_dataloaders` function prepares datasets for you in this example; specifically, it:
1. downloads and unpacks dataset into `ENOT_DATASETS_DIR`;
1. creates and returns train and validation dataloaders.

The two parts of the dataset:
* train: for quantization procedure (`ENOT_DATASETS_DIR`/imagenet10k/train/)
* validation: for model validation (`ENOT_DATASETS_DIR`/imagenet10k/val/)

In [None]:
train_dataloader, validation_dataloader = create_imagenet10k_dataloaders(
    dataset_root_dir=ENOT_DATASETS_DIR,
    input_size=224,
    batch_size=50,
    num_workers=4,
)

## Evaluate pretrained MobileNetV2 from torchvision

In [None]:
from torchvision.models.mobilenetv2 import mobilenet_v2
regular_model = mobilenet_v2(pretrained=True).cuda()
regular_model.classifier[0].p = 0.0  # This is required to stabilize fine-tuning procedure.

In [None]:
val_loss, val_accuracy = eval_model(regular_model, validation_dataloader)
print(f'Regular (non-quantized) model: accuracy={val_accuracy:.3f}, loss={val_loss:.3f}')

## End2end quantization with ENOT

Simply run our ``DefaultQuantizationDistiller`` class to use distillation with quantization.

In [None]:
fake_quantized_model = TrtFakeQuantizedModel(regular_model).cuda()

# Distill model quantization thresholds and weights using RMSE loss.
dist = DefaultQuantizationDistiller(
    quantized_model=fake_quantized_model,
    dataloader=train_dataloader,
    device='cuda',
    logdir=PROJECT_DIR,
    verbose=2,
)

# Uncomment lines below if you want to reach the best quantization
# performance (71.90% top1 accuracy for quantized model).

# dist.distillers[0].n_epochs = 10  # Increase the number of threshold fine-tuning epochs.
# dist.distillers[0].scheduler.T_max *= 10  # Fix learning rate schedule.

dist.distill()

In [None]:
fake_quantized_model.enable_quantization_mode(True)
val_loss, val_accuracy = eval_model(fake_quantized_model, validation_dataloader)
print(f'Optimized quantized model: accuracy={val_accuracy:.3f}, loss={val_loss:.3f}')

## Inference using enot-lite with TensorRT int8 backend

For **enot-lite**, we should export our quantized model to onnx, and save its calibration table:

In [None]:
fake_quantized_model.cpu()
fake_quantized_model.export_to_onnx(
    torch.zeros(50, 3, 224, 224),
    'exported_model.onnx',
    input_names=['input'],
    output_names=['output'],
)

Initialize **enot-lite** inference session with TensorRT Int8 Execution Provider:

In [None]:
torch.cuda.empty_cache()  # Empty PyTorch CUDA cache before running enot-lite.

calibration_table = CalibrationTableTensorrt.from_file_json(
    './exported_model.onnx.calibration_table_enot_lite'
)
sess = backend.OrtTensorrtInt8Backend('./exported_model.onnx', calibration_table)
input_name = sess.get_inputs()[0].name

First TensorRT run is usually slow because it chooses the best algorithms for inference.

Let's run session once before validation:

In [None]:
sess.run(output_names=None, input_feed={input_name: np.zeros((50, 3, 224, 224), dtype=np.float32)});

Evaluate quantized model on TensorRT:

In [None]:
def model_fn(inputs):
    input_feed = {input_name: inputs.cpu().numpy()}
    trt_output = sess.run(output_names=None, input_feed=input_feed)[0]
    return torch.tensor(trt_output, device='cuda')

val_loss, val_accuracy = eval_model(model_fn, validation_dataloader)
print(f'Quantized model with fine-tuned weights with TRT: accuracy={val_accuracy:.3f}, loss={val_loss:.3f}')