# Neural networks quantization

Today we will deal with neural networks quantization!

Remember that our goal is to reduce network size while keeping the accuracy high!

For this purpose we will use Intel's OpenVino and Neural Network Compression Framework (NNCF). Be aware, that there are other frameworks to choose from: buildin PyTorch quantization, Brevitas from Xilinx, TensorRT and others.

Use this link for OpenVino reference and documentation: https://docs.openvino.ai/2023.0/home.html

First, install and import nessessery libraries.

In [None]:
!pip3 install openvino
!pip3 install nncf

In [None]:
import torch
import nncf
import openvino as ov
import time
import numpy as np
import tqdm

from nncf import NNCFConfig
from nncf.torch import create_compressed_model, register_default_init_args
from openvino.runtime.ie_api import CompiledModel
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor, RandomRotation
from typing import Union, List, Tuple, Any
from abc import ABC, abstractmethod


Let's start with...

##Quantizing Models Post-training

Post-training model optimization is the process of applying special methods that transform the model into a more hardware-friendly representation without retraining or fine-tuning. The most popular and widely-spread method here is 8-bit post-training quantization because it is:

- It is easy-to-use.
- It does not hurt accuracy a lot.
- It provides significant performance improvement.
- It suites many hardware available in stock since most of them support 8-bit computation natively.

8-bit integer quantization lowers the precision of weights and activations to 8 bits, which leads to significant reduction in the model footprint and significant improvements in inference speed.

Source: https://docs.openvino.ai/2023.0/ptq_introduction.html

So first! We need a model to quantize.
Reuse the CNN model from Laboratory 1 (along with training loops, metics, optimazers and loss function).

Train it for 5 epochs with MNIST dataset. You should get around ~90% accuracy.

Name the final trained model `CNN_MNIST`

In [None]:
class CNN(torch.nn.Module):
  ...

torch_device = ...

train_dataset = ...
test_dataset = ...

train_loader = ...
test_loader = ...

CNN_MNIST = CNN(...)
metric = ...
loss_fcn = ...
optimizer = ...

def train_test_pass():
  ...

def training():
  ...

CNN_MNIST, history = training( ... )

# You can alternatively load .pth file created during Lab1 with torch.load.
# Use following code to upload files to colab:
# from google.colab import files
# files.upload()


Now - we will quantize this model to INT8.

NNCF enables post-training quantization (PTQ) by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers.

By default PTQ uses an unannotated dataset to perform quantization. It uses representative dataset items to estimate the range of activation values in a network and then quantizes the network.

Create an instance of `nncf.Dataset` class by passing two parameters:
- data_source (PyTorch loader containing training samples)
- transform_fn (to make data suitable for API).

Call this instance `calibration_dataset`.

Then, quantize `CNN_MNIST` model with `nncf.quantize()` function, which takes as input two parameters - the model and `calibration_dataset`. Call it `quantized_model`

In [None]:
def transform_fn(data_item):
    images, _ = data_item
    return images

calibration_dataset = nncf.Dataset( ... )
quantized_model = nncf.quantize( ... )

Finally, we will convert modes to OpenVINO Intermediate Representation (IR) format.

OpenVINO IR is the proprietary model format of OpenVINO. It is produced after converting a model with model conversion API. Model conversion API translates the frequently used deep learning operations to their respective similar representation in OpenVINO and tunes them with the associated weights and biases from the trained model. The resulting IR contains two files:
- `xml` - Describes the model topology.
- `bin` - Contains the weights and binary data.

To do that, we'll need `dummy_input` filled with random values and of size:

`[batch_size, channel_number, image_shape[0], image_shape[1]]`

Create `MNIST_fp32_ir` model with `ov.convert_model` that takes three parameters: the model, the dummy input and input size. Use `CNN_MNIST` model.

Then, create `MNIST_int8_ir` model in the same way using `quantized_model`.

Save both models to files (named `MNIST_fp32_ir.xml` and `MNIST_int8_ir.xml` respectively. Use `ov.save_model()` function.

Finally - compile both models with `core.compile_model` function and use  `validate` function to calculate both models accuracy.

In [None]:
core = ov.Core()
devices = core.available_devices

dummy_input = torch.randn( ... )
MNIST_fp32_ir = ov.convert_model(..., example_input=..., input=[-1, ..., ..., ...]) #TODO - FILL THE GAPS
MNIST_int8_ir = ov.convert_model(..., example_input=..., input=[-1, ..., ..., ...]) #TODO - FILL THE GAPS
ov.save_model(MNIST_fp32_ir, ...) #TODO - FILL THE FILENAME
ov.save_model(MNIST_int8_ir, ...) #TODO - FILL THE FILENAME

fp32_compiled_model = core.compile_model(MNIST_fp32_ir, devices[0])
int8_compiled_model = core.compile_model(MNIST_int8_ir, devices[0])

def validate(val_loader: torch.utils.data.DataLoader, model: Union[torch.nn.Module, CompiledModel], metric: BaseMetic):

    # Switch to evaluate mode.
    if not isinstance(model, CompiledModel):
        model.eval()
        model.to(torch_device)
    total_accuracy = 0
    samples_num = 0
    with torch.no_grad():
        end = time.time()
        for i, (images, target) in tqdm.tqdm(enumerate(val_loader)):
            images = images.to(torch_device)
            target = target.to(torch_device)

            # Compute the output.
            if isinstance(model, CompiledModel):
                output_layer = model.output(0)
                output = model(images)[output_layer]
                output = torch.from_numpy(output)
            else:
                output = model(images)

            # Measure accuracy and record loss.
            accuracy = metric(output, target)
            total_accuracy += accuracy.item() * target.shape[0]
            samples_num += target.shape[0]

    return total_accuracy / samples_num

acc1 = validate( ... )
print(f'FP 32 model acc={acc1:.4f}')

acc2 = validate( ... )
print(f'INT 8 model acc={acc2:.4f}')

Is INT8 model accuracy similar to FP32 model accuracy? We should hope so!

But let's verify what we have saved in terms of memory resources and network throughput!

First, check the size of OpenVINO IR binary files. You saved both of them on your drive. Is the INT8 model smaller?

In [None]:
!ls -lh #BINARY_FILE_NAME
!ls -lh #BINARY_FILE_NAME

Then, use the following code to benchmark both models. Is INT8 model faster?

In [None]:
def parse_benchmark_output(benchmark_output: str):
    """Prints the output from benchmark_app in human-readable format"""
    parsed_output = [line for line in benchmark_output if 'FPS' in line]
    print(*parsed_output, sep='\n')


print('Benchmark FP32 model on CPU')
benchmark_output = ! benchmark_app -m MNIST_fp32_ir.xml -d CPU -api async -t 15 -shape "[1, 1, 28, 28]"
parse_benchmark_output(benchmark_output)

print('Benchmark INT8 model on CPU')
benchmark_output = ! benchmark_app -m MNIST_int8_ir.xml -d CPU -api async -t 15 -shape "[1, 1, 28, 28]"
parse_benchmark_output(benchmark_output)


Note, that we used very small network and we deal with very simple task. For bigger models and harder networks the perfomance and size differences can be even more significant!

***Extention exercises***

Read about `Quantizing with Accuracy Control` and try to use it for some pretrained network. Use `nncf.quantize_with_accuracy_control`. You can find pretrained networks with `torchvision.models`

## Quantization-aware Training (QAT)

Training-time model compression improves model performance by applying optimizations (such as quantization) during the training. The training process minimizes the loss associated with the lower-precision optimizations, so it is able to maintain the model’s accuracy while reducing its latency and memory footprint. Generally, training-time model optimization results in better model performance and accuracy than post-training optimization, but it can require more effort to set up.

Quantization-aware Training is a popular method that allows quantizing a model and applying fine-tuning to restore accuracy degradation caused by quantization. In fact, this is the most accurate quantization method.

For this part, let's use a bit harder Dataset. For MNIST, PTQ method was enough, right?

Train your CNN model on CIFAR10 dataset for 10-20 epochs (google it!). Use the same training loops, metics, optimazers and loss function.

Name the final trained model `CNN_CIFAR`, convert it to OpenVino IR and save to xml file.

We start our QAT process with creating compressed models. Just use the following code (fill in the gaps).

In [None]:
train_dataset = ...
test_dataset = ...
train_loader = ...
test_loader = ...
CNN_CIFAR, history = training( ... )

# SAVE floating point model converted to OpenVino IR
dummy_input = torch.randn(...) # Create dummy_input
CIFAR_fp32_ir = ov.convert_model(...)
ov.save_model(CIFAR_fp32_ir, ...) #TODO - FILL THE FILENAME

# Compress model
nncf_config_dict = {
    "input_info": {"sample_size": [1, ..., ..., ...]}, #Put number of channels, image_size[0] and image_size[1]
    "compression": {
        "algorithm": "quantization",
    },
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
nncf_config = register_default_init_args(nncf_config, train_loader)
compression_ctrl, CNN_CIFAR_int8 = create_compressed_model(CNN_CIFAR, nncf_config)

We have our CIFAR CNN model ready to QAT. So... Just train it!

Use your `training` function to train `CNN_CIFAR_int8` model for one more epoch!

Thanks to OpenVINO API, after creating compressed model all we need to do is to continue training on INT8 model :) We call this process fine-tuning. It is applied to futher improve quantized model accuracy! Normally, several epochs of tuning are required with a small learning rate, the same that is usually used at the end of the training of the original model. No other changes in the training pipeline are required.

In [None]:
CNN_CIFAR_int8_finetuned, history = training( ... ) # just one epoch

Convert fine-tuned model to OpenVinoIR, save it to xml and verify both `CIFAR_fp32_ir` and `CIFAR_int8_ir` sizes.

Is the INT8 network smaller?

In [80]:
core = ov.Core()
devices = core.available_devices
dummy_input = ...
CIFAR_int8_ir = ov.convert_model( ... )
ov.save_model( ... )

In [None]:
!ls -lh ...

Finally - compile models, validate and benchmark them.

Did accuracy decreased?
Is INT8 model faster?

In [None]:
fp32_cifar_compiled_model = core.compile_model( ... )
int8_cifar_compiled_model = core.compile_model( ... )
acc1 = validate( ... )
print( ... )
acc2 = validate( ... )
print( ... )

In [None]:
def parse_benchmark_output(benchmark_output: str):
    """Prints the output from benchmark_app in human-readable format"""
    parsed_output = [line for line in benchmark_output if 'FPS' in line]
    print(*parsed_output, sep='\n')


print('Benchmark FP32 model on CPU')
benchmark_output = ! benchmark_app -m CIFAR_fp32_ir.xml -d CPU -api async -t 15 -shape "[1, 3, 32, 32]"
parse_benchmark_output(benchmark_output)

print('Benchmark INT8 model on CPU')
benchmark_output = ! benchmark_app -m CIFAR_int8_ir.xml -d CPU -api async -t 15 -shape "[1, 3, 32, 32]"
parse_benchmark_output(benchmark_output)

***Extention exercise***

Compare PTQ and QAT. Create CNN model and:
- train it for 20 epochs and save as `CNN_long.pth`
- train it for 15 epochs and save as `CNN_short.pth`

Then, apply PTQ on `CNN_long.pth` model and QAT (for 5 epochs) on `CNN_short.pth`. Compare the resulting models (in terms of accuracy, size and FPS).