# Quantization

Before starting you will need to have downloaded the Tiny-ImageNet-200 with the provided
utility script and have run the `prepare_imagefolder_dataset.sh` bash script in order to
prepare the data structure for the torch `ImageFolder` dataset that we will use later.

You can use it like such `prepare_imagefolder_dataset.sh <path to val_annotations.txt> <path to images/>`

# Preparation

## Selecting the model

For this part of the challenge I chose to work with a vision transformer pretrained with the DINOv2 framework.

First of all we need to list availbale models and chose one. We can do that using the `torch.hub.list` function.

In [None]:
import torch


torch.hub.list('facebookresearch/dinov2')

We can see there are plenty of pretrained pretrained models to chose from, the ones  that interests us are the models with a linear classifier head that have been trained on the ImageNet-1000 dataset, they are marked with a `_lc` suffix.

Because of the limitations of the hardware I am working with I chose to use the base ViT size. 

I opted for the versions with extra registers because they peform slightly better than normal ViT with minimal computational overhead.

## Loading the model

Now that we have selected a model we can load it and prepare if for inference.

The first step is loading it from the Torch Hub. We also set it in "eval mode" to deactivate DropOut layers.
For now native torch quantization is only supported for CPU, so we keep the model on CPU.

In [None]:
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg_lc')
model.eval()

## Preparing the data

We will use the validation set from the tiny-imagenet-200 dataset for our inference.

In [None]:
with open('imagenet-1000_words.txt', "r") as f:
    imagenet_1000_classes = list()
    
    for line in f.readlines():
        id = line.split('\t')[0]
        imagenet_1000_classes.append(id)

In [None]:
import torchvision


dataset = torchvision.datasets.ImageFolder('dataset/')

Since Vision Transformers embedding layer expects image widths and heights that are multiple of the the embedding patch size we have to resize the examples.

I chose to resize the images to the closest integer multiple of 14, which is 70. Since tiny-imagenet-200 is a subset of imagenet we also apply the standard imagenet normalization to the examples.

In [None]:
import torchvision.transforms as T


transforms = T.Compose([
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    T.Resize((70, 70))
])

Let's pick a single example from the dataset. This seems to be the image of a fish/lobster.

In [None]:
img, cls = dataset[999]
img

We can use a the id lookup table to see the corresponding human-readable class

In [None]:
from utils import build_id2cls_map


id2cls = build_id2cls_map("imagenet_words.txt")

id2cls[dataset.classes[cls]]

## Running the model

We need to transform the image into an exmaple that can be processed by our model.

In [None]:
x = transforms(img)[None, ...] # adding an empty batch dimension

Now we can forward it trough the model (with `torch.no_grad` context because we are doing inference, not training).

In [None]:
with torch.no_grad():
    y_hat = model(x)

Now we'll check the models predictions.

Because the model linear classifier has been trained on ImageNet-1000 we fist have to match the output to an ID. 

Because Tiny-ImageNet-200 is a subset of ImageNet-1000 we can't use the dataset `classes` attribute to determine that, so we have to use a file mapping the IDs of ImageNet-1000 to the class numbers.

In [None]:
top_1 = torch.argmax(y_hat).item()
id_predicted_top_1 = imagenet_1000_classes[top_1]
cls_predicted_top_1 = id2cls[id_predicted_top_1]

top_1, id_predicted_top_1, cls_predicted_top_1

The outputs seems a bit fishy, although we are not very far from the groud-truth, so let's check the top-5 outputs.

In [None]:
top_5 = torch.topk(y_hat, 5, dim=1)
id_predicted_top_5 = [imagenet_1000_classes[t.item()] for t in top_5.indices[0]]
cls_predicted_top_5 = [id2cls[id] for id in id_predicted_top_5]

top_5, id_predicted_top_5, cls_predicted_top_5

The model has predicted the right class in the top-5 outputs.

# Quantization effects

You can run the `benchmark.py` script to get benchmarks for the accuracy, inference time
and model size.

## Accuracy

By quantizing a model we generally trade accuracy for space and compute time.

We tested the normal model, and the model quantized with `toch.qint8` and 
`torch.float16` data types.

The accuracy benchmark gives the following results :

```

===== Running ACCURACY benchmark ======

Evaluating models accuracy using TOP-5
Accuracy (normal):      0.7209                                                                                                                                                  
Accuracy (quantized - qint8):   0.0718                                                                                                                                          
Accuracy (quantized - float16): 0.7208
```

While the accuracy of the normal model and the `float16` quantized models are similar, 
the results with the `qint8` quantized model are catastrophic.

## Inference time

Quantized models are generally faster than the normal versions.

In order to do a quick check of the inference time we can use the `%%timeit` magic command.

In [None]:
%%timeit

with torch.no_grad():
    model(x)

On my machine I get a inference time of about 75 ms for the normal model.

Now let's check the effect of quantization on the inference time of the model.

First we use the `torch.quantization` module to convert our models `Linear` layers weigts to `torch.qint8` type.

In [None]:
quantized_model = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
quantized_model.eval()

Then we run the cell with the magic function to get an estimate of the inference time.

In [None]:
%%timeit

with torch.no_grad():
    quantized_model(x)

With the quantized model I get an inference time of about 3/4th of the normal inference
time.

This result can seem disappointing, but we have to keep in mind that the default 
quantization backend can only convert `Linear` layers in our model. 
Operations such as self-attention, GELU etc.. are not quantized.

The benchmark results are the following : 

```
===== Running INFERENCE benchmark ======

[-------------------------------------- ViT Forward --------------------------------------]                                                                                     
                                    |  normal  |  quantized - qint8  |  quantized - float16
6 threads: --------------------------------------------------------------------------------
      torch.Size([1, 3, 518, 518])  |    3.1   |          2.1        |           2.7       
      torch.Size([2, 3, 518, 518])  |    4.4   |          3.9        |           4.2       
      torch.Size([4, 3, 518, 518])  |    8.8   |          8.2        |          10.8       
      torch.Size([8, 3, 518, 518])  |   23.1   |         18.2        |          20.7       

Times are in seconds (s).
```

The quantized models are slightly faster than the normal model, but the loss in accuracy 
for the `qint8` quantized model is not worth the small acceleration we get.

We can run the `torch.autograd.profiler` to get more in-depth results about the 
inference timings of the normal and quantized models and find the bottlenecks.

By running a more accurate benchmark using the `benchmark.py` script we get the following results

In [None]:
import torch.autograd.profiler as profiler


with profiler.profile(with_stack=True, profile_memory=True) as p:
    y_hat = model(x)

print(p.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total'))

with profiler.profile(with_stack=True, profile_memory=True) as p:
    y_hat = quantized_model(x)

print(p.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total'))

## Size of the model

One of the main purposes for quantization is running models on edge devices with limited
memory and compute power let's check the scale of the size reduction for quantized
models.

We can check the check the size that the models will occupy with the benchmark script :

```
===== Running SIZE benchmark ======

Estimated size in memory
Size (normal):   disk - 345.026460647583 MB  | memory - 361.698208 MB 
Size (quantized - qint8):        disk - 91.05587959289551 MB  | memory - 6.263808 MB 
Size (quantized - float16):      disk - 345.040864944458 MB  | memory - 6.263808 MB
```

The size reduction for the file of the `quint8` quantized model is sizeable, while the
`float16` occupy almost the sme size on disk as the normal model even thoug the model is
loaded with `float32` parameters.

The size of the quantized models in memory is noticabley lower.

These results are probably explained by some implementation quirks.