# Assignment 5: Neural Network Quantization


In this assignment, the goal is to reduce the size of a deep neural network. This action will provide a lighter and potentially faster model. We will rely on the PyTorch functionalities for quantizing neural networks. The autonomous driving models from the previous assignments will be used for this purpose. 

In our context, the process of quantization will convert the floating point parameters (32-bit, single precision) to integer parameters.

Note that all scripts should be self-contained and executed on *any* machine that has the required libraries installed.

The solutions of the assignment can be delivered as Python Notebooks or .py files. The visual results can delivered as pdf- or image-files.

**Important**: There is a helpful tutorial on quantization at [https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html).



## 1. Evaluation Metrics

You will make use of the convolutional neural network from the autonomous driving assignment. The goal is to analyze it in terms of of inference time, [FLOPS](https://github.com/facebookresearch/fvcore/blob/main/docs/flop_count.md) (floating operations), [model size](https://discuss.pytorch.org/t/finding-model-size/130275) (in MB) and accuracy (classification problem). This task does not require training. A pre-trained model from the previous assignments can be employed.

*Note*: You may use the library [ptflops](https://pypi.org/project/ptflops/) to determine the number of floating point operations (FLOPS) of your model.

*Task Output*: The convolutional model from the autonomous driving assignment should be used in order to compute the execution time, FLOPS, model size and accuracy. For that reason, one function for each metric should be created. The same functions will be later used for evaluating the quantized model.

*Important*: The scripts should be **self-contained**.

In [1]:
from imitation_learning import *
from model import *
import numpy as np
import torch

In [2]:
def load_model(model_path):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = CNN().to(device)
    model.load_state_dict(torch.load(model_path))
    return model

model = load_model('./models/best_model1.pth')

In [3]:
pip install ptflops

In [4]:
from ptflops import get_model_complexity_info
import time
import os

# Function to calculate inference time
def get_inference_time(model, input_size=(3, 224, 224), num_samples=100):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inputs = torch.randn(num_samples, *input_size).to(device)
    model.to(device)

    start_time = time.time()
    with torch.no_grad():
        for i in range(num_samples):
            _ = model(inputs[i:i+1])
    end_time = time.time()

    total_time = end_time - start_time
    avg_time_per_inference = total_time / num_samples
    return avg_time_per_inference

# Function to calculate FLOPS
def get_flops(model, input_size=(3, 224, 224)):
    macs, params = get_model_complexity_info(model, input_size, as_strings=False, print_per_layer_stat=False)
    flops = macs * 2  # FLOPS are 2x the number of MACs
    return flops

# Function to calculate model size
def get_model_size(model_path):
    size_in_mb = os.path.getsize(model_path) / (1024 * 1024)
    return size_in_mb

# Function to calculate accuracy (assuming a dataset and evaluation function are available)
def get_accuracy(model, dataset_path):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    accuracy = evaluate_model_accuracy(model, dataset_path, device)
    return accuracy

In [5]:
# Define input size based on model requirements
input_size = (3, 96, 96)
model_path = './models/best_model1.pth'
# Compute metrics
inference_time = get_inference_time(model, input_size=input_size)
flops = get_flops(model, input_size=input_size)
model_size = get_model_size(model_path)
# accuracy = get_accuracy(model, './data/test_dataset')  # Assuming the path to the dataset

# Print results
print(f"Inference Time (s): {inference_time:.6f}")
print(f"FLOPS: {flops / 1e9:.2f} GFLOPS")
print(f"Model Size (MB): {model_size:.2f}")
# print(f"Accuracy: {accuracy:.2f}%")

## 2. Static Quantization

In this task, the parameters of the feed forward model will be quantized. To reach this goal, experimental functions of PyTorch will be used such as: `torch.quantization`. The quantization is static and thus it does not include any training process. 

Check PyTorch tutorial on [quantization](https://pytorch.org/docs/stable/quantization.html#post-training-static-quantization).

*Task Output*: The weights of the model are float32 variables. They should be converted to int8 (i.e., 8 bit). Then the execution time, FLOPS, model size and accuracy should be computed and compared to the original model.

*Important*: The scripts should be **self-contained**.

In [8]:
def quantize_model(model):
    model.eval()
    model.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm') 
    model_fused = torch.ao.quantization.fuse_modules(model, [['conv1', 'relu1'], ['conv2', 'relu2']])
    model_prepared = torch.ao.quantization.prepare(model_fused, inplace=True)
    for _ in range(100):
        input_tensor = torch.randn(1, 3, 96, 96)
        model_prepared(input_tensor)
    model = torch.ao.quantization.convert(model_prepared, inplace=True)
    return model

# Quantize the model
quantized_model = load_model(model_path)  # Reload the model to avoid inplace changes affecting original model
quantized_model = quantize_model(quantized_model)

# Save quantized model to file to get its size
quantized_model_path = './models/quantized_model.pth'
torch.save(quantized_model.state_dict(), quantized_model_path)

In [9]:
quantized_inference_time = get_inference_time(quantized_model, input_size=input_size)
quantized_flops = get_flops(quantized_model, input_size=input_size)
quantized_model_size = get_model_size(quantized_model_path)
# quantized_accuracy = get_accuracy(quantized_model, './data/test_dataset')


print(f"Inference Time (s): {quantized_inference_time:.6f}")
print(f"FLOPS: {quantized_flops / 1e9:.2f} GFLOPS")
print(f"Model Size (MB): {quantized_model_size:.2f}")
# print(f"Accuracy: {quantized_accuracy:.2f}%")

## 3. Dynamic Quantization (optinal)

In this task, the parameters of the feed forward model will be quantized and trained at the same time. To reach this goal, experimental functions of PyTorch will be used such as: `torch.quantization`. This type of training is called Quantization-aware training (QAT). 

Check PyTorch tutorial on [quantization](https://pytorch.org/docs/stable/quantization.html#post-training-static-quantization)

*Note*: This task is optinal and not required to pass the lab course.

*Task Output*: The already quantized model from the previous will be use to train conduct the training process. The model should be trained until convergence. Then, then the execution time, FLOPS, model size and accuracy should be computed and compared to the static quantization and the original model.

*Important*: Quantization is possible on the eager mode of PyTorch. This requires to install another version of PyTorch.

*Important*: The scripts should be **self-contained**.