# The Effect of Quantization on Neural Networks


# What are we aiming for?

Our aim is to build a program which quantizes a neural network and, if time allows us, to try and compress the network. Quantizing has a minimal effect on the accuracy of the model. Our quantizer will decrease the storage space and memory usage of the pre-trained model and hopefully increase the running speed.

Our method is based on the paper Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights, by Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, Yurong Chen. It uses group-wise weight quantization and weight partitioning to do the quantization. It converts a pre-trained full precision convolutional neural network model into a low precision model, where the weight are defined by powers of either 2 or zero.

### We will look at how quantization affects the following three factors:
 - Decreased storage space usage
 - Decreased memory usage
 - Increased running speed

But first ...

Lets import some stuff...

In [1]:
import torch
import matplotlib.pyplot as plt
from fastai.metrics import dice
from fastai.vision import get_transforms, SegmentationItemList, to_device, Learner,  validate
from zb_hps_algo.pytorch_models.models.unet import Unet
from pathlib import Path
import gc
import tester #our little tester module
from quantization import INQScheduler #the quantizer object
import quantization #the quantization module

Loading up a dataset and some pretrained models for later...

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

img_size = 256
zb_dataset_path = Path('/data/home/craig/zb-datasets/')
training_path = zb_dataset_path / 'nn_training' / 'head_finder' / f'{img_size}'
model_save_dir = Path('models/')

classes = ['Background', 'Right', 'Left']
tfms = get_transforms(do_flip=False, flip_vert=False,
                      p_lighting=0.5, max_lighting=0.85,
                      max_rotate=20,
                      max_warp=0,
                      max_zoom=1.2)
data = (SegmentationItemList.from_folder(training_path / 'data', convert_mode='L')#, div=False)
        .split_by_rand_pct(seed=42)
        .label_from_func(lambda x: training_path / 'labels' / x.name, classes=classes)
        .transform(tfms, tfm_y=True, size=img_size)
        .databunch(bs=6)  # 4@512 and 12@256 will just about max out a rtx2070
        .normalize())  # This uses a batch's statistics to normalize the dataset. I don't like this.


model = Unet(1, 3)
model.load_state_dict(torch.load(model_save_dir / "model.tmp"))
model.eval()
learner = Learner(data, model, metrics=[dice], opt_func=torch.optim.SGD)

## How the Quantizer works
we want to convert all the weights to be represented by some integer *__p__* such that $2^{p}$ is the closest approximation of the weight as possible.

To minimise the loss in accuracy we quantize the weights in increments. After each increment we re-train the model  so that the network can compensate for the loss in accuracy.

### Algorithm

>`Instantiate quantizer
For (number of increments - 1):
    Step quantizer for next increment
    Train network
Step quantizer for final increment` <br>


Using this algorithm we can specify how much we want to quantize in each step with the increment, allowing for fine tuning

In [None]:
increments = [0.5,0.75,0.82,1.0] #We define the increments, noting that the last increment must ALWAYS be one
optimizer = torch.optim.SGD(model.parameters(), lr = 1e-3) #We instantiate an optimizer

quantizer = INQScheduler(optimizer, increments) #Instantiating quantizer which loops
                                                #through the network in the pre-defined increments

for i in range(len(increments)):
    quantizer.step()
    learner.fit(5, lr = 1e-3)
inq_scheduler.step()

With our model now quantized, we can now compare the quantized model to the original model and assess its performance...

In [3]:
#Loading in our pre-quantized model
quant   = Unet(1, 3)
quant.load_state_dict(torch.load(model_save_dir / "model_quant.tmp"))
quant.eval()
quant_learner = Learner(data, quant, metrics=[dice], opt_func=torch.optim.SGD)

#Loading in our pre-trained model
unquant = Unet(1, 3)
unquant.load_state_dict(torch.load(model_save_dir / "model.tmp"))
unquant.eval()
unquant_learner = Learner(data, unquant, metrics=[dice], opt_func=torch.optim.SGD)

quant.to(device)
unquant.to(device)

tester.compare_loss(quant_learner,unquant_learner)

Unquantized loss:	0.0391
Quantized Loss:		0.0408
Percentage Increase:	4.2370%


## Model Size
One can drastically decrease physical size of the network if we rather save our integer *__p__* instead of a float for every single weight and then convert all of those saved integers back into floats using $2^{p}$ to recover the model ready to be run

In [4]:
#First we load up our quantized model
model = Unet(1, 3)
model.load_state_dict(torch.load(model_save_dir / "model_quant.tmp"))
model.eval()

#We create a quantizer for it and have it turn the weights into exponent integers
optimizer = torch.optim.SGD(model.parameters(), lr = 1e-3)
quantizer = INQScheduler(optimizer, [1]) #Constructing the quantizer
quantizer.quantize_int()

#Now we can save the model, doing so will also
quantization.save_quantized_model(model, model_save_dir / "saved_qaunt.tmp")

In [5]:
model = Unet(1, 3)
model.load_state_dict(torch.load(model_save_dir / "model_quant.tmp"))
model.eval()

loaded_model = Unet(1, 3)
quantization.load_quantized_model(loaded_model, model_save_dir / "saved_qaunt.tmp")

learnerloaded = Learner(data, loaded_model, metrics=[dice], opt_func=torch.optim.SGD)
learnermodel = Learner(data, model, metrics=[dice], opt_func=torch.optim.SGD)

tester.print_loss(learnermodel,learnerloaded)
tester.compare_size(model_save_dir / "model.tmp", model_save_dir / "saved_qaunt.tmp")

Original model loss:	0.0408
Loaded model loss:	0.0408

Unqauntized Model Size:	62Mb
Quantized Model Size:	15.55Mb
Quantized size is 24.99% of the unquantized size


## Memory and Run Speed
While we were unable to implement it, it is entirely possible to never have to use floats at any point of the process. If you simply convert the entire network to just run off of integers you can replace the costly float multiplication operations. If you multiply two $2^{p}$s together you only need to calculate the sum of the exponenets.

$2^{a}$ * $2^{b}$ = $2^{c}$

and being that in this hypothetical we are only storing the exponents we can make it even simpler

a + b = c

## Future work

Another approach to optimize the neural net is to prune it. The pruning process can be done by following the steps below.

1) Find a measure of importance and rank the filters accordingly

2) Prune the weakest (lowest ranked) filter away.

3) Retrain the network.

4) Assess its performance.

5) If performance starts to decrease, stop the process.

These 5 steps can be performed in an iterative manner over the entire model.