# Torch Quantization

Continuing the previous section, the CNN model is running at floating-point precision by default.

In [1]:
import os
from models import initialize_wrapper

dataset_name = 'cifar10'
dataset_path = os.path.expanduser("~/dataset/cifar10")
model_name = 'vgg16_bn'
batch_size = 64
workers = 4

model_wrapper = initialize_wrapper(dataset_name, model_name,
                                    dataset_path, batch_size, workers)
                                 

Files already downloaded and verified
Files already downloaded and verified


Using cache found in /home/zy18/.cache/torch/hub/chenyaofo_pytorch-cifar-models_master


In [2]:
print("FLOAT32 Inference")
model_wrapper.inference()

FLOAT32 Inference
Inference mode: test
 * Acc@1 94.160 Acc@5 99.710


(94.16000366210938, 99.70999908447266)

In fpgaconvnet-torch, we can emulate the quantization effect of weights and activations by calling [`quantize_model`](https://github.com/Yu-Zhewen/fpgaconvnet-torch/blob/main/quantization/utils.py#L208), which will update the parameters in weights and also insert [`QuantAct`](https://github.com/Yu-Zhewen/fpgaconvnet-torch/blob/main/quantization/utils.py#L105) modules between layers to manipulate the activations in the forward pass.

We then run the modified model on a calibration set of data to capture the activatoin dynamic range and decide the scaling factors for activations as well. Finally, the quantized model will be examined on the target validation/test data again.

In [3]:
from quantization.utils import QuantMode, quantize_model

For example, if we would like to quantize both weights and activations into 16-bit fixed point, just run the following code and observe the change in accuracy.

In [4]:
print("NETWORK FP16 Inference")
# reload the model everytime a new quantization mode is tested
model_wrapper.load_model()
quantize_model(model_wrapper, {
                'weight_width': 16, 'data_width': 16, 'mode': QuantMode.NETWORK_FP})
model_wrapper.inference("test")

NETWORK FP16 Inference
network weight min: tensor(-0.6226, grad_fn=<MinimumBackward0>)
network weight max: tensor(0.4982, grad_fn=<MaximumBackward0>)
Inference mode: calibrate


Using cache found in /home/zy18/.cache/torch/hub/chenyaofo_pytorch-cifar-models_master


 * Acc@1 100.000 Acc@5 100.000
activation min: tensor(-10.6308)
activation max: tensor(13.8312)
Inference mode: test
 * Acc@1 94.160 Acc@5 99.710


(94.16000366210938, 99.70999908447266)

If a different quantization format is desired, we can replace the values in the dict. Note that each time we switch to a different precision, the [`load_model`](https://github.com/Yu-Zhewen/fpgaconvnet-torch/blob/main/models/base.py#L17) will be called to re-initialze the model.

In [5]:
print("NETWORK FP8 Inference")
model_wrapper.load_model()
quantize_model(model_wrapper, {
                'weight_width': 8, 'data_width': 8, 'mode': QuantMode.NETWORK_FP})
model_wrapper.inference("test")

NETWORK FP8 Inference
network weight min: tensor(-0.6226, grad_fn=<MinimumBackward0>)
network weight max: tensor(0.4982, grad_fn=<MaximumBackward0>)
Inference mode: calibrate


Using cache found in /home/zy18/.cache/torch/hub/chenyaofo_pytorch-cifar-models_master


 * Acc@1 100.000 Acc@5 100.000
activation min: tensor(-6.9199)
activation max: tensor(9.9050)
Inference mode: test
 * Acc@1 13.270 Acc@5 53.130


(13.270000457763672, 53.130001068115234)

The above two quantizatoin scheme have a shared scaling factor accorss the network. We can also apply block floating-point by changing the [`QuantMode`](https://github.com/Yu-Zhewen/fpgaconvnet-torch/blob/main/quantization/utils.py#L9) to [`CHANNEL_BFP`](https://github.com/Yu-Zhewen/fpgaconvnet-torch/blob/main/quantization/utils.py#L12).

In [6]:
print("CHANNEL BFP8 Inference")
model_wrapper.load_model()
quantize_model(model_wrapper,  {
                'weight_width': 8, 'data_width': 8, 'mode': QuantMode.CHANNEL_BFP})
model_wrapper.inference("test")

CHANNEL BFP8 Inference
network weight min: tensor(-0.6226, grad_fn=<MinimumBackward0>)
network weight max: tensor(0.4982, grad_fn=<MaximumBackward0>)
Inference mode: calibrate


Using cache found in /home/zy18/.cache/torch/hub/chenyaofo_pytorch-cifar-models_master


 * Acc@1 100.000 Acc@5 100.000
activation min: tensor(-10.6390)
activation max: tensor(13.8376)
Inference mode: test
 * Acc@1 94.090 Acc@5 99.690


(94.08999633789062, 99.69000244140625)

Note that we are emulating the effect of quantization, so there will be no performance speed-up on CPU/GPU compared with floating-point. Actually, the overhead of emulation will make things run even slower.