# Network Quantization in PyTorch 

## Performing post-training quantization in PyTorch 

In the case of PyTorch, there are two different post-training quantization methods: dynamic quantization and static quantization. They differ by when the quantization occurs and have different advantages and disadvantages. 

### Dynamic quantization: quantizing the model at runtime 
First, we will look at dynamic quantization, the simplest form of quantization available in PyTorch. This type of algorithm applies the quantization on weights ahead of time while quantization on activations occurs dynamically during inference. Therefore, dynamic quantization is often used for situations where the model execution is mainly throttled by loading weights while computing matrix multiplication is not an issue. This type of quantization is often used for LSTM or Transformer networks.

Please note that the code we are providing here is based on the official tutorial: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

### Create a sample LSTM model

In [31]:
import torch
import torch.quantization
import torch.nn as nn

torch.manual_seed(0)  # set the seed for reproducibility

class SampleLSTM(nn.Module):
  """Sample lstm model"""

  def __init__(self,in_dim,out_dim,depth):
     super(SampleLSTM,self).__init__()
     self.lstm = nn.LSTM(in_dim,out_dim,depth)

  def forward(self,inputs,hidden):
     out,hidden = self.lstm(inputs,hidden)
     return out, hidden


#shape parameters
model_dimension=20
sequence_length=10
batch_size=1
lstm_depth=1

# random data for input
inputs = torch.randn(sequence_length,batch_size,model_dimension)
# hidden is actually is a tuple of the initial hidden state and the initial cell state
hidden = (torch.randn(lstm_depth,batch_size,model_dimension), torch.randn(lstm_depth,batch_size,model_dimension))

### Apply quantization

In [32]:
# here is our floating point instance
original_lstm = SampleLSTM(model_dimension, model_dimension, lstm_depth)

# apply quantization on the model
quantized_lstm = torch.quantization.quantize_dynamic(
    original_lstm, {nn.LSTM, nn.Linear}, dtype=torch.qint8
)

# show the changes that were made
print('Original model:')
print(original_lstm)
print('')
print('Quantized model:')
print(quantized_lstm)

Original model:
SampleLSTM(
  (lstm): LSTM(20, 20)
)

Quantized model:
SampleLSTM(
  (lstm): DynamicQuantizedLSTM(20, 20)
)


### Compare model size

In [33]:
import os

# save the model and check the model size
def print_size_of_model(model, label=""):
    torch.save(model.state_dict(), "temp.p")
    size=os.path.getsize("temp.p")
    print("model: ",label,' \t','Size (KB):', size/1e3)
    os.remove('temp.p')
    return size


In [34]:
f=print_size_of_model(original_lstm,"fp32")
q=print_size_of_model(quantized_lstm,"int8")
print("{0:.2f} times smaller".format(f/q))

model:  fp32  	 Size (KB): 14.879
model:  int8  	 Size (KB): 5.791
2.57 times smaller


### Compare inference latency

In [35]:
print("Floating point FP32: ")
%timeit original_lstm.forward(inputs, hidden)

print("Quantized INT8: ")
%timeit quantized_lstm.forward(inputs,hidden)

Floating point FP32: 
318 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Quantized INT8: 
216 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Compare accuracy

In [36]:
# run the float model
out1, hidden1 = original_lstm(inputs, hidden)
mag1 = torch.mean(abs(out1)).item()
print('mean absolute value of output tensor values in the FP32 model is {0:.5f} '.format(mag1))

# run the quantized model
out2, hidden2 = quantized_lstm(inputs, hidden)
mag2 = torch.mean(abs(out2)).item()
print('mean absolute value of output tensor values in the INT8 model is {0:.5f}'.format(mag2))

# compare them
mag3 = torch.mean(abs(out1-out2)).item()
print('mean absolute value of the difference between the output tensors is {0:.5f} or {1:.2f} percent'.format(mag3,mag3/mag1*100))

mean absolute value of output tensor values in the FP32 model is 0.11877 
mean absolute value of output tensor values in the INT8 model is 0.11880
mean absolute value of the difference between the output tensors is 0.00166 or 1.40 percent


### Static quantization: determining optimal quantization parameters using a representative dataset 

The other type of quantization is called static quantization. Like full integer quantization of TF, this type of quantization minimizes the model performance degradation by estimating the range of numbers that the model interacts with using a representative dataset.

Detailed explanation on static quantization can be found here: https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html

### Create a sample model

In [37]:
# A model with few linear layers 
class SampleLinearModel(torch.nn.Module): 

    def __init__(self): 
        super(SampleLinearModel, self).__init__() 
        # QuantStub converts the incoming floating point tensors into a quantized tensor 
        self.quant = torch.quantization.QuantStub() 
        self.linear1 = torch.nn.Linear(10, 100) 
        self.linear2 = torch.nn.Linear(100, 100) 
        self.linear3 = torch.nn.Linear(100, 100) 
        self.linear4 = torch.nn.Linear(100, 100) 
        self.linear5 = torch.nn.Linear(100, 1) 
        # DeQuantStub converts the given quantized tensor into a tensor in floating point 
        self.dequant = torch.quantization.DeQuantStub() 

    def forward(self, x): 
        # using QuantStub and DeQuantStub operations, we can indicate the region for quantization 
        # point to quantized in the quantized model 
        x = self.quant(x) 
        x = self.linear1(x) 
        x = self.linear2(x) 
        x = self.linear3(x) 
        x = self.linear4(x) 
        x = self.linear5(x) 
        x = self.dequant(x)
        return x 

In [38]:
# Prepare model for static quantization
original_model = SampleLinearModel()
print(original_model)

SampleLinearModel(
  (quant): QuantStub()
  (linear1): Linear(in_features=10, out_features=100, bias=True)
  (linear2): Linear(in_features=100, out_features=100, bias=True)
  (linear3): Linear(in_features=100, out_features=100, bias=True)
  (linear4): Linear(in_features=100, out_features=100, bias=True)
  (linear5): Linear(in_features=100, out_features=1, bias=True)
  (dequant): DeQuantStub()
)


### Apply Quantization

In [39]:
class CustomCalibrationDataset(torch.utils.data.Dataset):
    def __init__(self):
        self.num_samples = 100
        self.data = torch.rand([self.num_samples, 10])
        self.label = torch.rand([self.num_samples, 1])

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.data[idx], self.label[idx]


calibration_dataset = CustomCalibrationDataset()
calibration_data_loader = torch.utils.data.DataLoader(calibration_dataset)

In [40]:
original_model.eval()
original_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') 
quantized_model = torch.quantization.prepare(original_model) 

quantized_model.eval()
for data, label in calibration_data_loader:
    quantized_model(data)

torch.quantization.convert(quantized_model, inplace=True)
print(quantized_model)

SampleLinearModel(
  (quant): Quantize(scale=tensor([0.0079]), zero_point=tensor([0]), dtype=torch.quint8)
  (linear1): QuantizedLinear(in_features=10, out_features=100, scale=0.017628204077482224, zero_point=57, qscheme=torch.per_channel_affine)
  (linear2): QuantizedLinear(in_features=100, out_features=100, scale=0.012031901627779007, zero_point=61, qscheme=torch.per_channel_affine)
  (linear3): QuantizedLinear(in_features=100, out_features=100, scale=0.00714643020182848, zero_point=64, qscheme=torch.per_channel_affine)
  (linear4): QuantizedLinear(in_features=100, out_features=100, scale=0.005204709246754646, zero_point=58, qscheme=torch.per_channel_affine)
  (linear5): QuantizedLinear(in_features=100, out_features=1, scale=0.00029902311507612467, zero_point=0, qscheme=torch.per_channel_affine)
  (dequant): DeQuantize()
)


### Compare model size

In [41]:
# compare the sizes
f=print_size_of_model(original_model,"fp32")
q=print_size_of_model(quantized_model,"int8")
print("{0:.2f} times smaller".format(f/q))

model:  fp32  	 Size (KB): 129.031
model:  int8  	 Size (KB): 48.261
2.67 times smaller


## Quantization aware training in PyTorch 

QAT in PyTorch goes through the similar process. Throughout training, the necessary calculations are achieved in floating point. However, the intermediate values are clamped and rounded to simulate the effect of quantization. The complete details are available at https://pytorch.org/docs/stable/quantization.html. Let’s look at how to set up a QAT for a PyTorch model. 

### Create a sample model

In [42]:
original_model = SampleLinearModel()

training_dataset = CustomCalibrationDataset()
training_data_loader = torch.utils.data.DataLoader(calibration_dataset, 5)

### Apply Quantization

In [43]:
original_model.train()
original_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') 
quantized_model = torch.quantization.prepare_qat(original_model) 

# train the model
quantized_model.train()
mse_loss = torch.nn.MSELoss()
optimizer = torch.optim.SGD(original_model.parameters(), lr=0.001, momentum=0.9)
for data, label in training_data_loader:
    optimizer.zero_grad()
    pred = quantized_model(data)
    loss = mse_loss(pred, label)
    loss.backward()
    optimizer.step()

quantized_model.eval()
torch.quantization.convert(quantized_model, inplace=True)
print(quantized_model)

SampleLinearModel(
  (quant): Quantize(scale=tensor([0.0079]), zero_point=tensor([0]), dtype=torch.quint8)
  (linear1): QuantizedLinear(in_features=10, out_features=100, scale=0.020053371787071228, zero_point=63, qscheme=torch.per_channel_affine)
  (linear2): QuantizedLinear(in_features=100, out_features=100, scale=0.011381133459508419, zero_point=65, qscheme=torch.per_channel_affine)
  (linear3): QuantizedLinear(in_features=100, out_features=100, scale=0.007688559591770172, zero_point=57, qscheme=torch.per_channel_affine)
  (linear4): QuantizedLinear(in_features=100, out_features=100, scale=0.005497108679264784, zero_point=56, qscheme=torch.per_channel_affine)
  (linear5): QuantizedLinear(in_features=100, out_features=1, scale=0.001096641062758863, zero_point=0, qscheme=torch.per_channel_affine)
  (dequant): DeQuantize()
)


### Compare model size

In [44]:
# compare the sizes
f=print_size_of_model(original_model,"fp32")
q=print_size_of_model(quantized_model,"int8")
print("{0:.2f} times smaller".format(f/q))

model:  fp32  	 Size (KB): 129.031
model:  int8  	 Size (KB): 48.261
2.67 times smaller
