# Quantizing a PyTorch Model

Quantization is a process that reduces the precision of the weights and activations of a neural network, converting them from floating point to lower precision formats like integer or fixed point. This can significantly reduce the model size and increase inference speed, while aiming to maintain the original model's accuracy.

The quantization package in Metinor offers a straightforward API to apply various quantization strategies to your neural network models. In this tutorial, we will use Metinor to qauntize a PyTorch model.

## Supported Quantization Strategies

First, to see which quantization strategies are available, use the `list_quantization_strategies` function:

In [1]:
from metinor.optimizer import list_quantization_strategies

strategies = list_quantization_strategies()
print(strategies)

['WeightOnlyQuantization', 'WeightBiasQuantization']


## Using the Quantization API

The `quantize` function in Metinor's quantization package is the main entry point for quantizing a model. It takes a PyTorch model, a quantization strategy, and a dictionary of configuration options as input, and returns a quantized model.

### Create a Model

First, let's create a simple neural network model using PyTorch. For demonstration purposes, we will use the [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) architecture from LeCun et al., 1998.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square conv kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 5x5 image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, int(x.nelement() / x.shape[0]))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = LeNet().to(device=device)
input_shape = (1, 1, 32, 32)

print(model)

LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


### Import Required Functions

Next, we import the required functions from the `metinor.optimizer.quantization` module.

In [3]:
# Import quantization functions
from metinor.functional.quantization import (
    quantize,
    quantize_node,
)

import warnings
warnings.filterwarnings("ignore")

To quantize a model, you will use the `quantize` function. This function requires specifying the model, the quantization strategy, the type of quantization, and the precision level.

### Weight-Only Quantization

This strategy quantizes only the weights of the model to the specified precision.

In [4]:
model_quant_weightonly_float16 = quantize(model, "WeightOnlyQuantization", "float", 16)
print(model_quant_weightonly_float16)

LeNet(
  (conv1): QuantConv2d(
    1, 6, kernel_size=(5, 5), stride=(1, 1)
    (input_quant): ActQuantProxyFromInjector(
      (_zero_hw_sentinel): StatelessBuffer()
    )
    (output_quant): ActQuantProxyFromInjector(
      (_zero_hw_sentinel): StatelessBuffer()
    )
    (weight_quant): WeightQuantProxyFromInjector(
      (_zero_hw_sentinel): StatelessBuffer()
      (tensor_quant): RescalingIntQuant(
        (int_quant): IntQuant(
          (float_to_int_impl): RoundSte()
          (tensor_clamp_impl): TensorClampSte()
          (delay_wrapper): DelayWrapper(
            (delay_impl): _NoDelay()
          )
        )
        (scaling_impl): StatsFromParameterScaling(
          (parameter_list_stats): _ParameterListStats(
            (first_tracked_param): _ViewParameterWrapper(
              (view_shape_impl): OverTensorView()
            )
            (stats): _Stats(
              (stats_impl): AbsMax()
            )
          )
          (stats_scaling_impl): _StatsScaling(
      

If you compare the printed model summary before and after quantization, you will see that the `Conv2d` layers have been replaced with `QuantizedConv2d` layers with a `WeightQuantProxyFromInjector` module. This indicates that the weights of the model have been quantized.

### Weight-Bias Quantization

This strategy quantizes both the weights and biases of the model to the specified precision.

In [5]:
model_quant_weightbias_int8 = quantize(model, "WeightBiasQuantization", "fixed", 8)
print(model_quant_weightbias_int8)

LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


### Layerwise Quantization

Instead of quantizing the entire model at once, you can also quantize specific layers of the model. This can be useful if you want to apply different quantization strategies to different parts of the model.

In [6]:
from metinor.visualizer import get_graph
from metinor.functional.quantization import quantize_node

# Create model graph with maximum depth
graph = get_graph(model, input_size=input_shape, depth='max', device="cpu")

# Create a dictionary of node_id: module
node_ids = list(graph.id_dict.keys())
node_module_dict = {}
for node_id in node_ids:
    node = graph.find_node_by_id(node_id)
    node_module_dict[node_id] = node

# Quantize layer by id
node = node_module_dict[node_ids[1]]
print("Node Module: ", node.module_unit)

quantized_module = quantize_node(
    node_ids[1], graph, "WeightOnlyQuantization", "float", 4
)
# print('Quantized Module: ', quantized_module)

# Find node again to verify the change
node = graph.find_node_by_id(node_ids[1])
print("Updated Node: ", node.module_unit)

Node Module:  Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
Updated Node:  QuantConv2d(
  1, 6, kernel_size=(5, 5), stride=(1, 1)
  (input_quant): ActQuantProxyFromInjector(
    (_zero_hw_sentinel): StatelessBuffer()
  )
  (output_quant): ActQuantProxyFromInjector(
    (_zero_hw_sentinel): StatelessBuffer()
  )
  (weight_quant): WeightQuantProxyFromInjector(
    (_zero_hw_sentinel): StatelessBuffer()
    (tensor_quant): RescalingIntQuant(
      (int_quant): IntQuant(
        (float_to_int_impl): RoundSte()
        (tensor_clamp_impl): TensorClampSte()
        (delay_wrapper): DelayWrapper(
          (delay_impl): _NoDelay()
        )
      )
      (scaling_impl): StatsFromParameterScaling(
        (parameter_list_stats): _ParameterListStats(
          (first_tracked_param): _ViewParameterWrapper(
            (view_shape_impl): OverTensorView()
          )
          (stats): _Stats(
            (stats_impl): AbsMax()
          )
        )
        (stats_scaling_impl): _StatsScaling(
 

### Other Quantization Strategies

In addition to the strategies demonstrated above, Metinor also supports the following quantization strategies:

- **ActivationOnlyQuantization**: Applies quantization only to the activations during inference.
- **WeightBiasActivationQuantization**: Quantizes weights, biases, and activations of the model, offering a comprehensive quantization approach.