# Linear Quantization
Quantization is a process of mapping a large set to a small set of values. For example applying 8-bit linear quantization on the following matrix:
$$
\begin{bmatrix}
191.6 & -13.5 & 728.6 \\
92.14 & 295.5 & -184 \\
0 & 684.6 & 245.5
\end{bmatrix}
$$

**Quantized Matrix:**
$$
\begin{bmatrix}
-23 & -81 & 127 \\
-51 & 6 & -128 \\
-77 & 114 & -8
\end{bmatrix}
$$

we can map the most positive number in the matrix **(728.6)** to the maximum value that **int8** can store, which is **127**. Similarly the most negative number **(-184)** to **-128**. Then by following a linear mapping we can map the rest of the values. After this we can delete the original tensor to free up the space and end up with the quantized tensors with parameters **s (scale)** and **z (zero point)** that we used to perform linear mapping.

## How can we go the other way back to the original tensor?
We can apply the same mapping but we won't get the same values. That means quantization results in loss of information. By applying the same linear qunatization on the quantized matrix we get the dequantized matirx.

**Dequantized Matrix:**
$$
\begin{bmatrix}
193.2 & -14.3 & 730.1 \\
93.1 & 297 & -182.5 \\
0 & 683.6 & 246.9
\end{bmatrix}
$$

**Error Matrix:**
$$
\begin{bmatrix}
1.66 & 0.82 & 1.48 \\
0.91 & 1.54 & 1.48 \\
0 & 1.04 & 1.44
\end{bmatrix}
$$

We can see that the values of the original matrix and dequantized matrix are approximately the same. The error matrix is the difference of original and dequnatized one and we can see the error is not zero but not too bad either.

In [None]:
!pip install transformers
!pip install quanto
!pip install sentencepiece

/usr/bin/sh: 1: pip: not found
/usr/bin/sh: 1: pip: not found
/usr/bin/sh: 1: pip: not found


In [None]:
model_name = "google/flan-t5-small"

Flan-T5 is a variant of the T5 (Text-to-Text Transfer Transformer) model, which is a popular and powerful transformer-based model for natural language processing (NLP) tasks. Flan-T5 is specifically designed for few-shot learning, which means it can perform well even when trained on a small amount of labeled data.

In [None]:
import sentencepiece as spm
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


The tokenizer is used to transform the text into a list of tokens that the model is able to understand.

In [None]:
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

In [None]:
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

<pad> annie scott</s>


In [None]:
import torch
def named_module_tensors(module, recurse=False):
    for named_parameter in module.named_parameters(recurse=recurse):
      name, val = named_parameter
      flag = True
      if hasattr(val,"_data") or hasattr(val,"_scale"):
        if hasattr(val,"_data"):
          yield name + "._data", val._data
        if hasattr(val,"_scale"):
          yield name + "._scale", val._scale
      else:
        yield named_parameter

    for named_buffer in module.named_buffers(recurse=recurse):
      yield named_buffer

def dtype_byte_size(dtype):
    """
    Returns the size (in bytes) occupied by one parameter of type `dtype`.
    """
    import re
    if dtype == torch.bool:
        return 1 / 8
    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
    if bit_search is None:
        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
    bit_size = int(bit_search.groups()[0])
    return bit_size // 8

def compute_module_sizes(model):
    """
    Compute the size of each submodule of a given model.
    """
    from collections import defaultdict
    module_sizes = defaultdict(int)
    for name, tensor in named_module_tensors(model, recurse=True):
      size = tensor.numel() * dtype_byte_size(tensor.dtype)
      name_parts = name.split(".")
      for idx in range(len(name_parts) + 1):
        module_sizes[".".join(name_parts[:idx])] += size

    return module_sizes

In [None]:
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

The model size is 0.307844608 GB


Flan-T5 almost contains 75 million parameters, each parameter is FP32 (as 8-bit=1 byte; 32 bits=4 bytes). 75x$10^6$x4 = 300 million bytes (MB) = 0.3 GB.

## Quantizing the Model

In [None]:
from quanto import quantize, freeze

As the model has many layers we will try to quantize only the linear layers. To quantize the model we just need to call **quantize()** method of quanto library. Here we will just quantize the weights into integers not the activations.

In [None]:
quantize(model, weights=torch.int8, activations=None)

In [None]:
print(model)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): QLinear(in_features=512, out_features=384, bias=False)
              (k): QLinear(in_features=512, out_features=384, bias=False)
              (v): QLinear(in_features=512, out_features=384, bias=False)
              (o): QLinear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): QLinear(in_features=512, out_features=1024, bias=False)
              (wi_1): QLinear(in_features=512, out_features=1024, bias=False)
              

Here we can see that all the Linear layers are now replaced by QLinear (quantized linear). After this to get the quantized model we just need to call **freeze()** method.

In [None]:
freeze(model)

In [None]:
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

The model size is 0.12682868 GB


Let's check if ther is any performance degredation or not.

In [None]:
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

<pad> annie scott</s>


# Theory for Linear Quantization
formula: **$r = s(q-z)$**, where r is the original value (FP32), s is scale (FP32), q is quantized value (int8), z is zero point (int8).<br>
Linear quantization maps the floating point range [$r_{min}$, $r_{max}$] to the quantized range  [$q_{min}$, $q_{max}$].<br>
If we look at the extreme values we should get:<br>
**$r_{min}$ = s($q_{min}$-z)** & **$r_{maxmax}$ = s($q_{max}$-z)**

if we subtract the first equation from the second one we get the scale s:<br>
s = ($r_{max}$ - $r_{min}$) / ($q_{max}$ - $q_{min}$)

For the zero-point z, we need to round the value since it is a n-bit integer:<br>
z = int(round($q_{min}$ - $r_{min}$)/s)

## Intermediate State
The quanto library creates an intermediate state after we call quantize. Then we call freeze to get the quantized weights. These intemediate are useful for two things:
1. When we run the inference on a model by passing an input such as image, a text, etc the activation of the model will vary dpeending on the input to get good linear paramters.

## Calibration
1. Calibrate model when the activations of the model:
- Range of activation values depends on what input was given.
- eg. a different input text will generate different activations.
2. Min/Max of activation ranges are used to perform linear quantization.
3. How to get min and mox arange to activation?
- gather sample input data.
- run inference
- calculate min/max of activations

## Qunatization Aware Timing
Training in a way that controls how the model performs once it is quantized.
1. Intermediate stat holds both:
- A quantized version of wights.
- Orignial unquantized weights.
2. Forward pass (inference)
- Use quantized version of model weights to make predictions eg. BF16.
3. Back propagation (updating model weights)
- Update original, unquantized version of model weights eg. in FP32.