# 🤗 Quanto: a pytorch quantization toolkit

Welcome to this tutorial where we showcase how to use `quanto` library quantize any model in 8, 4, even 2 bit precision on GPU / CPU and MPS device! Let's get started 🔥

### Setup

In [None]:
pip install quanto accelerate

Collecting quanto
  Downloading quanto-0.1.0-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.8/41.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting ninja (from quanto)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=2.2.0->quanto)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=2.2.0->quanto)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from quanto import freeze, quantize
import torch

In [None]:
# monkey patched for quanto
def named_module_tensors(module, recurse=False):
    for named_parameter in module.named_parameters(recurse=recurse):
      name, val = named_parameter
      flag = True
      if hasattr(val,"_data") or hasattr(val,"_scale"):
        if hasattr(val,"_data"):
          yield name + "._data", val._data
        if hasattr(val,"_scale"):
          yield name + "._scale", val._scale
      else:
        yield named_parameter

    for named_buffer in module.named_buffers(recurse=recurse):
      yield named_buffer

def dtype_byte_size(dtype):
    """
    Returns the size (in bytes) occupied by one parameter of type `dtype`.
    """
    import re
    if dtype == torch.bool:
        return 1 / 8
    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
    if bit_search is None:
        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
    bit_size = int(bit_search.groups()[0])
    return bit_size // 8

def compute_module_sizes(model):
    """
    Compute the size of each submodule of a given model.
    """
    from collections import defaultdict
    module_sizes = defaultdict(int)
    for name, tensor in named_module_tensors(model, recurse=True):
      size = tensor.numel() * dtype_byte_size(tensor.dtype)
      name_parts = name.split(".")
      for idx in range(len(name_parts) + 1):
        module_sizes[".".join(name_parts[:idx])] += size

    return module_sizes

### Without quantization

Let's first load following model `bigscience/bloom-560m` and play with it.

In [None]:
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", low_cpu_mem_usage=True, torch_dtype = torch.float32).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

In [None]:
print(model)

In [None]:
text = "Hello my name is"

inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

You can see that the model size is 2.2GB (~560m parameters * 4 bytes) since we loaded the model in torch.float32 (32 bits = 4 bytes).

In [None]:
module_sizes = compute_module_sizes(model.transformer)
print(module_sizes)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

In [None]:
print(model.transformer.h[0].self_attention.query_key_value.weight)

### 8-bit quantization (weight only)

Now, we will use quanto to quantize the model in 8 bits. To do that, we can specify that we want the weights to be quantized with the quanto data type `qint8`. In this example, we will only quantize the weights but we can also quantize the activation.

#### 1. Quantize
At this stage, only the inference of the model is modified to dynamically quantize the weights. (aka dynamic quantization)

In [None]:
from quanto import qint8
# quantization happens in-place
quantize(model.transformer, weights=qint8, activations=None)

You can see that the `Linear` layers have been replaced by `QLinear` from quanto library. Also, we don't quantize the `lm_head`, so that we don't get performance degradation with LLMs.

In [None]:
print(model)

In [None]:
print(model.transformer.h[0].self_attention.query_key_value.weight)

In [None]:
module_sizes = compute_module_sizes(model.transformer)

print(f"The model size is {module_sizes[''] * 1e-9} GB")

As you can see, the weights are still on torch.float32 and we quantize the weights only during inference.
This intermediate state is useful for:
- **Calibration** : Record the activation ranges while passing representative samples through the quantized model
- **Quantization-Aware-Training** : If the performance of the model degrades too much, one can tune it for a few epochs to recover the float model performance


We can also dynamically quantize the weights by calling the property `qweight`. This is what is done in practice during inference if we don't freeze the weights.

In [None]:
print(model.transformer.h[0].self_attention.query_key_value.qweight)

#### 2. Freeze the weights (float -> int)

When freezing a model, its float weights are replaced by quantized integer weights.

In [None]:
freeze(model)

You can see that we have integer weights now

In [None]:
print(model.transformer.h[0].self_attention.query_key_value.weight)

The model size is only 1.3 GB now ! Note that the embedding layer takes 1GB and is not quantized. This is why the model size is not divided by 4 as expected if the entire model is converted to from 32 bits to 8 bits. You can see that by doing: `print(module_sizes)`

In [None]:
module_sizes = compute_module_sizes(model.transformer)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

In [None]:
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))