<a href="https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization?scriptVersionId=163020432" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

**Note:all the images are from the Credits section or internet**

Typically, the size of of a model is calculated by multiplying the number of parameters(**size**) by the precision of these values(**data type**). However, to save memory, weights can be stored using lower-precision data types through a process known as quantization.

We distinguish two main families of weight quantization techniques in the literature:

**Post-Training Quantization(PTQ)**

It is a straightforward technique where the weights if an already trained model are converted to lower precision without necessitating any retraining. Although easy to implement, PTQ is associated with potential performance degradation. More examples of Post-Training Quantization, see [Post-Training Quantization](https://www.kaggle.com/code/aisuko/post-training-quantization-methods/notebook)

**Quantization-Aware Training(QAT)**

It incorporates the weight conversion process during the pre-training or fine-tuning stage, resulting in enhanced model performance. However, QAT is computationally expensive and demands representative training data.

Here, we focus one PTQ to reduce the precision of our parameters. 


# Background on Floating Point representation

The choice of data type dictates the quantity of computational resources required, affecting the speed and efficiency of the model. In deep learning applications, balancing precision and computational performance becomes a vital exercise as higher precision often implies greater computational demands.

Among various data types, floating point numbers are predominantly employed in deep learning due to their ability to represent a wide range of values with high precision. Typically, a floating point number uses n bits to store a numerical value. These n bits are further partitioned into three distinct components:


## Sign

The sign bit indicates the positive or negative nature of the number. It uses one bit where 0 indicates a positive number and 1 signals a negative number.

## Exponent

The exponent is a segment of bits that represents the power to which the base (usually 2 in binary representation) is raised. The exponent can also be positive or magtive, allowing the number to represent very large or very small values.

## Significand/Mantissa

The remaining bits are used to store the significand, also referred to as the mantissa. This represents the signigicant digits of the number. The precision of the number heavily depends on the length of the significand.


This design allows floating point numbers to cover a wide range of values with varying levels of precision. The formula used for this representation is:


$$(-1)^{sign}*base^{exponent}*significand$$

For example, if we try to convert int to binary

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/803/484/486/105/591/original/de67143ca647507a.webp" width="60%" heigh="60%" alt="Converting int to binary"></div>

We are trying to convert float to a binary

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/803/486/083/309/139/original/ef0c081f705360d5.webp" width="60%" heigh="60%" alt="Converting float to binary"></div>

To understand this better, let's delve into some commonly used data types in deep learning


# Common data types used in ML

The size of a model is determined by the number of its parameters, and their precision, typically one of float32(FP32), float16(FP16) or bfloat16(BF16)

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/780/434/306/380/735/original/a8c1a7528940019b.webp" width="60%" heigh="60%" alt="float types in ML"></div>


## Float32(FP32)

It is stands for the standardized IEEE 32-bit floating point representation. With this data type it is possible to represent a wide range of floating numbers. In FP32, 8 bits are reserved for the "exponent", 23 bits for the "manitissa" and 1 bit for the "sign" of the number. In addition to that, most of the hardware supports FP32 operations and instructions. While it provides a high degree of precision, the downside of FP32 is its high computational and memory footprint.


## Float16(FP16)

5 bits are reserved for the exponent and 10 bits are reserved for the mantissa. This makes the representable range of FP16 numbers much lower than FP32, so it is more memory-efficient and accelerate computations. However, the reduced range and precision can introduce numberical instability, potentially impacting model accuracy. 

For example, if you do 10k* 10k you end up with 100M which is not possible to represent in FP16, as the largest number possible is 64k.


## BFloat16(BF16)

It is also a 16-bit format but with one bit for the sign, eight for the exponent, and seven for the significand. BF16 expands the representable range compared to FP16, thus decreasing underflow and overflow risk. Despite a reduction in precision due to fewer significand bits, BF16 typically does not signigicantly impact model performance and is a useful compromise for deep learning tasks.


In ML jargon, FP32 is often termed "full precision"(4 bytes), while BF16 and FP16 are "half-precision"(2bytes).

We need to store those weights with less memory using a different data type, it is **quantization**. And according to the blog in the Credit section, Int8(8bits) consists of an 8-bit representation capable of storing $2^8=256$ different values. 

# Introduction to model quantization(8bit)

Experimentially, we have discovered that instead of using the 4-byte FP32 precision, we can get an almost identical inference outcome with 2-byte BF16/FP16 hald-precision, which halves the model size. If we cut it further, the inference quality outcome starts to drop dramatically at lower precision.

To remediate that, we introduce 8-bit quantization. This method uses a quarter precision, thus needing only 1/4th of the model size. But it's not done by just dropping another half of the bits.

Quantization is done by essentially "rounding" from one data type to another. For example, if one data type has the range 0..9 and another 0..4, then the value "4" in the first data type would be rounded to "2" in the second datatype. However, if we have the value "3" in the first data type, it lies between 1 and 2 of the second type, then we would usually round to "2". This shows that both values "4" and "3" of the first data type have the same value "2" in the second data type. This highlights that quantization is a noisy process that can lead to information loss, a sort of lossy compression.

The two most common 8-bit quantization techniques are zero-point quantization and absolute maximum(absmax) quantization. Zero-point quantization and absmax quantization map the floating point values into more compact int8(1 byte) values. Let's mapping an FP32(tensor X) to an INT8 (tensor X_quant).


## Absolute maximum(absmax)

With absmax quantization, the original number is divided by the absolute maximum value of the tensot and multiplied by a scaling factor(127) to map inputs into the range[-127,127]. To retrieve the original FP16 values, the INT8 number is divided by the quantization factor, acknowledging some loss of precision due to rounding.

$$X_{quant}=round({127 \over max[X]}*X)$$

$$X_{dequant}={{max[X] \over 127 }* X_{quant}}$$


For instance, let's say we have an absolution maximum value of 3.2. A weight of 0.1 would be quantized to $round(0.1*127/3.2)=4$. If we want to dequantize it, we would get $4x3.2/127=0.1008$, which implies an error of 0.008. Here's the corresponding Python implementation:

In [1]:
import torch

def absmax_quantize(X):
    # Calculate scale
    scale=127/torch.max(torch.abs(X))
    
    # Quantize
    X_quant=(scale*X).round()
    
    # Dequantize
    X_dequant=X_quant/scale
    
    return X_quant.to(torch.int8), X_dequant

## Zero-point quantization

With zero-point quantization, we can consider asymmetric input distributions, which is useful when you consider the ouput of a ReLU function(only positive values), for example. The input values are first scaled by the total range of values(255) divided by the difference between the maximum and minimum values. This distribution is then shifted by the zero-point to map it into the range[-128,127] (notice the extra value compared to absmax). First, we calculate the scale factor and the zero-point value:

$$scale={255 \over max(X)-min(X)}$$

$$zeropoint=-round(scale*min(X))-128$$

Then, we can use these variables to quantize or dequantize our weights:

$$X_{quant}=round(scale*X+zeropoint)$$

$$X_{dequant}={X_{quant}-zeropoint \over scale}$$

For example: we have a maximum value of 3.2 and a minimum value of -3.0. We can calcualte the scale is $255/(3.2+3.0)=41.13$ and the $zero-point-round(41.13-3.0)-128=123-128=-5$, so our previous weight of 0.1 would be quantized to $round(41.13*0.1-5)=-1$. This is very different from the previous value obtained using absmax(4 vs.-1).

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/780/706/117/880/782/original/ff18414a971886e3.webp" width="60%" heigh="60%" alt="Absmax/Zero-point quantization"></div>

In [2]:
def zeropoint_quantize(X):
    # Calculate value range (denominator)
    x_range=torch.max(X) -torch.min(X)
    x_range=1 if x_range==0 else x_range
    
    # Calculate scale
    scale=255/x_range
    
    # Shift by zero-point
    zeropoint=(-scale*torch.min(X)-128).round()
    
    # Scale and round the inputs
    X_quant=torch.clip((X*scale+zeropoint).round(),-128,127)
    
    # Dequantize
    X_dequant=(X_quant-zeropoint) / scale
    
    return X_quant.to(torch.int8), X_dequant

# Demo with Transformers

We start by loading the model and tokenizer for GPT-2. It is a very small modell for us to do the demo easier.

In [3]:
%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install bitsandbytes==0.41.3

In [4]:
!accelerate estimate-memory gpt2 --library_name transformers

Loading pretrained config for `gpt2` from `transformers`...
config.json: 100%|█████████████████████████████| 665/665 [00:00<00:00, 3.11MB/s]
┌────────────────────────────────────────────────────┐
│          Memory Usage for loading `gpt2`           │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│  147.24 MB  │ 476.2 MB │      1.86 GB      │
│float16│   73.62 MB  │ 238.1 MB │      952.4 MB     │
│  int8 │   36.81 MB  │119.05 MB │      476.2 MB     │
│  int4 │   18.4 MB   │ 59.53 MB │      238.1 MB     │
└───────┴─────────────┴──────────┴───────────────────┘


In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.manual_seed(0)

model_name='gpt2'

model=AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer=AutoTokenizer.from_pretrained(model_name)
# the size of model in FP32 in bytes
model.get_memory_footprint()

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

510342192

In [6]:
# original weights
weights=model.transformer.h[0].attn.c_attn.weight.data
weights

tensor([[-0.4738, -0.2614, -0.0978,  ...,  0.0513, -0.0584,  0.0250],
        [ 0.0874,  0.1473,  0.2387,  ..., -0.0525, -0.0113, -0.0156],
        [ 0.0039,  0.0695,  0.3668,  ...,  0.1143,  0.0363, -0.0318],
        ...,
        [-0.2592, -0.0164,  0.1991,  ...,  0.0095, -0.0516,  0.0319],
        [ 0.1517,  0.2170,  0.1043,  ...,  0.0293, -0.0429, -0.0475],
        [-0.4100, -0.1924, -0.2400,  ..., -0.0046,  0.0070,  0.0198]],
       device='cuda:0')

In [7]:
# Quantize layer using absmax quantization
weights_abs_quant,_=absmax_quantize(weights)
weights_abs_quant

tensor([[-21, -12,  -4,  ...,   2,  -3,   1],
        [  4,   7,  11,  ...,  -2,  -1,  -1],
        [  0,   3,  16,  ...,   5,   2,  -1],
        ...,
        [-12,  -1,   9,  ...,   0,  -2,   1],
        [  7,  10,   5,  ...,   1,  -2,  -2],
        [-18,  -9, -11,  ...,   0,   0,   1]], device='cuda:0',
       dtype=torch.int8)

In [8]:
# Quantize layer using zero-point-quantize
weights_zp_quant,_=zeropoint_quantize(weights)
weights_zp_quant

tensor([[-20, -11,  -3,  ...,   3,  -2,   2],
        [  5,   8,  12,  ...,  -1,   0,   0],
        [  1,   4,  18,  ...,   6,   3,   0],
        ...,
        [-11,   0,  10,  ...,   1,  -1,   2],
        [  8,  11,   6,  ...,   2,  -1,  -1],
        [-18,  -8, -10,  ...,   1,   1,   2]], device='cuda:0',
       dtype=torch.int8)

The difference between the original (FP32) and quantized values(INT8) is clear, but the difference between absmax and zero-point weights is more subtle. In this case, the inputs look shifted by a value of -1. This sugeests that the weight distribution in this layer is quite symmetric.


We can campare these techniques by quantizing every layer in GPT-2(linear layer, attention layers, etc.) and create two new models: `model_abs` and `model_zp`. To be precise, we will actually replace the original weight with de-quantized ones. This has two benefits: it allows us to 1/compare the distribution of our weights(same scale) and 2/actually run the models.

Indeed, PyTorch doesn't allow INT8 matrix multiplication by default. In a real scenario, we should dequantize them to run the model(in FP16 for example) but store them as INT8. In the next section, we will use the `bitsandbytes` library to solve this issue.

In [9]:
import numpy as np
from copy import deepcopy

# Store original weights
weights=[param.data.clone() for param in model.parameters()]

# Create model to quantize
model_abs=deepcopy(model)

# Quantize all model weights
weights_abs=[]
for param in model_abs.parameters():
    _,dequantized=absmax_quantize(param.data)
    param.data=dequantized
    weights_abs.append(dequantized)

# Create model to quantize
model_zp=deepcopy(model)

# Quantize all model weights
weights_zp=[]
for param in model_zp.parameters():
    _, dequantized=zeropoint_quantize(param.data)
    param.data=dequantized
    weights_zp.append(dequantized)

# Comparison the perplexity

We can quantify the outpus of them by calculating the perplexity of the each output. This is a common metric used to evaluate language models, which meastures that uncertrainty of a model in predicting the next token in a sequence. And in the comparison, we make the common assumption that the lower the score, the better the model is. 

In [10]:
def generate_text(model, input_text, max_length=50):
    input_ids=tokenizer.encode(input_text, return_tensors='pt').to('cuda')
    output=model.generate(
        inputs=input_ids,
        max_length=max_length,
        do_sample=True,
        top_k=30,
        pad_token_id=tokenizer.eos_token_id,
        attention_mask=input_ids.new_ones(input_ids.shape)
    )
    
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Generate text with original and quantized models
original_text=generate_text(model, "The weather in Melbourne")
abs_max_text=generate_text(model_abs, "The weather in Melbourne")
zp_text=generate_text(model_zp, "The weather in Melbourne")



In [11]:
def calculate_perplexity(model, text):
    encodings=tokenizer(text, return_tensors="pt").to('cuda')
    
    input_ids=encodings.input_ids
    target_ids=input_ids.clone()
    
    with torch.no_grad():
        outputs=model(input_ids, labels=target_ids)
        
    # loss calculation
    neg_log_likelihood=outputs.loss
    
    # Perplexity calculation
    ppl=torch.exp(torch.tensor(neg_log_likelihood,dtype=torch.float))
    
    return ppl

ppl = calculate_perplexity(model, original_text)
ppl_abs=calculate_perplexity(model_abs, abs_max_text)
ppl_zp=calculate_perplexity(model_zp, zp_text)

print(f"Original perplexity: {ppl.item():.2f}")
print(f"Absmax perplexity: {ppl_abs.item():.2f}")
print(f"Zeropoint perplexity: {ppl_zp.item():.2f}")

Original perplexity: 14.23
Absmax perplexity: 15.59
Zeropoint perplexity: 17.43


  ppl=torch.exp(torch.tensor(neg_log_likelihood,dtype=torch.float))


We see that the perplexity of the original model is lower than the two others. A single experiment is not very reliable, but we could repear this process multiple times to see the difference between each model. In theory, zero-point quantization should be slightly better than absmax, but is also more costly to compute.

We applied quantization techniques to the entire layers(per-tensor basis). However, we could apply it at different granularity levels: from the entire model to individual values. Quantizing the entire model in one pass would seriously degrade the performance, while quantizing individual values would create a big overhead. In practice, we often prefer the **vector-wise quantization**, which considers the variability of values in rows and columns inside the same tensor.

However, even vector-wise quantization doesn't solve the problem of outlier features. **Outlier features are extreme values(negative or positive) that appear in all transformer layers when the model reach a certain scale(>6.7B parameters)**. This is an issue since a single outlier can reduce the precision for all other values. But discarding these outlier features is not an option since it would **greatly degrade** the model's performance.

# LLM.int8()

It is a solution to the outlier problem. **It relies on a vector-wise(absmax) quantization scheme and introduces mixed-precision quantization.** This means that outlier features are processed in a FP16 format to retain their precision, while the other values are processed in an INT8 format. As outliers represent about 0.1% of values, this effectively reduces the memory footprint of the LLM by almost 2x.

LLM.int8() works by conducting matrix multiplication computation in three key steps:
1.Extract columns from the input hidden states X containing outlier features using a custom threshold.
2.Perform the matrix multiplication of the outliers using FP16 and the non_outliers using INT8 with vector-wise quantization(row-wide for the hidden state X and column-wide for th weight matrix W)
3.Dequantize the non_outlier results (INT8 to FP16) and add them to the outlier results to get the full result in FP16.


This approach is necessary because 8-bit precision is limited and can lead to substrantial errors when quantizing a vector with large values. These errors also tend to amplify as they propagate through multiple layers.

However, we can see 8-bit models bring some useful features, like offloading, outlier thresholds, skipping modules conversion.


## Offloading

8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispacthed to the CPU are actually stored in `float32`, and aren't converted to 8-bit.
 
## Outlier threshold

An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5,3.5]), this distribution can be very different for large models([-60,6] or [6.60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models(small models or finetuning).

## Skip module conversion

For some models, like Jukebox, we do not need to quantize every module to 8-bit which can actually cause instability. With Jukebix, there are severak `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter.

In [12]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config=BitsAndBytesConfig(
    load_in_8bit=True,
    # https://github.com/huggingface/transformers/issues/22018#issuecomment-1460139242
#     llm_int8_enable_fp32_cpu_offload=True,
#     llm_int8_skip_modules=["lm_head"],
    llm_int8_threshold=6.0,
)

# device_map={
#     "transformer.word_embeddings":0,
#     "transformer.word_embeddings_layernorm":0,
#     "lm_head":"cpu",
#     "transformer.h":0,
#     "transformer.ln_f":0,
# }



int8_model=AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    quantization_config=bnb_config
)

int8_model.get_memory_footprint()

176527896

The INT8 model is almost 3 times smaller then the original(FP32).

In [13]:
print(f"{model.get_memory_footprint()/int8_model.get_memory_footprint()}")

2.8910002530138352


In [14]:
prompt=generate_text(int8_model, "The weather in Melbourne is ")

ppl_int8=calculate_perplexity(int8_model, prompt)

print(f"Model with INT8 perplexity: {ppl_int8.item():.2f}")

Model with INT8 perplexity: 20.28


  ppl=torch.exp(torch.tensor(neg_log_likelihood,dtype=torch.float))


# More demo using int8

The practice with Transformers see [Lighter models on GPU for inference](https://www.kaggle.com/code/aisuko/lighter-models-on-gpu-for-inference/notebook)

The prectice with PyTorch see [Zero degradation matrix multiplication](https://www.kaggle.com/code/aisuko/zero-degradation-matrix-multiplication)

# Credit

* https://huggingface.co/blog/hf-bitsandbytes-integration?source=post_page-----287da2d5d7f1--------------------------------
* https://huggingface.co/blog/4bit-transformers-bitsandbytes
* https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c
* https://huggingface.co/docs/transformers/main/quantization#8-bit
* https://pub.towardsai.net/how-to-fit-large-language-models-in-small-memory-quantization-e8c3981430b2