## Quantization in Large Language Models

### **Introduction to Quantization**

Quantization is a process used to reduce the memory requirements and computational complexity of large machine learning models. By representing model parameters with lower-precision values, quantization makes it possible to run models more efficiently on devices with limited memory and computational resources.

For large language models (LLMs), quantization can:
- **Reduce Memory Usage:** Lower-precision data types (such as int8) use less memory than higher-precision types (like float32), allowing models to fit into memory-constrained environments.
- **Improve Inference Speed:** By using simpler operations on smaller data types, quantization can reduce the time it takes for a model to process inputs and generate outputs.
- **Preserve Accuracy:** Quantization is carefully designed to minimize the impact on model accuracy, though a trade-off often exists between precision and efficiency.

In this lab, we will explore and compare two common types of quantization: **dynamic quantization** and **static quantization**.

In [1]:
import torch
import os
import time 
from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
from huggingface_hub import login

# Login to the Hugging Face model hub to be able to upload models
with open("../hf_token.txt", "r") as f:
    token = f.read()
    f.close()

login(token=token)

In [3]:
# Load the pre-trained GPT-2 model
model_id = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_id)  
tokenizer = AutoTokenizer.from_pretrained(model_id) 

tokenizer.pad_token = tokenizer.eos_token

In [None]:
def get_model_size(model):
    torch.save(model.state_dict(), "temp.pth")
    size = os.path.getsize("temp.pth") / 1e6  # size in MB
    os.remove("temp.pth")
    return size

print(f"Model size before quantization: {get_model_size(model)} MB")

In [None]:
text = "The secret of life is"
inputs = tokenizer(text, return_tensors="pt")

tic = time.time()

with torch.no_grad():
    baseline_output = model.generate(inputs['input_ids'], max_length=50)

elapsed_time = time.time() - tic

baseline_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)

print("\nBaseline model output:", baseline_decoded)
print("\nTime taken for baseline model:", elapsed_time)

### 1. Dynamic Quantization

Dynamic quantization applies lower precision to model weights and activations at runtime. This method doesn’t require modifications to the model architecture or retraining, which makes it relatively easy to apply.

- **Advantages:** 
  - Quick to implement with minimal changes. No calibration step is needed.

- **Limitations:** 
  - Activations are not pre-quantized, meaning some precision is maintained but at the cost of slightly higher resource use at inference time.

In [None]:
model

In [None]:
from torch.quantization import prepare, convert, get_default_qconfig

quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
).to('cpu')

# Model size after quantization
print(f"Model size after quantization: {get_model_size(quantized_model)} MB")

In [None]:
tic = time.time()

with torch.no_grad():
    output = quantized_model.generate(inputs['input_ids'], max_length=100)

elapsed_time = time.time() - tic

output_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)

print("\nQuantized model output:", output_decoded)
print("\nTime taken for baseline model:", elapsed_time)

In [None]:
quantized_model

https://huggingface.co/docs/transformers/v4.46.0/quantization/overview

In [None]:
from transformers import QuantoConfig

quantization_config = QuantoConfig(weights="int8",  modules_to_not_convert=["lm_head"])
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

print(f"Model size after quantization: {get_model_size(quantized_model)} MB")

In [None]:
tic = time.time()

with torch.no_grad():
    output = quantized_model.generate(inputs['input_ids'], max_length=50)

elapsed_time = time.time() - tic

output_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)

print("\nquantized model output:", output_decoded)
print("\nTime taken for baseline model:", elapsed_time)

### 2. Static Quantization

Static quantization, on the other hand, involves converting both weights and activations to lower-precision values before running the model. This is achieved by calibrating the model on a small subset of data to determine appropriate quantization parameters. 

- **Advantages:** 
  - Provides greater memory savings than dynamic quantization.
  - Can speed up inference more effectively since activations are pre-quantized.

- **Limitations:** 
  - Requires a calibration dataset for accurate quantization, which adds an extra step.

In [43]:
tokenizer = AutoTokenizer.from_pretrained(model_id) 

tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:
from functools import partial
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTQuantizer, ORTModelForCausalLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoCalibrationConfig

model_id = "meta-llama/Llama-3.2-1B"

quantized_model = ORTModelForCausalLM.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantizer = ORTQuantizer.from_pretrained(quantized_model)
quantizer.tokenizer = tokenizer
qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)

In [None]:
# Funzione di preprocessamento per il dataset di calibrazione
def preprocess_fn(ex, tokenizer):
    return tokenizer(ex["text"], truncation=True, padding="max_length", max_length=512)

# Carica il dataset di calibrazione
calibration_dataset = quantizer.get_calibration_dataset(
    "wikitext",
    dataset_config_name="wikitext-2-raw-v1",
    preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
    num_samples=50,
    dataset_split="train",
)


calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)

In [None]:
# create my own calibration dataset importing a dataset from the huggingface hub 
from datasets import load_dataset

dataset = load_dataset("wikipedia", "20220301.en", trust_remote_code=True)

In [None]:
# Perform the calibration step: computes the activations quantization ranges
ranges = quantizer.fit(
    dataset=calibration_dataset,
    calibration_config=calibration_config,
    operators_to_quantize=qconfig.operators_to_quantize,
)

# Apply static quantization on the model
model_quantized_path = quantizer.quantize(
    save_dir="static_quantized_model",
    calibration_tensors_range=ranges,
    quantization_config=qconfig,
)

In [None]:
text = "The secret of life is"
inputs = tokenizer(text, return_tensors="pt")

tic = time.time()

with torch.no_grad():
    output = quantized_model.generate(inputs['input_ids'], max_length=100)

elapsed_time = time.time() - tic

output_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)

print("\nQuantized model output:", output_decoded)
print("\nTime taken for baseline model:", elapsed_time)