### **What is Quantization?**
Quantization reduces the precision of a model's weights and activations to use fewer bits, enabling:
- **Smaller model size** (e.g., 4-bit vs 32-bit → 8x compression)
- **Faster inference** (less memory bandwidth needed)
- **Lower hardware requirements** (runs on consumer GPUs)

### **Types of Quantization in LLMs**

#### **1. Post-Training Quantization (PTQ)**
Quantize a pre-trained model without retraining.

| Type          | Bits | Key Features                          | Use Case                     |
|---------------|------|---------------------------------------|------------------------------|
| **FP16**      | 16   | Half-precision float                  | Baseline for comparisons     |
| **INT8**      | 8    | Simple 8-bit integer                  | Balanced speed/accuracy      |
| **NF4**       | 4    | 4-bit "Normal Float" (optimal bins)   | QLoRA fine-tuning            |
| **GPTQ**      | 3-4  | Layer-wise calibration                | GPU inference                |

```python
# Example: 8-bit quantization
quant_config = BitsAndBytesConfig(load_in_8bit=True)
```

#### **2. Quantization-Aware Training (QAT)**
Models are trained with simulated quantization.

| Type          | Bits | Key Features                          |
|---------------|------|---------------------------------------|
| **QAT-FP8**   | 8    | Maintains float point                 |
| **QAT-INT4**  | 4    | Simulates 4-bit during training       |

#### **3. Hybrid Quantization**
Combines different precisions:
- **Weights**: 4-bit (e.g., NF4)
- **Activations**: 8/16-bit
```python
# QLoRA hybrid example
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16  # <- Activations in bfloat16
)
```

### **Key Tradeoffs**
| Technique     | Memory Savings | Accuracy Drop | Hardware Support |
|--------------|----------------|---------------|------------------|
| FP16         | 2x             | None          | Universal        |
| INT8         | 4x             | ~1-2%         | NVIDIA GPUs      |
| NF4 (QLoRA)  | 8x             | ~2-5%         | Recent GPUs      |
| GPTQ         | 10x+           | ~5-10%        | Consumer GPUs    |


### **Why QLoRA Uses NF4**
1. **Optimal binning**: Distributes 4-bit values to match float32 distribution
2. **Double quantization**: Compresses quantization constants
3. **bfloat16 compute**: Maintains stability during fine-tuning

```python
# Optimal QLoRA config
BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # <- Special 4-bit type
    bnb_4bit_use_double_quant=True,  # <- Extra compression
    bnb_4bit_compute_dtype=torch.bfloat16
)
```

# **Today's Learning Objectives:**
1. Understand QLoRA and its advantages for efficient fine-tuning
2. Compare different quantization approaches
3. Analyze memory footprint of different model configurations
4. Examine LoRA adapter architecture

### **<--- Setup Section --->**


In [None]:
!pip install -q datasets requests torch peft bitsandbytes transformers trl accelerate sentencepiece

In [None]:
# Import with clear grouping
import os
import re
import math
from datetime import datetime
from tqdm import tqdm

# HuggingFace and Colab specific
from google.colab import userdata
from huggingface_hub import login

# PyTorch and Transformers
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    set_seed
)
from peft import LoraConfig, PeftModel

In [None]:
# Constants
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
FINETUNED_MODEL = f"ed-donner/pricer-2024-09-13_13.04.39"

# QLoRA Hyperparameters

LORA_R = 32          # LoRA rank (dimension of the low-rank matrices)
LORA_ALPHA = 64      # Scaling factor for LoRA weights
TARGET_MODULES = [   # Which layers to apply LoRA to
    "q_proj",        # Query projection
    "v_proj",        # Value projection
    "k_proj",        # Key projection
    "o_proj"         # Output projection
]

Before proceeding, you'll need:
1. A HuggingFace account (https://huggingface.co)
2. An access token (create at https://huggingface.co/settings/tokens)
3. Add token to Colab secrets (Key icon → New secret) named 'HF_TOKEN'

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

## Quantization Comparison Section

### Quantization Comparison

We'll compare three configurations:
1. No quantization (full precision)
2. 8-bit quantization
3. 4-bit quantization (QLoRA)

Note: After each full model load, you'll need to:
Runtime → Restart session → Run initial cells again
to clear GPU memory.

## 1. No Quantization

### Base Model (No Quantization)
- Full 32-bit precision
- Maximum memory usage
- Best theoretical performance

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    device_map="auto"
)

In [None]:
print(f"Memory footprint: {base_model.get_memory_footprint() / 1e9:,.1f} GB")

In [None]:
base_model

## 2. 8-bit Quantization

### 8-bit Quantization
- Reduces memory usage significantly
- Minimal accuracy loss
- Good balance for many applications


In [None]:
quant_config = BitsAndBytesConfig(load_in_8bit=True)

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)

In [None]:
print(f"Memory footprint: {base_model.get_memory_footprint() / 1e9:,.1f} GB")

In [None]:
base_model

## 3. 4-bit QLoRA

### 4-bit QLoRA Configuration
- Most memory efficient
- Uses 'nf4' (normal float 4) quantization
- Double quantization for additional savings
- bfloat16 compute dtype for stability

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)

In [None]:
print(f"Memory footprint: {base_model.get_memory_footprint() / 1e9:,.2f} GB")


In [None]:
base_model

## Fine-tuned Model Loading

## Loading Fine-tuned Adapters
The PeftModel combines:
1. Original quantized base model
2. Trained LoRA adapters

In [None]:
fine_tuned_model = PeftModel.from_pretrained(base_model, FINETUNED_MODEL)

In [None]:
print(f"Memory footprint with adapters: {fine_tuned_model.get_memory_footprint() / 1e9:,.2f} GB")

In [None]:
fine_tuned_model

## LoRA Architecture Analysis

### How LoRA Adapts Pretrained Models

LoRA (Low-Rank Adaptation) works by injecting **trainable low-rank matrices** into specific layers while keeping the original weights frozen. This is more efficient than full fine-tuning because:

1. Only ~0.1-1% of parameters are updated
2. Original model remains intact (no catastrophic forgetting)
3. Adapters can be swapped for different tasks


Each target module has two low-rank matrices:
- lora_A (dimension: original_size × r)
- lora_B (dimension: r × original_size)

Where r = LORA_R (32 in our case)

In [None]:
# Calculate adapter parameters for one layer
lora_q_proj = 4096 * 32 + 4096 * 32  # (input_dim × r) + (r × output_dim)
lora_k_proj = 4096 * 32 + 1024 * 32
lora_v_proj = 4096 * 32 + 1024 * 32
lora_o_proj = 4096 * 32 + 4096 * 32

In [None]:
lora_layer = lora_q_proj + lora_k_proj + lora_v_proj + lora_o_proj
total_params = lora_layer * 32  # 32 layers in the model
size_mb = (total_params * 4) / 1_000_000  # 4 bytes per parameter (float32)

In [None]:
print(f"Total number of params: {params:,} and size {size:,.1f}MB")

### Key Takeaways:
1. QLoRA enables efficient fine-tuning with minimal memory overhead
2. 4-bit quantization reduces memory by ~8x compared to full precision
3. LoRA adapters add only ~10MB of parameters while being effective