# 📚 Quantization Basics: Your First Steps into LLM Optimization

Welcome to the world of Large Language Model optimization! This interactive tutorial will teach you the fundamentals of quantization - the art and science of making AI models smaller, faster, and more efficient.

## 🎯 What You'll Learn
- What quantization is and why it matters
- Different types of quantization (4-bit, 8-bit, 16-bit)
- Hands-on quantization of your first model
- How to measure and compare results
- Real-world applications and use cases

## ⏱️ Time Required: 30-45 minutes
## 📋 Prerequisites: Basic Python knowledge

## 🔍 What is Quantization?

Imagine you have a high-resolution photo that takes up 10MB of space. You can compress it to 2MB with minimal visual quality loss. Quantization does something similar for AI models - it reduces the precision of numbers used in the model to save memory and increase speed.

### The Magic Numbers
- **32-bit (FP32)**: Full precision - like a 4K photo
- **16-bit (FP16)**: Half precision - like 1080p video
- **8-bit (INT8)**: Quarter precision - like a compressed image
- **4-bit (INT4)**: Ultra compression - like a thumbnail

Let's see this in action!

In [None]:
# Let's start with the basics - understanding number precision
import torch
import numpy as np
import matplotlib.pyplot as plt

# Create a sample "weight" - this represents a tiny piece of an AI model
original_weight = torch.tensor([3.14159265, 2.71828182, 1.41421356, 0.57721566])
print(f"Original weights (32-bit): {original_weight}")

# Convert to different precisions
weight_16bit = original_weight.half()  # 16-bit
weight_8bit = torch.round(original_weight * 127) / 127  # Simulated 8-bit
weight_4bit = torch.round(original_weight * 7) / 7  # Simulated 4-bit

print(f"16-bit weights: {weight_16bit}")
print(f"8-bit weights:  {weight_8bit}")
print(f"4-bit weights:  {weight_4bit}")

# Calculate memory savings
memory_32bit = original_weight.element_size() * original_weight.numel()
memory_16bit = weight_16bit.element_size() * weight_16bit.numel()
memory_8bit = memory_32bit // 4  # 8-bit uses 1/4 the memory
memory_4bit = memory_32bit // 8  # 4-bit uses 1/8 the memory

print(f"\n💾 Memory Usage:")
print(f"32-bit: {memory_32bit} bytes")
print(f"16-bit: {memory_16bit} bytes ({memory_32bit/memory_16bit:.1f}x smaller)")
print(f"8-bit:  {memory_8bit} bytes ({memory_32bit/memory_8bit:.1f}x smaller)")
print(f"4-bit:  {memory_4bit} bytes ({memory_32bit/memory_4bit:.1f}x smaller)")

## 🎨 Visualizing Quantization

Let's create a visual representation of how quantization affects data:

In [None]:
# Create a visualization of quantization effects
plt.figure(figsize=(12, 8))

# Generate smooth curve (representing model weights)
x = np.linspace(0, 4*np.pi, 1000)
y_original = np.sin(x) + 0.5 * np.sin(3*x)

# Simulate different quantization levels
def quantize(data, levels):
    min_val, max_val = data.min(), data.max()
    scale = (max_val - min_val) / (levels - 1)
    quantized = np.round((data - min_val) / scale) * scale + min_val
    return quantized

y_32bit = y_original  # No quantization
y_8bit = quantize(y_original, 256)  # 8-bit = 256 levels
y_4bit = quantize(y_original, 16)   # 4-bit = 16 levels
y_2bit = quantize(y_original, 4)    # 2-bit = 4 levels

# Plot comparisons
plt.subplot(2, 2, 1)
plt.plot(x, y_32bit, 'b-', linewidth=2, label='32-bit (Original)')
plt.title('32-bit: Full Precision')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(2, 2, 2)
plt.plot(x, y_original, 'b-', alpha=0.5, label='Original')
plt.plot(x, y_8bit, 'r-', linewidth=2, label='8-bit')
plt.title('8-bit: Minimal Loss')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(2, 2, 3)
plt.plot(x, y_original, 'b-', alpha=0.5, label='Original')
plt.plot(x, y_4bit, 'g-', linewidth=2, label='4-bit')
plt.title('4-bit: Noticeable but Acceptable')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(2, 2, 4)
plt.plot(x, y_original, 'b-', alpha=0.5, label='Original')
plt.plot(x, y_2bit, 'm-', linewidth=2, label='2-bit')
plt.title('2-bit: Significant Loss')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('📊 How Quantization Affects Data Quality', fontsize=16, y=1.02)
plt.show()

# Calculate and display error metrics
mse_8bit = np.mean((y_original - y_8bit)**2)
mse_4bit = np.mean((y_original - y_4bit)**2)
mse_2bit = np.mean((y_original - y_2bit)**2)

print(f"📈 Quantization Error (Mean Squared Error):")
print(f"8-bit:  {mse_8bit:.6f} (Excellent)")
print(f"4-bit:  {mse_4bit:.6f} (Good)")
print(f"2-bit:  {mse_2bit:.6f} (Poor)")

## 🤖 Your First Model Quantization

Now let's quantize a real language model! We'll start with a small model so it runs quickly on any hardware.

In [None]:
# Install required packages (run this if you haven't already)
# !pip install transformers torch bitsandbytes accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time
import psutil
import os

# Choose a small model for this tutorial
model_name = "microsoft/DialoGPT-small"  # Only 117M parameters

print(f"🤖 Loading model: {model_name}")
print("This might take a minute the first time...")

# Load the original model (32-bit)
start_time = time.time()
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_original = AutoModelForCausalLM.from_pretrained(model_name)
load_time_original = time.time() - start_time

# Calculate original model size
def get_model_size(model):
    total_params = sum(p.numel() for p in model.parameters())
    total_size = sum(p.numel() * p.element_size() for p in model.parameters())
    return total_params, total_size

params_orig, size_orig = get_model_size(model_original)

print(f"\n📊 Original Model Stats:")
print(f"Parameters: {params_orig:,}")
print(f"Size: {size_orig / 1024**2:.1f} MB")
print(f"Load time: {load_time_original:.2f} seconds")

In [None]:
# Now let's create a quantized version!
print("🔧 Creating 8-bit quantized model...")

# Configure 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0  # Threshold for outlier detection
)

# Load quantized model
start_time = time.time()
model_8bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)
load_time_8bit = time.time() - start_time

# Calculate quantized model size (approximate)
params_8bit, size_8bit = get_model_size(model_8bit)
# Note: The actual memory usage is lower due to 8-bit storage
estimated_8bit_size = size_orig / 4  # 8-bit uses ~1/4 the memory

print(f"\n📊 8-bit Model Stats:")
print(f"Parameters: {params_8bit:,} (same as original)")
print(f"Estimated size: {estimated_8bit_size / 1024**2:.1f} MB")
print(f"Load time: {load_time_8bit:.2f} seconds")
print(f"Memory savings: {size_orig / estimated_8bit_size:.1f}x smaller! 🎉")

## 🧪 Testing Model Quality

The big question: Does our quantized model still work well? Let's test it!

In [None]:
# Set up tokenizer padding
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Test prompts
test_prompts = [
    "Hello, how are you?",
    "What's the weather like?",
    "Tell me a joke",
    "Explain artificial intelligence"
]

def generate_response(model, prompt, max_length=50):
    """Generate a response from the model"""
    inputs = tokenizer.encode(prompt + tokenizer.eos_token, return_tensors='pt')
    
    start_time = time.time()
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    generation_time = time.time() - start_time
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove the input prompt from the response
    response = response[len(prompt):].strip()
    
    return response, generation_time

print("🧪 Testing Model Quality\n")
print("=" * 80)

for i, prompt in enumerate(test_prompts, 1):
    print(f"\n🔸 Test {i}: '{prompt}'")
    print("-" * 50)
    
    # Test original model
    response_orig, time_orig = generate_response(model_original, prompt)
    print(f"Original (32-bit): {response_orig}")
    print(f"Time: {time_orig:.3f}s")
    
    # Test quantized model
    response_8bit, time_8bit = generate_response(model_8bit, prompt)
    print(f"Quantized (8-bit): {response_8bit}")
    print(f"Time: {time_8bit:.3f}s")
    
    # Speed comparison
    speedup = time_orig / time_8bit if time_8bit > 0 else 1
    print(f"Speedup: {speedup:.2f}x {'🚀' if speedup > 1 else ''}")

print("\n" + "=" * 80)

## 📈 Comprehensive Comparison

Let's create a detailed comparison of our models:

In [None]:
# Create a comprehensive comparison
import pandas as pd

# Collect metrics
comparison_data = {
    'Metric': [
        'Parameters',
        'Model Size (MB)',
        'Load Time (s)',
        'Memory Usage',
        'Precision',
        'Compression Ratio'
    ],
    'Original (32-bit)': [
        f"{params_orig:,}",
        f"{size_orig / 1024**2:.1f}",
        f"{load_time_original:.2f}",
        "High",
        "Full (FP32)",
        "1x (baseline)"
    ],
    'Quantized (8-bit)': [
        f"{params_8bit:,}",
        f"{estimated_8bit_size / 1024**2:.1f}",
        f"{load_time_8bit:.2f}",
        "Low",
        "Reduced (INT8)",
        f"{size_orig / estimated_8bit_size:.1f}x smaller"
    ]
}

df = pd.DataFrame(comparison_data)
print("📊 Model Comparison Summary")
print("=" * 60)
print(df.to_string(index=False))

# Create a visual comparison
plt.figure(figsize=(12, 6))

# Memory usage comparison
plt.subplot(1, 2, 1)
models = ['Original\n(32-bit)', 'Quantized\n(8-bit)']
sizes = [size_orig / 1024**2, estimated_8bit_size / 1024**2]
colors = ['#ff7f7f', '#7fbf7f']

bars = plt.bar(models, sizes, color=colors, alpha=0.8)
plt.title('💾 Memory Usage Comparison')
plt.ylabel('Size (MB)')
plt.grid(True, alpha=0.3)

# Add value labels on bars
for bar, size in zip(bars, sizes):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
             f'{size:.1f} MB', ha='center', va='bottom', fontweight='bold')

# Speed comparison (if we have timing data)
plt.subplot(1, 2, 2)
load_times = [load_time_original, load_time_8bit]
bars = plt.bar(models, load_times, color=colors, alpha=0.8)
plt.title('⚡ Load Time Comparison')
plt.ylabel('Time (seconds)')
plt.grid(True, alpha=0.3)

# Add value labels
for bar, time_val in zip(bars, load_times):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
             f'{time_val:.2f}s', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# Summary insights
print(f"\n🎯 Key Insights:")
print(f"✅ Memory reduced by {size_orig / estimated_8bit_size:.1f}x")
print(f"✅ Model still generates coherent responses")
print(f"✅ Load time: {load_time_8bit:.2f}s vs {load_time_original:.2f}s")
print(f"✅ Same number of parameters, just stored more efficiently")

## 🎓 What You've Learned

Congratulations! You've just completed your first quantization experiment. Here's what you accomplished:

### ✅ Key Concepts Mastered
1. **Quantization Basics**: Converting high-precision numbers to lower precision
2. **Memory Efficiency**: How quantization reduces model size
3. **Quality Trade-offs**: Understanding the balance between size and performance
4. **Practical Implementation**: Using real tools to quantize models

### 🔍 What We Observed
- **Memory Savings**: ~4x reduction in memory usage
- **Quality Retention**: Model still generates coherent responses
- **Speed Benefits**: Faster loading and potentially faster inference
- **Minimal Setup**: Easy to implement with existing tools

### 🚀 Real-World Applications
- **Mobile Deployment**: Run larger models on phones/tablets
- **Edge Computing**: Deploy AI in resource-constrained environments
- **Cost Reduction**: Use smaller, cheaper GPUs for inference
- **Batch Processing**: Process more requests simultaneously

## 🎯 Next Steps

Ready to dive deeper? Here are your next learning opportunities:

### Immediate Next Steps (Choose One)
1. **[Tutorial 2: Advanced Quantization](./02_advanced_quantization.ipynb)** - Learn about 4-bit quantization and QLoRA
2. **[Tutorial 3: Model Comparison](./03_model_comparison.ipynb)** - Compare different quantization methods
3. **[Interactive Examples](../../examples/interactive/)** - Try quantization on different model types

### Longer-Term Learning Path
1. **Intermediate**: Explore GPTQ, AWQ, and other advanced methods
2. **Advanced**: Learn about abliteration and model modification
3. **Expert**: Contribute to research and develop new techniques

## 🤝 Get Help & Connect

- **Questions?** Open an issue on our [GitHub repository](https://github.com/your-repo/issues)
- **Discussions**: Join our [community forum](https://github.com/your-repo/discussions)
- **Updates**: Follow our [research blog](https://your-blog.com)

---

**🎉 Congratulations on completing your first quantization tutorial! You're now equipped with the fundamental knowledge to optimize language models efficiently.**

## 🔬 Bonus: Experiment Ideas

Want to explore further? Try these experiments:

### Experiment 1: Different Model Sizes
```python
# Try quantizing different sized models
models_to_try = [
    "microsoft/DialoGPT-small",   # 117M parameters
    "microsoft/DialoGPT-medium",  # 345M parameters
    "gpt2",                       # 124M parameters
    "gpt2-medium"                 # 355M parameters
]
```

### Experiment 2: 4-bit Quantization
```python
# Try even more aggressive quantization
quantization_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
```

### Experiment 3: Quality Metrics
```python
# Implement more sophisticated quality measurements
from transformers import pipeline

# Use perplexity or other metrics to measure quality loss
```

Happy experimenting! 🧪