# Lab 3.2.2: GPTQ Quantization

**Module:** 3.2 - Model Quantization & Optimization  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê‚òÜ

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how GPTQ quantization works algorithmically
- [ ] Quantize a model using AutoGPTQ
- [ ] Experiment with different group sizes (32, 64, 128)
- [ ] Evaluate the quality/speed tradeoffs
- [ ] Save and load GPTQ quantized models

---

## üìö Prerequisites

- Completed: Lab 3.2.1 (Quantization Overview)
- Knowledge of: Basic quantization concepts, PyTorch
- Hardware: DGX Spark with 128GB unified memory

---

## üåç Real-World Context

**The Problem:** You want to run Llama 7B on a consumer GPU with 8GB VRAM.

- **FP16**: 7B √ó 2 bytes = 14GB ‚Üí Doesn't fit!
- **INT8**: 7B √ó 1 byte = 7GB ‚Üí Barely fits, no room for KV cache
- **GPTQ 4-bit**: 7B √ó 0.5 bytes = 3.5GB ‚Üí Fits with room to spare!

**Why GPTQ?**
- Most widely adopted 4-bit quantization method
- Thousands of pre-quantized models on Hugging Face
- Optimized CUDA kernels for fast inference
- Works great on DGX Spark!

---

## üßí ELI5: What is GPTQ?

> **Imagine you're an artist making a mosaic...**
>
> You have a beautiful photograph to recreate, but you can only use 16 different colored tiles.
>
> **Naive approach:** Just pick the closest color for each pixel independently.
> - Result: The mosaic looks grainy and loses detail.
>
> **GPTQ approach:** Start from one corner and work systematically:
> 1. Pick the best tile for pixel 1
> 2. **Adjust the remaining pixels** to compensate for any error
> 3. Move to pixel 2, repeat
> 4. The errors spread out and cancel, giving a better overall image!
>
> **In AI terms:** GPTQ quantizes weights one-by-one, updating remaining weights to compensate for quantization errors. This "error compensation" is what makes it so good!

---

## Part 1: Understanding GPTQ Algorithm

GPTQ is based on **Optimal Brain Quantization (OBQ)**, which uses second-order information (the Hessian matrix) to minimize quantization error.

### The Key Insight

When we quantize weight $w_i$, we introduce an error. Instead of just accepting this error, GPTQ asks:

> "How should I adjust the remaining unquantized weights to compensate?"

The answer uses the Hessian matrix: $H = X^T X$ where $X$ is the input activations.

### Algorithm Overview

```
1. Collect calibration data (run forward passes)
2. Compute Hessian matrix for each layer
3. For each weight column:
   a. Quantize the weight
   b. Compute quantization error
   c. Update remaining weights to compensate
4. Save quantized model
```

### Group Size

Instead of using one scale factor for an entire weight matrix, GPTQ uses **groups** of weights with shared scale factors.

- **Group size 128:** Faster, less memory for scales, slightly lower quality
- **Group size 64:** Balanced
- **Group size 32:** Slower, more scales stored, higher quality

Let's visualize this:

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import os
import time

print("=" * 60)
print("DGX Spark Environment Check")
print("=" * 60)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory available: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Visualize group-wise quantization

def visualize_group_quantization(weights, group_sizes=[128, 64, 32]):
    """
    Visualize how different group sizes affect quantization.
    """
    fig, axes = plt.subplots(1, len(group_sizes) + 1, figsize=(15, 4))
    
    # Original weights
    im = axes[0].imshow(weights.numpy(), cmap='RdBu', aspect='auto')
    axes[0].set_title(f'Original\n({weights.numel()*4/1024:.1f} KB)')
    axes[0].set_xlabel('Columns')
    axes[0].set_ylabel('Rows')
    plt.colorbar(im, ax=axes[0], fraction=0.046)
    
    for idx, gs in enumerate(group_sizes):
        # Reshape weights into groups
        flat = weights.flatten()
        num_groups = len(flat) // gs
        grouped = flat[:num_groups * gs].reshape(-1, gs)
        
        # Quantize each group separately
        scales = grouped.abs().max(dim=1, keepdim=True).values
        scales = scales.clamp(min=1e-10)
        quantized = torch.round(grouped / scales * 7).clamp(-8, 7)
        dequantized = quantized * scales / 7
        
        # Reshape back
        result = torch.zeros_like(weights)
        result.flatten()[:num_groups * gs] = dequantized.flatten()
        
        # Calculate error and storage
        error = (weights - result).pow(2).mean().sqrt()
        storage = (weights.numel() * 0.5 + num_groups * 2) / 1024  # 4-bit + FP16 scales
        
        im = axes[idx + 1].imshow(result.numpy(), cmap='RdBu', aspect='auto')
        axes[idx + 1].set_title(f'Group Size {gs}\n({storage:.1f} KB, RMSE={error:.4f})')
        axes[idx + 1].set_xlabel('Columns')
        plt.colorbar(im, ax=axes[idx + 1], fraction=0.046)
    
    plt.tight_layout()
    plt.savefig('group_quantization.png', dpi=150, bbox_inches='tight')
    plt.show()
    plt.close(fig)  # Free memory from figure

# Create sample weight matrix
torch.manual_seed(42)
sample_weights = torch.randn(256, 256) * 0.5

visualize_group_quantization(sample_weights)
print("\nüí° Notice: Smaller group sizes have lower RMSE but slightly more storage!")

### üîç What Just Happened?

We visualized how group size affects quantization:

1. **Group size 128**: Fewer scale factors, more compression, slightly more error
2. **Group size 64**: Balance between compression and accuracy
3. **Group size 32**: More scale factors, less compression, lower error

The difference in error is small because we're using the same 4-bit representation. The group size mainly affects how well we can capture weight distributions within each group.

---

## Part 2: Quantizing with AutoGPTQ

Let's use the AutoGPTQ library to quantize a real model!

In [None]:
# Install AutoGPTQ if needed
# Note: On DGX Spark (ARM64), prefer using pre-installed NGC container packages

try:
    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
    print("‚úÖ AutoGPTQ is available!")
except ImportError:
    print("Installing AutoGPTQ...")
    print("‚ö†Ô∏è  On DGX Spark (ARM64), this may compile CUDA kernels from source.")
    print("   This can take 5-10 minutes. Please be patient...")

    import subprocess
    result = subprocess.run(
        ["pip", "install", "auto-gptq", "--no-cache-dir"],
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        print(f"‚ùå Installation failed!")
        print("Error output (last 1000 chars):")
        print(result.stderr[-1000:] if len(result.stderr) > 1000 else result.stderr)
        print("\nüí° Solution: Use an NGC container with AutoGPTQ pre-installed")
        print("   Or try: pip install auto-gptq --no-build-isolation")
        raise ImportError("AutoGPTQ installation failed - see error above")

    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
    print("‚úÖ AutoGPTQ installed successfully!")

from transformers import AutoTokenizer
import gc

In [None]:
# Prepare calibration data
# GPTQ needs example inputs to calculate the Hessian matrix

def get_calibration_data(tokenizer, num_samples=128, max_length=512):
    """
    Generate calibration data for GPTQ quantization.
    
    In practice, use data similar to your inference workload!
    """
    # Sample calibration texts (diverse topics)
    calibration_texts = [
        "The field of machine learning has grown exponentially in recent years.",
        "Artificial intelligence systems can now perform complex reasoning tasks.",
        "Large language models are transforming how we interact with computers.",
        "Neural networks consist of interconnected layers of artificial neurons.",
        "Deep learning has enabled breakthroughs in computer vision and NLP.",
        "The transformer architecture revolutionized sequence modeling in 2017.",
        "Quantization reduces model size while maintaining performance.",
        "GPU acceleration enables training of billion-parameter models.",
        "Transfer learning allows models to leverage pre-trained knowledge.",
        "Attention mechanisms help models focus on relevant information.",
        # Add more diverse samples for better calibration
        "In the year 1969, humans first landed on the moon.",
        "The capital of France is Paris, known for the Eiffel Tower.",
        "Python is a popular programming language for data science.",
        "Climate change poses significant challenges to global ecosystems.",
        "Quantum computing may revolutionize cryptography and drug discovery.",
        "The stock market experienced significant volatility last quarter.",
        "Healthy eating habits contribute to overall well-being.",
        "Space exploration continues to expand our understanding of the universe.",
        "Renewable energy sources are becoming increasingly cost-effective.",
        "The history of mathematics spans thousands of years.",
    ]
    
    # Repeat and extend to get desired number of samples
    extended_texts = (calibration_texts * ((num_samples // len(calibration_texts)) + 1))[:num_samples]
    
    # Tokenize
    calibration_data = []
    for text in extended_texts:
        tokenized = tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=max_length,
            padding=False
        )
        calibration_data.append(tokenized.input_ids[0].tolist())
    
    return calibration_data

print("Calibration data generator defined!")

In [None]:
# Choose a model to quantize
# For this tutorial, we use a smaller model. For production, use Llama 7B/13B/70B

model_id = "facebook/opt-350m"  # Small model for quick demo
# For larger models:
# model_id = "facebook/opt-1.3b"
# model_id = "meta-llama/Llama-2-7b-hf"  # Requires HF login

print(f"Selected model: {model_id}")

# Load tokenizer with error handling for network issues
try:
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    print(f"‚úÖ Tokenizer loaded. Vocab size: {len(tokenizer)}")
except Exception as e:
    print(f"‚ùå Failed to load tokenizer: {e}")
    print("\nPossible solutions:")
    print("  1. Check your internet connection")
    print("  2. Verify the model ID is correct")
    print("  3. For gated models (e.g., Llama), run: huggingface-cli login")
    raise

In [None]:
# Generate calibration data
print("Generating calibration data...")
calibration_data = get_calibration_data(tokenizer, num_samples=128)
print(f"Generated {len(calibration_data)} calibration samples")
print(f"Sample lengths: {[len(s) for s in calibration_data[:5]]}")

In [None]:
# GPTQ Configuration
# We'll create configs for different group sizes to compare

def create_gptq_config(bits=4, group_size=128, desc_act=True, sym=False):
    """
    Create GPTQ quantization configuration.
    
    Args:
        bits: Number of bits (4 is most common)
        group_size: Weights per group (32, 64, 128)
        desc_act: Use descending activation order (improves quality)
        sym: Use symmetric quantization
    """
    return BaseQuantizeConfig(
        bits=bits,
        group_size=group_size,
        desc_act=desc_act,
        sym=sym,
        damp_percent=0.1,  # Damping for numerical stability
    )

# Create configs for comparison
configs = {
    'group_128': create_gptq_config(group_size=128),
    'group_64': create_gptq_config(group_size=64),
    'group_32': create_gptq_config(group_size=32),
}

print("GPTQ Configurations:")
for name, config in configs.items():
    print(f"\n{name}:")
    print(f"  Bits: {config.bits}")
    print(f"  Group size: {config.group_size}")
    print(f"  Desc act: {config.desc_act}")

In [None]:
# Quantize with group_size=128 first (fastest)
import time

print("="*60)
print("Quantizing with group_size=128...")
print("="*60)

# Clear memory
gc.collect()
torch.cuda.empty_cache()

start_time = time.time()

# Load and quantize
quantized_model_128 = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    configs['group_128']
)

# Perform quantization (this is where GPTQ runs)
quantized_model_128.quantize(
    calibration_data,
    batch_size=4,  # Adjust based on your GPU memory
)

quant_time_128 = time.time() - start_time
print(f"\nQuantization time: {quant_time_128:.1f} seconds")

# Memory usage
mem_128 = torch.cuda.memory_allocated() / 1e9
print(f"GPU memory used: {mem_128:.2f} GB")

In [None]:
# Save the quantized model
import os

save_dir_128 = "./quantized_models/opt-350m-gptq-4bit-g128"
os.makedirs(save_dir_128, exist_ok=True)

print(f"Saving quantized model to {save_dir_128}...")
quantized_model_128.save_quantized(save_dir_128)
tokenizer.save_pretrained(save_dir_128)

# Check file sizes
total_size = 0
print("\nSaved files:")
for f in os.listdir(save_dir_128):
    size = os.path.getsize(os.path.join(save_dir_128, f))
    total_size += size
    print(f"  {f}: {size/1e6:.2f} MB")
print(f"\nTotal: {total_size/1e6:.2f} MB")

In [None]:
# Clean up and quantize with other group sizes
del quantized_model_128
gc.collect()
torch.cuda.empty_cache()

In [None]:
# Quantize with group_size=64
print("="*60)
print("Quantizing with group_size=64...")
print("="*60)

start_time = time.time()

quantized_model_64 = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    configs['group_64']
)

quantized_model_64.quantize(
    calibration_data,
    batch_size=4,
)

quant_time_64 = time.time() - start_time
print(f"\nQuantization time: {quant_time_64:.1f} seconds")

save_dir_64 = "./quantized_models/opt-350m-gptq-4bit-g64"
os.makedirs(save_dir_64, exist_ok=True)
quantized_model_64.save_quantized(save_dir_64)
tokenizer.save_pretrained(save_dir_64)

del quantized_model_64
gc.collect()
torch.cuda.empty_cache()

In [None]:
# Quantize with group_size=32
print("="*60)
print("Quantizing with group_size=32...")
print("="*60)

start_time = time.time()

quantized_model_32 = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    configs['group_32']
)

quantized_model_32.quantize(
    calibration_data,
    batch_size=4,
)

quant_time_32 = time.time() - start_time
print(f"\nQuantization time: {quant_time_32:.1f} seconds")

save_dir_32 = "./quantized_models/opt-350m-gptq-4bit-g32"
os.makedirs(save_dir_32, exist_ok=True)
quantized_model_32.save_quantized(save_dir_32)
tokenizer.save_pretrained(save_dir_32)

del quantized_model_32
gc.collect()
torch.cuda.empty_cache()

---

## Part 3: Comparing Quantized Models

Let's compare the three group sizes on:
1. Model size
2. Inference speed
3. Output quality (perplexity)

In [None]:
# Load original FP16 model for baseline
from transformers import AutoModelForCausalLM
import math

print("Loading FP16 baseline model...")
model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="cuda"
)

# Calculate baseline model size
param_count = sum(p.numel() for p in model_fp16.parameters())
fp16_size_mb = param_count * 2 / 1e6
print(f"FP16 model size: {fp16_size_mb:.1f} MB ({param_count/1e6:.0f}M params)")

In [None]:
# Perplexity evaluation function
from tqdm import tqdm

def calculate_perplexity(model, tokenizer, texts, max_length=256):
    """Calculate perplexity on a set of texts."""
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for text in tqdm(texts, desc="Evaluating", leave=False):
            encodings = tokenizer(
                text, 
                return_tensors='pt', 
                truncation=True, 
                max_length=max_length
            )
            input_ids = encodings.input_ids.to(model.device)
            
            if input_ids.size(1) < 2:
                continue
            
            outputs = model(input_ids, labels=input_ids)
            loss = outputs.loss.item()
            num_tokens = input_ids.size(1) - 1
            
            total_loss += loss * num_tokens
            total_tokens += num_tokens
    
    return math.exp(total_loss / total_tokens)

# Evaluation texts
eval_texts = [
    "The quick brown fox jumps over the lazy dog in the garden.",
    "Machine learning is transforming industries around the world.",
    "Scientists have discovered a new species of deep-sea fish.",
    "The history of ancient civilizations fascinates many scholars.",
    "Technology continues to advance at an unprecedented rate.",
    "Climate change affects ecosystems across the planet.",
    "The stock market showed significant gains this quarter.",
    "Music has the power to evoke strong emotions in listeners.",
    "Space exploration opens new frontiers for humanity.",
    "Healthy eating habits contribute to longevity and well-being.",
]

In [None]:
# Evaluate FP16 baseline
print("\nEvaluating FP16 baseline...")
ppl_fp16 = calculate_perplexity(model_fp16, tokenizer, eval_texts)
print(f"FP16 Perplexity: {ppl_fp16:.2f}")

# Clean up
del model_fp16
gc.collect()
torch.cuda.empty_cache()

In [None]:
# Load and evaluate GPTQ models
results = {'FP16': {'perplexity': ppl_fp16, 'size_mb': fp16_size_mb}}

for name, save_dir in [
    ('GPTQ-128', save_dir_128),
    ('GPTQ-64', save_dir_64),
    ('GPTQ-32', save_dir_32)
]:
    print(f"\nLoading {name}...")
    
    # Check which format is available (safetensors or bin)
    safetensors_exists = any(f.endswith('.safetensors') for f in os.listdir(save_dir))
    
    # Load quantized model with appropriate format
    model = AutoGPTQForCausalLM.from_quantized(
        save_dir,
        device="cuda:0",
        use_safetensors=safetensors_exists
    )
    
    # Get model size from saved files
    size_mb = sum(
        os.path.getsize(os.path.join(save_dir, f)) 
        for f in os.listdir(save_dir) 
        if f.endswith('.safetensors') or f.endswith('.bin')
    ) / 1e6
    
    # Evaluate perplexity
    ppl = calculate_perplexity(model, tokenizer, eval_texts)
    
    results[name] = {
        'perplexity': ppl,
        'size_mb': size_mb
    }
    
    print(f"  Size: {size_mb:.1f} MB")
    print(f"  Perplexity: {ppl:.2f}")
    
    del model
    gc.collect()
    torch.cuda.empty_cache()

In [None]:
# Summary comparison
print("\n" + "="*70)
print("GPTQ Quantization Comparison")
print("="*70)
print(f"{'Model':<15} {'Size (MB)':>12} {'Perplexity':>12} {'PPL Delta':>12} {'Compression':>12}")
print("-"*70)

baseline_ppl = results['FP16']['perplexity']
baseline_size = results['FP16']['size_mb']

for name, data in results.items():
    ppl_delta = data['perplexity'] - baseline_ppl if name != 'FP16' else 0
    compression = baseline_size / data['size_mb']
    
    delta_str = f"+{ppl_delta:.2f}" if ppl_delta > 0 else "baseline"
    
    print(f"{name:<15} {data['size_mb']:>12.1f} {data['perplexity']:>12.2f} {delta_str:>12} {compression:>11.2f}x")

print("="*70)
print("\nüí° Key Observations:")
print("   - All GPTQ variants achieve ~4x compression")
print("   - Smaller group sizes have slightly better perplexity")
print("   - Quality degradation is minimal (<0.5 PPL typically acceptable)")

---

## Part 4: Inference Speed Comparison

Let's benchmark generation speed for each variant.

In [None]:
def benchmark_generation(model, tokenizer, prompt, num_tokens=50, num_runs=5):
    """
    Benchmark text generation speed.
    Returns tokens per second.
    """
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Warmup
    with torch.no_grad():
        _ = model.generate(**inputs, max_new_tokens=10, do_sample=False)
    torch.cuda.synchronize()
    
    # Benchmark
    times = []
    for _ in range(num_runs):
        torch.cuda.synchronize()
        start = time.perf_counter()
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=num_tokens,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id
            )
        
        torch.cuda.synchronize()
        end = time.perf_counter()
        times.append(end - start)
    
    avg_time = sum(times) / len(times)
    tokens_per_second = num_tokens / avg_time
    
    return tokens_per_second, avg_time * 1000  # Return tok/s and latency_ms

prompt = "The future of artificial intelligence will"
print(f"Benchmark prompt: '{prompt}'")
print("Generating 50 tokens, 5 runs each...")

In [None]:
# Benchmark FP16
print("\nBenchmarking FP16...")
model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="cuda"
)

tok_s, latency = benchmark_generation(model_fp16, tokenizer, prompt)
results['FP16']['tokens_per_sec'] = tok_s
results['FP16']['latency_ms'] = latency
print(f"  {tok_s:.1f} tokens/sec, {latency:.0f} ms")

del model_fp16
gc.collect()
torch.cuda.empty_cache()

In [None]:
# Benchmark GPTQ variants
for name, save_dir in [
    ('GPTQ-128', save_dir_128),
    ('GPTQ-64', save_dir_64),
    ('GPTQ-32', save_dir_32)
]:
    print(f"\nBenchmarking {name}...")
    
    # Check which format is available
    safetensors_exists = any(f.endswith('.safetensors') for f in os.listdir(save_dir))
    
    model = AutoGPTQForCausalLM.from_quantized(
        save_dir,
        device="cuda:0",
        use_safetensors=safetensors_exists
    )
    
    tok_s, latency = benchmark_generation(model, tokenizer, prompt)
    results[name]['tokens_per_sec'] = tok_s
    results[name]['latency_ms'] = latency
    print(f"  {tok_s:.1f} tokens/sec, {latency:.0f} ms")
    
    del model
    gc.collect()
    torch.cuda.empty_cache()

In [None]:
# Final comprehensive comparison
print("\n" + "="*85)
print("COMPREHENSIVE GPTQ COMPARISON")
print("="*85)
print(f"{'Model':<12} {'Size (MB)':>10} {'PPL':>8} {'PPL Œî':>8} {'Tok/s':>10} {'Speedup':>10} {'Compress':>10}")
print("-"*85)

baseline_speed = results['FP16']['tokens_per_sec']

for name, data in results.items():
    ppl_delta = data['perplexity'] - baseline_ppl if name != 'FP16' else 0
    speedup = data['tokens_per_sec'] / baseline_speed
    compression = baseline_size / data['size_mb']
    
    delta_str = f"+{ppl_delta:.2f}" if ppl_delta > 0 else "-"
    
    print(f"{name:<12} {data['size_mb']:>10.1f} {data['perplexity']:>8.2f} {delta_str:>8} {data['tokens_per_sec']:>10.1f} {speedup:>9.2f}x {compression:>9.2f}x")

print("="*85)

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

models = list(results.keys())
colors = ['#2196F3', '#4CAF50', '#FF9800', '#F44336']

# Size comparison
sizes = [results[m]['size_mb'] for m in models]
axes[0].bar(models, sizes, color=colors)
axes[0].set_ylabel('Size (MB)')
axes[0].set_title('Model Size')
for i, v in enumerate(sizes):
    axes[0].text(i, v + 5, f'{v:.0f}', ha='center')

# Perplexity comparison
ppls = [results[m]['perplexity'] for m in models]
axes[1].bar(models, ppls, color=colors)
axes[1].set_ylabel('Perplexity (lower is better)')
axes[1].set_title('Quality (Perplexity)')
for i, v in enumerate(ppls):
    axes[1].text(i, v + 0.5, f'{v:.1f}', ha='center')

# Speed comparison
speeds = [results[m]['tokens_per_sec'] for m in models]
axes[2].bar(models, speeds, color=colors)
axes[2].set_ylabel('Tokens/second')
axes[2].set_title('Inference Speed')
for i, v in enumerate(speeds):
    axes[2].text(i, v + 2, f'{v:.0f}', ha='center')

plt.tight_layout()
plt.savefig('gptq_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)  # Free memory from figure

---

## ‚úã Try It Yourself

### Exercise 1: Quantize a Larger Model

Quantize OPT-1.3B or Llama-2-7B with different group sizes. How does model size affect the quality-compression tradeoff?

<details>
<summary>üí° Hint</summary>

Use the same code patterns, but with:
```python
model_id = "facebook/opt-1.3b"  # or "meta-llama/Llama-2-7b-hf"
# You may need to reduce batch_size for larger models
batch_size = 2
```
</details>

In [None]:
# TODO: Quantize a larger model
# YOUR CODE HERE

### Exercise 2: Custom Calibration Data

The quality of GPTQ quantization depends heavily on calibration data. Try using:
1. Code snippets (for a coding assistant)
2. Scientific papers (for a research assistant)
3. Conversation logs (for a chatbot)

Compare perplexity on domain-specific vs generic test data.

<details>
<summary>üí° Hint</summary>

Load domain-specific data from a text file:
```python
with open("code_samples.txt") as f:
    domain_texts = f.read().split("\n\n")
calibration_data = [tokenizer.encode(t) for t in domain_texts]
```
</details>

In [None]:
# TODO: Experiment with domain-specific calibration data
# YOUR CODE HERE

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Insufficient Calibration Data

```python
# ‚ùå Wrong: Too few samples
calibration_data = get_calibration_data(tokenizer, num_samples=10)

# ‚úÖ Right: Use at least 128 samples
calibration_data = get_calibration_data(tokenizer, num_samples=128)
```

**Why:** GPTQ needs enough samples to estimate the Hessian accurately.

### Mistake 2: Wrong Model Loading for Inference

```python
# ‚ùå Wrong: Loading as regular model
model = AutoModelForCausalLM.from_pretrained("./quantized_model")

# ‚úÖ Right: Use AutoGPTQ loader
model = AutoGPTQForCausalLM.from_quantized(
    "./quantized_model",
    device="cuda:0"
)
```

**Why:** GPTQ models have custom weight formats that require special loading.

### Mistake 3: Ignoring desc_act

```python
# ‚ùå Wrong: desc_act=False
config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False)

# ‚úÖ Right: desc_act=True for better quality
config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=True)
```

**Why:** Descending activation order processes most important weights first, improving quality.

---

## üéâ Checkpoint

You've learned:

- ‚úÖ **GPTQ algorithm**: Uses Hessian-based error compensation for optimal quantization
- ‚úÖ **Group sizes**: Smaller = better quality, larger = faster quantization
- ‚úÖ **Calibration matters**: Representative data improves quantization quality
- ‚úÖ **4x compression**: With minimal quality loss (<0.5 PPL typically)
- ‚úÖ **Speedup bonus**: Quantized models often run faster!

---

## üöÄ Challenge (Optional)

**Create a GPTQ Quantization Pipeline**

Build an end-to-end function that:
1. Takes a model name and quantization config
2. Downloads and quantizes the model
3. Evaluates perplexity and speed
4. Saves with proper naming convention
5. Uploads to Hugging Face Hub (optional)

```python
def quantize_and_publish(
    model_id: str,
    bits: int = 4,
    group_size: int = 128,
    upload_to_hub: bool = False
):
    # YOUR CODE HERE
    pass
```

---

## üìñ Further Reading

- [GPTQ Paper: Accurate Post-Training Quantization](https://arxiv.org/abs/2210.17323)
- [AutoGPTQ GitHub](https://github.com/PanQiWei/AutoGPTQ)
- [Optimal Brain Compression](https://arxiv.org/abs/2208.11580) (OBC, GPTQ's predecessor)
- [TheBloke's Quantized Models](https://huggingface.co/TheBloke) (Thousands of pre-quantized models!)

---

## üßπ Cleanup

In [None]:
# Clean up quantized models (optional - comment out to keep them)
import shutil

# Uncomment to delete quantized models:
# shutil.rmtree("./quantized_models", ignore_errors=True)

# Clear GPU memory
gc.collect()
torch.cuda.empty_cache()

print("Cleanup complete!")
print(f"GPU memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

---

## Next Steps

In the next notebook, we'll explore **AWQ (Activation-aware Weight Quantization)**, which improves on GPTQ by protecting "salient" weights!

‚û°Ô∏è Continue to: [03-awq-quantization.ipynb](03-awq-quantization.ipynb)