# Lab 3.2.1: Quantization Overview

**Module:** 3.2 - Model Quantization & Optimization  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚òÜ‚òÜ

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand different numerical data types (FP32, FP16, BF16, INT8, INT4, FP4)
- [ ] Explain the precision-memory-speed tradeoffs in quantization
- [ ] Compare model sizes and inference speeds across precision levels
- [ ] Measure perplexity to assess quality degradation
- [ ] Appreciate DGX Spark's unique capabilities for quantized inference

---

## üìö Prerequisites

- Completed: Module 10 (LLM Fine-tuning)
- Knowledge of: PyTorch basics, transformers library
- Hardware: DGX Spark with 128GB unified memory

---

## üåç Real-World Context

**The Problem:** You've fine-tuned a powerful 70B parameter model, but deploying it costs a fortune!

Consider the math:
- **70B parameters √ó 2 bytes (FP16) = 140GB** just for weights
- Plus activations, KV cache, framework overhead...
- That's $25,000+ for a cloud GPU instance

**The Solution:** Quantization reduces memory by 2-4√ó with minimal quality loss.

| Company | Use Case | Quantization Win |
|---------|----------|------------------|
| Google | On-device Gemini Nano | 4-bit enables running on phones |
| Meta | Llama deployment | INT8 halves serving costs |
| Apple | CoreML models | 16‚Üí4 bit for Neural Engine |
| **You** | DGX Spark | NVFP4 gives 3.5√ó compression! |

---

## üßí ELI5: What is Quantization?

> **Imagine you're taking notes in class...**
>
> You could write down every single word the teacher says (FP32 - full precision).  
> That's accurate, but your notebook fills up fast and your hand gets tired!
>
> Instead, you could:
> - Write only the key points (FP16 - half precision)
> - Use abbreviations like "b/c" for "because" (INT8 - 8-bit integers)
> - Just draw simple diagrams (INT4 - 4-bit integers)
>
> Each shorthand:
> - ‚úÖ Uses less notebook space
> - ‚úÖ Lets you write faster
> - ‚ö†Ô∏è Might lose some details
>
> **In AI terms:** Quantization is using fewer bits to store each number in a neural network.  
> Just like your notes, smaller bits = smaller models = faster inference, but potentially less accurate.

---

## Part 1: Understanding Data Types

Neural networks are just massive collections of numbers (weights). How we store those numbers matters!

### The Number Line Analogy

Imagine a ruler that can measure from -1000 to +1000:

- **FP32 (32 bits)**: Like having 4 billion tick marks on your ruler. Incredibly precise!
- **FP16 (16 bits)**: 65,000 tick marks. Still very precise for most purposes.
- **BF16 (16 bits)**: Same range as FP32, but with FP16's precision. Best of both worlds!
- **INT8 (8 bits)**: Only 256 tick marks. Some values get rounded to neighbors.
- **INT4 (4 bits)**: Just 16 tick marks. Significant rounding, but surprisingly okay for many tasks!

Let's visualize this:

In [None]:
# First, let's check our DGX Spark environment
import torch
import numpy as np
import matplotlib.pyplot as plt
import os
import gc
import time
import math

print("=" * 60)
print("DGX Spark Environment Check")
print("=" * 60)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"Compute Capability: {torch.cuda.get_device_capability()}")

# Check for BF16 support (Blackwell has native support)
print(f"\nBF16 supported: {torch.cuda.is_bf16_supported()}")

print("=" * 60)

> üí° **Note:** This module includes reusable utility scripts in the `scripts/` directory.
> For your own projects, you can import them directly:
> ```python
> import sys
> sys.path.insert(0, '../scripts')
> from memory_utils import get_gpu_memory, clear_memory, MemoryTracker
> from perplexity import calculate_perplexity
> from quantization_utils import symmetric_quantize, dequantize
> ```
> For this tutorial, we define helper functions inline for clarity.


In [None]:
# Visualizing precision loss across data types

def show_float_representation(value, dtype_name):
    """Show how a number is represented in different data types."""
    if dtype_name == 'FP32':
        tensor = torch.tensor([value], dtype=torch.float32)
    elif dtype_name == 'FP16':
        tensor = torch.tensor([value], dtype=torch.float16)
    elif dtype_name == 'BF16':
        tensor = torch.tensor([value], dtype=torch.bfloat16)
    else:
        tensor = torch.tensor([value], dtype=torch.float32)
    
    stored_value = tensor.item()
    error = abs(value - stored_value)
    
    return stored_value, error

# Let's see how the same number is stored differently
test_value = 3.141592653589793  # Pi

print(f"Original value (Python float64): {test_value}")
print("\nHow different precisions store œÄ:")
print("-" * 50)

for dtype in ['FP32', 'FP16', 'BF16']:
    stored, error = show_float_representation(test_value, dtype)
    print(f"{dtype:>6}: {stored:20.15f}  (error: {error:.2e})")

In [None]:
# Memory comparison for model weights
# This is the key insight: fewer bits = smaller models

model_sizes_billions = [1, 3, 7, 13, 34, 70]  # Common LLM sizes

# Bytes per parameter for each data type
bytes_per_param = {
    'FP32': 4,
    'FP16': 2,
    'BF16': 2,
    'INT8': 1,
    'INT4': 0.5,
    'FP4': 0.5,
}

print("Model Memory Requirements (GB) by Precision")
print("=" * 80)
print(f"{'Model Size':<12}", end="")
for dtype in bytes_per_param:
    print(f"{dtype:>10}", end="")
print("\n" + "-" * 80)

for params_b in model_sizes_billions:
    print(f"{params_b}B{'':<9}", end="")
    for dtype, bpp in bytes_per_param.items():
        size_gb = (params_b * 1e9 * bpp) / 1e9
        print(f"{size_gb:>10.1f}", end="")
    print()

print("\n" + "=" * 80)
print("\nüí° Key Insight: DGX Spark has 128GB unified memory!")
print("   - FP16: Can fit up to ~64B parameters")
print("   - INT4: Can fit up to ~256B parameters!")
print("   - FP4 (Blackwell exclusive): Same as INT4, but better quality!")

### üîç What Just Happened?

We calculated how much memory different model sizes require at various precisions:

1. **FP32 ‚Üí FP16**: Cuts memory in half with almost no quality loss. This is why FP16 is the standard for training.

2. **FP16 ‚Üí INT8**: Another 50% reduction. Works great for inference.

3. **INT8 ‚Üí INT4/FP4**: Another 50%! This is where DGX Spark shines with hardware FP4 support.

---

## Part 2: Quantization in Action

Let's actually quantize some tensors and see what happens to the values.

### Understanding Quantization Math

At its core, quantization maps continuous floating-point values to discrete integers:

```
quantized_value = round((float_value - zero_point) / scale)
dequantized_value = quantized_value * scale + zero_point
```

The `scale` and `zero_point` parameters determine how we map the float range to the integer range.

In [None]:
# Simple quantization demonstration

def quantize_tensor(tensor, bits=8):
    """
    Symmetric quantization of a tensor.
    
    Args:
        tensor: Input float tensor
        bits: Number of bits for quantization (4 or 8)
    
    Returns:
        quantized: Integer tensor
        scale: Scale factor for dequantization
    """
    # Symmetric quantization: map [-max_val, max_val] to [-2^(bits-1), 2^(bits-1)-1]
    max_val = tensor.abs().max()
    qmax = 2 ** (bits - 1) - 1  # 127 for INT8, 7 for INT4
    
    scale = max_val / qmax
    
    # Quantize
    quantized = torch.round(tensor / scale).to(torch.int8)
    
    return quantized, scale


def dequantize_tensor(quantized, scale):
    """Convert quantized tensor back to float."""
    return quantized.float() * scale


# Create a sample weight tensor (like from a neural network layer)
torch.manual_seed(42)
original_weights = torch.randn(4, 4)  # Small example for visibility

print("Original Weights (FP32):")
print(original_weights)
print(f"\nMemory: {original_weights.numel() * 4} bytes")

In [None]:
# Quantize to INT8
quantized_int8, scale_int8 = quantize_tensor(original_weights, bits=8)

print("INT8 Quantized Weights:")
print(quantized_int8)
print(f"Scale factor: {scale_int8:.6f}")
print(f"\nMemory: {quantized_int8.numel() * 1} bytes (4x smaller!)")

# Dequantize and check error
dequantized_int8 = dequantize_tensor(quantized_int8, scale_int8)
error_int8 = (original_weights - dequantized_int8).abs()

print(f"\nReconstruction Error (mean): {error_int8.mean():.6f}")
print(f"Reconstruction Error (max):  {error_int8.max():.6f}")

In [None]:
# Compare INT8 vs INT4
quantized_int4, scale_int4 = quantize_tensor(original_weights, bits=4)

print("INT4 Quantized Weights:")
print(quantized_int4)  # Note: values range from -7 to 7
print(f"Scale factor: {scale_int4:.6f}")

# Dequantize and check error
dequantized_int4 = dequantize_tensor(quantized_int4, scale_int4)
error_int4 = (original_weights - dequantized_int4).abs()

print(f"\nReconstruction Error (mean): {error_int4.mean():.6f}")
print(f"Reconstruction Error (max):  {error_int4.max():.6f}")

# Comparison
print("\n" + "=" * 50)
print("Comparison: INT8 vs INT4")
print("=" * 50)
print(f"{'Metric':<25} {'INT8':>12} {'INT4':>12}")
print("-" * 50)
print(f"{'Memory (bytes)':<25} {quantized_int8.numel():>12} {quantized_int8.numel() // 2:>12}")
print(f"{'Mean Error':<25} {error_int8.mean():>12.6f} {error_int4.mean():>12.6f}")
print(f"{'Max Error':<25} {error_int8.max():>12.6f} {error_int4.max():>12.6f}")

### üîç What Just Happened?

1. **INT8 quantization** maps float values to 256 possible integers (-128 to 127), introducing small rounding errors.

2. **INT4 quantization** uses only 16 possible values (-8 to 7), causing larger errors but using half the memory.

3. **The key insight**: Even with 16√ó compression (FP32‚ÜíINT4), the fundamental "shape" of the weights is preserved!

---

## Part 3: Quantizing a Real Model

Let's move from toy examples to real models. We'll compare a model at different precision levels.

### Loading Models at Different Precisions

In [None]:
# Memory monitoring utility for DGX Spark
import subprocess
import gc

def get_gpu_memory():
    """Get current GPU memory usage in GB."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        return allocated, reserved
    return 0, 0

def clear_memory():
    """Clear GPU memory cache."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

print("Memory utilities loaded!")
allocated, reserved = get_gpu_memory()
print(f"Current GPU memory: {allocated:.2f} GB allocated, {reserved:.2f} GB reserved")

In [None]:
# Load a small model to demonstrate precision differences
from transformers import AutoModelForCausalLM, AutoTokenizer

# We'll use a small model first - GPT-2 (124M parameters)
model_name = "gpt2"

print("Loading GPT-2 at different precisions...")
print("=" * 60)

# Clear any existing models
clear_memory()

# Load tokenizer (shared across all precision levels)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

In [None]:
# Now let's actually compare loading models at different precisions
from transformers import AutoModelForCausalLM

# We'll use a small model for demonstration
model_name = "gpt2"  # ~500MB, good for quick demo

print("Loading models at different precisions...")
print("=" * 60)

# Get baseline memory
initial_mem = get_gpu_memory()[0]

# FP32 Model
print("\n1. Loading FP32 model...")
model_fp32 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="cuda"
)

fp32_mem = get_gpu_memory()[0] - initial_mem
print(f"   FP32 Memory: {fp32_mem:.2f} GB")

In [None]:
# FP16 Model
print("\n2. Loading FP16 model...")

# Defensive check: ensure fp32_mem is defined from previous cell
if 'fp32_mem' not in dir() or fp32_mem is None or fp32_mem == 0:
    # Fallback estimate based on GPT-2 parameter count
    fp32_mem = 0.5  # ~500MB for GPT-2 in FP32

del model_fp32
clear_memory()
initial_mem = get_gpu_memory()[0]

model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cuda"
)

fp16_mem = get_gpu_memory()[0] - initial_mem
print(f"   FP16 Memory: {fp16_mem:.2f} GB")
print(f"   Savings vs FP32: {(1 - fp16_mem/fp32_mem) * 100:.1f}%")

In [None]:
# BF16 Model (preferred for Blackwell architecture)
print("\n3. Loading BF16 model (recommended for DGX Spark)...")

del model_fp16
clear_memory()
initial_mem = get_gpu_memory()[0]

model_bf16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Native Blackwell support
    device_map="cuda"
)

bf16_mem = get_gpu_memory()[0] - initial_mem
print(f"   BF16 Memory: {bf16_mem:.2f} GB")
print(f"   Savings vs FP32: {(1 - bf16_mem/fp32_mem) * 100:.1f}%")
print(f"   Note: BF16 has native tensor core support on Blackwell!")

In [None]:
# INT8 Model using bitsandbytes
print("\n4. Loading INT8 model (8-bit quantization)...")
del model_bf16
clear_memory()
initial_mem = get_gpu_memory()[0]

try:
    from transformers import BitsAndBytesConfig
    
    quantization_config_8bit = BitsAndBytesConfig(
        load_in_8bit=True
    )
    
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config_8bit,
        device_map="cuda"
    )
    
    int8_mem = get_gpu_memory()[0] - initial_mem
    print(f"   INT8 Memory: {int8_mem:.2f} GB")
    print(f"   Savings vs FP32: {(1 - int8_mem/fp32_mem) * 100:.1f}%")
    
except ImportError:
    print("   Note: bitsandbytes not installed. Run: pip install bitsandbytes")
    model_int8 = None
    int8_mem = fp32_mem / 4  # Estimated

In [None]:
# INT4 Model using bitsandbytes
print("\n5. Loading INT4 model (4-bit quantization)...")

# Safely clean up previous model if it exists
if 'model_int8' in dir() and model_int8 is not None:
    del model_int8
clear_memory()
initial_mem = get_gpu_memory()[0]

try:
    from transformers import BitsAndBytesConfig
    
    quantization_config_4bit = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,  # Use BF16 for compute on Blackwell
        bnb_4bit_quant_type="nf4"  # NormalFloat4 - optimized for weights
    )
    
    model_int4 = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config_4bit,
        device_map="cuda"
    )
    
    int4_mem = get_gpu_memory()[0] - initial_mem
    print(f"   INT4 Memory: {int4_mem:.2f} GB")
    print(f"   Savings vs FP32: {(1 - int4_mem/fp32_mem) * 100:.1f}%")
    
except ImportError:
    print("   Note: bitsandbytes not installed. Run: pip install bitsandbytes")
    model_int4 = None
    int4_mem = fp32_mem / 8  # Estimated

In [None]:
# Summary comparison
# Guard against missing baseline
if 'fp32_mem' not in dir() or fp32_mem is None or fp32_mem == 0:
    print("‚ö†Ô∏è  FP32 baseline not available (previous cell may have failed).")
    print("   Using estimated values for comparison.")
    fp32_mem = 0.5  # Estimated for GPT-2
print("\n" + "=" * 60)
print("Memory Comparison Summary")
print("=" * 60)
print(f"{'Precision':<15} {'Memory (GB)':>12} {'vs FP32':>12} {'Compression':>12}")
print("-" * 60)
print(f"{'FP32':<15} {fp32_mem:>12.3f} {'baseline':>12} {'1.0x':>12}")

# Safe division helper to avoid ZeroDivisionError
def safe_ratio(a, b, default="N/A"):
    return f"{a/b:.1f}x" if b > 0 else default

def safe_percent(a, b, default="N/A"):
    return f"{(1-a/b)*100:.0f}% less" if b > 0 else default

print(f"{'FP16':<15} {fp16_mem:>12.3f} {safe_percent(fp16_mem, fp32_mem):>12} {safe_ratio(fp32_mem, fp16_mem):>12}")
print(f"{'BF16':<15} {bf16_mem:>12.3f} {safe_percent(bf16_mem, fp32_mem):>12} {safe_ratio(fp32_mem, bf16_mem):>12}")
print(f"{'INT8':<15} {int8_mem:>12.3f} {safe_percent(int8_mem, fp32_mem):>12} {safe_ratio(fp32_mem, int8_mem):>12}")
print(f"{'INT4':<15} {int4_mem:>12.3f} {safe_percent(int4_mem, fp32_mem):>12} {safe_ratio(fp32_mem, int4_mem):>12}")
print("=" * 60)

---

## Part 4: Measuring Quality - Perplexity

Memory savings are useless if the model becomes terrible! Let's measure quality using **perplexity**.

### üßí ELI5: What is Perplexity?

> **Imagine you're playing a guessing game...**
>
> I say "The cat sat on the ___". How many reasonable words could fill the blank?
> - If you're confused and think it could be 100 different words ‚Üí high perplexity
> - If you're confident it's "mat" or "couch" ‚Üí low perplexity
>
> **In AI terms:** Perplexity measures how "surprised" a model is by text.  
> Lower perplexity = model predicts words better = higher quality.

In [None]:
# Perplexity calculation function
import math
from tqdm import tqdm

def calculate_perplexity(model, tokenizer, texts, max_length=512):
    """
    Calculate perplexity of a model on given texts.
    
    Perplexity = exp(average negative log-likelihood)
    Lower is better!
    
    Args:
        model: The language model
        tokenizer: The tokenizer
        texts: List of text strings
        max_length: Maximum sequence length
    
    Returns:
        float: The perplexity score
    """
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for text in tqdm(texts, desc="Calculating perplexity"):
            # Tokenize
            encodings = tokenizer(
                text,
                return_tensors='pt',
                truncation=True,
                max_length=max_length
            )
            
            input_ids = encodings.input_ids.to(model.device)
            
            # Skip very short sequences
            if input_ids.size(1) < 2:
                continue
            
            # Get model outputs
            outputs = model(input_ids, labels=input_ids)
            
            # Accumulate loss
            loss = outputs.loss.item()
            num_tokens = input_ids.size(1) - 1  # Exclude first token
            
            total_loss += loss * num_tokens
            total_tokens += num_tokens
    
    # Calculate perplexity
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    
    return perplexity

print("Perplexity function defined!")

In [None]:
# Sample evaluation texts
# In practice, you'd use a proper benchmark like WikiText or C4

sample_texts = [
    "The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet.",
    "Machine learning is a subset of artificial intelligence that enables computers to learn from data.",
    "The capital of France is Paris, which is known for the Eiffel Tower and its rich cultural heritage.",
    "In the year 2024, large language models became increasingly sophisticated and widely deployed.",
    "Neural networks consist of layers of interconnected nodes that process information in a hierarchical manner.",
    "The Python programming language is widely used in data science and machine learning applications.",
    "Quantization reduces the precision of neural network weights to decrease memory usage and increase speed.",
    "The transformer architecture, introduced in 2017, revolutionized natural language processing.",
    "Deep learning has enabled significant advances in computer vision, speech recognition, and translation.",
    "The DGX Spark platform provides 128GB of unified memory for running large AI models efficiently."
]

print(f"Prepared {len(sample_texts)} sample texts for evaluation")

In [None]:
# Compare perplexity across different precision levels
# Note: For fair comparison, we'll load fresh models for each

results = {}

# Test INT4 model (already loaded)
if model_int4 is not None:
    print("Testing INT4 model...")
    ppl_int4 = calculate_perplexity(model_int4, tokenizer, sample_texts)
    results['INT4'] = ppl_int4
    print(f"INT4 Perplexity: {ppl_int4:.2f}")

# Clean up and load FP16 for comparison
if model_int4 is not None:
    del model_int4
clear_memory()

In [None]:
# Test FP16 model (baseline quality)
print("Testing FP16 model...")
model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cuda"
)

ppl_fp16 = calculate_perplexity(model_fp16, tokenizer, sample_texts)
results['FP16'] = ppl_fp16
print(f"FP16 Perplexity: {ppl_fp16:.2f}")

del model_fp16
clear_memory()

In [None]:
# Test INT8 model
print("Testing INT8 model...")
try:
    # Re-import to ensure availability even if earlier cells failed
    from transformers import BitsAndBytesConfig
    
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(load_in_8bit=True),
        device_map="cuda"
    )
    
    ppl_int8 = calculate_perplexity(model_int8, tokenizer, sample_texts)
    results['INT8'] = ppl_int8
    print(f"INT8 Perplexity: {ppl_int8:.2f}")
    
    del model_int8
    clear_memory()
except Exception as e:
    print(f"INT8 test skipped: {e}")

In [None]:
# Summary of quality comparison
print("\n" + "=" * 60)
print("Quality Comparison: Perplexity by Precision")
print("=" * 60)
print(f"{'Precision':<15} {'Perplexity':>15} {'vs FP16':>15}")
print("-" * 60)

baseline = results.get('FP16', 0)

for precision in ['FP16', 'INT8', 'INT4']:
    if precision in results:
        ppl = results[precision]
        if precision == 'FP16':
            diff = "baseline"
        else:
            diff = f"+{ppl - baseline:.2f}"
        print(f"{precision:<15} {ppl:>15.2f} {diff:>15}")

print("=" * 60)
print("\nüí° Interpretation:")
print("   - Lower perplexity = Better quality")
print("   - Difference of <1.0 is typically acceptable")
print("   - Difference of <0.5 is excellent (production-ready)")

---

## Part 5: Speed Comparison

Quantization not only saves memory but can also speed up inference. Let's measure it!

In [None]:
import time

def benchmark_inference(model, tokenizer, prompt, num_tokens=50, num_runs=5):
    """
    Benchmark inference speed.
    
    Returns:
        dict: Contains tokens_per_second, latency_ms, and memory_gb
    """
    model.eval()
    
    # Prepare input
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Warmup
    with torch.no_grad():
        _ = model.generate(**inputs, max_new_tokens=10, do_sample=False)
    
    torch.cuda.synchronize()
    
    # Benchmark
    times = []
    for _ in range(num_runs):
        torch.cuda.synchronize()
        start = time.perf_counter()
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=num_tokens,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id
            )
        
        torch.cuda.synchronize()
        end = time.perf_counter()
        times.append(end - start)
    
    avg_time = sum(times) / len(times)
    tokens_per_second = num_tokens / avg_time
    
    return {
        'tokens_per_second': tokens_per_second,
        'latency_ms': avg_time * 1000,
        'memory_gb': get_gpu_memory()[0]
    }

print("Benchmark function defined!")

In [None]:
# Run benchmarks
benchmark_prompt = "The future of artificial intelligence is"
benchmark_results = {}

print("Running inference benchmarks...")
print(f"Prompt: '{benchmark_prompt}'")
print(f"Generating 50 tokens, 5 runs each\n")

In [None]:
# Benchmark FP16
print("Benchmarking FP16...")
clear_memory()

model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cuda"
)

benchmark_results['FP16'] = benchmark_inference(model_fp16, tokenizer, benchmark_prompt)
print(f"  Tokens/sec: {benchmark_results['FP16']['tokens_per_second']:.1f}")
print(f"  Memory: {benchmark_results['FP16']['memory_gb']:.2f} GB")

del model_fp16
clear_memory()

In [None]:
# Benchmark INT8
print("Benchmarking INT8...")

try:
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(load_in_8bit=True),
        device_map="cuda"
    )
    
    benchmark_results['INT8'] = benchmark_inference(model_int8, tokenizer, benchmark_prompt)
    print(f"  Tokens/sec: {benchmark_results['INT8']['tokens_per_second']:.1f}")
    print(f"  Memory: {benchmark_results['INT8']['memory_gb']:.2f} GB")
    
    del model_int8
    clear_memory()
except Exception as e:
    print(f"  Skipped: {e}")

In [None]:
# Benchmark INT4
print("Benchmarking INT4...")

try:
    model_int4 = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_quant_type="nf4"
        ),
        device_map="cuda"
    )
    
    benchmark_results['INT4'] = benchmark_inference(model_int4, tokenizer, benchmark_prompt)
    print(f"  Tokens/sec: {benchmark_results['INT4']['tokens_per_second']:.1f}")
    print(f"  Memory: {benchmark_results['INT4']['memory_gb']:.2f} GB")
    
    del model_int4
    clear_memory()
except Exception as e:
    print(f"  Skipped: {e}")

In [None]:
# Final summary table
print("\n" + "=" * 70)
print("FINAL COMPARISON: Memory, Speed, and Quality")
print("=" * 70)
print(f"{'Precision':<12} {'Memory (GB)':>12} {'Tokens/sec':>12} {'Perplexity':>12} {'Speedup':>10}")
print("-" * 70)

fp16_speed = benchmark_results.get('FP16', {}).get('tokens_per_second', 1)

for precision in ['FP16', 'INT8', 'INT4']:
    if precision in benchmark_results:
        br = benchmark_results[precision]
        ppl = results.get(precision, 'N/A')
        if isinstance(ppl, float):
            ppl = f"{ppl:.2f}"
        speedup = br['tokens_per_second'] / fp16_speed
        
        print(f"{precision:<12} {br['memory_gb']:>12.2f} {br['tokens_per_second']:>12.1f} {ppl:>12} {speedup:>9.2f}x")

print("=" * 70)
print("\nüéâ You've just compared quantization techniques on real models!")

---

## ‚úã Try It Yourself

### Exercise 1: Quantization Math

Implement asymmetric quantization (with zero-point) and compare it to symmetric quantization.

<details>
<summary>üí° Hint</summary>

Asymmetric quantization uses:
```python
scale = (max_val - min_val) / (qmax - qmin)
zero_point = round(-min_val / scale)
```
</details>

In [None]:
# TODO: Implement asymmetric quantization

def asymmetric_quantize(tensor, bits=8):
    """
    Asymmetric quantization with zero-point.
    
    Args:
        tensor: Input float tensor
        bits: Number of bits (default: 8)
    
    Returns:
        quantized: Quantized tensor
        scale: Scale factor
        zero_point: Zero point offset
    """
    # YOUR CODE HERE
    pass

# Test your implementation
# test_tensor = torch.tensor([0.0, 0.5, 1.0, 1.5, 2.0])  # All positive values
# q, s, zp = asymmetric_quantize(test_tensor)
# print(f"Quantized: {q}")
# print(f"Scale: {s}, Zero point: {zp}")

### Exercise 2: Larger Model Comparison

Try the same experiments with a larger model (e.g., `meta-llama/Llama-2-7b-hf`). 
How do the memory savings and quality compare?

<details>
<summary>üí° Hint</summary>

You'll need to:
1. Log in to Hugging Face (`huggingface-cli login`)
2. Accept the Llama 2 license agreement
3. Use the same code patterns from above
</details>

In [None]:
# TODO: Test with a larger model
# large_model_name = "meta-llama/Llama-2-7b-hf"

# YOUR CODE HERE

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Ignoring Calibration Data

```python
# ‚ùå Wrong: Using random data for calibration
calibration_data = torch.randn(100, 512)

# ‚úÖ Right: Use representative data from your domain
calibration_data = load_samples_from_training_data()
```

**Why:** Quantization needs to see the actual value distributions in your weights and activations. Random data leads to poor scale factors.

### Mistake 2: Quantizing Without Evaluation

```python
# ‚ùå Wrong: Just quantize and deploy
model = quantize(model)
deploy(model)

# ‚úÖ Right: Always measure quality first
model = quantize(model)
ppl_original = evaluate(original_model)
ppl_quantized = evaluate(model)
if ppl_quantized - ppl_original < 0.5:  # Acceptable threshold
    deploy(model)
```

**Why:** Some models/tasks are more sensitive to quantization than others.

### Mistake 3: Wrong Compute Dtype

```python
# ‚ùå Wrong: Using FP32 compute for quantized weights
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float32  # Slow!
)

# ‚úÖ Right: Use BF16 on Blackwell for best performance
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16  # Native Blackwell support
)
```

**Why:** The compute dtype affects inference speed. BF16 has native tensor core support on Blackwell.

---

## üéâ Checkpoint

You've learned:

- ‚úÖ **Data types matter**: FP32‚ÜíFP16‚ÜíINT8‚ÜíINT4 each halve memory
- ‚úÖ **Quantization = compression**: Map floats to fewer bits with controllable error
- ‚úÖ **Perplexity measures quality**: Lower is better, <0.5 difference is excellent
- ‚úÖ **Speed improves too**: Smaller data = faster memory access = faster inference
- ‚úÖ **DGX Spark advantage**: 128GB lets you experiment with any model size

---

## üöÄ Challenge (Optional)

**Build a Quantization Dashboard**

Create a function that takes a model and automatically:
1. Quantizes it to FP16, INT8, and INT4
2. Measures memory, speed, and perplexity for each
3. Recommends the best precision based on user constraints (memory budget, quality threshold)

```python
def quantization_analysis(model_name, memory_budget_gb=8, max_ppl_increase=0.5):
    """
    Analyze quantization options for a model.
    
    Returns recommendation based on constraints.
    """
    # YOUR CODE HERE
    pass
```

---

## üìñ Further Reading

- [A Survey of Quantization Methods for Efficient Neural Network Inference](https://arxiv.org/abs/2103.13630)
- [LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale](https://arxiv.org/abs/2208.07339)
- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- [NVIDIA Quantization Documentation](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#working-with-int8)

---

## üßπ Cleanup

In [None]:
# Clear GPU memory for next notebook
import gc
import torch

# Delete any remaining models safely
for var_name in list(globals().keys()):
    if 'model' in var_name.lower():
        try:
            del globals()[var_name]
        except (KeyError, NameError):
            pass  # Variable already deleted or doesn't exist

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Memory cleared!")
if torch.cuda.is_available():
    print(f"GPU memory after cleanup: {torch.cuda.memory_allocated()/1e9:.2f} GB")

---

## Next Steps

In the next notebook, we'll dive deep into **GPTQ Quantization** - the most popular 4-bit quantization method for GPU inference!

‚û°Ô∏è Continue to: [02-gptq-quantization.ipynb](02-gptq-quantization.ipynb)