# Task 11.6: Quality Benchmark Suite

**Module:** 11 - Model Quantization & Optimization  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê‚òÜ

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Build a comprehensive benchmark suite for quantized models
- [ ] Measure perplexity across multiple datasets
- [ ] Evaluate task-specific accuracy (MMLU, HellaSwag)
- [ ] Compare all quantization methods systematically
- [ ] Create publication-quality comparison tables and visualizations

---

## üìö Prerequisites

- Completed: Tasks 11.1-11.5 (all quantization notebooks)
- Models: Quantized models from previous notebooks (or we'll create them)
- Hardware: DGX Spark with 128GB unified memory

### ‚ö†Ô∏è Optional Dependencies (for full comparison)

For complete GPTQ and AWQ benchmarking, you should have completed:
- **Task 11.2** (`02-gptq-quantization.ipynb`) ‚Üí Creates `./quantized_models/opt-350m-gptq-4bit-g128`
- **Task 11.3** (`03-awq-quantization.ipynb`) ‚Üí Creates `./quantized_models/opt-350m-awq-4bit-g128`

**Note:** This notebook will still work without these models! It will benchmark FP16, INT8, and INT4 (bitsandbytes), and gracefully skip GPTQ/AWQ if not available.

---

## üåç Real-World Context

**The Problem:** You've created multiple quantized versions of your model. Which one should you deploy?

**The Answer:** It depends on your priorities!
- **Maximum quality** ‚Üí FP16 or Q8
- **Best balance** ‚Üí Q4_K_M (GGUF) or AWQ
- **Maximum compression** ‚Üí NVFP4 (Blackwell) or Q2_K
- **Task-specific** ‚Üí Benchmark on YOUR task!

This notebook gives you the tools to make data-driven decisions.

---

## üßí ELI5: Why Benchmarking Matters

> **Imagine you're buying a car...**
>
> You could just look at the price tag. But smart buyers check:
> - **Fuel efficiency** (like model size)
> - **Horsepower** (like inference speed)
> - **Safety rating** (like model quality)
> - **How it handles YOUR roads** (like task-specific accuracy)
>
> A sports car might be "best" on a track but terrible for city driving.
> Similarly, the "best" quantization depends on YOUR use case!
>
> **In AI terms:** A comprehensive benchmark tests multiple dimensions so you can pick the right model for YOUR deployment.

---

## Part 1: Setting Up the Benchmark Suite

We'll create a modular benchmark system that can evaluate any model.

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import gc
import time
import math
from tqdm import tqdm
from typing import Dict, List, Optional, Callable
from dataclasses import dataclass, field

print("="*60)
print("DGX Spark Benchmark Suite")
print("="*60)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Install lm-eval for comprehensive benchmarking
try:
    import lm_eval
    print(f"‚úÖ lm-eval version: {lm_eval.__version__}")
    
    # Check minimum version (API changed significantly in 0.4.0)
    from packaging import version
    if version.parse(lm_eval.__version__) < version.parse("0.4.0"):
        print(f"‚ö†Ô∏è  Warning: lm-eval version {lm_eval.__version__} is older than 0.4.0")
        print("   Some APIs may differ. Consider upgrading: pip install --upgrade lm-eval")
except ImportError:
    print("Installing lm-eval...")
    import subprocess
    subprocess.run(["pip", "install", "lm-eval", "--quiet"], check=True)
    import lm_eval
    print(f"‚úÖ lm-eval installed: {lm_eval.__version__}")

In [None]:
@dataclass
class BenchmarkResult:
    """Store benchmark results for a single model."""
    model_name: str
    quantization_type: str
    model_size_mb: float
    perplexity: Optional[float] = None
    tokens_per_second: Optional[float] = None
    memory_used_gb: Optional[float] = None
    task_scores: Dict[str, float] = field(default_factory=dict)
    metadata: Dict[str, any] = field(default_factory=dict)
    
    def compression_ratio(self, baseline_size_mb: float) -> float:
        """Calculate compression ratio vs baseline."""
        return baseline_size_mb / self.model_size_mb
    
    def to_dict(self) -> dict:
        """Convert to dictionary for DataFrame."""
        result = {
            'Model': self.model_name,
            'Quantization': self.quantization_type,
            'Size (MB)': self.model_size_mb,
            'Perplexity': self.perplexity,
            'Tokens/s': self.tokens_per_second,
            'Memory (GB)': self.memory_used_gb,
        }
        result.update(self.task_scores)
        return result


class BenchmarkSuite:
    """Comprehensive benchmark suite for quantized models."""
    
    def __init__(self, baseline_model_id: str):
        self.baseline_model_id = baseline_model_id
        self.results: List[BenchmarkResult] = []
        self.tokenizer = None
        
    def _load_tokenizer(self):
        """Load tokenizer if not already loaded."""
        if self.tokenizer is None:
            from transformers import AutoTokenizer
            self.tokenizer = AutoTokenizer.from_pretrained(self.baseline_model_id)
            if self.tokenizer.pad_token is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token
        return self.tokenizer
    
    def calculate_perplexity(
        self, 
        model, 
        texts: List[str], 
        max_length: int = 512
    ) -> float:
        """Calculate perplexity on a list of texts."""
        tokenizer = self._load_tokenizer()
        model.eval()
        
        total_loss = 0
        total_tokens = 0
        
        with torch.no_grad():
            for text in tqdm(texts, desc="Perplexity", leave=False):
                encodings = tokenizer(
                    text,
                    return_tensors='pt',
                    truncation=True,
                    max_length=max_length
                )
                input_ids = encodings.input_ids.to(model.device)
                
                if input_ids.size(1) < 2:
                    continue
                
                outputs = model(input_ids, labels=input_ids)
                loss = outputs.loss.item()
                num_tokens = input_ids.size(1) - 1
                
                total_loss += loss * num_tokens
                total_tokens += num_tokens
        
        return math.exp(total_loss / total_tokens) if total_tokens > 0 else float('inf')
    
    def benchmark_speed(
        self, 
        model, 
        prompt: str = "The future of artificial intelligence is",
        num_tokens: int = 50,
        num_runs: int = 5
    ) -> float:
        """Benchmark generation speed."""
        tokenizer = self._load_tokenizer()
        model.eval()
        
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        # Warmup
        with torch.no_grad():
            _ = model.generate(**inputs, max_new_tokens=10, do_sample=False)
        
        torch.cuda.synchronize()
        
        times = []
        for _ in range(num_runs):
            torch.cuda.synchronize()
            start = time.perf_counter()
            
            with torch.no_grad():
                _ = model.generate(
                    **inputs,
                    max_new_tokens=num_tokens,
                    do_sample=False,
                    pad_token_id=tokenizer.pad_token_id
                )
            
            torch.cuda.synchronize()
            times.append(time.perf_counter() - start)
        
        avg_time = sum(times) / len(times)
        return num_tokens / avg_time
    
    def add_result(self, result: BenchmarkResult):
        """Add a benchmark result."""
        self.results.append(result)
    
    def to_dataframe(self) -> pd.DataFrame:
        """Convert results to pandas DataFrame."""
        return pd.DataFrame([r.to_dict() for r in self.results])
    
    def plot_comparison(self, save_path: str = None):
        """Create comparison visualization."""
        df = self.to_dataframe()
        
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        colors = plt.cm.tab10(np.linspace(0, 1, len(df)))
        
        # Size comparison
        ax = axes[0, 0]
        ax.barh(df['Quantization'], df['Size (MB)'], color=colors)
        ax.set_xlabel('Size (MB)')
        ax.set_title('Model Size')
        ax.invert_yaxis()
        
        # Perplexity comparison
        ax = axes[0, 1]
        if 'Perplexity' in df.columns and df['Perplexity'].notna().any():
            ax.barh(df['Quantization'], df['Perplexity'], color=colors)
            ax.set_xlabel('Perplexity (lower is better)')
            ax.set_title('Model Quality')
            ax.invert_yaxis()
        else:
            ax.text(0.5, 0.5, 'No perplexity data', ha='center', va='center')
        
        # Speed comparison
        ax = axes[1, 0]
        if 'Tokens/s' in df.columns and df['Tokens/s'].notna().any():
            ax.barh(df['Quantization'], df['Tokens/s'], color=colors)
            ax.set_xlabel('Tokens/second')
            ax.set_title('Inference Speed')
            ax.invert_yaxis()
        else:
            ax.text(0.5, 0.5, 'No speed data', ha='center', va='center')
        
        # Efficiency (quality/size)
        ax = axes[1, 1]
        if 'Perplexity' in df.columns and df['Perplexity'].notna().any():
            baseline_size = df['Size (MB)'].max()
            efficiency = (baseline_size / df['Size (MB)']) / df['Perplexity']
            ax.barh(df['Quantization'], efficiency, color=colors)
            ax.set_xlabel('Efficiency Score (higher is better)')
            ax.set_title('Quality/Size Efficiency')
            ax.invert_yaxis()
        else:
            ax.text(0.5, 0.5, 'No efficiency data', ha='center', va='center')
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        
        plt.show()
        plt.close(fig)  # Free memory from figure

    def print_summary(self):
        """Print summary table."""
        df = self.to_dataframe()
        print("\n" + "="*80)
        print("BENCHMARK SUMMARY")
        print("="*80)
        print(df.to_string(index=False))
        print("="*80)


print("Benchmark suite classes defined!")

---

## Part 2: Evaluation Datasets

We'll use multiple evaluation datasets for comprehensive benchmarking.

In [None]:
# Diverse evaluation texts for perplexity
PERPLEXITY_EVAL_TEXTS = [
    # General knowledge
    "The history of human civilization spans thousands of years, from ancient Mesopotamia to modern times.",
    "The solar system contains eight planets orbiting around the Sun, each with unique characteristics.",
    "Climate change is caused by the accumulation of greenhouse gases in the Earth's atmosphere.",
    
    # Technical/Scientific
    "Machine learning algorithms can be categorized into supervised, unsupervised, and reinforcement learning.",
    "DNA molecules carry genetic information through sequences of nucleotide bases.",
    "Quantum mechanics describes the behavior of particles at the atomic and subatomic level.",
    
    # Creative/Literary
    "The old lighthouse stood on the cliff, its beam cutting through the fog like a sword of light.",
    "She walked through the autumn forest, leaves crunching beneath her feet like whispered secrets.",
    "The city never sleeps; its streets pulse with the rhythm of a million heartbeats.",
    
    # Factual/News-like
    "The stock market experienced significant volatility as investors reacted to economic indicators.",
    "New research published today suggests a breakthrough in renewable energy technology.",
    "The international conference brought together leaders from over fifty countries.",
    
    # Conversational
    "How are you doing today? I hope everything is going well with your projects.",
    "Could you please explain how to solve this problem step by step?",
    "That's a great question! Let me think about the best way to answer it.",
    
    # Code-like (for code models)
    "The function takes two parameters and returns their sum after validation.",
    "Import the necessary libraries and initialize the model with default parameters.",
    "Error handling is crucial for robust software development and user experience.",
    
    # More diverse topics
    "The recipe calls for flour, sugar, eggs, and butter mixed in specific proportions.",
    "The game ended with a dramatic last-minute goal that shocked everyone in the stadium.",
    "Music has the power to evoke emotions and connect people across cultures.",
]

print(f"Prepared {len(PERPLEXITY_EVAL_TEXTS)} evaluation texts for perplexity")

In [None]:
# Task-specific evaluation using lm-eval
BENCHMARK_TASKS = [
    "hellaswag",      # Common sense reasoning
    "arc_easy",       # Science questions (easy)
    "winogrande",     # Commonsense reasoning
    # "mmlu",         # Massive Multitask Language Understanding (slow!)
]

print(f"Will evaluate on tasks: {BENCHMARK_TASKS}")
print("\nNote: MMLU is commented out by default as it takes a long time.")
print("Uncomment it for comprehensive evaluation.")

---

## Part 3: Running the Benchmark

Let's benchmark multiple quantized models.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model to benchmark
MODEL_ID = "facebook/opt-350m"  # Small for demo; use larger for real benchmarks

# Initialize benchmark suite
suite = BenchmarkSuite(MODEL_ID)

print(f"Benchmarking model: {MODEL_ID}")

In [None]:
def get_model_size(model) -> float:
    """
    Calculate model size in MB based on parameter count and data types.

    Args:
        model: PyTorch model to measure

    Returns:
        Size in megabytes (MB)
    """
    param_bytes = sum(
        p.numel() * p.element_size() for p in model.parameters()
    )
    return param_bytes / 1e6


def get_memory_usage() -> float:
    """
    Get current GPU memory usage in GB.

    Returns:
        Memory allocated on GPU in gigabytes (GB), or 0 if no GPU
    """
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / 1e9
    return 0


def clear_memory():
    """Clear GPU memory cache and run garbage collection."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()


print("Utility functions defined!")

In [None]:
# Benchmark 1: FP16 Baseline
print("="*60)
print("Benchmarking FP16 Baseline")
print("="*60)

clear_memory()

model_fp16 = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="cuda"
)

# Measure size and memory
fp16_size = get_model_size(model_fp16)
fp16_memory = get_memory_usage()

# Calculate perplexity
print("Calculating perplexity...")
fp16_ppl = suite.calculate_perplexity(model_fp16, PERPLEXITY_EVAL_TEXTS)

# Benchmark speed
print("Benchmarking speed...")
fp16_speed = suite.benchmark_speed(model_fp16)

# Store results
suite.add_result(BenchmarkResult(
    model_name=MODEL_ID,
    quantization_type="FP16",
    model_size_mb=fp16_size,
    perplexity=fp16_ppl,
    tokens_per_second=fp16_speed,
    memory_used_gb=fp16_memory
))

print(f"\nFP16 Results:")
print(f"  Size: {fp16_size:.1f} MB")
print(f"  Perplexity: {fp16_ppl:.2f}")
print(f"  Speed: {fp16_speed:.1f} tok/s")
print(f"  Memory: {fp16_memory:.2f} GB")

del model_fp16
clear_memory()

In [None]:
# Benchmark 2: INT8 (bitsandbytes)
print("\n" + "="*60)
print("Benchmarking INT8 (bitsandbytes)")
print("="*60)

try:
    from transformers import BitsAndBytesConfig
    
    clear_memory()
    
    int8_config = BitsAndBytesConfig(load_in_8bit=True)
    
    model_int8 = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=int8_config,
        device_map="cuda"
    )
    
    int8_size = fp16_size / 2  # Approximate
    int8_memory = get_memory_usage()
    
    print("Calculating perplexity...")
    int8_ppl = suite.calculate_perplexity(model_int8, PERPLEXITY_EVAL_TEXTS)
    
    print("Benchmarking speed...")
    int8_speed = suite.benchmark_speed(model_int8)
    
    suite.add_result(BenchmarkResult(
        model_name=MODEL_ID,
        quantization_type="INT8",
        model_size_mb=int8_size,
        perplexity=int8_ppl,
        tokens_per_second=int8_speed,
        memory_used_gb=int8_memory
    ))
    
    print(f"\nINT8 Results:")
    print(f"  Size: {int8_size:.1f} MB")
    print(f"  Perplexity: {int8_ppl:.2f} (+{int8_ppl - fp16_ppl:.2f})")
    print(f"  Speed: {int8_speed:.1f} tok/s")
    print(f"  Memory: {int8_memory:.2f} GB")
    
    del model_int8
    clear_memory()
    
except Exception as e:
    print(f"INT8 benchmark skipped: {e}")

In [None]:
# Benchmark 3: INT4 (bitsandbytes NF4)
print("\n" + "="*60)
print("Benchmarking INT4/NF4 (bitsandbytes)")
print("="*60)

try:
    clear_memory()
    
    int4_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4"
    )
    
    model_int4 = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=int4_config,
        device_map="cuda"
    )
    
    int4_size = fp16_size / 4  # Approximate
    int4_memory = get_memory_usage()
    
    print("Calculating perplexity...")
    int4_ppl = suite.calculate_perplexity(model_int4, PERPLEXITY_EVAL_TEXTS)
    
    print("Benchmarking speed...")
    int4_speed = suite.benchmark_speed(model_int4)
    
    suite.add_result(BenchmarkResult(
        model_name=MODEL_ID,
        quantization_type="INT4/NF4",
        model_size_mb=int4_size,
        perplexity=int4_ppl,
        tokens_per_second=int4_speed,
        memory_used_gb=int4_memory
    ))
    
    print(f"\nINT4/NF4 Results:")
    print(f"  Size: {int4_size:.1f} MB")
    print(f"  Perplexity: {int4_ppl:.2f} (+{int4_ppl - fp16_ppl:.2f})")
    print(f"  Speed: {int4_speed:.1f} tok/s")
    print(f"  Memory: {int4_memory:.2f} GB")
    
    del model_int4
    clear_memory()
    
except Exception as e:
    print(f"INT4 benchmark skipped: {e}")

In [None]:
# Benchmark 4: GPTQ (if available from previous notebook)
print("\n" + "="*60)
print("Benchmarking GPTQ")
print("="*60)

import os
gptq_path = "./quantized_models/opt-350m-gptq-4bit-g128"

if os.path.exists(gptq_path):
    try:
        from auto_gptq import AutoGPTQForCausalLM
        
        clear_memory()
        
        model_gptq = AutoGPTQForCausalLM.from_quantized(
            gptq_path,
            device="cuda:0",
            use_safetensors=True
        )
        
        # Get actual model size from files
        gptq_size = sum(
            os.path.getsize(os.path.join(gptq_path, f))
            for f in os.listdir(gptq_path)
            if f.endswith('.safetensors') or f.endswith('.bin')
        ) / 1e6
        gptq_memory = get_memory_usage()
        
        print("Calculating perplexity...")
        gptq_ppl = suite.calculate_perplexity(model_gptq, PERPLEXITY_EVAL_TEXTS)
        
        print("Benchmarking speed...")
        gptq_speed = suite.benchmark_speed(model_gptq)
        
        suite.add_result(BenchmarkResult(
            model_name=MODEL_ID,
            quantization_type="GPTQ-4bit",
            model_size_mb=gptq_size,
            perplexity=gptq_ppl,
            tokens_per_second=gptq_speed,
            memory_used_gb=gptq_memory
        ))
        
        print(f"\nGPTQ Results:")
        print(f"  Size: {gptq_size:.1f} MB")
        print(f"  Perplexity: {gptq_ppl:.2f} (+{gptq_ppl - fp16_ppl:.2f})")
        print(f"  Speed: {gptq_speed:.1f} tok/s")
        print(f"  Memory: {gptq_memory:.2f} GB")
        
        del model_gptq
        clear_memory()
        
    except Exception as e:
        print(f"GPTQ benchmark failed: {e}")
else:
    print(f"GPTQ model not found at {gptq_path}")
    print("Run notebook 02 first to create GPTQ models.")

In [None]:
# Benchmark 5: AWQ (if available from previous notebook)
print("\n" + "="*60)
print("Benchmarking AWQ")
print("="*60)

awq_path = "./quantized_models/opt-350m-awq-4bit-g128"

if os.path.exists(awq_path):
    try:
        from awq import AutoAWQForCausalLM
        
        clear_memory()
        
        model_awq = AutoAWQForCausalLM.from_quantized(
            awq_path,
            fuse_layers=True
        )
        
        awq_size = sum(
            os.path.getsize(os.path.join(awq_path, f))
            for f in os.listdir(awq_path)
            if f.endswith('.safetensors') or f.endswith('.bin')
        ) / 1e6
        awq_memory = get_memory_usage()
        
        print("Calculating perplexity...")
        awq_ppl = suite.calculate_perplexity(model_awq, PERPLEXITY_EVAL_TEXTS)
        
        print("Benchmarking speed...")
        awq_speed = suite.benchmark_speed(model_awq)
        
        suite.add_result(BenchmarkResult(
            model_name=MODEL_ID,
            quantization_type="AWQ-4bit",
            model_size_mb=awq_size,
            perplexity=awq_ppl,
            tokens_per_second=awq_speed,
            memory_used_gb=awq_memory
        ))
        
        print(f"\nAWQ Results:")
        print(f"  Size: {awq_size:.1f} MB")
        print(f"  Perplexity: {awq_ppl:.2f} (+{awq_ppl - fp16_ppl:.2f})")
        print(f"  Speed: {awq_speed:.1f} tok/s")
        print(f"  Memory: {awq_memory:.2f} GB")
        
        del model_awq
        clear_memory()
        
    except Exception as e:
        print(f"AWQ benchmark failed: {e}")
else:
    print(f"AWQ model not found at {awq_path}")
    print("Run notebook 03 first to create AWQ models.")

---

## Part 4: Results Analysis

Let's analyze and visualize our benchmark results.

In [None]:
# Print summary table
suite.print_summary()

In [None]:
# Create visualization
suite.plot_comparison('benchmark_comparison.png')
plt.close('all')  # Free memory from figures

In [None]:
# Detailed analysis
df = suite.to_dataframe()

if len(df) > 0:
    print("\n" + "="*60)
    print("DETAILED ANALYSIS")
    print("="*60)
    
    baseline = df[df['Quantization'] == 'FP16'].iloc[0]
    
    print(f"\nBaseline (FP16):")
    print(f"  Size: {baseline['Size (MB)']:.1f} MB")
    print(f"  Perplexity: {baseline['Perplexity']:.2f}")
    print(f"  Speed: {baseline['Tokens/s']:.1f} tok/s")
    
    print(f"\nCompression Analysis:")
    for _, row in df.iterrows():
        if row['Quantization'] != 'FP16':
            compression = baseline['Size (MB)'] / row['Size (MB)']
            ppl_delta = row['Perplexity'] - baseline['Perplexity']
            speed_ratio = row['Tokens/s'] / baseline['Tokens/s']
            
            print(f"\n{row['Quantization']}:")
            print(f"  Compression: {compression:.1f}x")
            print(f"  Perplexity increase: +{ppl_delta:.2f} ({ppl_delta/baseline['Perplexity']*100:.1f}%)")
            print(f"  Speed ratio: {speed_ratio:.2f}x")
    
    # Recommendations
    print("\n" + "="*60)
    print("RECOMMENDATIONS")
    print("="*60)
    
    if len(df) > 1:
        # Best quality (lowest perplexity after FP16)
        quant_only = df[df['Quantization'] != 'FP16']
        if len(quant_only) > 0:
            best_quality = quant_only.loc[quant_only['Perplexity'].idxmin()]
            print(f"\nüèÜ Best Quality: {best_quality['Quantization']}")
            print(f"   Perplexity: {best_quality['Perplexity']:.2f}")
            
            # Best compression
            best_compression = quant_only.loc[quant_only['Size (MB)'].idxmin()]
            print(f"\nüíæ Best Compression: {best_compression['Quantization']}")
            print(f"   Size: {best_compression['Size (MB)']:.1f} MB ({baseline['Size (MB)']/best_compression['Size (MB)']:.1f}x)")
            
            # Best speed
            best_speed = quant_only.loc[quant_only['Tokens/s'].idxmax()]
            print(f"\n‚ö° Best Speed: {best_speed['Quantization']}")
            print(f"   Speed: {best_speed['Tokens/s']:.1f} tok/s")
            
            # Best balance (efficiency)
            quant_only['Efficiency'] = (baseline['Size (MB)'] / quant_only['Size (MB)']) / quant_only['Perplexity'] * 100
            best_balance = quant_only.loc[quant_only['Efficiency'].idxmax()]
            print(f"\n‚öñÔ∏è  Best Balance: {best_balance['Quantization']}")
            print(f"   Efficiency score: {best_balance['Efficiency']:.2f}")

In [None]:
# Create comprehensive comparison chart
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

df = suite.to_dataframe()

if len(df) > 0:
    baseline_ppl = df[df['Quantization'] == 'FP16']['Perplexity'].values[0]
    baseline_size = df[df['Quantization'] == 'FP16']['Size (MB)'].values[0]
    
    # Scatter: Size vs Perplexity
    ax = axes[0]
    scatter = ax.scatter(
        df['Size (MB)'], 
        df['Perplexity'], 
        s=200, 
        c=range(len(df)), 
        cmap='viridis',
        alpha=0.7
    )
    
    for i, row in df.iterrows():
        ax.annotate(
            row['Quantization'],
            (row['Size (MB)'], row['Perplexity']),
            textcoords="offset points",
            xytext=(0, 10),
            ha='center'
        )
    
    ax.set_xlabel('Model Size (MB)')
    ax.set_ylabel('Perplexity (lower is better)')
    ax.set_title('Size vs Quality Trade-off')
    ax.grid(True, alpha=0.3)
    
    # Pareto frontier line
    sorted_df = df.sort_values('Size (MB)')
    pareto = []
    min_ppl = float('inf')
    for _, row in sorted_df.iterrows():
        if row['Perplexity'] < min_ppl:
            pareto.append(row)
            min_ppl = row['Perplexity']
    if len(pareto) > 1:
        pareto_df = pd.DataFrame(pareto)
        ax.plot(pareto_df['Size (MB)'], pareto_df['Perplexity'], 'r--', alpha=0.5, label='Pareto frontier')
        ax.legend()
    
    # Bar chart: Relative comparison
    ax = axes[1]
    
    x = np.arange(len(df))
    width = 0.35
    
    # Normalize to percentages
    size_pct = df['Size (MB)'] / baseline_size * 100
    ppl_pct = df['Perplexity'] / baseline_ppl * 100
    
    bars1 = ax.bar(x - width/2, size_pct, width, label='Size (% of FP16)', color='steelblue')
    bars2 = ax.bar(x + width/2, ppl_pct, width, label='Perplexity (% of FP16)', color='coral')
    
    ax.axhline(y=100, color='gray', linestyle='--', alpha=0.5)
    ax.set_ylabel('Percentage of FP16 Baseline')
    ax.set_title('Relative Performance')
    ax.set_xticks(x)
    ax.set_xticklabels(df['Quantization'], rotation=45, ha='right')
    ax.legend()
    
    plt.tight_layout()
    plt.savefig('benchmark_detailed.png', dpi=150, bbox_inches='tight')
    plt.show()
    plt.close(fig)  # Free memory from figure

---

## ‚úã Try It Yourself

### Exercise 1: Benchmark a Larger Model

Run the full benchmark suite on Llama-2-7B or Mistral-7B.

<details>
<summary>üí° Hint</summary>

```python
MODEL_ID = "meta-llama/Llama-2-7b-hf"
suite = BenchmarkSuite(MODEL_ID)
# Run all benchmarks...
```
</details>

In [None]:
# TODO: Benchmark a larger model
# YOUR CODE HERE

### Exercise 2: Task-Specific Evaluation

Use lm-eval to evaluate on specific tasks (HellaSwag, ARC, etc.).

<details>
<summary>üí° Hint</summary>

```python
from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="hf",
    model_args=f"pretrained={model_path}",
    tasks=["hellaswag"],
    batch_size=8
)
```
</details>

In [None]:
# TODO: Run task-specific evaluation
# YOUR CODE HERE

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Only Using Perplexity

```python
# ‚ùå Wrong: Only perplexity
if ppl < 10:
    deploy(model)

# ‚úÖ Right: Multiple metrics
if ppl < 10 and accuracy > 0.8 and speed > 20:
    deploy(model)
```

**Why:** Perplexity doesn't capture task-specific performance.

### Mistake 2: Not Testing on Target Task

```python
# ‚ùå Wrong: General benchmarks only
results = evaluate_on_hellaswag(model)  # For a coding assistant?

# ‚úÖ Right: Evaluate on your use case
results = evaluate_on_code_completion(model)
```

**Why:** A model that's great at common sense may be terrible at code.

### Mistake 3: Ignoring Memory During Benchmark

```python
# ‚ùå Wrong: Not clearing between models
model1 = load("model1")  # Uses 8GB
model2 = load("model2")  # Now using 16GB total!

# ‚úÖ Right: Clear memory between benchmarks
model1 = load("model1")
benchmark(model1)
del model1
torch.cuda.empty_cache()
gc.collect()
model2 = load("model2")
```

**Why:** Memory fragmentation affects benchmark accuracy.

---

## üéâ Checkpoint

You've learned:

- ‚úÖ **Comprehensive benchmarking**: Size, speed, quality, efficiency
- ‚úÖ **Multiple metrics**: Perplexity isn't everything
- ‚úÖ **Visualization**: Clear comparison charts
- ‚úÖ **Data-driven decisions**: Choose based on YOUR priorities
- ‚úÖ **Production readiness**: Benchmark like you deploy

---

## üöÄ Challenge (Optional)

**Build an Automated Model Selection Pipeline**

Create a function that:
1. Takes constraints (max size, min quality, min speed)
2. Benchmarks all available quantization methods
3. Returns the best model for your constraints

```python
def select_best_model(
    model_id: str,
    max_size_mb: float = 1000,
    max_ppl_increase: float = 0.5,
    min_speed_tok_s: float = 20
) -> str:
    """
    Automatically select the best quantization method.
    
    Returns: Path to the best model
    """
    # YOUR CODE HERE
    pass
```

---

## üìñ Further Reading

- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [HELM Benchmark](https://crfm.stanford.edu/helm/)
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [Quantization Benchmarks Collection](https://huggingface.co/collections/quantization-benchmarks)

---

## üßπ Cleanup

In [None]:
# Final cleanup
clear_memory()

print("Cleanup complete!")
print(f"Final GPU memory: {get_memory_usage():.2f} GB")

---

## üéì Module Complete!

Congratulations! You've completed **Module 11: Model Quantization & Optimization**.

### What You've Learned:

1. **Quantization Fundamentals** - Data types, precision, memory tradeoffs
2. **GPTQ Quantization** - Hessian-based post-training quantization
3. **AWQ Quantization** - Activation-aware weight quantization
4. **GGUF Conversion** - llama.cpp compatibility
5. **FP4 Deep Dive** - Blackwell exclusive quantization
6. **Quality Benchmarking** - Comprehensive evaluation suite

### Your DGX Spark Superpowers:

- üöÄ Run 70B models with FP4 quantization
- ‚ö° 3√ó prefill speedup with native FP4 tensor cores
- üíæ 3.5√ó memory reduction with <1% quality loss
- üéØ Data-driven quantization method selection

### Next Steps:

Continue to **Module 12: Inference Optimization** to learn about:
- TensorRT deployment
- Continuous batching
- KV cache optimization
- Production inference pipelines

---

*Happy quantizing! You're now a DGX Spark quantization expert!* üéâ