# Lab 3.2.7: Quality Benchmark Suite

## Comprehensive Model Quality Evaluation Across Quantization Methods

**Duration:** 2 hours

---

### Learning Objectives

By the end of this lab, you will be able to:

1. **Implement perplexity benchmarks** to measure language model quality
2. **Run MMLU evaluations** to assess knowledge retention after quantization
3. **Compare all quantization methods** (FP16, INT8, INT4, FP8, FP4, GPTQ, AWQ, GGUF)
4. **Analyze quality-performance trade-offs** for production decisions
5. **Create comprehensive benchmark reports** with statistical analysis

---

### Why Quality Benchmarking Matters

```
Professor SPARK says:

"Quantization is like compression for photos. Yes, you save space, but did you
lose the important details? A good benchmark suite answers: 'Is this model still
smart enough for my use case?' We measure this through perplexity (how surprised
the model is by text) and MMLU (does it still know facts?)."
```

### The Quality Triangle

```
                    Quality
                      /\
                     /  \
                    /    \
                   /  ??  \
                  /________\
              Speed        Memory

Every quantization method makes different trade-offs.
This lab helps you measure exactly what you gain and lose.
```

---

## Section 1: Environment Setup and Benchmark Framework

### 1.1 Import Required Libraries

In [None]:
# Core libraries
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple, Any, Callable
from pathlib import Path
import json
import time
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Transformers and datasets
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer,
    BitsAndBytesConfig
)
from datasets import load_dataset

# Local utilities
import sys
sys.path.append('..')
from scripts import (
    calculate_perplexity,
    calculate_perplexity_batch,
    perplexity_by_domain,
    compare_perplexity,
    benchmark_inference,
    compare_models,
    get_gpu_memory,
    clear_memory,
    MemoryTracker,
    print_dgx_spark_status
)

# Plotting style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("Environment ready for quality benchmarking!")
print_dgx_spark_status()

### 1.2 Benchmark Configuration

In [None]:
@dataclass
class BenchmarkConfig:
    """Configuration for quality benchmarks."""
    
    # Model settings
    model_name: str = "meta-llama/Llama-3.2-3B-Instruct"
    
    # Perplexity settings
    perplexity_samples: int = 100  # Number of text samples
    max_length: int = 512  # Max tokens per sample
    stride: int = 256  # Sliding window stride
    
    # MMLU settings
    mmlu_subjects: List[str] = field(default_factory=lambda: [
        'abstract_algebra',
        'anatomy',
        'astronomy',
        'business_ethics',
        'clinical_knowledge',
        'computer_security',
        'conceptual_physics',
        'high_school_mathematics',
        'machine_learning',
        'professional_medicine'
    ])
    mmlu_samples_per_subject: int = 20  # Samples per subject
    
    # Inference settings
    batch_size: int = 1
    num_inference_runs: int = 50
    warmup_runs: int = 5
    
    # Output settings
    save_results: bool = True
    results_dir: str = "../data/benchmark_results"


# Create configuration
config = BenchmarkConfig()

# Ensure results directory exists
Path(config.results_dir).mkdir(parents=True, exist_ok=True)

print(f"Benchmark Configuration:")
print(f"  Model: {config.model_name}")
print(f"  Perplexity samples: {config.perplexity_samples}")
print(f"  MMLU subjects: {len(config.mmlu_subjects)}")
print(f"  Results directory: {config.results_dir}")

### 1.3 Benchmark Results Data Structure

In [None]:
@dataclass
class QualityBenchmarkResult:
    """Stores comprehensive benchmark results for a quantization method."""
    
    # Identification
    method_name: str
    quantization_bits: Optional[int]
    
    # Perplexity metrics
    perplexity_wikitext: float = 0.0
    perplexity_c4: float = 0.0
    perplexity_code: float = 0.0
    perplexity_avg: float = 0.0
    perplexity_std: float = 0.0
    
    # MMLU metrics
    mmlu_accuracy: float = 0.0
    mmlu_by_subject: Dict[str, float] = field(default_factory=dict)
    
    # Performance metrics
    inference_latency_ms: float = 0.0
    tokens_per_second: float = 0.0
    memory_gb: float = 0.0
    
    # Quality degradation (vs FP16 baseline)
    perplexity_increase_pct: float = 0.0
    mmlu_decrease_pct: float = 0.0
    
    # Metadata
    timestamp: str = ""
    notes: str = ""
    
    def to_dict(self) -> dict:
        """Convert to dictionary for serialization."""
        return {
            'method_name': self.method_name,
            'quantization_bits': self.quantization_bits,
            'perplexity_wikitext': self.perplexity_wikitext,
            'perplexity_c4': self.perplexity_c4,
            'perplexity_code': self.perplexity_code,
            'perplexity_avg': self.perplexity_avg,
            'perplexity_std': self.perplexity_std,
            'mmlu_accuracy': self.mmlu_accuracy,
            'mmlu_by_subject': self.mmlu_by_subject,
            'inference_latency_ms': self.inference_latency_ms,
            'tokens_per_second': self.tokens_per_second,
            'memory_gb': self.memory_gb,
            'perplexity_increase_pct': self.perplexity_increase_pct,
            'mmlu_decrease_pct': self.mmlu_decrease_pct,
            'timestamp': self.timestamp,
            'notes': self.notes
        }


print("Benchmark data structures defined!")

---

## Section 2: Perplexity Benchmarking

### What is Perplexity?

```
Professor SPARK's ELI5:

"Imagine you're playing a guessing game. I show you a sentence with a blank:
'The cat sat on the ___'

A good model confidently guesses 'mat' or 'floor'. A confused model might
consider 'elephant' or 'democracy' as equally likely.

Perplexity = how confused is the model?
- Lower = more confident = better
- FP16: ~10
- Good INT4: ~10.5 (5% increase)
- Bad quantization: ~15+ (50% increase, model is broken)"
```

### Mathematical Definition

$$\text{Perplexity} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(w_i | w_1, ..., w_{i-1})\right)$$

### 2.1 Perplexity Calculator Class

In [None]:
class PerplexityBenchmark:
    """
    Comprehensive perplexity benchmarking across multiple datasets.
    
    Supports:
    - WikiText-2 (general text)
    - C4 (web text)
    - Code (programming)
    """
    
    def __init__(self, config: BenchmarkConfig):
        self.config = config
        self.datasets = {}
        self._load_datasets()
    
    def _load_datasets(self):
        """Load evaluation datasets."""
        print("Loading perplexity evaluation datasets...")
        
        # WikiText-2 for general language modeling
        try:
            wikitext = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
            self.datasets['wikitext'] = self._prepare_texts(
                wikitext['text'], 
                self.config.perplexity_samples
            )
            print(f"  WikiText-2: {len(self.datasets['wikitext'])} samples")
        except Exception as e:
            print(f"  WikiText-2: Failed to load ({e})")
            self.datasets['wikitext'] = []
        
        # C4 for web text
        try:
            c4 = load_dataset(
                "allenai/c4", 
                "en", 
                split="validation", 
                streaming=True
            )
            c4_texts = []
            for i, item in enumerate(c4):
                if i >= self.config.perplexity_samples:
                    break
                c4_texts.append(item['text'])
            self.datasets['c4'] = c4_texts
            print(f"  C4: {len(self.datasets['c4'])} samples")
        except Exception as e:
            print(f"  C4: Failed to load ({e})")
            self.datasets['c4'] = []
        
        # Code dataset
        try:
            code = load_dataset(
                "codeparrot/github-code", 
                streaming=True, 
                split="train",
                languages=["Python"]
            )
            code_texts = []
            for i, item in enumerate(code):
                if i >= self.config.perplexity_samples:
                    break
                if len(item['code']) > 100:  # Skip very short snippets
                    code_texts.append(item['code'])
            self.datasets['code'] = code_texts
            print(f"  Code: {len(self.datasets['code'])} samples")
        except Exception as e:
            print(f"  Code: Failed to load ({e})")
            self.datasets['code'] = []
    
    def _prepare_texts(self, texts: List[str], n_samples: int) -> List[str]:
        """Filter and prepare text samples."""
        # Filter out empty or very short texts
        valid_texts = [t for t in texts if len(t.strip()) > 100]
        return valid_texts[:n_samples]
    
    @torch.no_grad()
    def calculate_perplexity(
        self,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        texts: List[str],
        desc: str = "Calculating perplexity"
    ) -> Tuple[float, float]:
        """
        Calculate perplexity over a list of texts.
        
        Returns:
            Tuple of (mean_perplexity, std_perplexity)
        """
        perplexities = []
        
        for text in tqdm(texts, desc=desc):
            if not text.strip():
                continue
            
            # Tokenize
            encodings = tokenizer(
                text,
                return_tensors="pt",
                truncation=True,
                max_length=self.config.max_length
            )
            
            input_ids = encodings.input_ids.to(model.device)
            
            if input_ids.size(1) < 2:
                continue
            
            # Get model outputs
            outputs = model(input_ids, labels=input_ids)
            loss = outputs.loss.item()
            
            # Perplexity = exp(loss)
            ppl = np.exp(loss)
            
            # Skip extreme outliers (indicates issues)
            if ppl < 1000:
                perplexities.append(ppl)
        
        if not perplexities:
            return float('inf'), 0.0
        
        return np.mean(perplexities), np.std(perplexities)
    
    def run_benchmark(
        self,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer
    ) -> Dict[str, Tuple[float, float]]:
        """
        Run perplexity benchmark across all datasets.
        
        Returns:
            Dictionary mapping dataset name to (mean_ppl, std_ppl)
        """
        results = {}
        
        for name, texts in self.datasets.items():
            if not texts:
                print(f"  Skipping {name} (no data)")
                results[name] = (float('inf'), 0.0)
                continue
            
            mean_ppl, std_ppl = self.calculate_perplexity(
                model, tokenizer, texts, desc=f"  {name}"
            )
            results[name] = (mean_ppl, std_ppl)
            print(f"  {name}: {mean_ppl:.2f} (Â±{std_ppl:.2f})")
        
        return results


# Initialize perplexity benchmark
ppl_benchmark = PerplexityBenchmark(config)
print("\nPerplexity benchmark ready!")

---

## Section 3: MMLU Benchmark

### What is MMLU?

```
Professor SPARK's ELI5:

"MMLU is like a standardized test covering 57 subjects - from astronomy to
zoology. It measures: 'Does the model still know stuff after quantization?'

Example question:
Q: What is the capital of France?
A) London  B) Paris  C) Berlin  D) Rome

A good FP16 model: 70% accuracy
Good INT4: 68% accuracy (acceptable)
Bad quantization: 50% (random guessing = broken)"
```

### 3.1 MMLU Evaluator Class

In [None]:
class MMLUBenchmark:
    """
    MMLU (Massive Multitask Language Understanding) benchmark.
    
    Evaluates model knowledge across diverse subjects.
    """
    
    def __init__(self, config: BenchmarkConfig):
        self.config = config
        self.subjects_data = {}
        self._load_mmlu()
    
    def _load_mmlu(self):
        """Load MMLU dataset for configured subjects."""
        print("Loading MMLU evaluation dataset...")
        
        for subject in self.config.mmlu_subjects:
            try:
                dataset = load_dataset(
                    "cais/mmlu",
                    subject,
                    split="test"
                )
                # Limit samples per subject
                samples = list(dataset)[:self.config.mmlu_samples_per_subject]
                self.subjects_data[subject] = samples
                print(f"  {subject}: {len(samples)} samples")
            except Exception as e:
                print(f"  {subject}: Failed to load ({e})")
    
    def _format_question(self, item: dict) -> str:
        """Format MMLU question for model input."""
        question = item['question']
        choices = item['choices']
        
        formatted = f"Question: {question}\n\n"
        for i, choice in enumerate(choices):
            formatted += f"{chr(65+i)}. {choice}\n"
        formatted += "\nAnswer:"
        
        return formatted
    
    @torch.no_grad()
    def evaluate_question(
        self,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        item: dict
    ) -> bool:
        """
        Evaluate a single MMLU question.
        
        Uses log-probability scoring to select the answer.
        """
        prompt = self._format_question(item)
        correct_answer = item['answer']  # Index 0-3
        
        # Tokenize prompt
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        # Get log probabilities for answer tokens (A, B, C, D)
        outputs = model(**inputs)
        logits = outputs.logits[0, -1, :]  # Last position logits
        
        # Get token IDs for A, B, C, D
        answer_tokens = [
            tokenizer.encode(" A", add_special_tokens=False)[-1],
            tokenizer.encode(" B", add_special_tokens=False)[-1],
            tokenizer.encode(" C", add_special_tokens=False)[-1],
            tokenizer.encode(" D", add_special_tokens=False)[-1],
        ]
        
        # Get logits for answer tokens
        answer_logits = [logits[t].item() for t in answer_tokens]
        
        # Model's prediction is the highest logit
        predicted = np.argmax(answer_logits)
        
        return predicted == correct_answer
    
    def run_benchmark(
        self,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer
    ) -> Dict[str, float]:
        """
        Run MMLU benchmark across all subjects.
        
        Returns:
            Dictionary mapping subject name to accuracy
        """
        results = {}
        total_correct = 0
        total_questions = 0
        
        for subject, samples in self.subjects_data.items():
            if not samples:
                continue
            
            correct = 0
            for item in tqdm(samples, desc=f"  {subject}"):
                if self.evaluate_question(model, tokenizer, item):
                    correct += 1
            
            accuracy = correct / len(samples)
            results[subject] = accuracy
            total_correct += correct
            total_questions += len(samples)
            
            print(f"  {subject}: {accuracy:.1%}")
        
        # Overall accuracy
        if total_questions > 0:
            results['overall'] = total_correct / total_questions
            print(f"\n  Overall MMLU: {results['overall']:.1%}")
        
        return results


# Initialize MMLU benchmark
mmlu_benchmark = MMLUBenchmark(config)
print("\nMMLU benchmark ready!")

---

## Section 4: Unified Benchmark Runner

### 4.1 Complete Benchmark Suite Class

In [None]:
class QualityBenchmarkSuite:
    """
    Complete quality benchmark suite for comparing quantization methods.
    
    Runs:
    - Perplexity across multiple datasets
    - MMLU knowledge evaluation
    - Inference performance metrics
    """
    
    def __init__(self, config: BenchmarkConfig):
        self.config = config
        self.ppl_benchmark = PerplexityBenchmark(config)
        self.mmlu_benchmark = MMLUBenchmark(config)
        self.results: List[QualityBenchmarkResult] = []
        self.baseline: Optional[QualityBenchmarkResult] = None
    
    def benchmark_model(
        self,
        model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        method_name: str,
        bits: Optional[int] = None,
        run_mmlu: bool = True,
        notes: str = ""
    ) -> QualityBenchmarkResult:
        """
        Run complete benchmark suite on a model.
        
        Args:
            model: The model to benchmark
            tokenizer: Model's tokenizer
            method_name: Name of quantization method
            bits: Bit width (e.g., 4, 8, 16)
            run_mmlu: Whether to run MMLU (slower)
            notes: Additional notes
            
        Returns:
            QualityBenchmarkResult with all metrics
        """
        print(f"\n{'='*60}")
        print(f"Benchmarking: {method_name}")
        print(f"{'='*60}")
        
        result = QualityBenchmarkResult(
            method_name=method_name,
            quantization_bits=bits,
            timestamp=time.strftime("%Y-%m-%d %H:%M:%S"),
            notes=notes
        )
        
        # Memory usage
        memory_info = get_gpu_memory()
        result.memory_gb = memory_info.get('used_gb', 0)
        print(f"\nMemory used: {result.memory_gb:.2f} GB")
        
        # Perplexity benchmark
        print("\n--- Perplexity Benchmark ---")
        ppl_results = self.ppl_benchmark.run_benchmark(model, tokenizer)
        
        result.perplexity_wikitext = ppl_results.get('wikitext', (float('inf'), 0))[0]
        result.perplexity_c4 = ppl_results.get('c4', (float('inf'), 0))[0]
        result.perplexity_code = ppl_results.get('code', (float('inf'), 0))[0]
        
        # Average perplexity (excluding inf)
        valid_ppls = [p for p in [result.perplexity_wikitext, result.perplexity_c4, result.perplexity_code] if p < float('inf')]
        if valid_ppls:
            result.perplexity_avg = np.mean(valid_ppls)
            result.perplexity_std = np.std(valid_ppls)
        
        # MMLU benchmark (optional)
        if run_mmlu:
            print("\n--- MMLU Benchmark ---")
            mmlu_results = self.mmlu_benchmark.run_benchmark(model, tokenizer)
            result.mmlu_accuracy = mmlu_results.get('overall', 0.0)
            result.mmlu_by_subject = {
                k: v for k, v in mmlu_results.items() if k != 'overall'
            }
        
        # Inference performance
        print("\n--- Inference Performance ---")
        perf_result = benchmark_inference(
            model=model,
            tokenizer=tokenizer,
            prompt="The future of artificial intelligence is",
            max_new_tokens=50,
            num_runs=self.config.num_inference_runs,
            warmup_runs=self.config.warmup_runs
        )
        
        result.inference_latency_ms = perf_result.mean_latency_ms
        result.tokens_per_second = perf_result.tokens_per_second
        print(f"  Latency: {result.inference_latency_ms:.2f} ms")
        print(f"  Throughput: {result.tokens_per_second:.1f} tok/s")
        
        # Calculate degradation vs baseline
        if self.baseline:
            if self.baseline.perplexity_avg > 0:
                result.perplexity_increase_pct = (
                    (result.perplexity_avg - self.baseline.perplexity_avg) / 
                    self.baseline.perplexity_avg * 100
                )
            if self.baseline.mmlu_accuracy > 0:
                result.mmlu_decrease_pct = (
                    (self.baseline.mmlu_accuracy - result.mmlu_accuracy) / 
                    self.baseline.mmlu_accuracy * 100
                )
        
        # Store result
        self.results.append(result)
        
        # Set as baseline if this is FP16
        if method_name.upper() == 'FP16' or bits == 16:
            self.baseline = result
        
        return result
    
    def save_results(self, filename: str = "benchmark_results.json"):
        """Save all results to JSON."""
        filepath = Path(self.config.results_dir) / filename
        
        data = {
            'config': {
                'model': self.config.model_name,
                'perplexity_samples': self.config.perplexity_samples,
                'mmlu_subjects': self.config.mmlu_subjects
            },
            'results': [r.to_dict() for r in self.results]
        }
        
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)
        
        print(f"\nResults saved to {filepath}")
    
    def get_comparison_dataframe(self) -> pd.DataFrame:
        """Convert results to DataFrame for analysis."""
        return pd.DataFrame([r.to_dict() for r in self.results])


# Create benchmark suite
benchmark_suite = QualityBenchmarkSuite(config)
print("\nQuality Benchmark Suite ready!")

---

## Section 5: Run Benchmarks Across Quantization Methods

### 5.1 Benchmark FP16 Baseline

In [None]:
# Load FP16 baseline model
print("Loading FP16 baseline model...")

tokenizer = AutoTokenizer.from_pretrained(config.model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model_fp16 = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Run benchmark
result_fp16 = benchmark_suite.benchmark_model(
    model=model_fp16,
    tokenizer=tokenizer,
    method_name="FP16",
    bits=16,
    run_mmlu=True,
    notes="Baseline model at full precision"
)

# Clear memory
del model_fp16
clear_memory()

### 5.2 Benchmark INT8 Quantization

In [None]:
# Load INT8 model
print("Loading INT8 quantized model...")

bnb_config_int8 = BitsAndBytesConfig(
    load_in_8bit=True
)

model_int8 = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    quantization_config=bnb_config_int8,
    device_map="auto"
)

# Run benchmark
result_int8 = benchmark_suite.benchmark_model(
    model=model_int8,
    tokenizer=tokenizer,
    method_name="INT8 (BnB)",
    bits=8,
    run_mmlu=True,
    notes="bitsandbytes 8-bit quantization"
)

del model_int8
clear_memory()

### 5.3 Benchmark INT4 (NF4) Quantization

In [None]:
# Load INT4/NF4 model
print("Loading INT4 (NF4) quantized model...")

bnb_config_nf4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model_nf4 = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    quantization_config=bnb_config_nf4,
    device_map="auto"
)

# Run benchmark
result_nf4 = benchmark_suite.benchmark_model(
    model=model_nf4,
    tokenizer=tokenizer,
    method_name="INT4 (NF4)",
    bits=4,
    run_mmlu=True,
    notes="bitsandbytes NF4 with double quantization"
)

del model_nf4
clear_memory()

### 5.4 Benchmark GPTQ (if available)

In [None]:
# Try to load GPTQ model
try:
    from auto_gptq import AutoGPTQForCausalLM
    
    # Use a pre-quantized GPTQ model
    gptq_model_name = "TheBloke/Llama-2-7B-GPTQ"  # Example GPTQ model
    
    print(f"Loading GPTQ model: {gptq_model_name}...")
    
    model_gptq = AutoGPTQForCausalLM.from_quantized(
        gptq_model_name,
        device_map="auto",
        use_safetensors=True
    )
    tokenizer_gptq = AutoTokenizer.from_pretrained(gptq_model_name)
    
    # Run benchmark
    result_gptq = benchmark_suite.benchmark_model(
        model=model_gptq,
        tokenizer=tokenizer_gptq,
        method_name="GPTQ",
        bits=4,
        run_mmlu=True,
        notes="Pre-quantized GPTQ model"
    )
    
    del model_gptq
    clear_memory()
    
except ImportError:
    print("GPTQ not available. Skipping GPTQ benchmark.")
    print("Install with: pip install auto-gptq")
except Exception as e:
    print(f"GPTQ benchmark failed: {e}")

### 5.5 Benchmark AWQ (if available)

In [None]:
# Try to load AWQ model
try:
    from awq import AutoAWQForCausalLM
    
    # Use a pre-quantized AWQ model
    awq_model_name = "TheBloke/Llama-2-7B-AWQ"  # Example AWQ model
    
    print(f"Loading AWQ model: {awq_model_name}...")
    
    model_awq = AutoAWQForCausalLM.from_quantized(
        awq_model_name,
        fuse_layers=True
    )
    tokenizer_awq = AutoTokenizer.from_pretrained(awq_model_name)
    
    # Run benchmark  
    result_awq = benchmark_suite.benchmark_model(
        model=model_awq,
        tokenizer=tokenizer_awq,
        method_name="AWQ",
        bits=4,
        run_mmlu=True,
        notes="Pre-quantized AWQ model with fused layers"
    )
    
    del model_awq
    clear_memory()
    
except ImportError:
    print("AWQ not available. Skipping AWQ benchmark.")
    print("Install with: pip install autoawq")
except Exception as e:
    print(f"AWQ benchmark failed: {e}")

### 5.6 Save Results

In [None]:
# Save all benchmark results
benchmark_suite.save_results("quality_benchmark_results.json")

# Get DataFrame for analysis
df_results = benchmark_suite.get_comparison_dataframe()
print("\nBenchmark Results Summary:")
print(df_results[['method_name', 'perplexity_avg', 'mmlu_accuracy', 
                  'tokens_per_second', 'memory_gb']].to_string(index=False))

---

## Section 6: Results Visualization and Analysis

### 6.1 Perplexity Comparison Chart

In [None]:
def plot_perplexity_comparison(df: pd.DataFrame):
    """
    Create perplexity comparison chart across methods.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Bar chart of average perplexity
    ax1 = axes[0]
    methods = df['method_name']
    ppls = df['perplexity_avg']
    
    colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(methods)))
    bars = ax1.bar(methods, ppls, color=colors, edgecolor='black', linewidth=1.2)
    
    # Add value labels
    for bar, ppl in zip(bars, ppls):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                 f'{ppl:.2f}', ha='center', va='bottom', fontweight='bold')
    
    ax1.set_xlabel('Quantization Method', fontsize=12)
    ax1.set_ylabel('Perplexity (lower is better)', fontsize=12)
    ax1.set_title('Average Perplexity by Method', fontsize=14, fontweight='bold')
    ax1.tick_params(axis='x', rotation=45)
    
    # Add baseline reference line
    if 'FP16' in methods.values:
        baseline_ppl = df[df['method_name'] == 'FP16']['perplexity_avg'].values[0]
        ax1.axhline(y=baseline_ppl, color='green', linestyle='--', 
                    label=f'FP16 Baseline: {baseline_ppl:.2f}')
        ax1.legend()
    
    # Perplexity by dataset
    ax2 = axes[1]
    x = np.arange(len(methods))
    width = 0.25
    
    ax2.bar(x - width, df['perplexity_wikitext'], width, label='WikiText', color='#3498db')
    ax2.bar(x, df['perplexity_c4'], width, label='C4', color='#e74c3c')
    ax2.bar(x + width, df['perplexity_code'], width, label='Code', color='#2ecc71')
    
    ax2.set_xlabel('Quantization Method', fontsize=12)
    ax2.set_ylabel('Perplexity', fontsize=12)
    ax2.set_title('Perplexity by Dataset', fontsize=14, fontweight='bold')
    ax2.set_xticks(x)
    ax2.set_xticklabels(methods, rotation=45, ha='right')
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig(Path(config.results_dir) / 'perplexity_comparison.png', dpi=150, bbox_inches='tight')
    plt.show()


# Generate perplexity chart
if len(df_results) > 0:
    plot_perplexity_comparison(df_results)

### 6.2 MMLU Accuracy Comparison

In [None]:
def plot_mmlu_comparison(df: pd.DataFrame):
    """
    Create MMLU accuracy comparison chart.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Overall MMLU accuracy
    ax1 = axes[0]
    methods = df['method_name']
    accuracies = df['mmlu_accuracy'] * 100  # Convert to percentage
    
    colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(methods)))
    bars = ax1.bar(methods, accuracies, color=colors, edgecolor='black', linewidth=1.2)
    
    # Add value labels
    for bar, acc in zip(bars, accuracies):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                 f'{acc:.1f}%', ha='center', va='bottom', fontweight='bold')
    
    ax1.set_xlabel('Quantization Method', fontsize=12)
    ax1.set_ylabel('MMLU Accuracy (%)', fontsize=12)
    ax1.set_title('MMLU Accuracy by Method', fontsize=14, fontweight='bold')
    ax1.tick_params(axis='x', rotation=45)
    ax1.set_ylim(0, 100)
    
    # Add random baseline (25% for 4 choices)
    ax1.axhline(y=25, color='red', linestyle='--', label='Random Baseline (25%)')
    ax1.legend()
    
    # Accuracy degradation
    ax2 = axes[1]
    degradations = df['mmlu_decrease_pct']
    
    colors = ['#2ecc71' if d <= 2 else '#f1c40f' if d <= 5 else '#e74c3c' 
              for d in degradations]
    bars = ax2.bar(methods, degradations, color=colors, edgecolor='black', linewidth=1.2)
    
    for bar, deg in zip(bars, degradations):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                 f'{deg:.1f}%', ha='center', va='bottom', fontweight='bold')
    
    ax2.set_xlabel('Quantization Method', fontsize=12)
    ax2.set_ylabel('Accuracy Degradation (%)', fontsize=12)
    ax2.set_title('MMLU Accuracy Degradation vs FP16', fontsize=14, fontweight='bold')
    ax2.tick_params(axis='x', rotation=45)
    
    # Add threshold lines
    ax2.axhline(y=2, color='green', linestyle='--', alpha=0.5, label='Good (<2%)')
    ax2.axhline(y=5, color='orange', linestyle='--', alpha=0.5, label='Acceptable (<5%)')
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig(Path(config.results_dir) / 'mmlu_comparison.png', dpi=150, bbox_inches='tight')
    plt.show()


# Generate MMLU chart
if len(df_results) > 0 and df_results['mmlu_accuracy'].sum() > 0:
    plot_mmlu_comparison(df_results)

### 6.3 Quality vs Performance Trade-off

In [None]:
def plot_quality_performance_tradeoff(df: pd.DataFrame):
    """
    Create quality vs performance scatter plot.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Perplexity vs Throughput
    ax1 = axes[0]
    scatter = ax1.scatter(
        df['tokens_per_second'], 
        df['perplexity_avg'],
        c=df['memory_gb'],
        s=200,
        cmap='coolwarm',
        edgecolors='black',
        linewidth=1.5
    )
    
    # Add labels
    for i, row in df.iterrows():
        ax1.annotate(
            row['method_name'], 
            (row['tokens_per_second'], row['perplexity_avg']),
            xytext=(5, 5), 
            textcoords='offset points',
            fontsize=10,
            fontweight='bold'
        )
    
    ax1.set_xlabel('Throughput (tokens/sec)', fontsize=12)
    ax1.set_ylabel('Perplexity (lower is better)', fontsize=12)
    ax1.set_title('Quality vs Speed Trade-off', fontsize=14, fontweight='bold')
    
    cbar = plt.colorbar(scatter, ax=ax1)
    cbar.set_label('Memory (GB)', fontsize=10)
    
    # Ideal zone (top-right corner for good quality + speed)
    ax1.annotate('Ideal Zone\n(High Speed + Low Perplexity)', 
                 xy=(0.75, 0.25), xycoords='axes fraction',
                 fontsize=10, style='italic', color='green')
    
    # MMLU vs Memory
    ax2 = axes[1]
    scatter2 = ax2.scatter(
        df['memory_gb'],
        df['mmlu_accuracy'] * 100,
        c=df['tokens_per_second'],
        s=200,
        cmap='viridis',
        edgecolors='black',
        linewidth=1.5
    )
    
    for i, row in df.iterrows():
        ax2.annotate(
            row['method_name'],
            (row['memory_gb'], row['mmlu_accuracy'] * 100),
            xytext=(5, 5),
            textcoords='offset points',
            fontsize=10,
            fontweight='bold'
        )
    
    ax2.set_xlabel('Memory Usage (GB)', fontsize=12)
    ax2.set_ylabel('MMLU Accuracy (%)', fontsize=12)
    ax2.set_title('Knowledge Retention vs Memory', fontsize=14, fontweight='bold')
    
    cbar2 = plt.colorbar(scatter2, ax=ax2)
    cbar2.set_label('Throughput (tok/s)', fontsize=10)
    
    plt.tight_layout()
    plt.savefig(Path(config.results_dir) / 'quality_performance_tradeoff.png', 
                dpi=150, bbox_inches='tight')
    plt.show()


# Generate trade-off chart
if len(df_results) > 0:
    plot_quality_performance_tradeoff(df_results)

### 6.4 Comprehensive Radar Chart

In [None]:
def plot_radar_comparison(df: pd.DataFrame):
    """
    Create radar chart comparing all aspects of quantization methods.
    """
    # Normalize metrics to 0-1 scale (higher is better)
    df_norm = df.copy()
    
    # Invert perplexity (lower is better -> higher is better)
    ppl_max = df['perplexity_avg'].max()
    df_norm['quality'] = 1 - (df['perplexity_avg'] / ppl_max) if ppl_max > 0 else 0
    
    # MMLU (already 0-1)
    df_norm['knowledge'] = df['mmlu_accuracy']
    
    # Speed (normalize to max)
    speed_max = df['tokens_per_second'].max()
    df_norm['speed'] = df['tokens_per_second'] / speed_max if speed_max > 0 else 0
    
    # Memory efficiency (invert memory, lower is better)
    mem_max = df['memory_gb'].max()
    df_norm['efficiency'] = 1 - (df['memory_gb'] / mem_max) if mem_max > 0 else 0
    
    # Compression (based on bits)
    df_norm['compression'] = 1 - (df['quantization_bits'].fillna(16) / 16)
    
    # Set up radar chart
    categories = ['Quality', 'Knowledge', 'Speed', 'Efficiency', 'Compression']
    N = len(categories)
    
    angles = [n / float(N) * 2 * np.pi for n in range(N)]
    angles += angles[:1]  # Close the polygon
    
    fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))
    
    colors = plt.cm.Set2(np.linspace(0, 1, len(df_norm)))
    
    for idx, (i, row) in enumerate(df_norm.iterrows()):
        values = [
            row['quality'],
            row['knowledge'],
            row['speed'],
            row['efficiency'],
            row['compression']
        ]
        values += values[:1]
        
        ax.plot(angles, values, 'o-', linewidth=2, 
                label=row['method_name'], color=colors[idx])
        ax.fill(angles, values, alpha=0.1, color=colors[idx])
    
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(categories, fontsize=12, fontweight='bold')
    ax.set_ylim(0, 1)
    
    plt.title('Quantization Method Comparison\n(Higher is Better)', 
              size=14, fontweight='bold', y=1.08)
    plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
    
    plt.tight_layout()
    plt.savefig(Path(config.results_dir) / 'radar_comparison.png', 
                dpi=150, bbox_inches='tight')
    plt.show()


# Generate radar chart
if len(df_results) > 0:
    plot_radar_comparison(df_results)

---

## Section 7: Benchmark Report Generation

### 7.1 Generate Comprehensive Report

In [None]:
def generate_benchmark_report(df: pd.DataFrame, config: BenchmarkConfig) -> str:
    """
    Generate a comprehensive markdown benchmark report.
    """
    report = f"""
# Quantization Quality Benchmark Report

**Generated:** {time.strftime("%Y-%m-%d %H:%M:%S")}
**Model:** {config.model_name}
**Platform:** DGX Spark (Blackwell GPU, 128GB Unified Memory)

---

## Executive Summary

This report compares {len(df)} quantization methods across quality and performance metrics.

### Key Findings:
"""
    
    # Find best methods
    best_quality = df.loc[df['perplexity_avg'].idxmin()]['method_name']
    best_speed = df.loc[df['tokens_per_second'].idxmax()]['method_name']
    best_memory = df.loc[df['memory_gb'].idxmin()]['method_name']
    
    report += f"""
- **Best Quality:** {best_quality} (lowest perplexity)
- **Best Speed:** {best_speed} (highest throughput)
- **Best Memory Efficiency:** {best_memory} (lowest memory usage)

---

## Detailed Results

### Perplexity Scores (Lower is Better)

| Method | WikiText | C4 | Code | Average | vs FP16 |
|--------|----------|----|----- |---------|--------|
"""
    
    for _, row in df.iterrows():
        report += f"| {row['method_name']} | {row['perplexity_wikitext']:.2f} | "
        report += f"{row['perplexity_c4']:.2f} | {row['perplexity_code']:.2f} | "
        report += f"{row['perplexity_avg']:.2f} | +{row['perplexity_increase_pct']:.1f}% |\n"
    
    report += f"""

### MMLU Accuracy (Higher is Better)

| Method | Overall Accuracy | vs FP16 |
|--------|-----------------|--------|
"""
    
    for _, row in df.iterrows():
        report += f"| {row['method_name']} | {row['mmlu_accuracy']*100:.1f}% | "
        report += f"-{row['mmlu_decrease_pct']:.1f}% |\n"
    
    report += f"""

### Performance Metrics

| Method | Throughput (tok/s) | Latency (ms) | Memory (GB) |
|--------|-------------------|--------------|-------------|
"""
    
    for _, row in df.iterrows():
        report += f"| {row['method_name']} | {row['tokens_per_second']:.1f} | "
        report += f"{row['inference_latency_ms']:.1f} | {row['memory_gb']:.1f} |\n"
    
    report += f"""

---

## Recommendations

### Use Case Recommendations:

1. **Production (Quality Critical):** FP16 or INT8 - minimal quality degradation
2. **Edge Deployment (Memory Critical):** INT4 (NF4) or AWQ - best compression
3. **High Throughput (Speed Critical):** GPTQ/AWQ with fused kernels
4. **Blackwell Optimized:** NVFP4 - native hardware support

### Quality Thresholds:

- **Excellent:** <2% perplexity increase, <1% MMLU drop
- **Good:** <5% perplexity increase, <2% MMLU drop  
- **Acceptable:** <10% perplexity increase, <5% MMLU drop
- **Poor:** >10% perplexity increase (investigate model)

---

*Report generated using DGX Spark Quality Benchmark Suite*
"""
    
    return report


# Generate and save report
if len(df_results) > 0:
    report = generate_benchmark_report(df_results, config)
    
    report_path = Path(config.results_dir) / 'benchmark_report.md'
    with open(report_path, 'w') as f:
        f.write(report)
    
    print(f"Report saved to: {report_path}")
    print("\n" + "="*60)
    print(report)

---

## Section 8: Expected Results on DGX Spark

### 8.1 Reference Benchmark Results

In [None]:
# Expected results for Llama-3.2-3B on DGX Spark
expected_results = pd.DataFrame([
    {
        'Method': 'FP16',
        'Bits': 16,
        'WikiText PPL': 8.5,
        'MMLU (%)': 65.0,
        'Throughput': 45,
        'Memory (GB)': 6.0,
        'PPL Increase': '0%'
    },
    {
        'Method': 'INT8 (BnB)',
        'Bits': 8,
        'WikiText PPL': 8.7,
        'MMLU (%)': 64.5,
        'Throughput': 55,
        'Memory (GB)': 3.2,
        'PPL Increase': '+2.4%'
    },
    {
        'Method': 'INT4 (NF4)',
        'Bits': 4,
        'WikiText PPL': 9.2,
        'MMLU (%)': 63.0,
        'Throughput': 70,
        'Memory (GB)': 1.8,
        'PPL Increase': '+8.2%'
    },
    {
        'Method': 'GPTQ-4bit',
        'Bits': 4,
        'WikiText PPL': 8.9,
        'MMLU (%)': 63.8,
        'Throughput': 85,
        'Memory (GB)': 1.9,
        'PPL Increase': '+4.7%'
    },
    {
        'Method': 'AWQ-4bit',
        'Bits': 4,
        'WikiText PPL': 8.8,
        'MMLU (%)': 64.0,
        'Throughput': 90,
        'Memory (GB)': 1.9,
        'PPL Increase': '+3.5%'
    },
    {
        'Method': 'FP8 (E4M3)',
        'Bits': 8,
        'WikiText PPL': 8.6,
        'MMLU (%)': 64.8,
        'Throughput': 95,
        'Memory (GB)': 3.0,
        'PPL Increase': '+1.2%'
    },
    {
        'Method': 'NVFP4 (Blackwell)',
        'Bits': 4,
        'WikiText PPL': 8.7,
        'MMLU (%)': 64.2,
        'Throughput': 120,
        'Memory (GB)': 1.6,
        'PPL Increase': '+2.4%'
    },
])

print("Expected Results on DGX Spark (Llama-3.2-3B):")
print("="*80)
print(expected_results.to_string(index=False))
print("\n" + "="*80)
print("\n Key Observations:")
print(" - FP8 and NVFP4 offer best quality/performance balance on Blackwell")
print(" - AWQ slightly better quality than GPTQ at same bit-width")
print(" - INT4 (NF4) shows largest quality degradation")
print(" - NVFP4 achieves 2.7x speedup with only 2.4% perplexity increase")

---

## Summary and Key Takeaways

### What We Learned

```
Professor SPARK's Summary:

"Quality benchmarking answers the question: 'Is my quantized model still useful?'

Key metrics to track:
1. Perplexity - Language modeling quality (lower = better)
2. MMLU - Knowledge retention (higher = better)
3. Throughput - Speed (higher = better)
4. Memory - Efficiency (lower = better)

The best quantization method depends on YOUR use case:
- Chatbot? Prioritize MMLU (knowledge matters)
- Text generation? Prioritize perplexity (fluency matters)
- Edge deployment? Prioritize memory efficiency
- Real-time? Prioritize throughput

On DGX Spark Blackwell:
- Use NVFP4 for best native performance
- Use FP8 for training with minimal quality loss
- Use AWQ/GPTQ for maximum compression with good quality"
```

### Quality Thresholds for Production

| Metric | Excellent | Good | Acceptable | Poor |
|--------|-----------|------|------------|------|
| Perplexity Increase | <2% | <5% | <10% | >10% |
| MMLU Drop | <1% | <2% | <5% | >5% |
| Memory Savings | >70% | >50% | >30% | <30% |

### Next Steps

1. **Lab 3.2.8:** TensorRT-LLM Engine - Production deployment optimization
2. Apply benchmarks to YOUR models and use cases
3. Create custom evaluation datasets for domain-specific applications

---

## Exercises

### Exercise 1: Custom Perplexity Dataset

Create a perplexity evaluation using a domain-specific dataset relevant to your use case.

In [None]:
# Exercise 1: Add your domain-specific texts
domain_texts = [
    # TODO: Add 10-20 representative texts from your domain
    "Your domain text 1...",
    "Your domain text 2...",
]

# Calculate perplexity on domain data
# ppl_benchmark.calculate_perplexity(model, tokenizer, domain_texts, "Domain PPL")

### Exercise 2: Custom MMLU Subject

Create a custom knowledge evaluation for a specific subject not in MMLU.

In [None]:
# Exercise 2: Custom knowledge questions
custom_questions = [
    {
        'question': 'Your question here?',
        'choices': ['A', 'B', 'C', 'D'],
        'answer': 0  # Index of correct answer
    },
    # Add more questions...
]

# Evaluate model on custom questions
# correct = sum(mmlu_benchmark.evaluate_question(model, tokenizer, q) for q in custom_questions)
# print(f"Custom accuracy: {correct/len(custom_questions):.1%}")

### Exercise 3: Statistical Significance

Determine if differences between quantization methods are statistically significant.

In [None]:
# Exercise 3: Statistical significance testing
from scipy import stats

# TODO: Compare perplexity distributions between methods
# Use t-test or Mann-Whitney U test to determine significance

# Example:
# ppl_fp16 = [...perplexity values from FP16...]
# ppl_int4 = [...perplexity values from INT4...]
# stat, p_value = stats.ttest_ind(ppl_fp16, ppl_int4)
# print(f"p-value: {p_value:.4f}")
# print(f"Significant difference: {p_value < 0.05}")