# Lab 4.2.1: LLM Benchmark Suite

**Module:** 4.2 - Benchmarking, Evaluation & MLOps  
**Time:** 3 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand what LLM benchmarks measure and why they matter
- [ ] Run the LM Evaluation Harness on multiple models
- [ ] Interpret benchmark results correctly
- [ ] Compare model performance across different tasks
- [ ] Know the limitations of standard benchmarks

---

## üìö Prerequisites

- Completed: Module 3.1 -14 (LLM fundamentals)
- Knowledge of: Model loading, inference basics
- Hardware: DGX Spark with 128GB unified memory

---

## üåç Real-World Context

**Imagine you're a chef opening a new restaurant.** Before the grand opening, you'd want to:
- Taste-test every dish (does it taste good?)
- Check presentation (does it look appetizing?)
- Time the kitchen (can we serve quickly?)
- Get outside opinions (what do critics say?)

**LLM benchmarks are the same idea for AI models!** They provide standardized tests that let you:
- Compare your model against others on a level playing field
- Identify strengths and weaknesses
- Make informed decisions about deployment
- Track improvements over time

**Companies like OpenAI, Anthropic, Meta, and Google all use these benchmarks** to demonstrate their models' capabilities. When you see "GPT-4 scores 86.4% on MMLU," that's a benchmark result!

---

## üßí ELI5: What Are LLM Benchmarks?

> **Imagine you're in school, and there's a big test coming up.** But it's not just any test‚Äîit's a test that EVERY student in EVERY school takes, with the exact same questions.
>
> When you get your score, you can compare yourself to:
> - Your classmates
> - Students from other schools
> - Students from other countries!
>
> **LLM benchmarks are like standardized tests for AI.** They ask the same questions to every AI model, so we can fairly compare them.
>
> Different benchmarks test different "subjects":
> - **MMLU** = A test covering 57 subjects (like SATs for AI)
> - **HellaSwag** = A test for common sense ("What happens next?")
> - **HumanEval** = A coding test ("Can you write this program?")
> - **MT-Bench** = A conversation test ("Can you chat naturally?")
>
> **In AI terms:** Benchmarks are standardized evaluation datasets with known correct answers, allowing reproducible comparison across models.

---

## Part 1: Understanding the Major Benchmarks

Before we run any code, let's understand what each benchmark actually measures:

### The Benchmark Landscape

| Benchmark | What It Tests | # Questions | Example Task |
|-----------|--------------|-------------|---------------|
| **MMLU** | World knowledge | 14,042 | Multiple choice across 57 subjects |
| **HellaSwag** | Common sense | 10,042 | Predict sentence completion |
| **ARC** | Science reasoning | 7,787 | Grade-school science questions |
| **WinoGrande** | Pronoun resolution | 1,767 | "The trophy doesn't fit in the suitcase because it's too [big/small]" |
| **TruthfulQA** | Factual accuracy | 817 | Avoid common misconceptions |
| **GSM8K** | Math reasoning | 8,500 | Grade-school math word problems |
| **HumanEval** | Code generation | 164 | Write Python functions |
| **MT-Bench** | Multi-turn chat | 80 | Conversational quality |

### üéØ The "Open LLM Leaderboard" Suite

The Hugging Face Open LLM Leaderboard uses a standard set of benchmarks:
1. **ARC** (AI2 Reasoning Challenge) - 25-shot
2. **HellaSwag** - 10-shot
3. **MMLU** - 5-shot
4. **TruthfulQA** - 0-shot
5. **Winogrande** - 5-shot
6. **GSM8K** - 5-shot (chain-of-thought)

### üîç What Does "N-shot" Mean?

> **ELI5:** Imagine you're taking a test, but before each question, the teacher shows you some example questions WITH their answers.
> 
> - **0-shot** = No examples, just the question
> - **5-shot** = 5 example Q&A pairs before your question
> - **25-shot** = 25 examples!
>
> More examples usually = better performance, but uses more memory and time.

---

## Part 2: Setting Up the Environment

Let's install and configure the LM Evaluation Harness, the gold-standard tool for benchmarking LLMs.

In [None]:
# First, let's check our environment
import torch
import sys

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("WARNING: No GPU detected! Benchmarks will be very slow.")

In [None]:
# Install lm-evaluation-harness
# Note: On DGX Spark (ARM64), the NGC container should have PyTorch pre-installed
# We install lm-eval which is mostly pure Python

import subprocess
import sys

try:
    import lm_eval
    print(f"lm-eval already installed: {lm_eval.__version__}")
except ImportError:
    print("Installing lm-eval...")
    # Install lm-eval - works on ARM64 as it's mostly pure Python
    subprocess.check_call([sys.executable, "-m", "pip", "install", "lm-eval", "-q"])
    import lm_eval
    print(f"lm-eval installed: {lm_eval.__version__}")

In [None]:
# Verify installation and check available tasks
!lm_eval --tasks list | head -50

### üîç What Just Happened?

We installed `lm-eval`, which provides:
- A standardized framework for running evaluations
- Pre-configured benchmark tasks (MMLU, HellaSwag, etc.)
- Support for various model backends (HuggingFace, vLLM, OpenAI API)
- Reproducible evaluation with consistent prompting

---

### ‚ö†Ô∏è Docker Configuration Note (DGX Spark)

If running in a Docker container, ensure you started it with:

```bash
docker run --gpus all -it --rm \
    -v $HOME/workspace:/workspace \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3
```

| Flag | Purpose |
|------|---------|
| `--gpus all` | Enable GPU access |
| `-it` | Interactive terminal |
| `--rm` | Remove container on exit |
| `-v workspace` | Mount workspace directory |
| `-v hf_cache` | Mount HuggingFace cache |
| `--ipc=host` | **Required** for PyTorch DataLoader with multiple workers |

Without `--ipc=host`, you may see errors like:
> "unable to open shared memory object"

This is especially important for lm-eval which uses parallel data loading.

---

## Part 3: Running Your First Benchmark

Let's start with a small model and a quick benchmark to understand the workflow.

In [None]:
# Clear GPU memory before loading models
import gc
import os
import subprocess
import torch

def clear_memory(clear_cache: bool = False):
    """
    Clear GPU memory to ensure clean state.
    
    Args:
        clear_cache: If True, also clear system buffer cache (requires sudo).
                    Recommended before loading large models (>10GB).
    """
    gc.collect()
    
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    
    # Clear system buffer cache for large model loading on DGX Spark
    if clear_cache:
        try:
            subprocess.run(
                ['sudo', 'sh', '-c', 'sync; echo 3 > /proc/sys/vm/drop_caches'],
                check=True, capture_output=True, timeout=10
            )
            print("System buffer cache cleared")
        except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
            print("Note: Could not clear buffer cache (requires sudo)")
    
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        print(f"GPU Memory: {allocated:.2f} GB allocated")

clear_memory()

In [None]:
import os
import json
from pathlib import Path

# Create output directory for results
# Use absolute path for reliability
NOTEBOOK_DIR = Path(os.getcwd())
RESULTS_DIR = str((NOTEBOOK_DIR / "../data/benchmark_results").resolve())
os.makedirs(RESULTS_DIR, exist_ok=True)

print(f"Results will be saved to: {RESULTS_DIR}")

### Running a Quick Benchmark with lm-eval

We'll start with a small, fast benchmark to understand the process.

In [None]:
# Run a quick benchmark on a small model
# Using HellaSwag with a small subset for speed

# For DGX Spark, we can use larger models!
# Start with a smaller one to learn the process

import subprocess

# Build command with proper path interpolation
# Note: Using subprocess.run() instead of ! shell command for reliable path handling
cmd = [
    "lm_eval",
    "--model", "hf",
    "--model_args", "pretrained=microsoft/phi-2,dtype=bfloat16",
    "--tasks", "hellaswag",
    "--num_fewshot", "0",
    "--batch_size", "8",
    "--limit", "100",
    "--output_path", f"{RESULTS_DIR}/phi2_quick_test"
]

print(f"Running command: {' '.join(cmd)}")
result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode != 0:
    print(f"‚ö†Ô∏è Benchmark may have encountered issues. Check output above.")

### üîç Understanding the Command

Let's break down what each argument does:

| Argument | Purpose |
|----------|----------|
| `--model hf` | Use HuggingFace backend |
| `--model_args pretrained=...` | Specify the model to evaluate |
| `dtype=bfloat16` | Use bfloat16 for memory efficiency (native on Blackwell!) |
| `--tasks hellaswag` | Which benchmark(s) to run |
| `--num_fewshot 0` | How many examples to show (0-shot) |
| `--batch_size 8` | Process 8 examples at once |
| `--limit 100` | Only run 100 samples (for testing) |
| `--output_path` | Where to save results |

In [None]:
# Load and display the results
import glob

result_files = glob.glob(f"{RESULTS_DIR}/phi2_quick_test/*/results.json")
if result_files:
    with open(result_files[0], 'r') as f:
        results = json.load(f)
    
    print("\n" + "="*50)
    print("üìä BENCHMARK RESULTS")
    print("="*50)
    
    # Display results nicely
    for task_name, task_results in results.get('results', {}).items():
        print(f"\nüìù Task: {task_name}")
        for metric, value in task_results.items():
            if isinstance(value, float):
                print(f"   {metric}: {value:.4f}")
            else:
                print(f"   {metric}: {value}")
else:
    print("No results found. Check the benchmark output above for errors.")

---

## Part 4: Comprehensive Model Evaluation

Now let's run a full benchmark suite on a model. With DGX Spark's 128GB memory, we can evaluate larger models locally!

### üßí ELI5: Why Multiple Benchmarks?

> **Imagine testing a new car.** You wouldn't just check the speed‚Äîyou'd also check:
> - Fuel efficiency (how far can it go?)
> - Safety (what happens in a crash?)
> - Comfort (is the ride smooth?)
> - Reliability (will it break down?)
>
> **One benchmark = one aspect.** A model might ace MMLU (knowledge) but fail at GSM8K (math). Running multiple benchmarks gives you the full picture.

In [None]:
# Define the benchmark suite we'll use
# This is similar to the Open LLM Leaderboard evaluation

BENCHMARK_SUITE = {
    "arc_easy": {
        "description": "Easy science reasoning questions",
        "num_fewshot": 0,
        "metric": "acc_norm"
    },
    "hellaswag": {
        "description": "Common sense sentence completion",
        "num_fewshot": 0, 
        "metric": "acc_norm"
    },
    "truthfulqa_mc2": {
        "description": "Avoiding false beliefs and misconceptions",
        "num_fewshot": 0,
        "metric": "acc"
    },
    "winogrande": {
        "description": "Pronoun resolution/coreference",
        "num_fewshot": 0,
        "metric": "acc"
    }
}

print("üìã Benchmark Suite Overview:")
print("-" * 60)
for name, config in BENCHMARK_SUITE.items():
    print(f"  {name}: {config['description']}")

In [None]:
# Helper function to run benchmarks programmatically
import subprocess
import time

def run_benchmark(model_name: str, tasks: list, output_name: str, 
                  batch_size: int = 8, limit: int = None,
                  dtype: str = "bfloat16") -> dict:
    """
    Run lm-eval benchmark on a model.
    
    Args:
        model_name: HuggingFace model path
        tasks: List of benchmark tasks
        output_name: Name for output directory
        batch_size: Batch size for evaluation
        limit: Optional limit on number of samples
        dtype: Data type (bfloat16 recommended for DGX Spark)
    
    Returns:
        Dictionary of results
    """
    clear_memory()
    
    output_path = f"{RESULTS_DIR}/{output_name}"
    tasks_str = ",".join(tasks)
    
    cmd = [
        "lm_eval",
        "--model", "hf",
        "--model_args", f"pretrained={model_name},dtype={dtype}",
        "--tasks", tasks_str,
        "--batch_size", str(batch_size),
        "--output_path", output_path
    ]
    
    if limit:
        cmd.extend(["--limit", str(limit)])
    
    print(f"\nüöÄ Starting evaluation of {model_name}")
    print(f"   Tasks: {tasks_str}")
    print(f"   Limit: {limit if limit else 'Full evaluation'}")
    
    start_time = time.time()
    result = subprocess.run(cmd, capture_output=True, text=True)
    elapsed = time.time() - start_time
    
    print(f"\n‚è±Ô∏è  Completed in {elapsed/60:.1f} minutes")
    
    if result.returncode != 0:
        print(f"\n‚ùå Error: {result.stderr}")
        return None
    
    # Load and return results
    result_files = glob.glob(f"{output_path}/*/results.json")
    if result_files:
        with open(result_files[0], 'r') as f:
            return json.load(f)
    return None

In [None]:
# Run benchmarks on a model
# Using phi-2 as an example (2.7B params - fast to evaluate)

tasks = list(BENCHMARK_SUITE.keys())

# For learning, use limit=100. Remove for full evaluation.
phi2_results = run_benchmark(
    model_name="microsoft/phi-2",
    tasks=tasks,
    output_name="phi2_benchmark",
    batch_size=8,
    limit=100  # Remove this line for full benchmark
)

In [None]:
# Display results in a nice table
def display_results(results: dict, model_name: str) -> None:
    """
    Display benchmark results in a formatted table.
    
    Args:
        results: Dictionary of benchmark results from lm-eval
        model_name: Name of the model being evaluated
    """
    if not results:
        print("No results to display.")
        return
    
    print(f"\n{'='*60}")
    print(f"üìä Results for {model_name}")
    print(f"{'='*60}")
    
    task_results = results.get('results', {})
    
    print(f"\n{'Task':<20} {'Metric':<15} {'Score':<10}")
    print("-" * 45)
    
    total_score = 0
    num_tasks = 0
    
    for task_name, metrics in task_results.items():
        # Find the main metric
        main_metric = BENCHMARK_SUITE.get(task_name, {}).get('metric', 'acc')
        score = metrics.get(main_metric, metrics.get('acc', 0))
        
        if isinstance(score, (int, float)):
            print(f"{task_name:<20} {main_metric:<15} {score*100:>6.2f}%")
            total_score += score
            num_tasks += 1
    
    if num_tasks > 0:
        avg_score = (total_score / num_tasks) * 100
        print("-" * 45)
        print(f"{'AVERAGE':<20} {'':<15} {avg_score:>6.2f}%")

if phi2_results:
    display_results(phi2_results, "microsoft/phi-2")

---

## Part 5: Comparing Multiple Models

The real power of benchmarks comes from comparison. Let's evaluate multiple models and compare them.

In [None]:
# Models to compare
# Adjust based on your memory constraints and time

MODELS_TO_COMPARE = [
    {
        "name": "microsoft/phi-2",
        "size": "2.7B",
        "description": "Microsoft's compact powerhouse"
    },
    {
        "name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "size": "1.1B", 
        "description": "Compact chat model"
    },
    # Uncomment for larger models (DGX Spark can handle these!)
    # {
    #     "name": "Qwen/Qwen3-4B",
    #     "size": "3B",
    #     "description": "Meta's efficient Llama"
    # },
]

print("üìã Models to Evaluate:")
for m in MODELS_TO_COMPARE:
    print(f"  ‚Ä¢ {m['name']} ({m['size']}) - {m['description']}")

In [None]:
# Run benchmarks on all models
# This will take some time - perfect for a coffee break!

all_results = {}

for model_info in MODELS_TO_COMPARE:
    model_name = model_info['name']
    safe_name = model_name.replace('/', '_').replace('-', '_')
    
    print(f"\n{'='*60}")
    print(f"Evaluating: {model_name} ({model_info['size']})")
    print(f"{'='*60}")
    
    results = run_benchmark(
        model_name=model_name,
        tasks=tasks,
        output_name=f"{safe_name}_benchmark",
        batch_size=8,
        limit=100  # Remove for full evaluation
    )
    
    if results:
        all_results[model_name] = results
        display_results(results, model_name)
    
    # Clear memory between models
    clear_memory()

In [None]:
# Create a comparison table
import pandas as pd

def create_comparison_table(results_dict: dict) -> pd.DataFrame:
    """Create a comparison table from multiple model results."""
    data = []
    
    for model_name, results in results_dict.items():
        row = {'Model': model_name.split('/')[-1]}
        
        task_results = results.get('results', {})
        
        for task_name, metrics in task_results.items():
            main_metric = BENCHMARK_SUITE.get(task_name, {}).get('metric', 'acc')
            score = metrics.get(main_metric, metrics.get('acc', 0))
            if isinstance(score, (int, float)):
                row[task_name] = score * 100
        
        # Calculate average
        scores = [v for k, v in row.items() if k != 'Model']
        row['Average'] = sum(scores) / len(scores) if scores else 0
        
        data.append(row)
    
    df = pd.DataFrame(data)
    df = df.set_index('Model')
    return df.round(2)

if all_results:
    comparison_df = create_comparison_table(all_results)
    print("\nüìä Model Comparison Table:")
    print(comparison_df.to_string())

In [None]:
# Visualize the comparison
import matplotlib.pyplot as plt
import numpy as np

if all_results and len(all_results) > 1:
    fig, ax = plt.subplots(figsize=(12, 6))
    
    x = np.arange(len(comparison_df.columns) - 1)  # Exclude 'Average'
    width = 0.35
    
    colors = plt.cm.Set2(np.linspace(0, 1, len(comparison_df)))
    
    for i, (model, row) in enumerate(comparison_df.iterrows()):
        values = [row[col] for col in comparison_df.columns if col != 'Average']
        offset = width * (i - len(comparison_df)/2 + 0.5)
        ax.bar(x + offset, values, width, label=model, color=colors[i])
    
    ax.set_ylabel('Score (%)')
    ax.set_title('LLM Benchmark Comparison')
    ax.set_xticks(x)
    ax.set_xticklabels([col for col in comparison_df.columns if col != 'Average'], 
                       rotation=45, ha='right')
    ax.legend()
    ax.set_ylim(0, 100)
    ax.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(f"{RESULTS_DIR}/benchmark_comparison.png", dpi=150)
    plt.show()
    
    print(f"\nüìÅ Chart saved to {RESULTS_DIR}/benchmark_comparison.png")

---

## Part 6: Understanding MMLU In-Depth

MMLU (Massive Multitask Language Understanding) is one of the most comprehensive benchmarks. Let's explore it in detail.

### üßí ELI5: What is MMLU?

> **Imagine a genius who claims to know EVERYTHING.** How would you test them?
>
> You'd ask questions from:
> - üß¨ Biology: "What is the powerhouse of the cell?"
> - ‚öñÔ∏è Law: "What is the Fifth Amendment about?"
> - üî¨ Physics: "What is E=mc¬≤?"
> - üìú History: "When did WW2 end?"
> - üñ•Ô∏è Computer Science: "What is a binary tree?"
>
> **MMLU does exactly this!** It has 14,042 questions across 57 subjects, testing if the AI truly understands diverse topics.

In [None]:
# MMLU subject categories
MMLU_SUBJECTS = {
    "STEM": [
        "abstract_algebra", "anatomy", "astronomy", "college_biology",
        "college_chemistry", "college_computer_science", "college_mathematics",
        "college_physics", "computer_security", "conceptual_physics",
        "electrical_engineering", "elementary_mathematics", "high_school_biology",
        "high_school_chemistry", "high_school_computer_science", 
        "high_school_mathematics", "high_school_physics", "high_school_statistics",
        "machine_learning"
    ],
    "Humanities": [
        "formal_logic", "high_school_european_history", "high_school_us_history",
        "high_school_world_history", "international_law", "jurisprudence",
        "logical_fallacies", "moral_disputes", "moral_scenarios", "philosophy",
        "prehistory", "professional_law", "world_religions"
    ],
    "Social_Sciences": [
        "econometrics", "high_school_geography", "high_school_government_and_politics",
        "high_school_macroeconomics", "high_school_microeconomics", "high_school_psychology",
        "human_sexuality", "professional_psychology", "public_relations", "security_studies",
        "sociology", "us_foreign_policy"
    ],
    "Other": [
        "business_ethics", "clinical_knowledge", "college_medicine", "global_facts",
        "human_aging", "management", "marketing", "medical_genetics", "miscellaneous",
        "nutrition", "professional_accounting", "professional_medicine", "virology"
    ]
}

print("üìö MMLU Subject Categories:")
for category, subjects in MMLU_SUBJECTS.items():
    print(f"\n{category} ({len(subjects)} subjects):")
    print(f"  {', '.join(subjects[:5])}..." if len(subjects) > 5 else f"  {', '.join(subjects)}")

In [None]:
# Run MMLU on specific categories (much faster than full MMLU)
# Let's test on a few representative subjects

mmlu_sample_tasks = [
    "mmlu_high_school_computer_science",
    "mmlu_college_physics",
    "mmlu_philosophy",
    "mmlu_high_school_psychology"
]

print("Running MMLU sample evaluation...")
print("Tasks:", mmlu_sample_tasks)

In [None]:
# Run MMLU sample (uncomment to execute - takes ~10 minutes)
# mmlu_results = run_benchmark(
#     model_name="microsoft/phi-2",
#     tasks=mmlu_sample_tasks,
#     output_name="phi2_mmlu_sample",
#     batch_size=8,
#     limit=50  # 50 samples per subject
# )
# 
# if mmlu_results:
#     display_results(mmlu_results, "phi-2 MMLU Sample")

---

## Part 7: Advanced Evaluation Techniques

### Using Different Backends

The LM Evaluation Harness supports multiple backends for different use cases.

In [None]:
# Different model backends available
BACKENDS = {
    "hf": {
        "name": "HuggingFace Transformers",
        "use_case": "Standard evaluation, most compatible",
        "example": "--model hf --model_args pretrained=Qwen/Qwen3-8B"
    },
    "hf-auto": {
        "name": "HuggingFace Auto",
        "use_case": "Automatic device mapping for large models",
        "example": "--model hf-auto --model_args pretrained=Qwen/Qwen3-32B,parallelize=True"
    },
    "vllm": {
        "name": "vLLM",
        "use_case": "Fastest inference, PagedAttention",
        "example": "--model vllm --model_args pretrained=Qwen/Qwen3-8B,tensor_parallel_size=1"
    },
    "openai-completions": {
        "name": "OpenAI API",
        "use_case": "Evaluate API-based models",
        "example": "--model openai-completions --model_args model=gpt-4"
    },
    "local-completions": {
        "name": "Local OpenAI-compatible API",
        "use_case": "Ollama, vLLM server, etc.",
        "example": "--model local-completions --model_args base_url=http://localhost:11434/v1"
    }
}

print("üîß Available Model Backends:")
print("="*60)
for backend, info in BACKENDS.items():
    print(f"\n{backend}:")
    print(f"  Name: {info['name']}")
    print(f"  Use: {info['use_case']}")
    print(f"  Example: {info['example']}")

In [None]:
# Example: Evaluate an Ollama model (if Ollama is running)
# First, start Ollama: ollama serve
# Then pull a model: ollama pull qwen3:4b

ollama_cmd = """
# Uncomment and run if you have Ollama set up:
# lm_eval --model local-completions \\
#     --model_args model=qwen3:4b,base_url=http://localhost:11434/v1 \\
#     --tasks hellaswag,arc_easy \\
#     --batch_size 1 \\
#     --limit 50 \\
#     --output_path ./results/ollama_llama32
"""
print(ollama_cmd)

---

## ‚úã Try It Yourself: Exercise 1

**Task:** Run a full benchmark comparison on two models of your choice.

Requirements:
1. Choose two models (can be different sizes)
2. Run at least 3 benchmarks on each
3. Create a comparison visualization
4. Write a brief analysis of which model is "better" and why

<details>
<summary>üí° Hint</summary>

Try comparing:
- Different sizes of the same family (Llama 3B vs 8B)
- Same size, different families (phi-2 vs TinyLlama)
- Base vs instruction-tuned versions

</details>

In [None]:
# YOUR CODE HERE
# Step 1: Define your models
my_models = [
    # Add your model choices here
]

# Step 2: Define benchmarks
my_benchmarks = [
    # Add your benchmark choices here
]

# Step 3: Run evaluations
# Use the run_benchmark function

# Step 4: Create visualization
# Use matplotlib to plot results

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Comparing Models with Different Few-shot Settings

In [None]:
# ‚ùå Wrong: Different few-shot settings make comparison unfair
# Model A: 0-shot evaluation
# lm_eval --model hf --model_args pretrained=model_a --tasks mmlu --num_fewshot 0

# Model B: 5-shot evaluation  
# lm_eval --model hf --model_args pretrained=model_b --tasks mmlu --num_fewshot 5

# ‚úÖ Right: Use the same settings for fair comparison
# Both models: 5-shot
# lm_eval --model hf --model_args pretrained=model_a --tasks mmlu --num_fewshot 5
# lm_eval --model hf --model_args pretrained=model_b --tasks mmlu --num_fewshot 5

print("‚úÖ Always use identical evaluation settings when comparing models!")

### Mistake 2: Running Out of Memory on Large Models

In [None]:
# ‚ùå Wrong: Loading a 70B model without proper settings
# lm_eval --model hf --model_args pretrained=Qwen/Qwen3-32B --tasks mmlu
# This will crash with OOM even on DGX Spark!

# ‚úÖ Right: Use model parallelization and smaller batch size
correct_cmd = """
lm_eval --model hf \\
    --model_args pretrained=Qwen/Qwen3-32B,dtype=bfloat16,parallelize=True \\
    --tasks mmlu \\
    --batch_size 1 \\
    --output_path ./results/llama70b
"""

print("For 70B models on DGX Spark:")
print("1. Use parallelize=True to shard across memory")
print("2. Use batch_size=1 to minimize peak memory")
print("3. Use dtype=bfloat16 for memory efficiency")

### Mistake 3: Misinterpreting Normalized vs Raw Accuracy

In [None]:
# Different metrics mean different things!

print("""
üìä Understanding Metrics:

acc (Accuracy)
  - Raw correct / total
  - Simple percentage of correct answers

acc_norm (Normalized Accuracy)  
  - Accounts for answer length bias
  - Preferred for multiple choice tasks
  - Used in: HellaSwag, ARC

acc_stderr
  - Standard error of the accuracy
  - Shows confidence in the result
  - Lower = more reliable

‚ö†Ô∏è Always report which metric you're using!
""")

### Mistake 4: Testing on Training Data (Contamination)

In [None]:
print("""
üö® Data Contamination Warning

Some models may have been trained on benchmark data!
This leads to inflated scores that don't reflect true capability.

Signs of contamination:
- Unusually high scores on specific benchmarks
- Model performance doesn't match real-world use
- Perfect recall of exact benchmark questions

Mitigation strategies:
1. Use newer benchmarks not in training data
2. Create custom evaluation sets
3. Test on held-out data
4. Use human evaluation for sanity checks
""")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ What LLM benchmarks are and why they matter
- ‚úÖ How to use the LM Evaluation Harness
- ‚úÖ Running benchmarks on multiple models
- ‚úÖ Comparing and visualizing results
- ‚úÖ Common pitfalls to avoid

---

## üöÄ Challenge (Optional)

**Advanced Challenge: Build a Benchmark Dashboard**

Create a Streamlit or Gradio app that:
1. Lets users select models and benchmarks
2. Runs evaluations in the background
3. Displays real-time progress
4. Shows interactive comparison charts
5. Exports results to CSV/JSON

This is how companies like HuggingFace built the Open LLM Leaderboard!

---

## üìñ Further Reading

- [LM Evaluation Harness Documentation](https://github.com/EleutherAI/lm-evaluation-harness)
- [MMLU Paper](https://arxiv.org/abs/2009.03300)
- [HellaSwag Paper](https://arxiv.org/abs/1905.07830)
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [HELM: Holistic Evaluation of Language Models](https://crfm.stanford.edu/helm/)
- [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/)

---

## üßπ Cleanup

In [None]:
# Clear GPU memory
import torch
import gc

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"GPU memory freed. Current allocation: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

---

## üìù Summary

In this notebook, we:

1. **Explored** the landscape of LLM benchmarks (MMLU, HellaSwag, ARC, etc.)
2. **Installed** and configured the LM Evaluation Harness
3. **Ran** benchmarks on multiple models
4. **Compared** results across models
5. **Visualized** benchmark results
6. **Learned** common mistakes to avoid

**Next up:** In notebook 02, we'll learn how to create custom evaluation frameworks when standard benchmarks don't fit your use case!