# Task 4.3.3: LLM Benchmark Suite

**Module:** 4.3 - MLOps & Experiment Tracking  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the major LLM benchmarks and what they measure
- [ ] Run standard benchmarks using lm-evaluation-harness
- [ ] Interpret benchmark results correctly
- [ ] Compare models using standardized evaluations
- [ ] Avoid common benchmarking pitfalls

---

## Prerequisites

- Completed: Tasks 4.3.1-4.3.2 (MLflow/W&B Setup)
- Knowledge of: LLM basics, transformers
- Hardware: DGX Spark with at least 20GB free GPU memory

---

## Real-World Context

When Meta releases Llama 3.1, or when Mistral announces a new model, how do they prove it's better than competitors?

**Benchmarks.**

Every major AI lab uses standardized benchmarks to:
- Compare models objectively
- Track progress over time
- Identify strengths and weaknesses
- Justify research claims in papers

Without proper benchmarking, you're just guessing if your model is good.

---

## ELI5: What are LLM Benchmarks?

> **Imagine you're testing how smart different students are.**
>
> You could ask each student random questions, but that's not fair! One might get easy questions, another hard ones.
>
> Instead, you give everyone the **same standardized test** - like the SAT. Now you can compare fairly:
> - Math section = reasoning ability
> - Reading section = language understanding
> - Writing section = generation quality
>
> **In AI terms:** Benchmarks are standardized tests for LLMs. MMLU tests knowledge, HellaSwag tests common sense, HumanEval tests coding. Same questions for every model, fair comparison.

---

## The Major LLM Benchmarks

| Benchmark | What It Tests | # Questions | How It Works |
|-----------|---------------|-------------|-------------|
| **MMLU** | World knowledge | 14,042 | Multiple choice (57 subjects) |
| **HellaSwag** | Common sense | 10,042 | Sentence completion |
| **ARC** | Science reasoning | 7,787 | Multiple choice (grade school) |
| **WinoGrande** | Pronoun resolution | 1,767 | Fill in the blank |
| **TruthfulQA** | Factual accuracy | 817 | Multiple choice |
| **GSM8K** | Math reasoning | 1,319 | Word problems |
| **HumanEval** | Coding | 164 | Write Python functions |
| **MT-Bench** | Chat quality | 80 | LLM-as-judge scoring |

---

## Part 1: Setting Up lm-evaluation-harness

EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) is the industry standard for LLM benchmarking.

In [None]:
# Install lm-eval
!pip install lm-eval -q

import lm_eval
print(f"lm-eval version: {lm_eval.__version__}")

In [None]:
# Check available GPU memory before loading models
import torch

def print_gpu_memory():
    """Print current GPU memory usage."""
    if torch.cuda.is_available():
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        allocated = torch.cuda.memory_allocated(0) / 1024**3
        cached = torch.cuda.memory_reserved(0) / 1024**3
        free = total - cached
        print(f"GPU Memory:")
        print(f"  Total:     {total:.1f} GB")
        print(f"  Allocated: {allocated:.1f} GB")
        print(f"  Cached:    {cached:.1f} GB")
        print(f"  Free:      {free:.1f} GB")
    else:
        print("No GPU available")

print_gpu_memory()

In [None]:
# List available benchmark tasks
from lm_eval import tasks

# Get all available tasks
all_tasks = tasks.ALL_TASKS
print(f"Total available tasks: {len(all_tasks)}")
print("\nPopular benchmarks:")

popular = ['mmlu', 'hellaswag', 'arc_easy', 'arc_challenge', 'winogrande', 
           'truthfulqa_mc', 'gsm8k', 'humaneval']
for task in popular:
    status = "available" if task in all_tasks else "not found"
    print(f"  - {task}: {status}")

---

## Part 2: Running Your First Benchmark

Let's benchmark a small model first to understand the process.

In [None]:
# Clear cache before loading model
import gc
gc.collect()
torch.cuda.empty_cache()

print("Starting benchmark run...")
print("This will:")
print("  1. Load the model (microsoft/phi-2)")
print("  2. Run HellaSwag benchmark (10k questions)")
print("  3. Report accuracy metrics")
print("\nEstimated time: 10-15 minutes on DGX Spark")

In [None]:
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
import time

def run_benchmark(
    model_name: str,
    tasks: list,
    batch_size: int = 8,
    limit: int = None,
    dtype: str = "bfloat16"
):
    """
    Run lm-eval benchmarks on a model.
    
    Args:
        model_name: HuggingFace model ID
        tasks: List of benchmark tasks to run
        batch_size: Batch size for inference
        limit: Limit number of examples (None for full benchmark)
        dtype: Model dtype (bfloat16 recommended for DGX Spark)
    
    Returns:
        dict: Benchmark results
    """
    print(f"\n{'='*60}")
    print(f"Benchmarking: {model_name}")
    print(f"Tasks: {tasks}")
    print(f"Batch size: {batch_size}")
    if limit:
        print(f"Limit: {limit} examples per task")
    print(f"{'='*60}\n")
    
    start_time = time.time()
    
    # Run evaluation
    results = evaluator.simple_evaluate(
        model="hf",
        model_args=f"pretrained={model_name},dtype={dtype}",
        tasks=tasks,
        batch_size=batch_size,
        limit=limit,
        device="cuda" if torch.cuda.is_available() else "cpu"
    )
    
    elapsed = time.time() - start_time
    
    # Print results
    print(f"\n{'='*60}")
    print(f"RESULTS for {model_name}")
    print(f"{'='*60}")
    
    for task_name, task_results in results['results'].items():
        print(f"\n{task_name}:")
        for metric, value in task_results.items():
            if isinstance(value, float):
                print(f"  {metric}: {value:.4f}")
            else:
                print(f"  {metric}: {value}")
    
    print(f"\nTotal time: {elapsed/60:.1f} minutes")
    print_gpu_memory()
    
    return results

In [None]:
# Run a quick benchmark with limited examples
# Using limit=100 for demonstration (full run takes longer)

quick_results = run_benchmark(
    model_name="microsoft/phi-2",
    tasks=["hellaswag"],
    batch_size=8,
    limit=100  # Remove this for full benchmark
)

### What Just Happened?

The benchmark:
1. Loaded the model onto GPU
2. Fed it 100 HellaSwag questions (sentence completion)
3. Calculated accuracy (how often it picked the right completion)

**Note:** With `limit=100`, this is a quick approximation. For publishable results, run the full benchmark!

---

## Part 3: Understanding the Benchmarks

### HellaSwag: Common Sense Reasoning

Example question:
> **Context:** "A woman is standing at a podium. She"
> 
> **Choices:**
> - A) "adjusts the microphone and begins speaking."
> - B) "starts doing jumping jacks."
> - C) "eats a sandwich loudly."
> - D) "falls asleep immediately."
>
> **Answer:** A

Models need common sense to know what typically happens next.

In [None]:
# Let's look at actual HellaSwag examples
from datasets import load_dataset

hellaswag = load_dataset("hellaswag", split="validation[:5]")

print("Sample HellaSwag questions:")
print("="*60)

for i, example in enumerate(hellaswag):
    print(f"\nQuestion {i+1}:")
    print(f"Activity: {example['activity_label']}")
    print(f"Context: {example['ctx']}")
    print(f"Endings:")
    for j, ending in enumerate(example['endings']):
        marker = "-> " if j == int(example['label']) else "   "
        print(f"  {marker}{chr(65+j)}) {ending[:80]}..." if len(ending) > 80 else f"  {marker}{chr(65+j)}) {ending}")
    print()

### MMLU: Massive Multitask Language Understanding

MMLU tests knowledge across 57 subjects:

In [None]:
# MMLU subjects
mmlu_subjects = [
    # STEM
    "abstract_algebra", "anatomy", "astronomy", "computer_science",
    "electrical_engineering", "machine_learning", "physics",
    # Humanities
    "philosophy", "world_history", "moral_scenarios", "us_history",
    # Social Sciences
    "economics", "psychology", "sociology", "political_science",
    # Other
    "clinical_knowledge", "medical_genetics", "professional_law",
    "professional_accounting", "marketing"
]

print("MMLU covers 57 subjects across 4 categories:")
print(f"\nSample subjects: {mmlu_subjects[:10]}...")
print("\nDifficulty ranges from high school to professional level.")

In [None]:
# Sample MMLU question
mmlu_sample = load_dataset("cais/mmlu", "computer_science", split="test[:3]")

print("Sample MMLU Computer Science questions:")
print("="*60)

for i, example in enumerate(mmlu_sample):
    print(f"\nQuestion {i+1}: {example['question']}")
    for j, choice in enumerate(example['choices']):
        marker = "-> " if j == example['answer'] else "   "
        print(f"  {marker}{chr(65+j)}) {choice}")
    print()

---

## Part 4: Comparing Multiple Models

Let's benchmark multiple models and compare them.

In [None]:
# Models to compare (small models for faster demo)
models_to_test = [
    "microsoft/phi-2",           # 2.7B params
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",  # 1.1B params
]

# Benchmarks to run
benchmark_tasks = ["hellaswag", "arc_easy"]

print("Model Comparison Plan:")
print(f"  Models: {models_to_test}")
print(f"  Benchmarks: {benchmark_tasks}")
print(f"  Note: Using limit=50 for quick comparison")

In [None]:
# Run comparisons
comparison_results = {}

for model_name in models_to_test:
    # Clear GPU memory between models
    gc.collect()
    torch.cuda.empty_cache()
    
    try:
        results = run_benchmark(
            model_name=model_name,
            tasks=benchmark_tasks,
            batch_size=4,
            limit=50  # Small limit for demo
        )
        comparison_results[model_name] = results
    except Exception as e:
        print(f"Error benchmarking {model_name}: {e}")
        comparison_results[model_name] = None

In [None]:
# Create comparison table
import pandas as pd

comparison_data = []

for model_name, results in comparison_results.items():
    if results is None:
        continue
    
    row = {"model": model_name.split("/")[-1]}
    
    for task_name, task_results in results['results'].items():
        # Get accuracy metric (different tasks use different names)
        acc_key = [k for k in task_results.keys() if 'acc' in k.lower()]
        if acc_key:
            row[task_name] = task_results[acc_key[0]]
    
    comparison_data.append(row)

if comparison_data:
    df = pd.DataFrame(comparison_data)
    print("\n" + "="*60)
    print("MODEL COMPARISON")
    print("="*60)
    print(df.to_string(index=False))
else:
    print("No results to compare")

---

## Part 5: Running Benchmarks from Command Line

For production use, the command line is often more convenient.

In [None]:
# Example command for full benchmark
cli_command = """
# Full MMLU benchmark (takes ~2-3 hours for an 8B model)
lm_eval --model hf \\
    --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16 \\
    --tasks mmlu \\
    --batch_size 8 \\
    --output_path ./results/llama-3.1-8b-mmlu

# Quick benchmark suite (common tasks)
lm_eval --model hf \\
    --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16 \\
    --tasks hellaswag,arc_easy,arc_challenge,winogrande \\
    --batch_size 8 \\
    --output_path ./results/llama-3.1-8b-quick

# With logging to W&B
lm_eval --model hf \\
    --model_args pretrained=microsoft/phi-2,dtype=bfloat16 \\
    --tasks hellaswag,mmlu \\
    --batch_size 8 \\
    --wandb_args project=llm-benchmarks,name=phi2-eval \\
    --output_path ./results/phi2
"""

print("Command Line Examples:")
print(cli_command)

In [None]:
# Run a quick CLI benchmark
!lm_eval --model hf \
    --model_args pretrained=microsoft/phi-2,dtype=bfloat16 \
    --tasks arc_easy \
    --batch_size 4 \
    --limit 20 \
    --output_path ./results/phi2_quick_test

---

## Part 6: Integrating with MLflow

Let's log benchmark results to MLflow for tracking.

In [None]:
import mlflow
import json

def benchmark_with_mlflow(
    model_name: str,
    tasks: list,
    batch_size: int = 8,
    limit: int = None,
    experiment_name: str = "LLM-Benchmarks"
):
    """
    Run benchmarks and log results to MLflow.
    """
    mlflow.set_experiment(experiment_name)
    
    with mlflow.start_run(run_name=f"{model_name.split('/')[-1]}-benchmark"):
        # Log configuration
        mlflow.log_params({
            "model_name": model_name,
            "tasks": ",".join(tasks),
            "batch_size": batch_size,
            "limit": limit or "full",
            "dtype": "bfloat16"
        })
        
        # Run benchmark
        results = evaluator.simple_evaluate(
            model="hf",
            model_args=f"pretrained={model_name},dtype=bfloat16",
            tasks=tasks,
            batch_size=batch_size,
            limit=limit,
            device="cuda" if torch.cuda.is_available() else "cpu"
        )
        
        # Log metrics
        for task_name, task_results in results['results'].items():
            for metric, value in task_results.items():
                if isinstance(value, (int, float)):
                    mlflow.log_metric(f"{task_name}_{metric}", value)
        
        # Save full results as artifact
        with open("benchmark_results.json", "w") as f:
            json.dump(results, f, indent=2, default=str)
        mlflow.log_artifact("benchmark_results.json")
        
        print(f"Results logged to MLflow run: {mlflow.active_run().info.run_id}")
        
        return results

In [None]:
# Run benchmark with MLflow logging
gc.collect()
torch.cuda.empty_cache()

mlflow_results = benchmark_with_mlflow(
    model_name="microsoft/phi-2",
    tasks=["arc_easy"],
    batch_size=4,
    limit=30
)

---

## Part 7: Understanding Benchmark Scores

### What's a "Good" Score?

| Benchmark | Random Baseline | Good (7B) | Excellent (70B+) | Human Level |
|-----------|-----------------|-----------|------------------|-------------|
| MMLU | 25% | 55-65% | 80-90% | ~90% |
| HellaSwag | 25% | 75-85% | 90-95% | ~95% |
| ARC-Easy | 25% | 75-85% | 90-95% | ~95% |
| ARC-Challenge | 25% | 45-55% | 70-85% | ~80% |
| WinoGrande | 50% | 70-75% | 85-90% | ~94% |
| GSM8K | 0% | 30-50% | 70-90% | ~100% |

In [None]:
# Reference scores from model cards
reference_scores = {
    "Model": ["Llama-2-7B", "Llama-2-70B", "Llama-3.1-8B", "Llama-3.1-70B", "GPT-4"],
    "MMLU": [45.3, 69.8, 66.6, 79.3, 86.4],
    "HellaSwag": [77.2, 85.3, 80.1, 88.0, 95.3],
    "ARC-C": [53.1, 68.0, 55.4, 68.8, 96.3],
    "WinoGrande": [74.0, 80.5, 77.0, 85.3, 87.5],
}

ref_df = pd.DataFrame(reference_scores)
print("Reference Benchmark Scores (from model cards):")
print("="*60)
print(ref_df.to_string(index=False))
print("\nNote: These are normalized accuracy percentages.")

---

## Try It Yourself

Run a full benchmark comparison on two models of your choice.

<details>
<summary>Hint</summary>

Try comparing:
- A base model vs. its instruction-tuned version
- Models from different families at similar sizes
- A model before and after your fine-tuning

Use `limit=None` for accurate results (will take longer).

</details>

In [None]:
# YOUR CODE HERE
# Compare two models on at least 2 benchmarks

# Suggested models:
# - "meta-llama/Llama-3.2-1B" (1B)
# - "meta-llama/Llama-3.2-3B" (3B)
# - "microsoft/phi-2" (2.7B)

# Your comparison code here...

---

## Common Mistakes

### Mistake 1: Cherry-Picking Benchmarks

```python
# Wrong - only reporting where your model does well
results = run_benchmark(model, tasks=["my_model_is_great_at_this"])

# Right - run standard benchmark suite
results = run_benchmark(model, tasks=["mmlu", "hellaswag", "arc_challenge", "winogrande"])
```
**Why:** Cherry-picking gives a misleading picture of model quality.

### Mistake 2: Comparing Apples to Oranges

```python
# Wrong - different evaluation settings
model_a_results = run_benchmark(model_a, batch_size=8, limit=100)
model_b_results = run_benchmark(model_b, batch_size=1, limit=None)

# Right - identical settings for fair comparison
for model in [model_a, model_b]:
    results = run_benchmark(model, batch_size=8, limit=None)
```
**Why:** Different batch sizes or limits can affect results.

### Mistake 3: Using Limit for Published Results

```python
# Wrong for publication - limited examples
results = run_benchmark(model, tasks=["mmlu"], limit=100)
print(f"MMLU Score: {results}%")  # Not statistically valid!

# Right for publication - full benchmark
results = run_benchmark(model, tasks=["mmlu"], limit=None)
```
**Why:** Small samples have high variance and aren't reproducible.

### Mistake 4: Ignoring Benchmark Leakage

```python
# Be aware: training data might contain benchmark questions!
# This inflates scores artificially.

# Good practice: report training data sources
# Good practice: use held-out test sets
# Good practice: report results on multiple benchmarks
```
**Why:** Some models are accidentally (or intentionally) trained on benchmark data.

---

## Checkpoint

You've learned:
- What major LLM benchmarks measure (MMLU, HellaSwag, ARC, etc.)
- How to run benchmarks using lm-evaluation-harness
- How to compare models fairly
- How to integrate benchmarks with MLflow tracking
- Common benchmarking pitfalls to avoid

---

## Challenge (Optional)

Create a benchmark dashboard that:
1. Runs 5 standard benchmarks on a model
2. Logs all results to MLflow
3. Creates a radar chart comparing to reference scores
4. Saves a summary report as an artifact

---

## Further Reading

- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [HELM Benchmark](https://crfm.stanford.edu/helm/) - Stanford's holistic evaluation
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [Chatbot Arena](https://arena.lmsys.org/) - Human preference rankings

---

## Cleanup

In [None]:
# Clear GPU memory
import torch
import gc
import os

gc.collect()
torch.cuda.empty_cache()

# Clean up temporary files
if os.path.exists("benchmark_results.json"):
    os.remove("benchmark_results.json")

print("Cleanup complete!")
print_gpu_memory()

---

## Next Steps

Standard benchmarks are great for comparing to other models, but what about your specific use case? The next notebook covers **custom evaluation** - creating task-specific metrics and using LLM-as-judge for nuanced assessment.

**Continue to:** [04-custom-evaluation.ipynb](04-custom-evaluation.ipynb)