# Lab 4.3.3: LLM Benchmark Suite

**Module:** 4.3 - MLOps & Experiment Tracking  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand standard LLM benchmarks (MMLU, HellaSwag, ARC, etc.)
- [ ] Use lm-evaluation-harness to evaluate models
- [ ] Compare multiple models on the same benchmarks
- [ ] Interpret benchmark results correctly
- [ ] Integrate benchmarking with experiment tracking

---

## üìö Prerequisites

- Completed: Labs 4.3.1-4.3.2 (MLflow, W&B)
- Knowledge of: LLMs, Python, basic statistics
- Hardware: DGX Spark (128GB unified memory)

---

## üåç Real-World Context

**How do you know if your model is actually good?**

Without standardized benchmarks, we'd be comparing apples to oranges. Industry leaders use these benchmarks:

| Benchmark | What It Tests | Example Question |
|-----------|--------------|------------------|
| **MMLU** | Knowledge (57 subjects) | "What is the capital of France?" |
| **HellaSwag** | Commonsense reasoning | "A man walks into a bar..." (complete the story) |
| **ARC** | Science reasoning | "Why does ice float on water?" |
| **WinoGrande** | Pronoun resolution | "The trophy doesn't fit in the suitcase because it's too big." (What is too big?) |
| **HumanEval** | Code generation | "Write a function that reverses a string" |
| **MT-Bench** | Chat quality | Multi-turn conversation scoring |

**Benchmark Leaderboards:**
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [HELM](https://crfm.stanford.edu/helm/)
- [Chatbot Arena](https://lmsys.org/)

---

## üßí ELI5: What Are LLM Benchmarks?

> **Imagine you're comparing students from different schools.**
>
> You can't just ask each school "Are your students smart?" - they'll all say yes!
>
> Instead, you give everyone the **same standardized test**:
> - Math section (like MMLU)
> - Reading comprehension (like HellaSwag)
> - Science questions (like ARC)
> - Critical thinking (like WinoGrande)
>
> Now you can fairly compare:
> - "School A scores 85% on math, School B scores 78%"
> - "School A is great at math but weak in science"
>
> **LLM benchmarks are standardized tests for AI models!**
> - Same questions for every model
> - Objective scoring
> - Apples-to-apples comparison

---

## Part 1: Understanding Key Benchmarks

### The "Big 6" Benchmarks

| Benchmark | Full Name | Metric | What It Measures |
|-----------|-----------|--------|------------------|
| **MMLU** | Massive Multitask Language Understanding | Accuracy | 57 subjects from STEM to humanities |
| **HellaSwag** | Harder Endings, Longer contexts | Accuracy | Commonsense reasoning |
| **ARC-c** | AI2 Reasoning Challenge (Challenge) | Accuracy | Grade-school science (hard) |
| **WinoGrande** | Winograd Schema Challenge | Accuracy | Coreference resolution |
| **TruthfulQA** | Truthful Question Answering | Accuracy | Avoiding falsehoods |
| **GSM8K** | Grade School Math 8K | Accuracy | Math word problems |

In [None]:
# Let's look at example questions from each benchmark

benchmark_examples = {
    "MMLU (College Chemistry)": {
        "question": "What is the molecular geometry of SF6?",
        "choices": ["A) Tetrahedral", "B) Octahedral", "C) Trigonal bipyramidal", "D) Square planar"],
        "answer": "B) Octahedral",
        "difficulty": "Requires chemistry knowledge"
    },
    
    "HellaSwag": {
        "question": "A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She...",
        "choices": [
            "A) rinses the bucket and puts water in it",
            "B) starts to chase the dog with the bucket",
            "C) gets the dog wet and starts to lather it up",
            "D) takes the dog inside and dries it off"
        ],
        "answer": "C) gets the dog wet and starts to lather it up",
        "difficulty": "Requires common sense about sequences"
    },
    
    "ARC-Challenge": {
        "question": "Which property of a mineral can be determined just by looking at it?",
        "choices": ["A) hardness", "B) color", "C) luster", "D) streak"],
        "answer": "B) color (or C) luster - both visual properties)",
        "difficulty": "Grade-school science reasoning"
    },
    
    "WinoGrande": {
        "question": "The trophy doesn't fit in the brown suitcase because it's too [big/small].",
        "task": "Determine if 'it' refers to trophy or suitcase",
        "answer": "If 'big' -> 'it' = trophy; If 'small' -> 'it' = suitcase",
        "difficulty": "Pronoun resolution (tricky!)"
    },
    
    "GSM8K": {
        "question": "Janet has 3 times as many marbles as Tom. If Janet gives Tom 10 marbles, they will have the same number. How many marbles does Janet have?",
        "answer": "30 marbles (Janet: 30, Tom: 10 -> After: Janet: 20, Tom: 20)",
        "difficulty": "Multi-step math reasoning"
    }
}

print("üìö BENCHMARK EXAMPLES")
print("=" * 70)

for benchmark, example in benchmark_examples.items():
    print(f"\nüî∑ {benchmark}")
    print("-" * 50)
    print(f"Question: {example['question']}")
    if 'choices' in example:
        for choice in example['choices']:
            print(f"  {choice}")
    if 'task' in example:
        print(f"Task: {example['task']}")
    print(f"Answer: {example['answer']}")
    print(f"Difficulty: {example['difficulty']}")

---

## Part 2: Setting Up lm-evaluation-harness

The [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) by EleutherAI is the industry standard for LLM benchmarking.

In [None]:
# Install lm-eval
import subprocess
import sys

try:
    import lm_eval
    print(f"‚úÖ lm-eval already installed: v{lm_eval.__version__}")
except ImportError:
    print("üì¶ Installing lm-evaluation-harness...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "lm-eval", "-q"])
    import lm_eval
    print(f"‚úÖ lm-eval installed: v{lm_eval.__version__}")

In [None]:
import lm_eval
from lm_eval import evaluator, tasks
import torch
import json
import os
from pathlib import Path
from datetime import datetime

print(f"lm-eval version: {lm_eval.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# List available tasks/benchmarks
available_tasks = tasks.TaskManager().all_tasks

print(f"üìã Available benchmarks: {len(available_tasks)} total")
print("\nüî• Popular benchmarks:")

popular = [
    "hellaswag", "arc_easy", "arc_challenge", "winogrande",
    "mmlu", "truthfulqa", "gsm8k", "humaneval"
]

for task in popular:
    if task in available_tasks:
        print(f"  ‚úì {task}")
    else:
        # Check for variants
        variants = [t for t in available_tasks if task in t]
        if variants:
            print(f"  ‚úì {task} (variants: {', '.join(variants[:3])}...)")
        else:
            print(f"  ‚úó {task} (not found)")

---

## Part 3: Running Your First Benchmark

Let's benchmark a small model to understand the process.

In [None]:
# Setup directories
NOTEBOOK_DIR = Path.cwd()
MODULE_DIR = (NOTEBOOK_DIR / "..").resolve()
RESULTS_DIR = MODULE_DIR / "evaluation"
RESULTS_DIR.mkdir(exist_ok=True)

print(f"üìÅ Results will be saved to: {RESULTS_DIR}")

In [None]:
# Helper function to run benchmarks
def run_benchmark(
    model_name: str,
    tasks_list: list[str],
    num_fewshot: int = 0,
    batch_size: str = "auto",
    device: str = "cuda",
    dtype: str = "bfloat16",
    limit: int = None
):
    """
    Run lm-eval benchmarks on a model.
    
    Args:
        model_name: HuggingFace model name
        tasks_list: List of benchmark names
        num_fewshot: Number of few-shot examples
        batch_size: Batch size ("auto" for automatic)
        device: Device to use
        dtype: Data type for model
        limit: Limit number of samples (for testing)
    
    Returns:
        dict: Evaluation results
    """
    print(f"\nüî¨ Benchmarking: {model_name}")
    print(f"   Tasks: {', '.join(tasks_list)}")
    print(f"   Few-shot: {num_fewshot}")
    print(f"   Limit: {limit if limit else 'Full dataset'}")
    print("=" * 60)
    
    # Build model args string
    model_args = f"pretrained={model_name}"
    if dtype:
        model_args += f",dtype={dtype}"
    if device == "cuda" and torch.cuda.is_available():
        model_args += ",device_map=auto"
    
    # Run evaluation
    results = evaluator.simple_evaluate(
        model="hf",
        model_args=model_args,
        tasks=tasks_list,
        num_fewshot=num_fewshot,
        batch_size=batch_size,
        device=device,
        limit=limit
    )
    
    return results

print("‚úÖ Benchmark function defined")

In [None]:
# Quick benchmark on a small model
# Using limit=50 for fast demonstration - remove for full evaluation!

# Choose a small, fast model for the demo
DEMO_MODEL = "microsoft/phi-2"  # 2.7B parameters

# Quick test with limited samples
QUICK_TASKS = ["hellaswag", "arc_easy"]

print("‚ö° Running quick benchmark (limited samples for demo)...")
print("   For real evaluation, remove the 'limit' parameter!")
print()

In [None]:
# Run the benchmark (this may take a few minutes)
try:
    results = run_benchmark(
        model_name=DEMO_MODEL,
        tasks_list=QUICK_TASKS,
        num_fewshot=0,
        limit=50,  # Remove this for full evaluation!
        dtype="bfloat16"
    )
    
    print("\n‚úÖ Benchmark complete!")
    
except Exception as e:
    print(f"\n‚ö†Ô∏è Benchmark error: {e}")
    print("\nThis might happen if:")
    print("1. Not enough GPU memory")
    print("2. Model not accessible")
    print("3. Network issues")
    print("\nWe'll use simulated results for the demo.")
    
    # Simulated results for demonstration
    results = {
        "results": {
            "hellaswag": {
                "acc": 0.7234,
                "acc_norm": 0.7456,
                "acc_stderr": 0.0089,
                "acc_norm_stderr": 0.0087
            },
            "arc_easy": {
                "acc": 0.7821,
                "acc_norm": 0.7654,
                "acc_stderr": 0.0123,
                "acc_norm_stderr": 0.0127
            }
        },
        "config": {
            "model": DEMO_MODEL,
            "model_args": f"pretrained={DEMO_MODEL},dtype=bfloat16"
        }
    }

In [None]:
# Display results nicely
def display_results(results: dict):
    """Display benchmark results in a formatted table."""
    print("\nüìä BENCHMARK RESULTS")
    print("=" * 60)
    
    if "results" in results:
        for task, metrics in results["results"].items():
            print(f"\nüî∑ {task.upper()}")
            print("-" * 40)
            
            for metric, value in metrics.items():
                if isinstance(value, float):
                    if "stderr" in metric:
                        print(f"   {metric}: ¬±{value:.4f}")
                    else:
                        print(f"   {metric}: {value:.4f} ({value*100:.1f}%)")
                else:
                    print(f"   {metric}: {value}")

display_results(results)

In [None]:
# Save results to file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_file = RESULTS_DIR / f"benchmark_{DEMO_MODEL.replace('/', '_')}_{timestamp}.json"

with open(results_file, 'w') as f:
    json.dump(results, f, indent=2, default=str)

print(f"\nüíæ Results saved to: {results_file}")

---

## Part 4: Comparing Multiple Models

The real power of benchmarks: fair comparison across models!

In [None]:
# Simulated results for multiple models
# In practice, you'd run benchmarks on each model

model_results = {
    "microsoft/phi-2": {
        "params": "2.7B",
        "mmlu": 0.562,
        "hellaswag": 0.735,
        "arc_challenge": 0.528,
        "winogrande": 0.742,
        "truthfulqa": 0.412,
        "gsm8k": 0.548
    },
    "Qwen/Qwen3-8B": {
        "params": "7B",
        "mmlu": 0.458,
        "hellaswag": 0.760,
        "arc_challenge": 0.532,
        "winogrande": 0.740,
        "truthfulqa": 0.389,
        "gsm8k": 0.141
    },
    "mistralai/Mistral-7B-v0.1": {
        "params": "7B",
        "mmlu": 0.625,
        "hellaswag": 0.812,
        "arc_challenge": 0.599,
        "winogrande": 0.785,
        "truthfulqa": 0.425,
        "gsm8k": 0.352
    },
    "google/gemma-2b": {
        "params": "2B",
        "mmlu": 0.424,
        "hellaswag": 0.714,
        "arc_challenge": 0.482,
        "winogrande": 0.658,
        "truthfulqa": 0.378,
        "gsm8k": 0.175
    },
    "Qwen/Qwen2-7B": {
        "params": "7B",
        "mmlu": 0.702,
        "hellaswag": 0.798,
        "arc_challenge": 0.612,
        "winogrande": 0.772,
        "truthfulqa": 0.445,
        "gsm8k": 0.524
    }
}

print("üìä Model comparison data loaded")
print(f"   Models: {len(model_results)}")

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Create comparison dataframe
benchmarks = ["mmlu", "hellaswag", "arc_challenge", "winogrande", "truthfulqa", "gsm8k"]

comparison_data = []
for model, scores in model_results.items():
    row = {"model": model.split("/")[-1], "params": scores["params"]}
    for benchmark in benchmarks:
        row[benchmark] = scores[benchmark]
    row["average"] = np.mean([scores[b] for b in benchmarks])
    comparison_data.append(row)

df = pd.DataFrame(comparison_data)
df = df.sort_values("average", ascending=False)

print("\nüìä MODEL COMPARISON TABLE")
print("=" * 100)
print(df.to_string(index=False))

In [None]:
# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Overall comparison (bar chart)
models = df["model"].tolist()
averages = df["average"].tolist()

colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(models)))
bars = axes[0, 0].barh(models, averages, color=colors)
axes[0, 0].set_xlabel("Average Score")
axes[0, 0].set_title("Overall Model Ranking")
axes[0, 0].set_xlim(0, 1)

# Add value labels
for bar, avg in zip(bars, averages):
    axes[0, 0].text(avg + 0.02, bar.get_y() + bar.get_height()/2, 
                    f"{avg:.1%}", va="center")

# Plot 2: Radar chart
angles = np.linspace(0, 2 * np.pi, len(benchmarks), endpoint=False).tolist()
angles += angles[:1]  # Complete the circle

ax_radar = axes[0, 1]
ax_radar = fig.add_subplot(2, 2, 2, projection='polar')

for i, (model, scores) in enumerate(model_results.items()):
    values = [scores[b] for b in benchmarks]
    values += values[:1]  # Complete the circle
    ax_radar.plot(angles, values, 'o-', linewidth=2, 
                  label=model.split("/")[-1], alpha=0.7)
    ax_radar.fill(angles, values, alpha=0.1)

ax_radar.set_xticks(angles[:-1])
ax_radar.set_xticklabels(benchmarks)
ax_radar.set_title("Benchmark Profile Comparison")
ax_radar.legend(loc="upper right", bbox_to_anchor=(1.3, 1.0))

# Plot 3: Heatmap
heatmap_data = df[benchmarks].values
im = axes[1, 0].imshow(heatmap_data, cmap="RdYlGn", aspect="auto", vmin=0, vmax=1)

axes[1, 0].set_xticks(range(len(benchmarks)))
axes[1, 0].set_xticklabels(benchmarks, rotation=45, ha="right")
axes[1, 0].set_yticks(range(len(models)))
axes[1, 0].set_yticklabels(df["model"].tolist())
axes[1, 0].set_title("Benchmark Scores Heatmap")
plt.colorbar(im, ax=axes[1, 0], label="Accuracy")

# Add text annotations
for i in range(len(models)):
    for j in range(len(benchmarks)):
        text = axes[1, 0].text(j, i, f"{heatmap_data[i, j]:.2f}",
                               ha="center", va="center", fontsize=8,
                               color="white" if heatmap_data[i, j] > 0.5 else "black")

# Plot 4: Params vs Average Score
params_num = [float(p.replace("B", "")) for p in df["params"].tolist()]
axes[1, 1].scatter(params_num, df["average"], s=200, c=range(len(models)), cmap="viridis")

for i, (model, param, avg) in enumerate(zip(df["model"], params_num, df["average"])):
    axes[1, 1].annotate(model, (param, avg), textcoords="offset points", 
                        xytext=(5, 5), fontsize=9)

axes[1, 1].set_xlabel("Parameters (Billions)")
axes[1, 1].set_ylabel("Average Score")
axes[1, 1].set_title("Model Size vs Performance")
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(RESULTS_DIR / "model_comparison.png", dpi=150, bbox_inches="tight")
plt.show()

print(f"\nüìä Comparison saved to: {RESULTS_DIR / 'model_comparison.png'}")

### üîç Key Insights

From the comparison:

1. **Bigger isn't always better**: Phi-2 (2.7B) competes with 7B models
2. **Training data matters**: Qwen2-7B outperforms despite similar size to Mistral
3. **Specialization**: Some models excel at specific benchmarks (Mistral at HellaSwag)
4. **Math is hard**: GSM8K scores are generally lower than other benchmarks

---

## Part 5: Running Full Benchmark Suites

For production evaluation, you'd run the full benchmark suite.

In [None]:
# Command-line usage (preferred for long benchmarks)
cli_examples = '''
# Quick evaluation (subset of samples)
lm_eval --model hf \
    --model_args pretrained=microsoft/phi-2,dtype=bfloat16 \
    --tasks hellaswag,arc_easy \
    --batch_size 8 \
    --limit 100 \
    --output_path ./results/phi2_quick

# Full evaluation on Open LLM Leaderboard benchmarks
lm_eval --model hf \
    --model_args pretrained=Qwen/Qwen3-8B,dtype=bfloat16 \
    --tasks hellaswag,arc_challenge,winogrande,mmlu,truthfulqa,gsm8k \
    --num_fewshot 5 \
    --batch_size auto \
    --output_path ./results/llama31_full

# With specific GPU
CUDA_VISIBLE_DEVICES=0 lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=bfloat16 \
    --tasks mmlu \
    --batch_size 4 \
    --output_path ./results/mistral_mmlu

# Code evaluation (HumanEval)
lm_eval --model hf \
    --model_args pretrained=bigcode/starcoder,dtype=bfloat16 \
    --tasks humaneval \
    --batch_size 1 \
    --output_path ./results/starcoder_code
'''

print("üìã CLI Commands for Full Benchmarking:")
print("=" * 60)
print(cli_examples)

In [None]:
# Benchmark time estimates
time_estimates = {
    "hellaswag": {"samples": 10042, "time_7b": "~20 min"},
    "arc_easy": {"samples": 2376, "time_7b": "~5 min"},
    "arc_challenge": {"samples": 1172, "time_7b": "~5 min"},
    "winogrande": {"samples": 1267, "time_7b": "~3 min"},
    "mmlu": {"samples": 14042, "time_7b": "~45 min"},
    "truthfulqa": {"samples": 817, "time_7b": "~5 min"},
    "gsm8k": {"samples": 1319, "time_7b": "~15 min"}
}

print("‚è±Ô∏è BENCHMARK TIME ESTIMATES (7B model, DGX Spark)")
print("=" * 50)
print(f"{'Benchmark':<15} {'Samples':>10} {'Est. Time':>15}")
print("-" * 50)

total_samples = 0
for bench, info in time_estimates.items():
    print(f"{bench:<15} {info['samples']:>10,} {info['time_7b']:>15}")
    total_samples += info['samples']

print("-" * 50)
print(f"{'TOTAL':<15} {total_samples:>10,} {'~2 hours':>15}")
print("\nüí° Tip: Use --limit to run quick tests first!")

---

## Part 6: Integration with Experiment Tracking

Log benchmark results to MLflow or W&B for tracking over time.

In [None]:
import mlflow

# Setup MLflow
MLFLOW_DIR = MODULE_DIR / "mlflow"
mlflow.set_tracking_uri(f"file://{MLFLOW_DIR}")
mlflow.set_experiment("LLM-Benchmarks")

print(f"üìä MLflow tracking: {MLFLOW_DIR}")

In [None]:
def log_benchmark_to_mlflow(model_name: str, benchmark_results: dict, run_name: str = None):
    """
    Log benchmark results to MLflow.
    
    Args:
        model_name: Name of the model being evaluated
        benchmark_results: Results dictionary from lm-eval
        run_name: Optional run name
    """
    if run_name is None:
        run_name = f"{model_name.replace('/', '_')}_benchmark"
    
    with mlflow.start_run(run_name=run_name) as run:
        # Log model info
        mlflow.log_param("model_name", model_name)
        
        # Log config if available
        if "config" in benchmark_results:
            for key, value in benchmark_results["config"].items():
                if isinstance(value, (str, int, float, bool)):
                    mlflow.log_param(f"config_{key}", value)
        
        # Log benchmark scores
        if "results" in benchmark_results:
            for task, metrics in benchmark_results["results"].items():
                for metric, value in metrics.items():
                    if isinstance(value, (int, float)):
                        mlflow.log_metric(f"{task}_{metric}", value)
        
        # Log results as artifact
        results_path = "/tmp/benchmark_results.json"
        with open(results_path, 'w') as f:
            json.dump(benchmark_results, f, indent=2, default=str)
        mlflow.log_artifact(results_path, artifact_path="benchmarks")
        
        # Add tags
        mlflow.set_tag("type", "benchmark")
        mlflow.set_tag("hardware", "DGX Spark")
        
        print(f"‚úÖ Logged to MLflow: {run.info.run_id}")
        return run.info.run_id

print("‚úÖ MLflow logging function defined")

In [None]:
# Log our earlier results
run_id = log_benchmark_to_mlflow(
    model_name=DEMO_MODEL,
    benchmark_results=results,
    run_name="phi2_quick_test"
)

In [None]:
# Log comparison data for all models
print("üìä Logging all model benchmarks to MLflow...")
print("=" * 50)

for model_name, scores in model_results.items():
    # Convert to lm-eval format
    fake_results = {
        "results": {
            bench: {"acc": score} 
            for bench, score in scores.items() 
            if bench != "params"
        },
        "config": {
            "model": model_name,
            "params": scores["params"]
        }
    }
    
    log_benchmark_to_mlflow(
        model_name=model_name,
        benchmark_results=fake_results,
        run_name=f"{model_name.split('/')[-1]}_full_benchmark"
    )

print("\n‚úÖ All benchmarks logged!")

---

## ‚úã Try It Yourself: Exercise

**Task:** Run your own benchmark comparison.

1. Choose 2-3 models you're interested in
2. Run benchmarks on at least 2 tasks (use `limit=50` for speed)
3. Log results to MLflow
4. Create a visualization comparing the models
5. Identify which model is best for your use case

<details>
<summary>üí° Hint</summary>

```python
# Pick lightweight models for faster testing
models_to_test = [
    "microsoft/phi-2",
    "google/gemma-2b"
]

for model in models_to_test:
    results = run_benchmark(
        model_name=model,
        tasks_list=["hellaswag", "arc_easy"],
        limit=50  # Quick test
    )
    log_benchmark_to_mlflow(model, results)
```
</details>

In [None]:
# YOUR CODE HERE

# Step 1: Choose models


# Step 2: Run benchmarks


# Step 3: Log to MLflow


# Step 4: Visualize comparison


---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Comparing with Different Few-Shot Settings

In [None]:
# ‚ùå WRONG: Different few-shot = unfair comparison
# Model A: 0-shot -> 65%
# Model B: 5-shot -> 72%
# "Model B is better!" <- INVALID comparison!

# ‚úÖ RIGHT: Same settings for all models
# Model A: 5-shot -> 70%
# Model B: 5-shot -> 72%
# Now we can fairly compare!

print("Always use the same num_fewshot for fair comparisons!")

### Mistake 2: Cherry-Picking Benchmarks

In [None]:
# ‚ùå WRONG: Only reporting favorable benchmarks
# "Our model scores 85% on HellaSwag!" 
# (but only 35% on MMLU, which wasn't mentioned)

# ‚úÖ RIGHT: Report comprehensive benchmark suite
# - HellaSwag: 85%
# - MMLU: 35%  <- Honest about weaknesses
# - Average: 60%

print("Report ALL benchmarks for honest evaluation!")

### Mistake 3: Ignoring Statistical Uncertainty

In [None]:
# ‚ùå WRONG: Treating small differences as significant
# "Model A (72.3%) beats Model B (72.1%)!"
# <- This difference might not be statistically significant

# ‚úÖ RIGHT: Consider error bars
# Model A: 72.3% ¬± 0.8%
# Model B: 72.1% ¬± 0.9%
# <- Overlapping confidence intervals = no clear winner

print("Always check stderr (standard error) in results!")
print("A 0.2% difference with 0.8% stderr is not significant.")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Understanding major LLM benchmarks (MMLU, HellaSwag, ARC, etc.)
- ‚úÖ Using lm-evaluation-harness to run benchmarks
- ‚úÖ Comparing models fairly with standardized tests
- ‚úÖ Visualizing and interpreting benchmark results
- ‚úÖ Integrating benchmarks with experiment tracking

---

## üìñ Further Reading

- [lm-evaluation-harness GitHub](https://github.com/EleutherAI/lm-evaluation-harness)
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [HELM (Stanford)](https://crfm.stanford.edu/helm/)
- [MMLU Paper](https://arxiv.org/abs/2009.03300)
- [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/)

---

## üßπ Cleanup

In [None]:
import gc
import torch

plt.close('all')
gc.collect()

if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"GPU memory cleared")

print(f"\nüìÅ Results saved to: {RESULTS_DIR}")
print(f"üìä MLflow data saved to: {MLFLOW_DIR}")

---

## üìù Summary

In this lab, we:

1. **Explored** standard LLM benchmarks and what they measure
2. **Set up** lm-evaluation-harness on DGX Spark
3. **Ran** benchmarks on a demo model
4. **Compared** multiple models with visualizations
5. **Integrated** benchmark results with MLflow tracking

**Next up:** Lab 4.3.4 - Custom Evaluation Framework with LLM-as-judge!