# Week 1 Exercise: Benchmarking Language Models

In this exercise, you'll gain hands-on experience with the fundamental concepts of benchmarking LLMs:
- Loading and using a language model
- Examining token probabilities
- Testing different sampling strategies
- Implementing three prompting approaches
- Computing evaluation metrics
- Measuring statistical significance

## Setup

First, install the required libraries:

In [None]:
!pip install transformers torch numpy scipy sklearn -q

In [None]:
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from scipy import stats
from sklearn.metrics import precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

## Part 1: Loading a Language Model

We'll use GPT-2, a small but capable autoregressive language model.

In [None]:
# Load GPT-2 small (124M parameters)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set padding token
tokenizer.pad_token = tokenizer.eos_token

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded: {model_name}")
print(f"Device: {device}")
print(f"Vocabulary size: {len(tokenizer)}")

## Part 2: Understanding Token Probabilities

Let's examine what the model actually computes: a probability distribution over the next token.

In [None]:
def get_next_token_probabilities(text):
    """Get probability distribution over next token given input text."""
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
        # Get logits for the last token position
        logits = outputs.logits[0, -1, :]
        # Convert to probabilities
        probs = torch.softmax(logits, dim=0)
    
    return probs.cpu()

# Example: What comes after "The capital of France is"?
prompt = "The capital of France is"
probs = get_next_token_probabilities(prompt)

# Show top 10 most likely tokens
top_k = 10
top_probs, top_indices = torch.topk(probs, top_k)

print(f"Prompt: '{prompt}'")
print(f"\nTop {top_k} most likely next tokens:")
for prob, idx in zip(top_probs, top_indices):
    token = tokenizer.decode([idx])
    print(f"  '{token}' → {prob.item():.4f}")

### Exercise 2.1: Cloze Evaluation

Use token probabilities to evaluate factual knowledge.

In [None]:
def get_token_probability(prompt, target_token):
    """Get the probability of a specific token following the prompt."""
    probs = get_next_token_probabilities(prompt)
    target_id = tokenizer.encode(target_token, add_special_tokens=False)[0]
    return probs[target_id].item()

# Test factual knowledge
test_cases = [
    ("The capital of France is", " Paris", " London"),
    ("The largest planet in our solar system is", " Jupiter", " Mars"),
    ("Water freezes at", " 0", " 100"),
]

print("Cloze Evaluation Results:")
for prompt, correct, incorrect in test_cases:
    prob_correct = get_token_probability(prompt, correct)
    prob_incorrect = get_token_probability(prompt, incorrect)
    
    print(f"\nPrompt: '{prompt}'")
    print(f"  P('{correct}') = {prob_correct:.4f}")
    print(f"  P('{incorrect}') = {prob_incorrect:.4f}")
    print(f"  Correct answer favored: {prob_correct > prob_incorrect}")

**Your Task:** Add 3 more test cases for a concept relevant to your project. Test whether the model assigns higher probability to correct answers.

In [None]:
# TODO: Add your test cases here
my_test_cases = [
    # ("prompt", " correct_answer", " incorrect_answer"),
]

# Test your cases
# for prompt, correct, incorrect in my_test_cases:
#     ...

## Part 3: Autoregressive Sampling Strategies

Different sampling methods produce different text characteristics.

In [None]:
def generate_text(prompt, strategy="greedy", max_length=50, **kwargs):
    """Generate text using different sampling strategies."""
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    if strategy == "greedy":
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
    elif strategy == "temperature":
        temp = kwargs.get("temperature", 1.0)
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            do_sample=True,
            temperature=temp,
            pad_token_id=tokenizer.eos_token_id
        )
    elif strategy == "top_k":
        k = kwargs.get("k", 50)
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            do_sample=True,
            top_k=k,
            pad_token_id=tokenizer.eos_token_id
        )
    elif strategy == "top_p":
        p = kwargs.get("p", 0.9)
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            do_sample=True,
            top_p=p,
            pad_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test different strategies
prompt = "Once upon a time, in a distant galaxy,"

print("Comparing Sampling Strategies:\n")
print("="*80)
print(f"GREEDY:\n{generate_text(prompt, 'greedy')}")
print("="*80)
print(f"TEMPERATURE (0.7):\n{generate_text(prompt, 'temperature', temperature=0.7)}")
print("="*80)
print(f"TOP-K (k=50):\n{generate_text(prompt, 'top_k', k=50)}")
print("="*80)
print(f"TOP-P (p=0.9):\n{generate_text(prompt, 'top_p', p=0.9)}")
print("="*80)

### Exercise 3.1: Sampling Reproducibility

For benchmarking, we need reproducible results. Test which strategies are deterministic.

In [None]:
# TODO: Run each sampling strategy multiple times
# Which ones produce identical outputs? Why?

test_prompt = "The meaning of life is"

# Test greedy
print("Greedy (run 1):", generate_text(test_prompt, "greedy", max_length=20))
print("Greedy (run 2):", generate_text(test_prompt, "greedy", max_length=20))

# TODO: Test temperature sampling with same temperature
# TODO: Why do results differ?

## Part 4: Three Prompting Strategies

Test instruction-following, cloze, and in-context learning.

### 4.1: Instruction-Following

Note: GPT-2 is a base model, not instruction-tuned, so this may not work well.

In [None]:
instruction_prompt = "Translate to French: Hello, how are you?\nTranslation:"
result = generate_text(instruction_prompt, "greedy", max_length=30)
print(f"Instruction prompt:\n{result}")

### 4.2: Cloze Prompts

Already covered in Part 2!

### 4.3: In-Context Learning

Provide examples to teach the task.

In [None]:
icl_prompt = """Classify the sentiment of each review:

Review: This movie was amazing! I loved every minute.
Sentiment: Positive

Review: Terrible waste of time. Very disappointing.
Sentiment: Negative

Review: It was okay, nothing special.
Sentiment: Neutral

Review: Absolutely brilliant performances and story!
Sentiment:"""

result = generate_text(icl_prompt, "greedy", max_length=len(tokenizer.encode(icl_prompt)) + 5)
print(result)

### Exercise 4.1: Design ICL for Your Concept

Create a few-shot prompt for a concept relevant to your project.

In [None]:
# TODO: Design a few-shot prompt for your concept
my_icl_prompt = """
# Your examples here
"""

# Test it
# result = generate_text(my_icl_prompt, "greedy", max_length=...)
# print(result)

## Part 5: Evaluation Metrics

Compute precision, recall, F1, and perplexity.

### 5.1: Classification Metrics

In [None]:
# Example: Binary classification task
# 1 = positive, 0 = negative

true_labels = np.array([1, 1, 0, 1, 0, 0, 1, 1, 0, 1])
predicted_labels = np.array([1, 1, 0, 0, 0, 1, 1, 1, 0, 1])

# Calculate metrics
precision = precision_score(true_labels, predicted_labels)
recall = recall_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)
accuracy = (true_labels == predicted_labels).mean()

print("Classification Metrics:")
print(f"  Accuracy:  {accuracy:.3f}")
print(f"  Precision: {precision:.3f}")
print(f"  Recall:    {recall:.3f}")
print(f"  F1 Score:  {f1:.3f}")

# Confusion matrix breakdown
tp = ((true_labels == 1) & (predicted_labels == 1)).sum()
fp = ((true_labels == 0) & (predicted_labels == 1)).sum()
fn = ((true_labels == 1) & (predicted_labels == 0)).sum()
tn = ((true_labels == 0) & (predicted_labels == 0)).sum()

print(f"\nConfusion Matrix:")
print(f"  True Positives:  {tp}")
print(f"  False Positives: {fp}")
print(f"  False Negatives: {fn}")
print(f"  True Negatives:  {tn}")

### 5.2: Perplexity

Measures how "surprised" the model is by a sequence.

In [None]:
def calculate_perplexity(text):
    """Calculate perplexity of text under the model."""
    encodings = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**encodings, labels=encodings["input_ids"])
        # Negative log likelihood
        nll = outputs.loss.item()
    
    # Perplexity = exp(average negative log likelihood)
    perplexity = np.exp(nll)
    return perplexity

# Test on different texts
texts = [
    "The cat sat on the mat.",
    "Colorless green ideas sleep furiously.",  # Grammatical but nonsensical
    "asdf qwer zxcv hjkl",  # Random characters
]

print("Perplexity Comparison:")
for text in texts:
    ppl = calculate_perplexity(text)
    print(f"  PPL = {ppl:8.2f} | '{text}'")

### Exercise 5.1: Compare Perplexity

Calculate perplexity for text from your domain vs. out-of-domain text.

In [None]:
# TODO: Add examples from your concept domain
in_domain_text = "..."
out_domain_text = "..."

# Calculate and compare perplexities
# What does this tell you about the model's knowledge of your domain?

## Part 6: Statistical Significance

Determine if differences in performance are meaningful.

In [None]:
def calculate_confidence_interval(accuracies):
    """Calculate 95% confidence interval for accuracy."""
    mean = np.mean(accuracies)
    se = stats.sem(accuracies)  # Standard error
    ci = stats.t.interval(0.95, len(accuracies)-1, loc=mean, scale=se)
    return mean, ci

def bootstrap_ci(scores, n_bootstrap=1000):
    """Bootstrap confidence interval."""
    bootstrap_means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(scores, size=len(scores), replace=True)
        bootstrap_means.append(np.mean(sample))
    
    ci_lower = np.percentile(bootstrap_means, 2.5)
    ci_upper = np.percentile(bootstrap_means, 97.5)
    return np.mean(scores), (ci_lower, ci_upper)

# Example: Two models on a benchmark
# Each entry is 1 (correct) or 0 (incorrect)
model_a_results = np.random.binomial(1, 0.85, 100)  # 85% accuracy
model_b_results = np.random.binomial(1, 0.82, 100)  # 82% accuracy

# Bootstrap CIs
mean_a, ci_a = bootstrap_ci(model_a_results)
mean_b, ci_b = bootstrap_ci(model_b_results)

print("Model A:")
print(f"  Accuracy: {mean_a:.3f}")
print(f"  95% CI: [{ci_a[0]:.3f}, {ci_a[1]:.3f}]")
print(f"\nModel B:")
print(f"  Accuracy: {mean_b:.3f}")
print(f"  95% CI: [{ci_b[0]:.3f}, {ci_b[1]:.3f}]")

# Do confidence intervals overlap?
overlap = not (ci_a[1] < ci_b[0] or ci_b[1] < ci_a[0])
print(f"\nConfidence intervals overlap: {overlap}")
if overlap:
    print("⚠️  Difference may not be statistically significant")
else:
    print("✓ Difference appears statistically significant")

### Exercise 6.1: Statistical Testing

Use McNemar's test to compare two models on paired examples.

In [None]:
from statsmodels.stats.contingency_tables import mcnemar

# Create contingency table
# model_a_correct, model_b_correct (both arrays of 1s and 0s)
model_a_correct = np.random.binomial(1, 0.85, 100)
model_b_correct = np.random.binomial(1, 0.82, 100)

# Build 2x2 table
both_correct = ((model_a_correct == 1) & (model_b_correct == 1)).sum()
a_correct_b_wrong = ((model_a_correct == 1) & (model_b_correct == 0)).sum()
a_wrong_b_correct = ((model_a_correct == 0) & (model_b_correct == 1)).sum()
both_wrong = ((model_a_correct == 0) & (model_b_correct == 0)).sum()

table = [[both_correct, a_correct_b_wrong],
         [a_wrong_b_correct, both_wrong]]

print("Contingency Table:")
print(f"  Both correct: {both_correct}")
print(f"  A correct, B wrong: {a_correct_b_wrong}")
print(f"  A wrong, B correct: {a_wrong_b_correct}")
print(f"  Both wrong: {both_wrong}")

# McNemar's test
result = mcnemar(table, exact=True)
print(f"\nMcNemar's test p-value: {result.pvalue:.4f}")

if result.pvalue < 0.05:
    print("✓ Statistically significant difference (p < 0.05)")
else:
    print("⚠️  No statistically significant difference (p >= 0.05)")

## Part 7: Memorization vs. Generalization

Test whether the model has memorized vs. learned.

In [None]:
# Example: Test on exact phrasing vs. paraphrased

exact_prompt = "To be or not to be, that is the"
paraphrase_prompt = "The question is whether to exist or not exist, that is the"

# Get next token probabilities
probs_exact = get_next_token_probabilities(exact_prompt)
probs_para = get_next_token_probabilities(paraphrase_prompt)

# Expected continuation: "question"
target_token = " question"
prob_exact = get_token_probability(exact_prompt, target_token)
prob_para = get_token_probability(paraphrase_prompt, target_token)

print("Memorization Test:")
print(f"  Exact phrase: P('{target_token}') = {prob_exact:.4f}")
print(f"  Paraphrased:  P('{target_token}') = {prob_para:.4f}")
print(f"\n  Probability ratio: {prob_exact / (prob_para + 1e-10):.2f}x")

if prob_exact > 10 * prob_para:
    print("  ⚠️ Likely memorized (much higher prob for exact phrasing)")
else:
    print("  ✓ May have generalized (similar probs for paraphrase)")

### Exercise 7.1: Design Generalization Tests

Create pairs of prompts to test memorization vs. understanding for your concept.

In [None]:
# TODO: Create test pairs for your concept
# Compare:
# 1. Common phrasing vs. novel phrasing
# 2. Standard examples vs. counterfactual examples
# 3. In-distribution vs. out-of-distribution cases

test_pairs = [
    # ("common_phrasing", "novel_phrasing", "expected_answer"),
]

# Test and compare probabilities

## Reflection Questions

Answer these in your project writeup:

1. **Prompting Strategy**: Which prompting strategy (instruction, cloze, ICL) works best for your concept? Why?

2. **Metrics**: Which evaluation metrics (accuracy, precision/recall, perplexity) are most appropriate for your concept? Why?

3. **Sample Size**: How many test examples do you need for statistical significance? Use the confidence interval calculations to justify your answer.

4. **Memorization**: How will you distinguish memorization from true understanding? What makes a prompt unlikely to have been in training data?

5. **Generalization**: What variations of your prompts will test for genuine concept understanding rather than surface pattern matching?

## Next Steps

Use this notebook as a foundation for your Week 1 assignment:
- Design your benchmark prompts
- Test them on GPT-2 or another model
- Calculate evaluation metrics
- Assess statistical significance
- Create generalization tests

Good luck!