# Evaluation Harness (CPU)

## Why Evaluate on CPU?

Before training, we want to:
1. **Baseline perplexity:** Measure base model performance
2. **Sample generation:** See what the base model produces
3. **Compare post-training:** Same metrics after finetuning

Running evaluation on CPU is slow but:
- No GPU needed for initial checks
- Helps validate the pipeline
- Can run locally before GPU training

## Perplexity Approximation

Perplexity measures how "surprised" the model is by the data:
- Lower = better (model predicts data well)
- Formula: exp(mean(negative_log_likelihood))

We'll compute it on a small validation slice for speed.


In [None]:
# === TODO (you code this) ===
# Compute perplexity for a small validation slice using the base model (CPU).
# Hints:
#   - Load model and tokenizer
#   - For each sample, compute negative log-likelihood of tokens
#   - Sum NLL, divide by total tokens, then exp() for perplexity
#   - Use model.generate() is NOT needed hereâ€”we're computing likelihood
# Acceptance:
#   - prints ppl_base on ~100-200 samples

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def cpu_perplexity_estimate(base_model: str, tokenized_ds, n_samples: int=200):
    """
    Estimate perplexity on CPU (slow but works without GPU).
    
    Args:
        base_model: Model name
        tokenized_ds: Tokenized validation dataset
        n_samples: Number of samples to evaluate
    """
    raise NotImplementedError

# Load dataset and compute baseline
# dset = load_dataset("YOURUSER/frankenstein-fanfic-snippets")  # or local
# cpu_perplexity_estimate("mistralai/Mistral-7B-Instruct-v0.2", dset['validation'], n_samples=200)


## Sample Generation (CPU)

Generating text on CPU is very slow, but useful for:
- Seeing base model outputs
- Validating the generation pipeline
- Comparing before/after training

We'll generate short continuations (60 tokens max) for a few prompts.


In [None]:
# === TODO (you code this) ===
# Generation wrapper using base model on CPU for 1-2 short prompts.
# Hints:
#   - Load model and tokenizer
#   - Use model.generate() with max_new_tokens
#   - Decode and print outputs
#   - Warn about CPU latency
# Acceptance:
#   - prints 2 short continuations; warns about CPU latency

def cpu_sample_generate(base_model: str, prompts: list, max_new_tokens: int=60):
    """
    Generate text samples on CPU (slow but works without GPU).
    
    Args:
        base_model: Model name
        prompts: List of prompt strings
        max_new_tokens: Maximum tokens to generate
    """
    raise NotImplementedError

# Test generation
prompts = [
    "It was on a dreary night of November that",
    "The monster gazed upon his creator with"
]
# cpu_sample_generate("mistralai/Mistral-7B-Instruct-v0.2", prompts, max_new_tokens=60)
