# Evaluation Harness (CPU)

## Why Evaluate on CPU?

Before training, we want to:
1. **Baseline perplexity:** Measure base model performance
2. **Sample generation:** See what the base model produces
3. **Compare post-training:** Same metrics after finetuning

Running evaluation on CPU is slow but:
- No GPU needed for initial checks
- Helps validate the pipeline
- Can run locally before GPU training

## Perplexity Approximation

Perplexity measures how "surprised" the model is by the data:
- Lower = better (model predicts data well)
- Formula: exp(mean(negative_log_likelihood))

We'll compute it on a small validation slice for speed.

### What You'll See When Running This:

1. **Dataset Loading:** Downloads/loads your dataset from Hugging Face Hub
   - Train split: 456 samples
   - Validation split: 25 samples

2. **Model Download (First Time Only):** 
   - Downloads Mistral-7B (~15GB total)
   - 3 model files (model-00001/02/03-of-00003.safetensors)
   - This only happens once - files are cached locally
   - **This is what you're seeing now!** ‚¨áÔ∏è

3. **Model Loading:**
   - Loads the 7B parameter model into RAM
   - Uses CPU (no GPU needed, but slower)
   - Takes a few minutes

4. **Perplexity Calculation:**
   - Processes validation samples one by one
   - Computes how well model predicts each token
   - Averages across all samples
   - **This will take 10-30 minutes on CPU** (be patient!)

### Why So Slow on CPU?
- 7B parameters = billions of calculations per sample
- CPU has fewer cores than GPU
- This is why we only do this once for baseline, then use GPU for training


In [1]:
# Check if GPU cuda is available
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

if torch.cuda.is_available():
    print("‚úÖ GPU detected: ", torch.cuda.get_device_name(0))
else:
    print("‚ö†Ô∏è  No GPU detected! This will be very slow or crash.")
    print("   Please enable GPU in Colab: Runtime ‚Üí Change runtime type ‚Üí GPU")
    

‚úÖ GPU detected:  Tesla T4


## Authentication for Private Dataset

**‚ö†Ô∏è IMPORTANT:** The dataset `Tuminha/frankenstein-fanfic-snippets` is **private**, so you must authenticate first!

### Option 1: Using .env file (Recommended for local development)

Create a `.env` file in your project root with:
```
HF_TOKEN=your_token_here
```

Or use any of these variable names:
- `HF_TOKEN`
- `HUGGINGFACE_HUB_TOKEN`
- `HUGGINGFACE_API_KEY`

**Note:** Make sure `.env` is in `.gitignore` (it already is)!

### Option 2: Interactive Login

Run the authentication cell below to login interactively:

```python
from huggingface_hub import login
login()  # Paste your token when prompted
```

### Option 3: Environment Variable (Colab/Cloud)

Set environment variable before running:
```python
import os
os.environ["HF_TOKEN"] = "your_token_here"
```

### Get Your Token
1. Go to: https://huggingface.co/settings/tokens
2. Click "New token"
3. Name it (e.g., "colab-access")
4. Select "Read" access
5. Copy the token

**Note:** The dataset uses Parquet format (not CSV) - this is fine! `load_dataset()` handles it automatically.


In [1]:
# Helper function to load dataset (Hub or local fallback)
def load_dataset_with_fallback(hub_id="Tuminha/frankenstein-fanfic-snippets", token=None):
    """
    Try to load dataset from Hub, with authentication if needed.
    Falls back to local CSV if Hub access fails.
    
    Args:
        hub_id: Hugging Face dataset ID
        token: Optional Hugging Face token (or use login() first)
        
    Returns:
        DatasetDict with 'train' and 'validation' splits
    """
    from datasets import Dataset, DatasetDict
    from sklearn.model_selection import train_test_split
    import pandas as pd
    import os
    
    # Try Hub first (with authentication if needed)
    try:
        # Get token from environment or cached login
        from huggingface_hub import HfFolder
        import os
        
        # Try to get token from various sources (in order of priority)
        hf_token = token  # 1. Use provided token first
        
        # 2. Try to load from .env file
        if not hf_token:
            try:
                from dotenv import load_dotenv
                load_dotenv()  # Load .env file if it exists
                hf_token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN") or os.getenv("HUGGINGFACE_API_KEY")
            except ImportError:
                # python-dotenv not installed, skip .env loading
                pass
            except Exception:
                # .env file doesn't exist or other error, continue
                pass
        
        # 3. Try environment variables (already loaded from system/env)
        if not hf_token:
            hf_token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN") or os.getenv("HUGGINGFACE_API_KEY")
        
        # 4. Try to get from HfFolder (cached login via login())
        if not hf_token:
            try:
                hf_token = HfFolder.get_token()
            except Exception:
                pass
        
        # Load dataset with explicit token
        if hf_token:
            token_source = "provided parameter"
            if not token:  # If not provided as parameter, figure out source
                if os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN") or os.getenv("HUGGINGFACE_API_KEY"):
                    token_source = "environment variable (.env or system)"
                else:
                    token_source = "cached login (HfFolder)"
            print(f"‚úÖ Using authentication token from: {token_source} (length: {len(hf_token)})")
            dset = load_dataset(hub_id, token=hf_token)
        else:
            print("‚ö†Ô∏è  No token found. Trying without explicit token (may use cached)...")
            dset = load_dataset(hub_id)  # Will try to use cached token
        
        print("‚úÖ Loaded dataset from Hugging Face Hub")
        print(f"   Train: {len(dset['train'])}, Validation: {len(dset['validation'])}")
        return dset
        
    except Exception as e:
        error_msg = str(e).lower()
        error_type = type(e).__name__
        
        print(f"‚ö†Ô∏è  Could not load from Hub: {error_type}")
        
        # Check if it's an authentication/access issue
        if "not found" in error_msg or "cannot be accessed" in error_msg or "401" in error_msg:
            print("\nüîê AUTHENTICATION REQUIRED")
            print("   The dataset is private. You need to authenticate first.")
            print("\n   Run this in a cell above:")
            print("   ```python")
            print("   from huggingface_hub import login")
            print("   login()  # Enter your token when prompted")
            print("   ```")
            print("\n   Get your token from: https://huggingface.co/settings/tokens")
            print("   (Create a token with 'read' access)")
            print("\n   After authenticating, run this cell again.")
            print("\n   Alternatively, you can pass a token directly:")
            print("   ```python")
            print("   dset = load_dataset_with_fallback(token='your_token_here')")
            print("   ```")
        
        print("\n   Falling back to local CSV...")
        
        # Fallback: Try multiple possible CSV paths
        possible_paths = [
            "../data/processed/frankenstein_cleaned.csv",  # Local relative
            "data/processed/frankenstein_cleaned.csv",     # Colab root
            "/content/data/processed/frankenstein_cleaned.csv",  # Colab content
            "./data/processed/frankenstein_cleaned.csv",   # Current dir
        ]
        
        df = None
        used_path = None
        
        for csv_path in possible_paths:
            try:
                if os.path.exists(csv_path):
                    df = pd.read_csv(csv_path)
                    used_path = csv_path
                    print(f"   ‚úÖ Found CSV at: {csv_path}")
                    break
            except Exception:
                continue
        
        if df is None:
            print(f"\n‚ùå CSV file not found in any of these locations:")
            for p in possible_paths:
                print(f"   - {p}")
            print("\n   üí° SOLUTIONS:")
            print("   1. Authenticate to Hub (recommended):")
            print("      from huggingface_hub import login")
            print("      login()")
            print("\n   2. Upload CSV to Colab:")
            print("      from google.colab import files")
            print("      files.upload()  # Upload frankenstein_cleaned.csv")
            print("\n   3. Download dataset files manually from Hub and load them")
            
            raise FileNotFoundError(
                "Could not access dataset. Please authenticate to Hub or provide CSV file.\n"
                "The dataset exists but is private - authentication is required."
            )
        
        print(f"   Loaded CSV: {len(df)} rows from {used_path}")
        
        # Create train/val split (same as notebook 02: 5% validation, seed=42)
        train_df, val_df = train_test_split(df, test_size=0.05, random_state=42)
        
        dset = DatasetDict({
            'train': Dataset.from_pandas(train_df),
            'validation': Dataset.from_pandas(val_df)
        })
        print(f"‚úÖ Created dataset from local CSV: {len(dset['train'])} train, {len(dset['validation'])} val")
        return dset

# Load dataset
# NOTE: If you get authentication error, run this first:
# from huggingface_hub import login
# login()  # Enter your token
# Then run this cell again
dset = load_dataset_with_fallback()


‚úÖ Using authentication token from: environment variable (.env or system) (length: 37)
‚ö†Ô∏è  Could not load from Hub: NameError

   Falling back to local CSV...
   ‚úÖ Found CSV at: ../data/processed/frankenstein_cleaned.csv
   Loaded CSV: 481 rows from ../data/processed/frankenstein_cleaned.csv
‚úÖ Created dataset from local CSV: 456 train, 25 val


In [None]:
# === TODO (you code this) ===
# Compute perplexity for a small validation slice using a CPU-friendly model.
# NOTE: Mistral-7B is too large for CPU! Use DistilGPT-2 instead.
# Hints:
#   - Load model and tokenizer (use "distilgpt2" for CPU)
#   - Tokenize the dataset first (it's not tokenized yet!)
#   - For each sample, compute negative log-likelihood of tokens
#   - Sum NLL, divide by total tokens, then exp() for perplexity
# Acceptance:
#   - prints ppl_base on ~25 validation samples (all of them)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset
from dotenv import load_dotenv
import os
load_dotenv()
HF_TOKEN = os.getenv("HUGGINGFACE_API_KEY") or os.getenv("HF_TOKEN")

def cpu_perplexity_estimate(base_model: str, dataset, n_samples: int=25):
    """
    Estimate perplexity on CPU using a smaller model (DistilGPT-2).
    
    NOTE: Mistral-7B requires ~28GB RAM and will crash on most CPUs.
    We use DistilGPT-2 as a proxy baseline here.
    Real Mistral baseline will be computed on GPU in notebook 11.
    
    Args:
        base_model: Model name (use "distilgpt2" for CPU)
        dataset: Validation dataset (not tokenized yet)
        n_samples: Number of samples to evaluate
    """
    print(f"Loading model {base_model} on CPU...")
    print("‚ö†Ô∏è  Using DistilGPT-2 as proxy (Mistral-7B too large for CPU)")
    
    model = AutoModelForCausalLM.from_pretrained(base_model)
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    tokenizer.pad_token = tokenizer.eos_token  # Set padding token
    model.eval()
    
    # Limit samples
    n_samples = min(n_samples, len(dataset))
    print(f"Computing perplexity on {n_samples} samples...")
    
    total_nll = 0.0
    total_tokens = 0
    
    with torch.no_grad():
        for i, sample in enumerate(dataset.select(range(n_samples))):
            # Tokenize the text
            text = sample['text']
            encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
            input_ids = encoded['input_ids']
            
            # Forward pass to get logits
            outputs = model(input_ids, labels=input_ids)
            
            # Loss is already negative log-likelihood per token (averaged)
            nll = outputs.loss.item() * input_ids.size(1)
            
            total_nll += nll
            total_tokens += input_ids.size(1)
            
            if (i + 1) % 5 == 0:
                print(f"  Processed {i + 1}/{n_samples} samples...")
    
    # Compute perplexity
    avg_nll = total_nll / total_tokens
    perplexity = torch.exp(torch.tensor(avg_nll)).item()
    
    print(f"\n‚úÖ Baseline Perplexity (DistilGPT-2): {perplexity:.2f}")
    print(f"   (computed on {n_samples} samples, {total_tokens} tokens)")
    print(f"\nüìù Note: This is DistilGPT-2 baseline, not Mistral.")
    print(f"   Real Mistral baseline will be computed on GPU in notebook 11.")
    
    return perplexity

# Load dataset and compute baseline
dset = load_dataset("Tuminha/frankenstein-fanfic-snippets")
# Use DistilGPT-2 instead of Mistral (Mistral too large for CPU)
cpu_perplexity_estimate("mistralai/Mistral-7B-Instruct-v0.2", dset['validation'], n_samples=25)


Loading model mistralai/Mistral-7B-Instruct-v0.2 on CPU...
‚ö†Ô∏è  Using DistilGPT-2 as proxy (Mistral-7B too large for CPU)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Mistral-7B Perplexity (GPU/Colab)

**‚ö†Ô∏è Run this cell in Google Colab with GPU enabled!**

This version uses Mistral-7B for the real baseline. It requires:
- GPU runtime in Colab (Runtime ‚Üí Change runtime type ‚Üí GPU)
- ~15GB GPU memory (T4 works fine)
- Much faster than CPU (minutes instead of hours)

Copy this cell to Colab and run it there.


In [None]:
# === Mistral-7B Perplexity for GPU/Colab ===
# Copy this cell to Google Colab and run with GPU enabled!
# Runtime ‚Üí Change runtime type ‚Üí GPU (T4)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

def gpu_perplexity_estimate_mistral(base_model: str, dataset, n_samples: int=25):
    """
    Estimate perplexity using Mistral-7B on GPU (for Colab).
    
    Args:
        base_model: Model name ("mistralai/Mistral-7B-Instruct-v0.2")
        dataset: Validation dataset (not tokenized)
        n_samples: Number of samples to evaluate
    """
    # Check for GPU
    device = "cuda" if torch.cuda.is_available() else "cpu"
    if device == "cpu":
        print("‚ö†Ô∏è  WARNING: No GPU detected! This will be very slow or crash.")
        print("   Please enable GPU in Colab: Runtime ‚Üí Change runtime type ‚Üí GPU")
        return None
    
    print(f"‚úÖ GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"Loading model {base_model} on {device}...")
    
    # Load model on GPU with bfloat16 (saves memory)
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        torch_dtype=torch.bfloat16,  # Use bfloat16 for efficiency
        device_map="auto"  # Automatically place on GPU
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    tokenizer.pad_token = tokenizer.eos_token
    model.eval()
    
    # Limit samples
    n_samples = min(n_samples, len(dataset))
    print(f"Computing perplexity on {n_samples} samples...")
    
    total_nll = 0.0
    total_tokens = 0
    
    with torch.no_grad():
        for i, sample in enumerate(dataset.select(range(n_samples))):
            # Tokenize the text
            text = sample['text']
            encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
            input_ids = encoded['input_ids'].to(device)  # Move to GPU
            
            # Forward pass to get logits
            outputs = model(input_ids, labels=input_ids)
            
            # Loss is already negative log-likelihood per token (averaged)
            nll = outputs.loss.item() * input_ids.size(1)
            
            total_nll += nll
            total_tokens += input_ids.size(1)
            
            if (i + 1) % 5 == 0:
                print(f"  Processed {i + 1}/{n_samples} samples...")
    
    # Compute perplexity
    avg_nll = total_nll / total_tokens
    perplexity = torch.exp(torch.tensor(avg_nll)).item()
    
    print(f"\n‚úÖ Baseline Perplexity (Mistral-7B): {perplexity:.2f}")
    print(f"   (computed on {n_samples} samples, {total_tokens} tokens)")
    print(f"   Device: {device}")
    
    return perplexity

# Load dataset and compute baseline
# Uncomment and run in Colab:
# dset = load_dataset("Tuminha/frankenstein-fanfic-snippets")
# gpu_perplexity_estimate_mistral("mistralai/Mistral-7B-Instruct-v0.2", dset['validation'], n_samples=25)


## Mistral-7B Generation (GPU/Colab)

**‚ö†Ô∏è Run this cell in Google Colab with GPU enabled!**

Generate samples with Mistral-7B on GPU. Much faster and better quality than DistilGPT-2.


In [None]:
# === Mistral-7B Generation for GPU/Colab ===
# Copy this cell to Google Colab and run with GPU enabled!

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def gpu_sample_generate_mistral(base_model: str, prompts: list, max_new_tokens: int=100):
    """
    Generate text samples using Mistral-7B on GPU (for Colab).
    
    Args:
        base_model: Model name ("mistralai/Mistral-7B-Instruct-v0.2")
        prompts: List of prompt strings
        max_new_tokens: Maximum tokens to generate
    """
    # Check for GPU
    device = "cuda" if torch.cuda.is_available() else "cpu"
    if device == "cpu":
        print("‚ö†Ô∏è  WARNING: No GPU detected! This will be very slow or crash.")
        print("   Please enable GPU in Colab: Runtime ‚Üí Change runtime type ‚Üí GPU")
        return None
    
    print(f"‚úÖ GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"Loading model {base_model} on {device}...\n")
    
    # Load model on GPU with bfloat16
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    tokenizer.pad_token = tokenizer.eos_token
    model.eval()
    
    for i, prompt in enumerate(prompts, 1):
        print(f"Prompt {i}: {prompt}")
        print("Generating (GPU - this should be fast)...")
        
        # Tokenize prompt with attention_mask
        encoded = tokenizer(
            prompt, 
            return_tensors="pt",
            padding=False,
            truncation=True,
            max_length=512
        ).to(device)
        
        # Generate with attention_mask
        with torch.no_grad():
            outputs = model.generate(
                encoded['input_ids'],
                attention_mask=encoded['attention_mask'],  # Pass attention_mask
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                repetition_penalty=1.2,  # Reduce repetition
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id
            )
        
        # Decode
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        continuation = generated_text[len(prompt):].strip()
        
        print(f"Continuation: {continuation}\n")
        print("-" * 80 + "\n")

# Test generation
# Uncomment and run in Colab:
# prompts = [
#     "It was on a dreary night of November that",
#     "The monster gazed upon his creator with"
# ]
# gpu_sample_generate_mistral("mistralai/Mistral-7B-Instruct-v0.2", prompts, max_new_tokens=100)


## Sample Generation (CPU)

Generating text on CPU is very slow, but useful for:
- Seeing base model outputs
- Validating the generation pipeline
- Comparing before/after training

We'll generate short continuations (60 tokens max) for a few prompts.


In [3]:
# === TODO (you code this) ===
# Generation wrapper using DistilGPT-2 on CPU for 1-2 short prompts.
# NOTE: Using DistilGPT-2 instead of Mistral (Mistral too large for CPU).
# Hints:
#   - Load model and tokenizer (use "distilgpt2")
#   - Use model.generate() with max_new_tokens
#   - Decode and print outputs
#   - Warn about CPU latency
# Acceptance:
#   - prints 2 short continuations; warns about CPU latency

def cpu_sample_generate(base_model: str, prompts: list, max_new_tokens: int=60):
    """
    Generate text samples on CPU using DistilGPT-2 (Mistral too large for CPU).
    
    Args:
        base_model: Model name (use "distilgpt2" for CPU)
        prompts: List of prompt strings
        max_new_tokens: Maximum tokens to generate
    """
    print(f"‚ö†Ô∏è  Using {base_model} for CPU generation (Mistral-7B too large)")
    print("   Real Mistral generation will be done on GPU in notebook 11.")
    print("   Note: DistilGPT-2 wasn't trained on Frankenstein text, so outputs")
    print("   won't match the style. This is just for testing the pipeline.\n")
    
    model = AutoModelForCausalLM.from_pretrained(base_model)
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    
    # Fix: Set pad_token properly (DistilGPT-2 doesn't have one by default)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    model.eval()
    
    for i, prompt in enumerate(prompts, 1):
        print(f"Prompt {i}: {prompt}")
        print("Generating (CPU - this may take 30-60 seconds)...")
        
        # Fix: Tokenize with attention_mask (like notebook 3)
        # This explicitly creates attention_mask to avoid the warning
        encoded = tokenizer(
            prompt, 
            return_tensors="pt",
            padding=False,  # No padding needed for single prompt
            truncation=True,
            max_length=512
        )
        
        # Generate with attention_mask (fixes the warning)
        with torch.no_grad():
            outputs = model.generate(
                encoded['input_ids'],
                attention_mask=encoded['attention_mask'],  # Fix: Pass attention_mask
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.8,  # Slightly higher for more variety
                top_p=0.9,  # Nucleus sampling
                repetition_penalty=1.2,  # Reduce repetition
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id
            )
        
        # Decode
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        continuation = generated_text[len(prompt):].strip()
        
        print(f"Continuation: {continuation}\n")
        print("-" * 80 + "\n")

# Test generation
prompts = [
    "It was on a dreary night of November that",
    "The monster gazed upon his creator with"
]
cpu_sample_generate("distilgpt2", prompts, max_new_tokens=60)


‚ö†Ô∏è  Using distilgpt2 for CPU generation (Mistral-7B too large)
   Real Mistral generation will be done on GPU in notebook 11.
   Note: DistilGPT-2 wasn't trained on Frankenstein text, so outputs
   won't match the style. This is just for testing the pipeline.

Prompt 1: It was on a dreary night of November that
Generating (CPU - this may take 30-60 seconds)...
Continuation: it would not happen. This afternoon, my daughter and I sat down to take photos with her friends in the parking lot as she waited for them to be taken away from us by our own private plane.[7]
The photo posted at http://www2gw0u9rZd

--------------------------------------------------------------------------------

Prompt 2: The monster gazed upon his creator with
Generating (CPU - this may take 30-60 seconds)...
Continuation: a smile.
It was the most interesting thing to behold in my life: he had an almost supernatural ability and this one, which is more than anything I've ever seen before; it's hard to imagine h