# Evaluation: Base vs Finetuned (GPU)

## Comparison Goals

After training, we want to:
1. **Quantitative:** Compare perplexity (base vs finetuned)
2. **Qualitative:** Generate samples side-by-side
3. **Document:** Record hyperparameters, costs, latency

This notebook runs on GPU for speed, but you can adapt it for CPU if needed.


In [None]:
# === TODO (you code this) ===
# Load base model + attach LoRA adapters; run perplexity on validation slice.
# Hints:
#   - Load base model in 4-bit
#   - Use PeftModel.from_pretrained() to attach adapters
#   - Compute perplexity on validation set (similar to notebook 04, but on GPU)
# Acceptance:
#   - prints ppl_base vs ppl_finetuned

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

def eval_perplexity_with_adapters(base_model: str, adapter_repo: str, tokenized_ds, n_samples: int=500):
    """
    Evaluate perplexity with base and finetuned models.
    
    Args:
        base_model: Base model name
        adapter_repo: Hub repo with LoRA adapters
        tokenized_ds: Tokenized validation dataset
        n_samples: Number of samples to evaluate
    """
    raise NotImplementedError

# Evaluate
# eval_perplexity_with_adapters(
#     "mistralai/Mistral-7B-Instruct-v0.2",
#     "YOURUSER/mistral-frankenstein-qlora",
#     ds_val,
#     n_samples=500
# )


## Side-by-Side Generation

Generate text with both models using the same prompts. Compare style, coherence, and Frankenstein-like tone.


In [None]:
# === TODO (you code this) ===
# Generate 3-5 short continuations with both models for side-by-side comparison.
# Hints:
#   - Load base model and finetuned (base + adapters)
#   - Use same prompts for both
#   - Print outputs side-by-side or in a table
#   - Use reasonable generation parameters (temperature, top_p)
# Acceptance:
#   - prints paired outputs with fixed prompts

def compare_samples(base_model: str, adapter_repo: str, prompts: list, max_new_tokens: int=100):
    """
    Generate samples with base and finetuned models for comparison.
    
    Args:
        base_model: Base model name
        adapter_repo: Hub repo with LoRA adapters
        prompts: List of prompt strings
        max_new_tokens: Maximum tokens to generate
    """
    raise NotImplementedError

# Compare
prompts = [
    "It was on a dreary night of November that",
    "The monster gazed upon his creator with",
    "I beheld the wretchâ€”the miserable monster",
    "Life and death appeared to me ideal bounds"
]
# compare_samples(
#     "mistralai/Mistral-7B-Instruct-v0.2",
#     "YOURUSER/mistral-frankenstein-qlora",
#     prompts,
#     max_new_tokens=100
# )
