# AlpacaEval 2 Benchmark Evaluation

##  Purpose
This notebook implements runs on the fine-tuned model. This is a critical component of the evaluation framework required by [Assignment 7](../../tasks/Assignment7.md) to run the model weights on different inputs.

## Evaluation Context
Following the complete fine-tuning pipeline:

1. **[Baseline Model Evaluation](baseline_model.ipynb)** - Established baseline performance metrics
2. **[Fine-tuning Process](finetune_model.ipynb)** - LoRA fine-tuning on Dolly-15K dataset  
3. **This Notebook** - AlpacaEval 2 evaluation of fine-tuned model
4. **MT-Bench Evaluation** - Multi-turn dialogue assessment (separate notebook)

## AlpacaEval 2 Framework
- **Repository**: https://github.com/tatsu-lab/alpaca_eval
- **Purpose**: Standardized evaluation dataset for instruction-following models
- **Dataset**: 805 diverse instruction-following examples
- **Method**: Generate responses with our model, then use automated judge (GPT-4) to compare against reference responses
- **Metrics**: Win rate (how often our model's response is preferred over reference)

## Expected Improvements
After LoRA fine-tuning on Dolly-15K, we expect to see:
- **Higher win rates** against reference responses
- **Better instruction following** quality
- **More helpful and coherent** responses
- **Improved formatting** and structure

## Technical Implementation
- **Model**: Fine-tuned LLaMA-2-7B with LoRA adapters
- **Process**:
  1. Load our fine-tuned model
  2. Generate responses to AlpacaEval dataset instructions
  3. Use GPT-4 as automated judge to compare our responses vs. reference responses
  4. Calculate win rate (percentage of times our response is preferred)
- **Comparison**: Fine-tuned vs. baseline model win rates

## Workflow
1. **Load fine-tuned model** from saved LoRA adapters
2. **Load AlpacaEval dataset** (805 instruction examples)
3. **Generate responses** using our fine-tuned model
4. **Run automated evaluation** using GPT-4 judge to compare against reference responses
5. **Calculate win rate** and compare with baseline model results
6. **Document metrics** for final report

## Success Criteria
- **Higher win rate** than baseline model on AlpacaEval dataset
- **Measurable improvement** in instruction-following quality
- **Consistent performance** across different instruction types
- **Clear evidence** of fine-tuning effectiveness

---
**Note**: This evaluation is essential for demonstrating that our LoRA fine-tuning approach successfully improves the model's instruction-following capabilities on the Dolly-15K dataset.


In [None]:
!pip install git+https://github.com/tatsu-lab/alpaca_eval.git

In [None]:
!pip install -U transformers peft bitsandbytes

In [None]:
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from huggingface_hub import hf_hub_url
from peft import PeftModel
from tqdm.notebook import tqdm
import pandas as pd
import polars as pl
import os
from google.colab import drive

In [None]:
drive.mount('/content/drive', force_remount=True)

In [None]:
from huggingface_hub import login
login(new_session=False)

In [None]:
#Load the fine-tuned model
model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

adapter_path = "/content/drive/MyDrive/LLaMA2-Dolly-Training/results/final_lora_adapter"
print(f"Loading Fine-Tuned LoRA adapter from: {adapter_path}...")
model = PeftModel.from_pretrained(base_model, adapter_path)

tokenizer = AutoTokenizer.from_pretrained(adapter_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(model)



In [None]:
#Do a simple run with the fine tuned model
prompt = "What is the capital of France"

inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to(model.device)
generate_ids = model.generate(
    **inputs,
    max_new_tokens=50,
    pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(generate_ids[0], skip_special_tokens=True)
print(generated_text)

In [None]:
output_dir = "/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs"
baseline_parquet_path = os.path.join(output_dir, "baseline_model_outputs.parquet")

eval_set_with_baseline = pl.read_parquet(baseline_parquet_path)

eval_set_with_baseline.head(10)

In [None]:
instructions = eval_set_with_baseline.get_column("instruction").to_list()
finetuned_outputs = []
BATCH_SIZE = 1
print(f"\nGenerating {len(instructions)} outputs from fine-tuned model in batches of {BATCH_SIZE}...")

PROMPT_TEMPLATE = """### Instruction:
{instruction}

### Response:
"""

for i in tqdm(range(0, len(instructions), BATCH_SIZE)):
    batch_instructions = instructions[i : i + BATCH_SIZE]
    prompts = [PROMPT_TEMPLATE.format(instruction=inst) for inst in batch_instructions]

    inputs = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=1024
    ).to(model.device)

    with torch.no_grad():
        generate_ids = model.generate(
            **inputs,
            max_new_tokens=1024,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    output_tokens = generate_ids[:, inputs.input_ids.shape[1]:]
    batch_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
    finetuned_outputs.extend(batch_outputs)

eval_set_complete = eval_set_with_baseline.with_columns(
    pl.Series("finetuned_output", finetuned_outputs)
)

eval_set_complete.head()

In [None]:
random_row_with_output = eval_set_complete.sample(n=1)
print(f"DATASET: {random_row_with_output['dataset'][0]}")
print(f"INSTRUCTION: {random_row_with_output['instruction'][0]}")
print(f"GENERATOR: {random_row_with_output['generator'][0]}")
print(f"OUTPUT:\n {random_row_with_output['output'][0]}")

print(f"BASELINE:\n {random_row_with_output['baseline_output'][0]}")

print(f"FINE-TUNED:\n {random_row_with_output['finetuned_output'][0]}")

In [None]:
output_dir = "/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs"
combined_parquet_path = os.path.join(output_dir, "eval_outputs_combined.parquet")
os.makedirs(output_dir, exist_ok=True)
print(f"Saving combined DataFrame (baseline + fine-tuned) to: {combined_parquet_path}")
eval_set_complete.write_parquet(combined_parquet_path)
!ls -lh "{combined_parquet_path}"