# AlpacaEval 2 Benchmark Evaluation

##  Purpose
This notebook implements **AlpacaEval 2** evaluation to assess instruction-following quality of our fine-tuned LLaMA-2-7B model. This is a critical component of the evaluation framework required by [Assignment 7](../../tasks/Assignment7.md) to demonstrate the effectiveness of our LoRA fine-tuning approach.

## Evaluation Context
Following the complete fine-tuning pipeline:

1. **[Baseline Model Evaluation](baseline_model.ipynb)** - Established baseline performance metrics
2. **[Fine-tuning Process](finetune_model.ipynb)** - LoRA fine-tuning on Dolly-15K dataset  
3. **This Notebook** - AlpacaEval 2 evaluation of fine-tuned model
4. **MT-Bench Evaluation** - Multi-turn dialogue assessment (separate notebook)

## AlpacaEval 2 Framework
- **Repository**: https://github.com/tatsu-lab/alpaca_eval
- **Purpose**: Standardized evaluation dataset for instruction-following models
- **Dataset**: 805 diverse instruction-following examples
- **Method**: Generate responses with our model, then use automated judge (GPT-4) to compare against reference responses
- **Metrics**: Win rate (how often our model's response is preferred over reference)

## Expected Improvements
After LoRA fine-tuning on Dolly-15K, we expect to see:
- **Higher win rates** against reference responses
- **Better instruction following** quality
- **More helpful and coherent** responses
- **Improved formatting** and structure

## Technical Implementation
- **Model**: Fine-tuned LLaMA-2-7B with LoRA adapters
- **Process**:
  1. Load our fine-tuned model
  2. Generate responses to AlpacaEval dataset instructions
  3. Use GPT-4 as automated judge to compare our responses vs. reference responses
  4. Calculate win rate (percentage of times our response is preferred)
- **Comparison**: Fine-tuned vs. baseline model win rates

## Workflow
1. **Load fine-tuned model** from saved LoRA adapters
2. **Load AlpacaEval dataset** (805 instruction examples)
3. **Generate responses** using our fine-tuned model
4. **Run automated evaluation** using GPT-4 judge to compare against reference responses
5. **Calculate win rate** and compare with baseline model results
6. **Document metrics** for final report

## Success Criteria
- **Higher win rate** than baseline model on AlpacaEval dataset
- **Measurable improvement** in instruction-following quality
- **Consistent performance** across different instruction types
- **Clear evidence** of fine-tuning effectiveness

---
**Note**: This evaluation is essential for demonstrating that our LoRA fine-tuning approach successfully improves the model's instruction-following capabilities on the Dolly-15K dataset.


In [1]:
!pip install git+https://github.com/tatsu-lab/alpaca_eval.git

Collecting git+https://github.com/tatsu-lab/alpaca_eval.git
  Cloning https://github.com/tatsu-lab/alpaca_eval.git to /tmp/pip-req-build-ve46jucq
  Running command git clone --filter=blob:none --quiet https://github.com/tatsu-lab/alpaca_eval.git /tmp/pip-req-build-ve46jucq
  Resolved https://github.com/tatsu-lab/alpaca_eval.git to commit cd543a149df89434d8a54582c0151c0b945c3d20
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fire (from alpaca_eval==0.6.6)
  Downloading fire-0.7.1-py3-none-any.whl.metadata (5.8 kB)
Downloading fire-0.7.1-py3-none-any.whl (115 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.9/115.9 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: alpaca_eval
  Building wheel for alpaca_eval (setup.py) ... [?25l[?25hdone
  Created wheel for alpaca_eval: filename=alpaca_eval-0.6.6-py3-none-any.whl size=362273 sha256=83f909a3de2e6dd66d48bbe3c869cb53d0ceabb5f5988167e204c4ac5ab1f08a
  Stored in

In [6]:
import polars as pl
import json
import os
from google.colab import drive

In [9]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Drive not mounted, so nothing to flush and unmount.
Drive unmounted successfully.
Removed existing /content/drive directory.
Mounted at /content/drive


In [11]:
output_dir = "/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs"
os.makedirs(output_dir, exist_ok=True)
combined_parquet_path = os.path.join(output_dir, "eval_outputs_combined.parquet")
eval_set_complete = pl.read_parquet(combined_parquet_path)
print(f"✓ Loaded {eval_set_complete.height} records.")


✓ Loaded 10 records.


In [32]:
def save_for_alpaca_eval(df: pl.DataFrame, output_col_name: str, file_path: str, generator_name: str):
    if output_col_name not in df.columns:
        print(f"Error: Column '{output_col_name}' not found in DataFrame.")
        return
    if "instruction" not in df.columns:
        print("Error: Column 'instruction' not found in DataFrame.")
        return

    # Select columns and add generator name
    selected_df = df.select([
        "instruction",
        pl.col(output_col_name).alias("output"),
        pl.lit(generator_name).alias("generator")
    ])
    output_list = selected_df.to_dicts()

    try:
        with open(file_path, 'w', encoding='utf-8') as f:
            json.dump(output_list, f, indent=4, ensure_ascii=False)
        print(f"✓ Saved outputs for AlpacaEval to: {file_path}")
    except Exception as e:
        print(f"Error saving JSON file '{file_path}': {e}")

In [34]:
baseline_json_path = os.path.join(output_dir, "alpaca_eval_baseline_outputs.json")
finetuned_json_path = os.path.join(output_dir, "alpaca_eval_finetuned_outputs.json")

save_for_alpaca_eval(eval_set_complete, "baseline_output", baseline_json_path, "Llama2-7B-Baseline")
save_for_alpaca_eval(eval_set_complete, "finetuned_output", finetuned_json_path, "Llama2-7B-Dolly-QLoRA")

✓ Saved outputs for AlpacaEval to: /content/drive/MyDrive/LLaMA2-Dolly-Training/outputs/alpaca_eval_baseline_outputs.json
✓ Saved outputs for AlpacaEval to: /content/drive/MyDrive/LLaMA2-Dolly-Training/outputs/alpaca_eval_finetuned_outputs.json


In [35]:
eval_results_path = os.path.join(output_dir, "eval_results_finetuned_vs_baseline_WeightedGPT4t")
os.makedirs(eval_results_path, exist_ok=True)
os.environ["OPENAI_API_KEY"] = "PLACEHOLDER"
!alpaca_eval \
    --model_outputs "{finetuned_json_path}" \
    --reference_outputs "{baseline_json_path}" \
    --annotators_config "weighted_alpaca_eval_gpt4_turbo" \
    --output_path "{eval_results_path}" \
    --name "Llama2-7B-Dolly-QLoRA_vs_Baseline_WeightedGPT4t"

INFO:root:Evaluating the Llama2-7B-Dolly-QLoRA_vs_Baseline_WeightedGPT4t outputs.
INFO:root:Creating the annotator from `weighted_alpaca_eval_gpt4_turbo`.
INFO:root:Saving annotations to `/usr/local/lib/python3.12/dist-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json`.
INFO:root:Loading all annotations from /usr/local/lib/python3.12/dist-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json.
Annotation chunk:   0% 0/1 [00:00<?, ?it/s]INFO:root:Annotating 0 examples with weighted_alpaca_eval_gpt4_turbo
INFO:root:Saving all annotations to /usr/local/lib/python3.12/dist-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /usr/local/lib/python3.12/dist-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json.
Annotation chunk: 100% 1/1 [00:00<00:00, 34.4