# RLHF Evaluation Workflow
This notebook demonstrates an end-to-end evaluation workflow for models fine-tuned with RLHF, DPO, or other alignment techniques. It covers:

1. Generating outputs for a set of prompts
2. Running Preference Proxy Evaluation (PPE) with a reward model
3. Performing evaluation using Hugging Face's `evaluate` library
4. Optionally: scoring with a simulated LLM-as-a-Judge

We'll use `transformers`, `trl`, and `datasets` libraries.

In [None]:
# Install required packages
!pip install transformers datasets evaluate trl --quiet

## Step 1: Load or Define Prompts

In [None]:
from datasets import Dataset

prompts = [
    {'prompt': 'Explain why the sky is blue.'},
    {'prompt': 'Summarize the key differences between Python and Java.'},
    {'prompt': 'What are the ethical considerations in AI alignment?'}
]
prompt_dataset = Dataset.from_list(prompts)
prompt_dataset

## Step 2: Generate Outputs from Two Models (e.g., Base vs Aligned)

In [None]:
from transformers import pipeline

# Use two different pipelines as proxy for base and aligned
base_model = pipeline('text-generation', model='gpt2')
aligned_model = pipeline('text-generation', model='gpt2-medium')

def generate_outputs(example):
    example['base_output'] = base_model(example['prompt'], max_new_tokens=60)[0]['generated_text']
    example['aligned_output'] = aligned_model(example['prompt'], max_new_tokens=60)[0]['generated_text']
    return example

generated_dataset = prompt_dataset.map(generate_outputs)
generated_dataset

## Step 3: Preference Proxy Evaluation (Mock Reward Model)

In [None]:
def mock_reward_model(text):
    return len(text.split()) / 50.0  # normalized reward proxy

def score_outputs(example):
    example['reward_base'] = mock_reward_model(example['base_output'])
    example['reward_aligned'] = mock_reward_model(example['aligned_output'])
    return example

scored_dataset = generated_dataset.map(score_outputs)
scored_dataset

## Step 4: Compare Scores and Compute Win Rates

In [None]:
wins = sum([ex['reward_aligned'] > ex['reward_base'] for ex in scored_dataset])
print(f"Aligned model wins {wins} out of {len(scored_dataset)} prompts")

## Step 5: Evaluate with Hugging Face `evaluate` Library (e.g., BLEU, ROUGE)

In [None]:
import evaluate
rouge = evaluate.load('rouge')

results = rouge.compute(
    predictions=[ex['aligned_output'] for ex in scored_dataset],
    references=[ex['base_output'] for ex in scored_dataset]
)
results