# RL-based Language Model Finetuning Demo

This notebook demonstrates the complete pipeline for finetuning language models using Reinforcement Learning with Human Feedback (RLHF).

## Overview
1. Setup and imports
2. Load pretrained model
3. Evaluate base model
4. Supervised finetuning (baseline)
5. PPO-based RL finetuning
6. Compare results

## 1. Setup

In [None]:
import sys
sys.path.append('..')

import torch
import yaml
from src.models import PolicyLM, PolicyConfig, RewardModel
from src.ppo import PPOTrainer, PPOHyperParams
from src.utils import get_device

device = get_device()
print(f"Using device: {device}")

## 2. Load Base Model

In [None]:
# Load configuration
with open('../config/model_config.yaml', 'r') as f:
    model_config = yaml.safe_load(f)

# Initialize base model
base_policy = PolicyLM(PolicyConfig(
    model_name=model_config['model_name'],
    tokenizer_name=model_config['tokenizer_name'],
    max_length=model_config['max_length']
))

print("✓ Base model loaded successfully")

## 3. Test Generation

In [None]:
# Test prompts
test_prompts = [
    "This movie was absolutely",
    "I really enjoyed the",
    "The acting in this film was"
]

# Generate completions
print("Base Model Generations:")
print("=" * 60)
responses = base_policy.generate(test_prompts, max_new_tokens=20)
for prompt, response in zip(test_prompts, responses):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("-" * 60)

## 4. Initialize Reward Model

In [None]:
# Load reward config
with open('../config/reward_config.yaml', 'r') as f:
    reward_config = yaml.safe_load(f)

# Initialize reward model
reward_model = RewardModel(
    model_name=reward_config['reward_model']['name'],
    w_sentiment=reward_config['weights']['sentiment'],
    w_repetition=reward_config['weights']['repetition'],
    w_length=reward_config['weights']['length'],
    min_tokens=reward_config['length_target']['min_tokens'],
    max_tokens=reward_config['length_target']['max_tokens']
)

print("✓ Reward model loaded successfully")

## 5. Evaluate Base Model Rewards

In [None]:
# Compute rewards for base model generations
rewards = reward_model.compute_reward(responses)
sentiments = reward_model.sentiment_score(responses)

print("Base Model Evaluation:")
print("=" * 60)
for i, (response, reward, sentiment) in enumerate(zip(responses, rewards, sentiments)):
    print(f"Example {i+1}:")
    print(f"  Response: {response}")
    print(f"  Reward: {reward:.4f}")
    print(f"  Sentiment: {sentiment:.4f}")
    print("-" * 60)

print(f"\nAverage Reward: {rewards.mean():.4f}")
print(f"Average Sentiment: {sentiments.mean():.4f}")

## 6. Visualize Training Progress (if models are trained)

If you've already trained models, you can visualize the results:

In [None]:
import json
import matplotlib.pyplot as plt
import os

# Check if training stats exist
stats_path = '../models/policy_ppo/training_stats.json'
if os.path.exists(stats_path):
    with open(stats_path, 'r') as f:
        stats = json.load(f)
    
    # Plot reward progression
    rewards = [s['reward'] for s in stats]
    plt.figure(figsize=(10, 6))
    plt.plot(rewards, linewidth=2)
    plt.xlabel('Update Step')
    plt.ylabel('Average Reward')
    plt.title('PPO Training: Reward Progression')
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print("No training stats found. Train the model first using:")
    print("  bash scripts/run_ppo.sh")

## 7. Compare Models (if trained)

Load and compare base, SFT, and PPO models:

In [None]:
# Load comparison results if available
comparison_path = '../results/comparison.json'
if os.path.exists(comparison_path):
    with open(comparison_path, 'r') as f:
        comparison = json.load(f)
    
    print("Model Comparison:")
    print("=" * 60)
    for model_name, metrics in comparison['summary'].items():
        print(f"\n{model_name.upper()} Model:")
        for metric, value in metrics.items():
            print(f"  {metric}: {value:.4f}")
    
    # Visualize comparison
    models = list(comparison['summary'].keys())
    rewards = [comparison['summary'][m]['mean_reward'] for m in models]
    sentiments = [comparison['summary'][m]['mean_sentiment'] for m in models]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    ax1.bar(models, rewards, color=['blue', 'orange', 'green'])
    ax1.set_ylabel('Average Reward')
    ax1.set_title('Mean Reward Comparison')
    ax1.grid(True, alpha=0.3, axis='y')
    
    ax2.bar(models, sentiments, color=['blue', 'orange', 'green'])
    ax2.set_ylabel('Average Sentiment')
    ax2.set_title('Mean Sentiment Comparison')
    ax2.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
else:
    print("No comparison results found. Run evaluation first using:")
    print("  bash scripts/run_eval.sh --base --sft models/policy_sft --ppo models/policy_ppo/final")

## 8. Generate Examples from Trained Models

Compare generations from different models side-by-side:

In [None]:
# Load trained models if available
ppo_model_path = '../models/policy_ppo/final'

if os.path.exists(ppo_model_path):
    ppo_policy = PolicyLM(PolicyConfig(
        model_name=ppo_model_path,
        tokenizer_name=ppo_model_path,
        max_length=64
    ))
    
    print("Comparing Base vs PPO Model:")
    print("=" * 80)
    
    for prompt in test_prompts:
        base_gen = base_policy.generate([prompt], max_new_tokens=20)[0]
        ppo_gen = ppo_policy.generate([prompt], max_new_tokens=20)[0]
        
        base_reward = reward_model.compute_reward([base_gen])[0]
        ppo_reward = reward_model.compute_reward([ppo_gen])[0]
        
        print(f"\nPrompt: {prompt}")
        print("-" * 80)
        print(f"Base:  {base_gen}")
        print(f"       Reward: {base_reward:.4f}")
        print()
        print(f"PPO:   {ppo_gen}")
        print(f"       Reward: {ppo_reward:.4f}")
        print("=" * 80)
else:
    print("PPO model not found. Train it first using:")
    print("  bash scripts/run_ppo.sh")

## Next Steps

To train the models:

```bash
# 1. Supervised finetuning (optional baseline)
bash scripts/run_sft.sh

# 2. PPO training
bash scripts/run_ppo.sh

# 3. Evaluate all models
bash scripts/run_eval.sh --base --sft models/policy_sft --ppo models/policy_ppo/final

# 4. Generate plots
python -m src.utils.plotting \
  --training_stats models/policy_ppo/training_stats.json \
  --comparison results/comparison.json \
  --output plots
```