In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# End-to-End RLHF Pipeline — Sentiment Alignment with GPT-2

**Vizuara AI**

In this notebook, we bring everything together. We will take a pretrained GPT-2 model and align it to generate positive-sentiment text using RLHF. This combines the reward model from Notebook 1 and the PPO algorithm from Notebook 2 into a complete, working pipeline.

By the end, you will have a GPT-2 model that has been steered towards generating positive-sentiment completions through reinforcement learning from a reward signal.


## 1. Why Does This Matter?

This notebook is the capstone — it shows how all the pieces of RLHF fit together in practice. We will use a real GPT-2 model (not a toy), a real reward function (sentiment analysis), and a real optimization algorithm (PPO with KL penalty).

The task is deliberately simple — steer text towards positive sentiment — so we can focus on understanding the mechanics rather than fighting with scale. But the exact same pipeline powers the alignment of ChatGPT, Claude, and other frontier language models. The only differences are model size, reward model complexity, and compute budget.

In [None]:
# Setup — this requires a GPU for reasonable speed
!pip install torch transformers datasets -q

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import copy
import warnings
warnings.filterwarnings('ignore')

torch.manual_seed(42)
np.random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2. Building Intuition

The RLHF pipeline has three stages that build on each other:

1. **Start with a pretrained LLM** (GPT-2) — it can generate coherent text but has no notion of "good" vs "bad"
2. **Define a reward signal** — we use sentiment analysis as a proxy for human preference (positive = good)
3. **Optimize with PPO + KL penalty** — nudge the model towards higher rewards while staying close to the original

Think of it like training a musician. Stage 1 is learning to play notes (language ability). Stage 2 is having a music critic who scores performances (reward model). Stage 3 is the musician practicing and improving based on the critic's feedback, while keeping their own musical style (KL penalty).

Let us start by loading our base model and seeing what it generates without any alignment.

In [None]:
# Load GPT-2 (small, 124M parameters)
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Active model (will be optimized)
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)

# Reference model (frozen — for KL penalty)
ref_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
ref_model.eval()
for param in ref_model.parameters():
    param.requires_grad = False

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Reference model frozen: {not any(p.requires_grad for p in ref_model.parameters())}")

In [None]:
# Generate some baseline completions (before RLHF)
def generate_text(model, tokenizer, prompts, max_new_tokens=30):
    """Generate completions for a list of prompts."""
    model.eval()
    completions = []
    with torch.no_grad():
        for prompt in prompts:
            inputs = tokenizer(prompt, return_tensors="pt").to(device)
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
                top_k=50,
                pad_token_id=tokenizer.eos_token_id,
            )
            completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
            completions.append(completion)
    return completions

test_prompts = [
    "Today I feel",
    "The best thing about",
    "My favorite memory is",
    "I am excited because",
]

print("=== Baseline GPT-2 Completions (Before RLHF) ===\n")
baseline_completions = generate_text(model, tokenizer, test_prompts)
for prompt, completion in zip(test_prompts, baseline_completions):
    print(f"Prompt: '{prompt}'")
    print(f"Completion: {completion}\n")

## 3. The Mathematics

The complete RLHF objective combines three components:

**1. Reward from the reward model:**
$$r_{\text{RM}}(x, y) = \text{SentimentScore}(x + y)$$

**2. KL penalty against the reference model:**
$$\text{KL}_t = \log \pi_\theta(y_t | s_t) - \log \pi_{\text{ref}}(y_t | s_t)$$

**3. Total per-token reward:**
$$r_t = \begin{cases} -\beta \cdot \text{KL}_t & \text{for } t < T \\ r_{\text{RM}}(x, y) - \beta \cdot \text{KL}_t & \text{for } t = T \end{cases}$$

**Numerical example:** Suppose $r_{\text{RM}} = 2.5$, $\beta = 0.1$, and per-token KL values are $[0.1, 0.3, 0.2, 0.5]$:
- Total KL = $0.1 + 0.3 + 0.2 + 0.5 = 1.1$
- KL penalty = $0.1 \times 1.1 = 0.11$
- Effective reward = $2.5 - 0.11 = 2.39$

The KL penalty is small because the model has not diverged much from the reference. If the model started generating very different outputs (KL = 15.0), the penalty would be $0.1 \times 15.0 = 1.5$, significantly reducing the effective reward.

The PPO objective then maximizes this reward:
$$L^{\text{CLIP}} = \mathbb{E}\left[\min\left(r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t\right)\right]$$


## 4. Let's Build It — Component by Component

### Step 1: Reward Function (Sentiment Scoring)

We use a simple keyword-based sentiment scorer. In production RLHF, this would be a trained reward model.

In [None]:
# Simple sentiment reward function
# In real RLHF, this would be a trained reward model (see Notebook 1)

POSITIVE_WORDS = {
    'happy', 'great', 'wonderful', 'amazing', 'love', 'beautiful',
    'excellent', 'fantastic', 'joy', 'excited', 'good', 'best',
    'brilliant', 'awesome', 'delightful', 'cheerful', 'grateful',
    'blessed', 'fortunate', 'incredible', 'superb', 'perfect',
    'smile', 'laugh', 'fun', 'kind', 'warm', 'bright', 'hope',
}

NEGATIVE_WORDS = {
    'sad', 'terrible', 'awful', 'hate', 'ugly', 'worst',
    'horrible', 'disaster', 'pain', 'angry', 'bad', 'dead',
    'kill', 'die', 'fear', 'sick', 'miserable', 'depressed',
    'lonely', 'failure', 'broken', 'dark', 'cry', 'suffer',
}

def sentiment_reward(text):
    """
    Simple keyword-based sentiment reward.
    Returns a score between -2.0 and 2.0.
    """
    words = text.lower().split()
    pos_count = sum(1 for w in words if w.strip('.,!?') in POSITIVE_WORDS)
    neg_count = sum(1 for w in words if w.strip('.,!?') in NEGATIVE_WORDS)

    # Normalize by text length
    total = max(len(words), 1)
    score = (pos_count - neg_count) / total * 10
    return max(-2.0, min(2.0, score))  # Clip to [-2, 2]

# Test it
test_texts = [
    "I am so happy and grateful for this wonderful day",
    "The weather is okay nothing special",
    "This is terrible and I hate everything about it",
]

print("Sentiment Reward Examples:")
for text in test_texts:
    score = sentiment_reward(text)
    print(f"  '{text}' -> Reward: {score:.2f}")

### Step 2: KL Divergence Computation

In [None]:
def compute_kl_penalty(model, ref_model, input_ids, attention_mask):
    """
    Compute per-token KL divergence between model and reference model.

    KL_t = log pi_theta(y_t | s_t) - log pi_ref(y_t | s_t)
    """
    with torch.no_grad():
        ref_outputs = ref_model(input_ids, attention_mask=attention_mask)
        ref_logprobs = F.log_softmax(ref_outputs.logits, dim=-1)

    model_outputs = model(input_ids, attention_mask=attention_mask)
    model_logprobs = F.log_softmax(model_outputs.logits, dim=-1)

    # Get log probs for actual tokens (shift by 1 for autoregressive)
    # For token at position t, the prediction is from position t-1
    token_ids = input_ids[:, 1:]  # Target tokens
    model_token_logprobs = model_logprobs[:, :-1].gather(
        2, token_ids.unsqueeze(-1)
    ).squeeze(-1)
    ref_token_logprobs = ref_logprobs[:, :-1].gather(
        2, token_ids.unsqueeze(-1)
    ).squeeze(-1)

    # Per-token KL divergence
    kl_per_token = model_token_logprobs - ref_token_logprobs

    return kl_per_token, model_token_logprobs

# Quick test
test_input = tokenizer("Hello world", return_tensors="pt").to(device)
kl, logprobs = compute_kl_penalty(model, ref_model,
                                   test_input['input_ids'],
                                   test_input['attention_mask'])
print(f"KL per token: {kl.detach().cpu().numpy()}")
print("KL is approximately 0 because model and ref are identical (no training yet)")

### Step 3: RLHF Training Step

Now we build the core training function that combines generation, reward computation, and PPO optimization.

In [None]:
def rlhf_training_step(model, ref_model, tokenizer, optimizer,
                       prompts, beta=0.1, epsilon=0.2,
                       max_new_tokens=20, ppo_epochs=2):
    """
    One step of RLHF training:
    1. Generate completions from current policy
    2. Score with reward function
    3. Compute KL penalty
    4. Optimize with PPO
    """
    model.train()

    # --- Step 1: Generate completions ---
    model.eval()
    all_input_ids = []
    all_completions = []
    prompt_lengths = []

    with torch.no_grad():
        for prompt in prompts:
            inputs = tokenizer(prompt, return_tensors="pt").to(device)
            prompt_len = inputs['input_ids'].shape[1]

            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.8,
                top_k=50,
                pad_token_id=tokenizer.eos_token_id,
            )

            all_input_ids.append(outputs[0])
            completion = tokenizer.decode(outputs[0][prompt_len:],
                                         skip_special_tokens=True)
            all_completions.append(completion)
            prompt_lengths.append(prompt_len)

    model.train()

    # --- Step 2: Compute rewards ---
    rewards = []
    for prompt, completion in zip(prompts, all_completions):
        r = sentiment_reward(prompt + " " + completion)
        rewards.append(r)
    rewards = torch.tensor(rewards, dtype=torch.float32, device=device)

    # --- Step 3: Compute old log probs and KL penalty ---
    total_loss = torch.tensor(0.0, device=device)
    total_reward = rewards.mean().item()
    total_kl = 0.0

    for i, (input_ids, prompt_len) in enumerate(zip(all_input_ids, prompt_lengths)):
        input_ids = input_ids.unsqueeze(0)
        attention_mask = torch.ones_like(input_ids)

        kl_per_token, model_logprobs = compute_kl_penalty(
            model, ref_model, input_ids, attention_mask
        )

        # Per-token rewards: KL penalty on all tokens, sentiment reward on last
        completion_kl = kl_per_token[:, prompt_len-1:]
        token_rewards = -beta * completion_kl
        if completion_kl.shape[1] > 0:
            token_rewards[:, -1] += rewards[i]

        # Simple advantage: reward - mean
        advantages = token_rewards - token_rewards.mean()

        # Policy gradient loss (simplified REINFORCE with KL-adjusted rewards)
        completion_logprobs = model_logprobs[:, prompt_len-1:]
        pg_loss = -(completion_logprobs * advantages.detach()).mean()

        total_loss = total_loss + pg_loss
        total_kl += completion_kl.abs().mean().item()

    total_loss = total_loss / len(prompts)

    # --- Step 4: Optimize ---
    optimizer.zero_grad()
    total_loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    return {
        'loss': total_loss.item(),
        'mean_reward': total_reward,
        'mean_kl': total_kl / len(prompts),
        'completions': list(zip(prompts, all_completions)),
    }

## 5. Your Turn

### TODO 1: Implement Reward Logging and Visualization

Track rewards over training and create a live-updating plot.

In [None]:
class TrainingLogger:
    """
    Log training metrics and create visualizations.

    TODO: Implement the plot method.
    """
    def __init__(self):
        self.rewards = []
        self.kl_values = []
        self.losses = []

    def log(self, metrics):
        """Log a training step's metrics."""
        self.rewards.append(metrics['mean_reward'])
        self.kl_values.append(metrics['mean_kl'])
        self.losses.append(metrics['loss'])

    def plot(self):
        """
        Create a 3-panel figure showing:
        1. Mean reward over training steps
        2. KL divergence over training steps
        3. Loss over training steps

        TODO: Implement this method
        Hint:
            - Use plt.subplots(1, 3, figsize=(18, 4))
            - Plot self.rewards, self.kl_values, self.losses
            - Add labels, titles, and grid
        """
        # TODO: Implement
        pass

# logger = TrainingLogger()

### TODO 2: Implement Adaptive KL Penalty

In production RLHF, beta is adjusted dynamically to keep KL within a target range.

In [None]:
def adaptive_kl_controller(beta, kl_value, kl_target=0.5):
    """
    Adjust beta to keep KL divergence near the target.

    If KL > 1.5 * target: increase beta (more penalty)
    If KL < target / 1.5: decrease beta (less penalty)

    Args:
        beta: current KL penalty coefficient
        kl_value: observed KL divergence
        kl_target: desired KL divergence

    Returns:
        Updated beta value

    TODO: Implement this function
    Hint:
        - If kl_value > 1.5 * kl_target, multiply beta by 1.5
        - If kl_value < kl_target / 1.5, multiply beta by (1 / 1.5)
        - Otherwise keep beta the same
        - Clip beta to [0.01, 10.0] range
    """
    # TODO: Implement
    pass

# Test (uncomment after implementing):
# print(adaptive_kl_controller(0.1, 1.0, 0.5))  # Should increase beta
# print(adaptive_kl_controller(0.1, 0.2, 0.5))  # Should decrease beta
# print(adaptive_kl_controller(0.1, 0.5, 0.5))  # Should stay same

## 6. Putting It All Together

Let us run the full RLHF training loop.

In [None]:
# Training prompts — we cycle through these during training
training_prompts = [
    "Today I feel", "The best part of", "I really enjoy",
    "My favorite thing is", "I am grateful for", "What makes me happy is",
    "The world is", "People are", "Life is",
    "I believe that", "The future looks", "Every day I",
    "My friends always", "I love when", "The sun makes me",
    "Music makes me feel", "Nature is so", "Kindness is",
]

# Setup
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
beta = 0.1
num_steps = 40
batch_size = 4

all_metrics = {'rewards': [], 'kl': [], 'losses': []}

print("Starting RLHF training...")
print(f"Steps: {num_steps}, Batch size: {batch_size}, Beta: {beta}\n")

## 7. Training and Results

In [None]:
for step in range(num_steps):
    # Sample random prompts for this batch
    batch_prompts = [training_prompts[np.random.randint(len(training_prompts))]
                     for _ in range(batch_size)]

    # Run RLHF training step
    metrics = rlhf_training_step(
        model, ref_model, tokenizer, optimizer,
        batch_prompts, beta=beta, max_new_tokens=20,
    )

    all_metrics['rewards'].append(metrics['mean_reward'])
    all_metrics['kl'].append(metrics['mean_kl'])
    all_metrics['losses'].append(metrics['loss'])

    if (step + 1) % 10 == 0:
        print(f"Step {step+1}/{num_steps} — "
              f"Reward: {metrics['mean_reward']:.3f}, "
              f"KL: {metrics['mean_kl']:.4f}, "
              f"Loss: {metrics['loss']:.4f}")
        # Show a sample completion
        if metrics['completions']:
            p, c = metrics['completions'][0]
            print(f"  Sample: '{p}' -> '{c}'\n")

print("\nTraining complete!")

In [None]:
# Visualize training
fig, axes = plt.subplots(1, 3, figsize=(18, 4))

# Rewards
window = 5
smoothed_rewards = [np.mean(all_metrics['rewards'][max(0,i-window):i+1])
                    for i in range(len(all_metrics['rewards']))]
axes[0].plot(smoothed_rewards, 'b-', linewidth=2)
axes[0].set_xlabel('Step', fontsize=12)
axes[0].set_ylabel('Mean Reward', fontsize=12)
axes[0].set_title('Sentiment Reward During Training', fontsize=13)
axes[0].grid(True, alpha=0.3)

# KL divergence
axes[1].plot(all_metrics['kl'], 'r-', linewidth=2)
axes[1].set_xlabel('Step', fontsize=12)
axes[1].set_ylabel('KL Divergence', fontsize=12)
axes[1].set_title('KL Divergence from Reference', fontsize=13)
axes[1].grid(True, alpha=0.3)

# Loss
axes[2].plot(all_metrics['losses'], 'g-', linewidth=2)
axes[2].set_xlabel('Step', fontsize=12)
axes[2].set_ylabel('Loss', fontsize=12)
axes[2].set_title('Policy Gradient Loss', fontsize=13)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Final Output

Let us compare the model's outputs before and after RLHF training.

In [None]:
# Generate aligned completions (after RLHF)
print("=" * 70)
print("COMPARISON: Before vs After RLHF")
print("=" * 70)

test_prompts = [
    "Today I feel",
    "The best thing about",
    "My favorite memory is",
    "I am excited because",
]

# Generate with aligned model
aligned_completions = generate_text(model, tokenizer, test_prompts)

# Generate with reference (original) model
ref_completions = generate_text(ref_model, tokenizer, test_prompts)

for prompt, ref_comp, aligned_comp in zip(test_prompts, ref_completions, aligned_completions):
    ref_reward = sentiment_reward(ref_comp)
    aligned_reward = sentiment_reward(aligned_comp)

    print(f"\nPrompt: '{prompt}'")
    print(f"  Before RLHF (reward={ref_reward:.2f}): {ref_comp}")
    print(f"  After RLHF  (reward={aligned_reward:.2f}): {aligned_comp}")
    print()

# Overall statistics
print("\n" + "=" * 70)
n_samples = 50
ref_rewards = []
aligned_rewards = []

for _ in range(n_samples):
    prompt = training_prompts[np.random.randint(len(training_prompts))]
    ref_comp = generate_text(ref_model, tokenizer, [prompt])[0]
    aligned_comp = generate_text(model, tokenizer, [prompt])[0]
    ref_rewards.append(sentiment_reward(ref_comp))
    aligned_rewards.append(sentiment_reward(aligned_comp))

print(f"\nStatistics over {n_samples} samples:")
print(f"  Reference model mean reward: {np.mean(ref_rewards):.3f}")
print(f"  Aligned model mean reward:   {np.mean(aligned_rewards):.3f}")
print(f"  Improvement: {np.mean(aligned_rewards) - np.mean(ref_rewards):.3f}")

plt.figure(figsize=(8, 5))
plt.hist(ref_rewards, bins=15, alpha=0.5, label='Before RLHF', color='red')
plt.hist(aligned_rewards, bins=15, alpha=0.5, label='After RLHF', color='blue')
plt.xlabel('Sentiment Reward', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Reward Distribution: Before vs After RLHF', fontsize=13)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nThe aligned model generates consistently more positive text!")
print("This is exactly what we want.")

## 9. Reflection and Next Steps

**What we built:**
- A complete RLHF pipeline: generation, reward scoring, KL penalty, policy gradient optimization
- Applied it to a real GPT-2 model with a sentiment reward signal
- Demonstrated measurable improvement in alignment

**Key takeaways:**
1. RLHF combines reward signals with KL-penalized policy gradients
2. The KL penalty is critical — without it, the model would collapse to reward-hacking
3. Even simple reward functions can meaningfully steer model behavior
4. The same pipeline scales to frontier models — only the scale changes

**Think about:**
- What happens if you increase beta too much? (Hint: the model barely changes)
- What happens if you set beta to 0? (Hint: reward hacking)
- How would you replace the keyword-based reward with a trained reward model?
- How does this scale to models with billions of parameters?

**Congratulations!** You have built a complete RLHF pipeline from scratch. The same principles — reward modeling, policy optimization, KL regularization — power the alignment of every major language model today.