# Day 30: Deep Reinforcement Learning from Human Feedback (RLHF)

> Christiano et al. (2017) — [Deep Reinforcement Learning from Human Preferences](https://arxiv.org/abs/1706.03741)

This is the algorithm that enabled ChatGPT.

## What You'll Learn
1. **The Bradley-Terry Model**: How to turn pairwise preferences into a probability.
2. **Reward Modeling**: Training a neural network to predict human preferences.
3. **Synthetic Oracle**: Simulating human feedback using ground-truth rewards.
4. **The RLHF Loop**: Collect -> Label -> Train Reward Model -> Train Policy.


## Setup
We need PyTorch, Gymnasium, and Matplotlib.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 1. The Bradley-Terry Model

The core mathematical assumption in RLHF is that the probability of preferring segment $\sigma^1$ over $\sigma^2$ depends on their rewards:

$$ P[\sigma^1 \succ \sigma^2] = \frac{\exp(\sum r(\sigma^1))}{\exp(\sum r(\sigma^1)) + \exp(\sum r(\sigma^2))} $$

Let's implement this probability function.

In [None]:
def preference_probability(r1_sum, r2_sum):
    """Computes P(1 > 2) given reward sums."""
    r1_exp = torch.exp(r1_sum)
    r2_exp = torch.exp(r2_sum)
    return r1_exp / (r1_exp + r2_exp)

# Example: Which is better?
r1 = torch.tensor(1.0) # Sum reward 1.0
r2 = torch.tensor(0.0) # Sum reward 0.0
prob = preference_probability(r1, r2)
print(f"P(1 > 2): {prob.item():.4f}")
# Expected: exp(1)/(exp(1)+1) ~= 0.7311

## 2. The Reward Model

We need a neural network that takes an observation (state) and outputs a single scalar reward. This network will be trained to agree with the 'human' preferences.

In [None]:
class RewardModel(nn.Module):
    def __init__(self, obs_dim, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1)
        )
        
    def forward(self, obs):
        return self.net(obs)

# Test dimensions
rm = RewardModel(obs_dim=4)
dummy_obs = torch.randn(1, 4)
print(f"Reward Output: {rm(dummy_obs).item():.4f}")

## 3. The Preference Loss

We train the reward model by minimizing the cross-entropy loss between the predicted preference probabilities and the actual human labels.

If the human says "Segment 1 is better" (label=0), we want to maximize $P(\sigma^1 \succ \sigma^2)$.

In [None]:
def compute_loss(reward_model, s1, s2, label):
    # s1, s2 are batches of observations: (batch, len, obs_dim)
    # label is (batch,) where 0 means s1 is better, 1 means s2 is better
    
    # 1. Get rewards for all steps
    r1 = reward_model(s1).squeeze(-1) # (batch, len)
    r2 = reward_model(s2).squeeze(-1)
    
    # 2. Sum rewards
    r1_sum = r1.sum(dim=1)
    r2_sum = r2.sum(dim=1)
    
    # 3. Stack as logits
    logits = torch.stack([r1_sum, r2_sum], dim=1)
    
    # 4. Cross Entropy
    loss = F.cross_entropy(logits, label)
    return loss

# Test Loss
s1 = torch.randn(2, 5, 4) # Batch 2, Len 5, Dim 4
s2 = torch.randn(2, 5, 4)
labels = torch.tensor([0, 1]) # First pair 1>2, Second pair 2>1
loss = compute_loss(rm, s1, s2, labels)
print(f"Loss: {loss.item():.4f}")

## 4. The Synthetic Oracle

Since we can't ask you to click a button thousands of times, we simulate a 'perfect' human who prefers the segment with higher ground-truth reward.

In [None]:
class SyntheticOracle:
    def query(self, r1_sum, r2_sum):
        return 0 if r1_sum > r2_sum else 1

oracle = SyntheticOracle()
print(f"Oracle says: {oracle.query(10, 5)} (0 means first is better)")

## 5. Putting It All Together

The full loop involves collecting data, getting labels, training the reward model, and then training the policy (PPO) against that reward model.

For the full implementation, run `python implementation.py` or `python train_minimal.py`.
Here is the high-level logic:

In [None]:
print("1. Initialize Policy and Reward Model")
print("2. Collect trajectories using Policy")
print("3. Ask Oracle to label pairs of trajectories")
print("4. Update Reward Model to minimize Preference Loss")
print("5. Update Policy (PPO) to maximize Reward Model output")
print("6. Repeat!")

## Key Takeaways

1. **Preference Learning**: We can learn a reward function just by asking "which is better?"
2. **Bradley-Terry**: The math model that converts scores into preference probabilities.
3. **Scalability**: This method scales to tasks where we can't write a reward function (e.g., summarization, driving, behaving helpfully).
4. **Alignment**: This is the technique that aligns raw LLMs into helpful assistants like ChatGPT.