In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Reward Modeling — Teaching a Neural Network Human Preferences

**Vizuara AI**

In this notebook, we will build a reward model from scratch. A reward model is the component of RLHF that learns to predict which response a human would prefer. We will implement the Bradley-Terry preference model, train it on pairwise comparisons, and visualize how it learns to rank responses.

By the end of this notebook, you will have a working reward model that can score any text completion.


## 1. Why Does This Matter?

Let us start with a simple question: How do you teach a language model what a "good" response looks like?

For math problems, this is easy — the answer is either right or wrong. But for open-ended tasks like "explain gravity to a child," there is no single correct answer. Different humans might prefer different explanations.

The key insight behind RLHF is that **humans are much better at comparing two things than scoring one thing absolutely**. Show a person two explanations and ask "which is better?" — they can answer immediately and consistently.

A reward model takes this insight and turns it into a neural network. It learns to assign a scalar score to any response, such that preferred responses get higher scores than rejected ones. This is the foundation of the entire RLHF pipeline.

In [None]:
# Setup and imports
!pip install torch transformers datasets -q

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
import random

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Building Intuition

Before we write any math, let us build intuition for what a reward model does.

Imagine you are a teacher grading essays. You have a stack of essays and you need to rank them from best to worst. Grading each essay on a 1-10 scale is hard — is this essay a 7 or an 8? But comparing two essays and saying "this one is better" is much easier.

A reward model works the same way. It does not directly learn absolute scores. Instead, it learns from **pairwise comparisons**: "For this prompt, Response A is better than Response B."

Let us create some synthetic data to see this in action.

In [None]:
# Let us create synthetic preference data to build intuition
# Imagine a simple scenario: we have text of varying "quality" (length + keyword presence)

prompts = [
    "Explain machine learning simply.",
    "What is reinforcement learning?",
    "How do neural networks work?",
]

# Each prompt has a preferred (w) and rejected (l) response
preference_data = [
    {
        "prompt": "Explain machine learning simply.",
        "preferred": "Machine learning is when computers learn patterns from data, like how you learn to recognize faces.",
        "rejected": "ML utilizes gradient-based optimization of parameterized function approximators.",
    },
    {
        "prompt": "What is reinforcement learning?",
        "preferred": "RL is like training a dog — good behavior gets treats (rewards), bad behavior gets nothing.",
        "rejected": "Reinforcement learning optimizes a policy via the Bellman optimality equation.",
    },
    {
        "prompt": "How do neural networks work?",
        "preferred": "A neural network passes information through layers, like a game of telephone, refining the message at each step.",
        "rejected": "Neural networks are differentiable computational graphs with backpropagation.",
    },
]

print("Sample preference pair:")
print(f"Prompt: {preference_data[0]['prompt']}")
print(f"Preferred: {preference_data[0]['preferred']}")
print(f"Rejected: {preference_data[0]['rejected']}")

## 3. The Mathematics

Now let us formalize this with the **Bradley-Terry model**. Given two responses $y_w$ (preferred) and $y_l$ (rejected) to a prompt $x$, the probability that a human prefers $y_w$ is:

$$P(y_w \succ y_l | x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$

Here, $r_\theta$ is our reward model (a neural network with parameters $\theta$) and $\sigma$ is the sigmoid function.

**Numerical example:** Suppose $r_\theta(x, y_w) = 3.0$ and $r_\theta(x, y_l) = 1.0$:

$$P = \sigma(3.0 - 1.0) = \sigma(2.0) = \frac{1}{1 + e^{-2.0}} = \frac{1}{1.135} = 0.881$$

This says the model believes there is an 88.1% chance the preferred response is indeed better. This is exactly what we want.

The **loss function** for training the reward model is:

$$\mathcal{L} = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$

This loss is small when the reward model ranks correctly (preferred gets a higher score) and large when it ranks incorrectly.

In [None]:
# Let us visualize the Bradley-Terry loss to build intuition

reward_diff = np.linspace(-5, 5, 200)
sigmoid_vals = 1 / (1 + np.exp(-reward_diff))
loss_vals = -np.log(sigmoid_vals + 1e-8)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sigmoid
axes[0].plot(reward_diff, sigmoid_vals, 'b-', linewidth=2)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].scatter([2], [0.881], color='green', s=100, zorder=5, label='Case 1: correct (r_w > r_l)')
axes[0].scatter([-2], [0.119], color='red', s=100, zorder=5, label='Case 2: wrong (r_w < r_l)')
axes[0].set_xlabel('r(y_w) - r(y_l)', fontsize=12)
axes[0].set_ylabel('P(y_w preferred)', fontsize=12)
axes[0].set_title('Sigmoid: Preference Probability', fontsize=13)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Loss
axes[1].plot(reward_diff, loss_vals, 'r-', linewidth=2)
axes[1].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
axes[1].scatter([2], [-np.log(0.881)], color='green', s=100, zorder=5, label='Case 1: low loss (0.13)')
axes[1].scatter([-2], [-np.log(0.119)], color='red', s=100, zorder=5, label='Case 2: high loss (2.13)')
axes[1].set_xlabel('r(y_w) - r(y_l)', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Bradley-Terry Loss', fontsize=13)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey insight: The loss heavily penalizes incorrect rankings!")
print(f"Correct ranking (diff=+2): Loss = {-np.log(0.881):.3f}")
print(f"Wrong ranking (diff=-2):   Loss = {-np.log(0.119):.3f}")

## 4. Let's Build It — Component by Component

Now let us build a reward model step by step. In real RLHF, the reward model starts from a pretrained LLM and replaces the output head with a scalar projection. We will build a simplified version that captures all the essential mechanics.

### Step 1: Text Encoder

First, we need a way to convert text into a fixed-size vector. We will use a simple embedding + average pooling approach.

In [None]:
class SimpleTextEncoder(nn.Module):
    """A simple text encoder that converts token IDs to a fixed-size vector."""

    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.layers = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )

    def forward(self, token_ids):
        # token_ids: (batch, seq_len)
        embedded = self.embedding(token_ids)    # (batch, seq_len, embed_dim)
        pooled = embedded.mean(dim=1)           # (batch, embed_dim) — average pooling
        return self.layers(pooled)              # (batch, hidden_dim)

# Test it
encoder = SimpleTextEncoder(vocab_size=1000, embed_dim=64, hidden_dim=128).to(device)
dummy_input = torch.randint(0, 1000, (2, 20)).to(device)
output = encoder(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Encoded shape: {output.shape}")
print("Text encoder works!")

### Step 2: Reward Head

The reward head takes the encoded representation and projects it to a single scalar value.

In [None]:
class RewardModel(nn.Module):
    """
    Reward model: text encoder + scalar reward head.

    Architecture mirrors real RLHF reward models:
    - Shared encoder (analogous to LLM backbone)
    - Linear projection to scalar reward
    """

    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.encoder = SimpleTextEncoder(vocab_size, embed_dim, hidden_dim)
        self.reward_head = nn.Linear(hidden_dim, 1)  # Project to scalar

    def forward(self, token_ids):
        hidden = self.encoder(token_ids)      # (batch, hidden_dim)
        reward = self.reward_head(hidden)      # (batch, 1)
        return reward.squeeze(-1)              # (batch,) — scalar per example

# Test it
reward_model = RewardModel(vocab_size=1000, embed_dim=64, hidden_dim=128).to(device)
dummy_input = torch.randint(0, 1000, (4, 20)).to(device)
rewards = reward_model(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Rewards shape: {rewards.shape}")
print(f"Reward values: {rewards.detach().cpu().numpy()}")
print("\nReward model assigns a scalar score to each input!")

### Step 3: Bradley-Terry Loss

Now let us implement the loss function that trains the reward model on preference pairs.

In [None]:
def bradley_terry_loss(reward_preferred, reward_rejected):
    """
    Bradley-Terry preference loss.

    L = -log(sigmoid(r_preferred - r_rejected))

    Args:
        reward_preferred: (batch,) rewards for preferred responses
        reward_rejected: (batch,) rewards for rejected responses

    Returns:
        Scalar loss value
    """
    return -F.logsigmoid(reward_preferred - reward_rejected).mean()

# Test with our intuition from earlier
r_pref = torch.tensor([3.0, 1.5, 4.0])
r_rej = torch.tensor([1.0, 2.0, 0.5])

loss = bradley_terry_loss(r_pref, r_rej)
print(f"Preferred rewards: {r_pref.numpy()}")
print(f"Rejected rewards:  {r_rej.numpy()}")
print(f"Differences:       {(r_pref - r_rej).numpy()}")
print(f"Loss: {loss.item():.4f}")
print("\nNote: Pair 2 has r_pref < r_rej (wrong ranking), increasing the loss!")

## 5. Your Turn

### TODO 1: Implement a Preference Accuracy Metric

The accuracy metric tells us what fraction of preference pairs the reward model ranks correctly.

In [None]:
def preference_accuracy(reward_preferred, reward_rejected):
    """
    Calculate the fraction of pairs where the reward model
    correctly assigns a higher reward to the preferred response.

    Args:
        reward_preferred: (batch,) tensor of rewards for preferred responses
        reward_rejected: (batch,) tensor of rewards for rejected responses

    Returns:
        Float: accuracy between 0.0 and 1.0

    Hint: A pair is correct if reward_preferred > reward_rejected
    """
    # TODO: Implement this function
    # Step 1: Compare reward_preferred to reward_rejected element-wise
    # Step 2: Calculate the fraction that are correct
    pass

# Test your implementation
r_pref = torch.tensor([3.0, 1.5, 4.0, 2.0])
r_rej = torch.tensor([1.0, 2.0, 0.5, 1.8])
# Expected: 3 out of 4 correct = 0.75 (pair 2 is wrong)

acc = preference_accuracy(r_pref, r_rej)
print(f"Accuracy: {acc}")
# Should print: Accuracy: 0.75

### TODO 2: Create a Synthetic Preference Dataset

Create a dataset where "quality" is determined by a known function, so we can verify our reward model learns the right thing.

In [None]:
class SyntheticPreferenceDataset(Dataset):
    """
    Synthetic dataset for testing reward model training.

    Each "text" is a random token sequence. Quality is determined by
    a hidden scoring function (average token value). The preferred
    response always has higher true quality.

    TODO: Implement the __getitem__ method.
    """

    def __init__(self, num_pairs, seq_len, vocab_size):
        self.num_pairs = num_pairs
        self.seq_len = seq_len
        self.vocab_size = vocab_size

    def __len__(self):
        return self.num_pairs

    def __getitem__(self, idx):
        """
        Generate a preference pair.

        Returns:
            dict with keys:
                'preferred': tensor of token IDs (seq_len,) — higher quality
                'rejected': tensor of token IDs (seq_len,) — lower quality

        Hint:
            1. Generate two random token sequences
            2. Calculate a "quality score" for each (e.g., mean token value)
            3. The one with higher quality becomes 'preferred'
        """
        # TODO: Implement this method
        # Step 1: Generate two random token sequences using torch.randint
        # Step 2: Calculate quality scores (try: mean token value / vocab_size)
        # Step 3: Assign preferred/rejected based on quality
        pass

# Test your implementation
# dataset = SyntheticPreferenceDataset(num_pairs=1000, seq_len=20, vocab_size=1000)
# sample = dataset[0]
# print(f"Preferred shape: {sample['preferred'].shape}")
# print(f"Rejected shape: {sample['rejected'].shape}")

## 6. Putting It All Together

Now let us train our reward model on synthetic preference data and see if it learns to rank correctly.

In [None]:
# Create synthetic dataset with a known quality function
class WorkingSyntheticDataset(Dataset):
    """Each sequence has quality = mean(token_ids) / vocab_size."""

    def __init__(self, num_pairs, seq_len, vocab_size):
        self.data = []
        for _ in range(num_pairs):
            seq_a = torch.randint(0, vocab_size, (seq_len,))
            seq_b = torch.randint(0, vocab_size, (seq_len,))
            quality_a = seq_a.float().mean().item()
            quality_b = seq_b.float().mean().item()

            if quality_a > quality_b:
                self.data.append({'preferred': seq_a, 'rejected': seq_b})
            else:
                self.data.append({'preferred': seq_b, 'rejected': seq_a})

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Create dataset and dataloader
dataset = WorkingSyntheticDataset(num_pairs=2000, seq_len=20, vocab_size=1000)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Initialize model and optimizer
reward_model = RewardModel(vocab_size=1000, embed_dim=64, hidden_dim=128).to(device)
optimizer = torch.optim.Adam(reward_model.parameters(), lr=1e-3)

print(f"Dataset size: {len(dataset)} preference pairs")
print(f"Model parameters: {sum(p.numel() for p in reward_model.parameters()):,}")
print("Ready to train!")

## 7. Training and Results

In [None]:
# Training loop
losses = []
accuracies = []
num_epochs = 20

for epoch in range(num_epochs):
    epoch_loss = 0
    epoch_correct = 0
    epoch_total = 0

    for batch in dataloader:
        preferred = batch['preferred'].to(device)
        rejected = batch['rejected'].to(device)

        # Forward pass
        r_pref = reward_model(preferred)
        r_rej = reward_model(rejected)

        # Compute loss
        loss = bradley_terry_loss(r_pref, r_rej)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Track metrics
        epoch_loss += loss.item()
        epoch_correct += (r_pref > r_rej).sum().item()
        epoch_total += len(preferred)

    avg_loss = epoch_loss / len(dataloader)
    accuracy = epoch_correct / epoch_total
    losses.append(avg_loss)
    accuracies.append(accuracy)

    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}/{num_epochs} — Loss: {avg_loss:.4f}, Accuracy: {accuracy:.3f}")

print(f"\nFinal accuracy: {accuracies[-1]:.3f}")

In [None]:
# Visualize training progress
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(range(1, num_epochs+1), losses, 'b-', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Reward Model Training Loss', fontsize=13)
axes[0].grid(True, alpha=0.3)

axes[1].plot(range(1, num_epochs+1), accuracies, 'g-', linewidth=2)
axes[1].axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='Random baseline')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Preference Accuracy', fontsize=12)
axes[1].set_title('Reward Model Preference Accuracy', fontsize=13)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(0.4, 1.05)

plt.tight_layout()
plt.show()

print("\nThe reward model learned to rank preferred responses higher!")
print("Starting from random (50%), it achieves significantly better accuracy.")

## 8. Final Output

Let us verify that our trained reward model assigns higher scores to higher-quality sequences.

In [None]:
# Generate sequences of varying quality and check if reward model scores them correctly
reward_model.eval()

qualities = []
rewards = []

with torch.no_grad():
    for target_quality in np.linspace(100, 900, 20):
        # Generate sequence with tokens centered around target_quality
        seq = torch.randint(
            max(0, int(target_quality - 100)),
            min(1000, int(target_quality + 100)),
            (1, 20)
        ).to(device)

        reward = reward_model(seq).item()
        true_quality = seq.float().mean().item()

        qualities.append(true_quality)
        rewards.append(reward)

plt.figure(figsize=(8, 5))
plt.scatter(qualities, rewards, c='blue', s=50, alpha=0.8)
plt.xlabel('True Quality (mean token value)', fontsize=12)
plt.ylabel('Predicted Reward', fontsize=12)
plt.title('Reward Model: True Quality vs Predicted Reward', fontsize=13)
plt.grid(True, alpha=0.3)

# Add correlation
correlation = np.corrcoef(qualities, rewards)[0, 1]
plt.text(0.05, 0.95, f'Correlation: {correlation:.3f}',
         transform=plt.gca().transAxes, fontsize=12,
         verticalalignment='top', bbox=dict(boxstyle='round', facecolor='lightblue'))
plt.tight_layout()
plt.show()

print(f"\nCorrelation between true quality and predicted reward: {correlation:.3f}")
print("A high positive correlation means the reward model learned the quality function!")

## 9. Reflection and Next Steps

**What we built:**
- A reward model that learns from pairwise preference comparisons
- The Bradley-Terry loss function for training on ranked pairs
- Visualization of how the model learns to rank correctly

**Key takeaways:**
1. Reward models learn from **comparisons**, not absolute scores
2. The Bradley-Terry loss pushes preferred responses to get higher rewards
3. Even a simple model can learn a latent quality function from comparisons

**Think about:**
- What happens if the preference data is noisy (humans disagree)?
- How would you handle ties (both responses are equally good)?
- In real RLHF, the reward model starts from a pretrained LLM — why is this important?

**Next notebook:** We will use this reward model to actually improve a language model using policy gradients and PPO.