# Comprehensive Guide to Building Reward Models for RLHF

This notebook provides an **exhaustive** implementation of a GPT-2-based reward model for Reinforcement Learning from Human Feedback (RLHF). Unlike simpler approaches like GRPO (Group Relative Policy Optimization) that eliminate the need for explicit reward models, this implementation focuses on the traditional **PPO + Reward Model** paradigm that has proven successful in systems like ChatGPT, Claude, and other state-of-the-art language models.

## 🧠 Theoretical Foundation

### Why Reward Models?
While recent methods like GRPO show promise by comparing outputs within groups and eliminating separate reward models, the **reward model approach** remains the gold standard for several critical reasons:

1. **Scalability**: Reward models can be trained once and reused across multiple policy training runs
2. **Interpretability**: Explicit reward scores provide clear signals for debugging and analysis  
3. **Flexibility**: Can incorporate diverse preference data from multiple sources and modalities
4. **Proven Track Record**: Powers the most successful deployed language models in production

### Mathematical Framework

#### The Bradley-Terry Model for Preferences
Human preferences can be modeled using the **Bradley-Terry model**, which assumes that the probability of preferring completion $y_w$ (winner) over completion $y_l$ (loser) given prompt $x$ follows:

$P(y_w \succ y_l | x) = \frac{\exp(r_\theta(x, y_w))}{\exp(r_\theta(x, y_w)) + \exp(r_\theta(x, y_l))} = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$

Where:
- $r_\theta(x, y)$ is our learned reward function parameterized by $\theta$
- $\sigma(\cdot)$ is the sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-z}}$
- $y_w \succ y_l$ denotes that $y_w$ is preferred over $y_l$

#### Loss Function Derivation
The negative log-likelihood loss for the Bradley-Terry model becomes:

$\mathcal{L}_{RM}(\theta) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]$

This can be rewritten as:
$\mathcal{L}_{RM}(\theta) = \mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[ \log(1 + \exp(r_\theta(x, y_l) - r_\theta(x, y_w))) \right]$

#### Architecture Design
Our reward model $r_\theta(x, y)$ is constructed as:

$r_\theta(x, y) = \text{Linear}(\text{GPT-2}_\theta(\text{concat}(x, y))_{[\text{EOS}]})$

Where:
- $\text{concat}(x, y)$ represents the concatenation of prompt and completion
- $[\text{EOS}]$ indicates we extract the final token's hidden state
- $\text{Linear}(\cdot)$ is a single linear layer: $\mathbb{R}^{d_{\text{model}}} \rightarrow \mathbb{R}$

### Comparison with Alternative Approaches

#### GRPO vs Reward Models
**GRPO (Group Relative Policy Optimization)** eliminates reward models by using within-group comparisons:

$A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \ldots, r_G\})}{\text{std}(\{r_1, r_2, \ldots, r_G\})}$

**Advantages of GRPO:**
- No separate reward model training phase
- Reduced computational overhead
- Built-in variance reduction through group normalization

**Why We Still Use Reward Models:**
- **Reusability**: Train once, use for multiple policy iterations
- **Data Efficiency**: Can leverage large-scale preference datasets
- **Robustness**: More stable training dynamics for complex tasks
- **Interpretability**: Explicit reward values enable better analysis

## 🏗️ Implementation Strategy

### Architecture Components
1. **Base Model**: GPT-2 transformer for sequence encoding
2. **Value Head**: Single linear layer for scalar reward prediction  
3. **Training Loop**: Bradley-Terry loss optimization with preference pairs
4. **Evaluation**: Ranking correlation and preference accuracy metrics

### Training Pipeline
1. **Data Preparation**: Convert preference annotations to training pairs
2. **Model Training**: Optimize Bradley-Terry loss with Adam/AdamW
3. **Validation**: Test ranking ability on held-out preference data
4. **Integration**: Export for use in PPO policy training

This implementation provides the foundation for building production-ready reward models that can scale to large preference datasets and integrate seamlessly with PPO-based policy optimization.


In [None]:
import torch
import torch.nn as nn
from transformers import (
    AutoTokenizer,
    GPT2Model,
    GPT2PreTrainedModel,
    GPT2Config,
    get_linear_schedule_with_warmup,
)
from datasets import Dataset, load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm
import random
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

In [None]:
## 🧱 Step 1: Define GPT-2 Reward Model
class GPT2RewardModel(GPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.gpt2 = GPT2Model(config)
        self.value_head = nn.Linear(config.n_embd, 1)

        # Initialize the value head with small weights
        nn.init.normal_(self.value_head.weight, std=0.02)
        nn.init.zeros_(self.value_head.bias)

    def forward(self, input_ids, attention_mask=None):
        """
        Forward pass that returns a scalar reward for the input sequence.

        Args:
            input_ids: Token IDs of shape (batch_size, seq_len)
            attention_mask: Attention mask of shape (batch_size, seq_len)

        Returns:
            rewards: Scalar rewards of shape (batch_size,)
        """
        # Get GPT-2 outputs
        outputs = self.gpt2(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_states = (
            outputs.last_hidden_state
        )  # (batch_size, seq_len, hidden_size)

        # Get the hidden state of the last non-padding token for each sequence
        if attention_mask is not None:
            # Find the position of the last non-padding token
            seq_lengths = attention_mask.sum(dim=1) - 1  # -1 for 0-indexing
        else:
            seq_lengths = torch.full(
                (input_ids.shape[0],), input_ids.shape[1] - 1, device=input_ids.device
            )

        # Extract the final token's hidden state for each sequence in the batch
        batch_indices = torch.arange(
            last_hidden_states.shape[0], device=last_hidden_states.device
        )
        final_hidden_states = last_hidden_states[batch_indices, seq_lengths]

        # Pass through value head to get scalar reward
        rewards = self.value_head(final_hidden_states)
        return rewards.squeeze(-1)  # Shape: (batch_size,)

In [None]:
## 📚 Step 2: Initialize Model and Tokenizer


def initialize_model():
    """Initialize the reward model and tokenizer."""
    # Load GPT-2 config and create reward model
    config = GPT2Config.from_pretrained("gpt2")
    model = GPT2RewardModel(config)

    # Load tokenizer and set pad token
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

In [None]:
## 📊 Step 3: Create Preference Dataset


def create_preference_dataset(tokenizer, num_samples=1000, max_length=256):
    """
    Create a simulated preference dataset for training.
    In practice, you would load real preference data here.
    """
    # Load IMDB dataset as base text
    raw_dataset = load_dataset("imdb", split="train[:5%]")

    preference_data = []

    for i in tqdm(
        range(min(num_samples, len(raw_dataset))), desc="Creating preference pairs"
    ):
        base_text = raw_dataset[i]["text"][:200]  # Truncate for demo

        # Create two versions - one "better" than the other
        chosen_text = base_text + " This is a great movie with excellent acting."
        rejected_text = base_text + " This movie is terrible and boring."

        # Tokenize both versions
        chosen_encoded = tokenizer(
            chosen_text,
            truncation=True,
            padding="max_length",
            max_length=max_length,
            return_tensors="pt",
        )

        rejected_encoded = tokenizer(
            rejected_text,
            truncation=True,
            padding="max_length",
            max_length=max_length,
            return_tensors="pt",
        )

        preference_data.append(
            {
                "chosen_input_ids": chosen_encoded["input_ids"],
                "chosen_attention_mask": chosen_encoded["attention_mask"],
                "rejected_input_ids": rejected_encoded["input_ids"],
                "rejected_attention_mask": rejected_encoded["attention_mask"],
            }
        )

    return preference_data

## 🎯 Step 4: Fine-Tune GPT-2 for Scalar Reward Regression Pairwise Preference Loss

We train the reward model to give higher scores to the 'chosen' than to the 'rejected' outputs:

$$
\mathcal{L}_{\text{RM}} = -\log\left(\frac{\exp(r_\text{chosen})}{\exp(r_\text{chosen}) + \exp(r_\text{rejected})}\right)
$$

In [None]:
## 🎯 Step 4: Define Loss Function


def pairwise_preference_loss(chosen_rewards, rejected_rewards):
    """
    Bradley-Terry model loss for preference learning.
    Loss = -log(sigmoid(r_chosen - r_rejected))
    """
    return -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()

In [None]:
## 🔁 Step 5: Training Loop


def train_reward_model(
    model, preference_dataset, tokenizer, num_epochs=3, batch_size=4, lr=1e-5
):
    """Train the reward model on preference data."""

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.train()

    # Setup optimizer and scheduler
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
    total_steps = len(preference_dataset) * num_epochs // batch_size
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=total_steps // 10, num_training_steps=total_steps
    )

    # Create DataLoader
    dataloader = DataLoader(preference_dataset, batch_size=batch_size, shuffle=True)

    print(f"Training on {device}")
    print(f"Total training steps: {total_steps}")

    for epoch in range(num_epochs):
        total_loss = 0.0
        progress_bar = tqdm(dataloader, desc=f"Epoch {epoch + 1}/{num_epochs}")

        for batch_idx, batch in enumerate(progress_bar):
            # Move batch to device
            chosen_input_ids = batch["chosen_input_ids"].squeeze(1).to(device)
            chosen_attention_mask = batch["chosen_attention_mask"].squeeze(1).to(device)
            rejected_input_ids = batch["rejected_input_ids"].squeeze(1).to(device)
            rejected_attention_mask = (
                batch["rejected_attention_mask"].squeeze(1).to(device)
            )

            # Forward pass
            chosen_rewards = model(chosen_input_ids, chosen_attention_mask)
            rejected_rewards = model(rejected_input_ids, rejected_attention_mask)

            # Calculate loss
            loss = pairwise_preference_loss(chosen_rewards, rejected_rewards)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()

            total_loss += loss.item()

            # Update progress bar
            progress_bar.set_postfix(
                {
                    "loss": f"{loss.item():.4f}",
                    "avg_loss": f"{total_loss / (batch_idx + 1):.4f}",
                    "lr": f"{scheduler.get_last_lr()[0]:.2e}",
                }
            )

        avg_epoch_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch + 1} completed. Average Loss: {avg_epoch_loss:.4f}")

In [None]:
## 🔬 Step 6: Evaluation and Testing


def evaluate_reward_model(model, tokenizer, test_texts):
    """Evaluate the reward model on test texts."""
    model.eval()
    device = next(model.parameters()).device

    with torch.no_grad():
        for text in test_texts:
            encoded = tokenizer(
                text,
                truncation=True,
                padding="max_length",
                max_length=256,
                return_tensors="pt",
            ).to(device)

            reward = model(encoded["input_ids"], encoded["attention_mask"])
            print(f"Text: {text[:50]}...")
            print(f"Reward: {reward.item():.4f}\n")

In [None]:
## 🚀 Step 7: Main Training Script


def main():
    """Main training pipeline."""
    print("🎯 Initializing GPT-2 Reward Model for PPO")

    # Initialize model and tokenizer
    model, tokenizer = initialize_model()
    print(
        f"Model initialized with {sum(p.numel() for p in model.parameters())} parameters"
    )

    # Create preference dataset
    print("📊 Creating preference dataset...")
    preference_dataset = create_preference_dataset(tokenizer, num_samples=500)
    print(f"Created {len(preference_dataset)} preference pairs")

    # Train the model
    print("🔁 Starting training...")
    train_reward_model(model, preference_dataset, tokenizer, num_epochs=3, batch_size=2)

    # Evaluate on test examples
    print("🔬 Evaluating model...")
    test_texts = [
        "This movie is absolutely fantastic with great acting and plot!",
        "This movie is terrible and boring with bad acting.",
        "The film has decent cinematography but lacks emotional depth.",
    ]
    evaluate_reward_model(model, tokenizer, test_texts)

    # Save the model
    print("💾 Saving model...")
    model.save_pretrained("./gpt2_reward_model")
    tokenizer.save_pretrained("./gpt2_reward_model")
    print("Model saved successfully!")

    return model, tokenizer

In [None]:
## 🎮 Step 8: Usage with PPO


def use_with_ppo_example(model, tokenizer, prompt, response):
    """
    Example of how to use the reward model with PPO.
    This function shows the expected interface for PPO integration.
    """
    model.eval()
    device = next(model.parameters()).device

    # Combine prompt and response
    full_text = prompt + response

    # Tokenize
    encoded = tokenizer(
        full_text,
        truncation=True,
        padding="max_length",
        max_length=256,
        return_tensors="pt",
    ).to(device)

    # Get reward
    with torch.no_grad():
        reward = model(encoded["input_ids"], encoded["attention_mask"])

    return reward.item()

In [None]:
if __name__ == "__main__":
    # Run the main training pipeline
    trained_model, trained_tokenizer = main()

    # Example PPO usage
    print("🎮 PPO Integration Example:")
    sample_prompt = "What do you think about this movie? "
    sample_response = "I think it's a great film with excellent storytelling and character development."

    reward_score = use_with_ppo_example(
        trained_model, trained_tokenizer, sample_prompt, sample_response
    )
    print(f"Reward for response: {reward_score:.4f}")

🎯 Initializing GPT-2 Reward Model for PPO
Model initialized with 124440577 parameters
📊 Creating preference dataset...


Creating preference pairs: 100%|██████████| 500/500 [00:00<00:00, 1957.24it/s]


Created 500 preference pairs
🔁 Starting training...
Training on cuda
Total training steps: 750


Epoch 1/3: 100%|██████████| 250/250 [00:16<00:00, 15.25it/s, loss=-0.0000, avg_loss=0.1960, lr=7.41e-06]


Epoch 1 completed. Average Loss: 0.1960


Epoch 2/3: 100%|██████████| 250/250 [00:12<00:00, 20.22it/s, loss=0.0000, avg_loss=0.0002, lr=3.70e-06]  


Epoch 2 completed. Average Loss: 0.0002


Epoch 3/3: 100%|██████████| 250/250 [00:15<00:00, 16.45it/s, loss=-0.0000, avg_loss=0.0000, lr=0.00e+00]


Epoch 3 completed. Average Loss: 0.0000
🔬 Evaluating model...
Text: This movie is absolutely fantastic with great acti...
Reward: 10.0591

Text: This movie is terrible and boring with bad acting....
Reward: -7.3091

Text: The film has decent cinematography but lacks emoti...
Reward: -1.1237

💾 Saving model...
Model saved successfully!
🎮 PPO Integration Example:
Reward for response: 10.3318
