# PyTorch Tutorial 19: RLHF and Alignment (The FAANG Standard)

**Author:** [Your Name/Organization]  
**Date:** 2025  

Training an LLM to predict the next token is only half the battle. The raw model (Base Model) is often chaotic, repetitive, or even toxic. To make it a helpful assistant (like ChatGPT or Claude), we need **Alignment**.

This tutorial covers the advanced techniques used to align models, focusing on the modern standard: **DPO (Direct Preference Optimization)**.

## Learning Objectives
1.  **Understand Alignment**: Why Supervised Fine-Tuning (SFT) isn't enough.
2.  **RLHF vs. DPO**: The evolution from complex Reinforcement Learning to simple Probability Optimization.
3.  **Implement DPO Loss**: Write the exact loss function used to align state-of-the-art models from scratch in PyTorch.

---

## 1. Vocabulary First

This is a high-stakes interview topic. Know these terms cold.

-   **SFT (Supervised Fine-Tuning)**: Training the model on high-quality "Instruction -> Answer" pairs. This teaches the *format* but not necessarily the *preference*.
-   **Preference Data**: Data in the format `(Prompt, Chosen Response, Rejected Response)`. "Chosen" is better than "Rejected".
-   **Reward Model**: A separate neural network trained to output a scalar score indicating how "good" a response is. Used in PPO.
-   **RLHF (Reinforcement Learning from Human Feedback)**: The classic pipeline: SFT -> Reward Model -> PPO. Complex and unstable.
-   **DPO (Direct Preference Optimization)**: A newer method (2023) that optimizes the policy *directly* on preference data without a separate reward model. Stable and efficient.
-   **Reference Model**: A frozen copy of the SFT model. We want our new model to improve preferences *without drifting too far* from the reference (to prevent gibberish).

## 2. The Mathematics of DPO

DPO is elegant because it derives a loss function directly from the optimal policy of the RLHF objective. 

The core idea is to increase the likelihood of the **Chosen** response ($y_w$) and decrease the likelihood of the **Rejected** response ($y_l$), weighted by how much the model already knows (the Reference Model).

### The Formula

$$ L_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right] $$

Where:
-   $\pi_\theta$: The model we are training.
-   $\pi_{ref}$: The frozen reference model.
-   $y_w$: Winning (Chosen) response.
-   $y_l$: Losing (Rejected) response.
-   $\beta$: A hyperparameter (temperature) controlling deviation from the reference (usually 0.1).
-   $\sigma$: The Sigmoid function.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


## 3. Implementing DPO Loss from Scratch

This is the "Whiteboard Coding" part. We will implement the loss function assuming we have the log-probabilities of the chosen and rejected tokens.

In a real training loop, you would:
1.  Forward pass `Prompt + Chosen` through Policy Model -> Get LogProbs.
2.  Forward pass `Prompt + Rejected` through Policy Model -> Get LogProbs.
3.  Forward pass `Prompt + Chosen` through Reference Model -> Get LogProbs (No Grad).
4.  Forward pass `Prompt + Rejected` through Reference Model -> Get LogProbs (No Grad).

In [2]:
def dpo_loss(policy_chosen_logps, policy_rejected_logps, 
             ref_chosen_logps, ref_rejected_logps, 
             beta=0.1):
    """
    Computes the DPO loss for a batch of preferences.
    
    Args:
        policy_chosen_logps: Log-probs of chosen responses from the model being trained.
        policy_rejected_logps: Log-probs of rejected responses from the model being trained.
        ref_chosen_logps: Log-probs of chosen responses from the frozen reference model.
        ref_rejected_logps: Log-probs of rejected responses from the frozen reference model.
        beta: Temperature parameter (strength of the KL penalty).
        
    Returns:
        losses: The loss for each example in the batch.
        rewards_chosen: Implicit rewards for chosen examples (for logging).
        rewards_rejected: Implicit rewards for rejected examples (for logging).
    """
    
    # 1. Calculate the log-ratio for the Policy Model
    # log( pi(y|x) )
    # We want to maximize (policy_chosen - policy_rejected)
    
    # 2. Calculate the log-ratio for the Reference Model
    # log( ref(y|x) )
    
    # 3. The core DPO trick: Implicit Reward
    # The "reward" is the difference between the policy and reference log-probs
    # scaled by beta.
    logr_chosen = policy_chosen_logps - ref_chosen_logps
    logr_rejected = policy_rejected_logps - ref_rejected_logps
    
    # 4. The DPO objective maximizes the margin between chosen and rejected
    logits = beta * (logr_chosen - logr_rejected)
    
    # 5. The Loss is -log(sigmoid(logits))
    # F.logsigmoid is numerically more stable than log(sigmoid(x))
    losses = -F.logsigmoid(logits)
    
    # Optional: Calculate "Implicit Rewards" for visualization
    # This helps us see if the model is actually learning the preference
    with torch.no_grad():
        rewards_chosen = beta * logr_chosen
        rewards_rejected = beta * logr_rejected
        
    return losses, rewards_chosen, rewards_rejected

## 4. Testing the Loss Function

Let's verify this works with some dummy data. 

Imagine we have a batch of 2 examples.
-   **Example 1**: The model assigns higher probability to the chosen response than the reference model. This is **GOOD**. Loss should be low.
-   **Example 2**: The model assigns lower probability to the chosen response. This is **BAD**. Loss should be high.

In [3]:
# Dummy Data (Batch Size = 2)

# Example 1: Good case (Policy prefers chosen more than Ref)
# Example 2: Bad case (Policy prefers rejected more than Ref)

policy_chosen = torch.tensor([-10.0, -10.0]) 
policy_rejected = torch.tensor([-15.0, -5.0]) # Ex 1: Rej is unlikely (-15). Ex 2: Rej is likely (-5).

ref_chosen = torch.tensor([-10.0, -10.0])
ref_rejected = torch.tensor([-10.0, -10.0])

print("--- Inputs ---")
print(f"Policy Chosen LogProbs:   {policy_chosen}")
print(f"Policy Rejected LogProbs: {policy_rejected}")

losses, r_chosen, r_rejected = dpo_loss(
    policy_chosen, policy_rejected,
    ref_chosen, ref_rejected,
    beta=0.5 # Higher beta = stronger constraint
)

print("\n--- Results ---")
print(f"Losses: {losses}")
print(f"Rewards Chosen: {r_chosen}")
print(f"Rewards Rejected: {r_rejected}")

# Interpretation
print("\n--- Interpretation ---")
print(f"Example 1 Loss: {losses[0]:.4f} (Low, because Policy correctly disliked the rejected response)")
print(f"Example 2 Loss: {losses[1]:.4f} (High, because Policy wrongly liked the rejected response)")

--- Inputs ---
Policy Chosen LogProbs:   tensor([-10., -10.])
Policy Rejected LogProbs: tensor([-15.,  -5.])

--- Results ---
Losses: tensor([0.0789, 2.5789])
Rewards Chosen: tensor([0., 0.])
Rewards Rejected: tensor([-2.5000,  2.5000])

--- Interpretation ---
Example 1 Loss: 0.0789 (Low, because Policy correctly disliked the rejected response)
Example 2 Loss: 2.5789 (High, because Policy wrongly liked the rejected response)


## 5. The Training Loop (Conceptual)

In a real scenario, you would integrate this into a PyTorch loop.

```python
# Pseudocode for DPO Training Loop
for batch in dataloader:
    optimizer.zero_grad()
    
    # 1. Forward Pass Policy
    policy_logps_chosen = model(batch['chosen_ids'])
    policy_logps_rejected = model(batch['rejected_ids'])
    
    # 2. Forward Pass Reference (No Grad)
    with torch.no_grad():
        ref_logps_chosen = ref_model(batch['chosen_ids'])
        ref_logps_rejected = ref_model(batch['rejected_ids'])
        
    # 3. Compute Loss
    loss, _, _ = dpo_loss(
        policy_logps_chosen, policy_logps_rejected,
        ref_logps_chosen, ref_logps_rejected
    )
    
    # 4. Backprop
    loss.mean().backward()
    optimizer.step()
```

## 6. Key Takeaways for Interviews

1.  **Why DPO?** It removes the need for a separate Reward Model and the unstable PPO loop. It optimizes the policy directly against the preference data.
2.  **The Reference Model**: Crucial for preventing the model from "gaming" the system or outputting gibberish. It acts as a regularizer (KL Divergence).
3.  **Beta**: The hyperparameter that controls how much we trust the reference model vs. the preference data.

You now have the code to implement the core of modern LLM alignment!