# Direct Preference Optimization (DPO): Simplified Alignment Without Reward Models

## 🎯 Overview

Direct Preference Optimization (DPO) is a groundbreaking approach that revolutionized how we align language models with human preferences. It eliminates the need for explicit reward models in RLHF, directly optimizing the policy using preference data.

**Key Innovation**: Treats the alignment problem as a classification task on preference pairs, bypassing the complex RLHF pipeline.

**Impact**: Widespread adoption for alignment, simplifying the training process while achieving comparable or better results than RLHF.

## 📚 Background & Motivation

### The RLHF Complexity Problem
Traditional RLHF requires:
1. **Supervised Fine-tuning (SFT)** on demonstrations
2. **Reward Model training** on preference data  
3. **Reinforcement Learning** with the reward model
4. **Complex optimization** with PPO/TRPO

### The DPO Solution
- **Direct optimization** on preference data
- **No reward model** needed
- **Simpler training** with standard supervised learning
- **Theoretical grounding** via Bandit Problem formulation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import seaborn as sns
from typing import Tuple, Dict, List
import math

# Set style
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)
torch.manual_seed(42)

print("📦 Libraries imported successfully!")
print(f"🔢 NumPy version: {np.__version__}")
print(f"🔥 PyTorch version: {torch.__version__}")

## 🧮 Mathematical Foundation

### DPO Core Mathematics

DPO is based on the insight that the optimal policy π* for RLHF has a closed form:

**π*(y|x) = 1/Z(x) × π_ref(y|x) × exp(r*(x,y) / β)**

Where:
- **π_ref**: Reference model (SFT model)
- **r***: Optimal reward function
- **β**: Temperature parameter
- **Z(x)**: Partition function

### Key Insight: Reparameterization

We can solve for the reward function:

**r*(x,y) = β × log(π*(y|x) / π_ref(y|x)) + β × log Z(x)**

### DPO Loss Function

For preference pairs (y_w, y_l) where y_w ≻ y_l:

**L_DPO = -log σ(β × log(π_θ(y_w|x) / π_ref(y_w|x)) - β × log(π_θ(y_l|x) / π_ref(y_l|x)))**

Where σ is the sigmoid function.

In [None]:
class DPOLoss(nn.Module):
    """
    Direct Preference Optimization Loss.
    
    Implements the DPO loss function that directly optimizes preferences
    without requiring a separate reward model.
    """
    
    def __init__(self, beta: float = 0.1, label_smoothing: float = 0.0):
        super().__init__()
        self.beta = beta
        self.label_smoothing = label_smoothing
        
    def forward(
        self,
        policy_chosen_logps: torch.Tensor,
        policy_rejected_logps: torch.Tensor, 
        reference_chosen_logps: torch.Tensor,
        reference_rejected_logps: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Compute DPO loss.
        
        Args:
            policy_chosen_logps: Log probabilities of chosen responses under policy
            policy_rejected_logps: Log probabilities of rejected responses under policy
            reference_chosen_logps: Log probabilities of chosen responses under reference
            reference_rejected_logps: Log probabilities of rejected responses under reference
        
        Returns:
            loss: DPO loss
            chosen_rewards: Implicit rewards for chosen responses
            rejected_rewards: Implicit rewards for rejected responses
        """
        
        # Compute implicit rewards
        policy_chosen_rewards = self.beta * (policy_chosen_logps - reference_chosen_logps)
        policy_rejected_rewards = self.beta * (policy_rejected_logps - reference_rejected_logps)
        
        # DPO loss: -log sigmoid(beta * (log(pi_chosen/pi_ref_chosen) - log(pi_rejected/pi_ref_rejected)))
        logits = policy_chosen_rewards - policy_rejected_rewards
        
        if self.label_smoothing == 0.0:
            loss = -F.logsigmoid(logits)
        else:
            # Label smoothing: mix with uniform distribution
            loss = -F.logsigmoid(logits) * (1 - self.label_smoothing) - F.logsigmoid(-logits) * self.label_smoothing
        
        return loss.mean(), policy_chosen_rewards.mean(), policy_rejected_rewards.mean()
    
    def get_implicit_reward(self, policy_logps: torch.Tensor, reference_logps: torch.Tensor) -> torch.Tensor:
        """
        Compute implicit reward from log probabilities.
        """
        return self.beta * (policy_logps - reference_logps)


def compute_log_probabilities(model, input_ids, attention_mask, labels):
    """
    Compute log probabilities for a sequence under a model.
    
    This is a simplified version - in practice, you'd use the actual model's
    forward pass and compute log probabilities for the labels.
    """
    # Simplified simulation of log probability computation
    batch_size, seq_len = labels.shape
    
    # Simulate model logits (in practice, this comes from model.forward())
    vocab_size = 32000
    logits = torch.randn(batch_size, seq_len, vocab_size)
    
    # Compute log probabilities
    log_probs = F.log_softmax(logits, dim=-1)
    
    # Gather log probabilities for the actual labels
    gathered_log_probs = log_probs.gather(dim=-1, index=labels.unsqueeze(-1)).squeeze(-1)
    
    # Mask out padding tokens and sum
    mask = (labels != -100).float()  # Assuming -100 is the ignore index
    sequence_log_prob = (gathered_log_probs * mask).sum(dim=-1)
    
    return sequence_log_prob


# Test DPO loss computation
def test_dpo_loss():
    """
    Test the DPO loss computation with synthetic data.
    """
    print("🧪 Testing DPO Loss Computation")
    print("=" * 40)
    
    # Create synthetic preference data
    batch_size = 8
    
    # Simulate log probabilities
    # Policy model gives higher probability to chosen responses
    policy_chosen_logps = torch.randn(batch_size) + 1.0  # Higher values
    policy_rejected_logps = torch.randn(batch_size) - 1.0  # Lower values
    
    # Reference model is neutral
    reference_chosen_logps = torch.randn(batch_size)
    reference_rejected_logps = torch.randn(batch_size)
    
    # Test different beta values
    betas = [0.01, 0.1, 0.5, 1.0]
    
    results = []
    
    for beta in betas:
        dpo_loss_fn = DPOLoss(beta=beta)
        
        loss, chosen_rewards, rejected_rewards = dpo_loss_fn(
            policy_chosen_logps,
            policy_rejected_logps,
            reference_chosen_logps,
            reference_rejected_logps
        )
        
        reward_margin = chosen_rewards - rejected_rewards
        
        results.append({
            'beta': beta,
            'loss': loss.item(),
            'chosen_reward': chosen_rewards.item(),
            'rejected_reward': rejected_rewards.item(),
            'reward_margin': reward_margin.item()
        })
        
        print(f"Beta {beta:4.2f}: Loss={loss:.4f}, Margin={reward_margin:.4f}")
    
    return results

# Run the test
dpo_test_results = test_dpo_loss()

## 🏗️ DPO Training Implementation

Let's implement a complete DPO training loop.

In [None]:
class PreferenceDataset(Dataset):
    """
    Dataset for preference pairs.
    
    Each sample contains:
    - prompt: Input text
    - chosen: Preferred response
    - rejected: Less preferred response
    """
    
    def __init__(self, preferences: List[Dict]):
        self.preferences = preferences
    
    def __len__(self):
        return len(self.preferences)
    
    def __getitem__(self, idx):
        return self.preferences[idx]


class DPOTrainer:
    """
    DPO Trainer for preference optimization.
    """
    
    def __init__(
        self,
        policy_model,
        reference_model,
        tokenizer,
        beta: float = 0.1,
        max_length: int = 512
    ):
        self.policy_model = policy_model
        self.reference_model = reference_model
        self.tokenizer = tokenizer
        self.beta = beta
        self.max_length = max_length
        
        # Freeze reference model
        for param in self.reference_model.parameters():
            param.requires_grad = False
        
        self.dpo_loss = DPOLoss(beta=beta)
    
    def compute_loss(self, batch):
        """
        Compute DPO loss for a batch of preference pairs.
        """
        # Extract prompts and responses
        prompts = batch['prompt']
        chosen_responses = batch['chosen']
        rejected_responses = batch['rejected']
        
        # Simulate tokenization and model forward pass
        batch_size = len(prompts)
        
        # In practice, you would:
        # 1. Tokenize prompts + responses
        # 2. Run forward pass through both models
        # 3. Compute log probabilities for the response tokens
        
        # Simulate log probabilities
        policy_chosen_logps = torch.randn(batch_size)
        policy_rejected_logps = torch.randn(batch_size)
        reference_chosen_logps = torch.randn(batch_size)
        reference_rejected_logps = torch.randn(batch_size)
        
        # Compute DPO loss
        loss, chosen_rewards, rejected_rewards = self.dpo_loss(
            policy_chosen_logps,
            policy_rejected_logps,
            reference_chosen_logps,
            reference_rejected_logps
        )
        
        return {
            'loss': loss,
            'chosen_rewards': chosen_rewards,
            'rejected_rewards': rejected_rewards,
            'reward_margin': chosen_rewards - rejected_rewards,
            'policy_chosen_logps': policy_chosen_logps.mean(),
            'policy_rejected_logps': policy_rejected_logps.mean()
        }
    
    def train_step(self, batch, optimizer):
        """
        Single training step.
        """
        optimizer.zero_grad()
        
        loss_dict = self.compute_loss(batch)
        loss = loss_dict['loss']
        
        loss.backward()
        optimizer.step()
        
        return loss_dict


# Simulate DPO training
def simulate_dpo_training():
    """
    Simulate a DPO training process with synthetic data.
    """
    print("🚀 Simulating DPO Training")
    print("=" * 30)
    
    # Create synthetic preference data
    preferences = []
    for i in range(100):
        preferences.append({
            'prompt': f"Question {i}: What is the capital of France?",
            'chosen': "The capital of France is Paris.",
            'rejected': "I think it might be Lyon or something."
        })
    
    dataset = PreferenceDataset(preferences)
    dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
    
    # Simulate models (in practice, these would be actual transformer models)
    class MockModel(nn.Module):
        def __init__(self):
            super().__init__()
            self.linear = nn.Linear(10, 10)
        
        def forward(self, x):
            return self.linear(x)
    
    policy_model = MockModel()
    reference_model = MockModel()
    
    # Initialize trainer
    trainer = DPOTrainer(
        policy_model=policy_model,
        reference_model=reference_model,
        tokenizer=None,  # Would be actual tokenizer
        beta=0.1
    )
    
    optimizer = torch.optim.Adam(policy_model.parameters(), lr=1e-5)
    
    # Training loop
    training_stats = []
    num_epochs = 5
    
    for epoch in range(num_epochs):
        epoch_losses = []
        epoch_margins = []
        
        for batch in dataloader:
            loss_dict = trainer.train_step(batch, optimizer)
            
            epoch_losses.append(loss_dict['loss'].item())
            epoch_margins.append(loss_dict['reward_margin'].item())
        
        avg_loss = np.mean(epoch_losses)
        avg_margin = np.mean(epoch_margins)
        
        training_stats.append({
            'epoch': epoch,
            'loss': avg_loss,
            'reward_margin': avg_margin
        })
        
        print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Margin={avg_margin:.4f}")
    
    return training_stats

# Run simulation
training_stats = simulate_dpo_training()

## 📊 DPO vs RLHF Comparison

In [None]:
def compare_dpo_vs_rlhf():
    """
    Compare DPO and RLHF approaches.
    """
    
    # Comparison metrics
    metrics = {
        'Training Steps': {'RLHF': 3, 'DPO': 1},
        'Models Required': {'RLHF': 4, 'DPO': 2},  # Policy, Reference, Reward, Critic vs Policy, Reference
        'Hyperparameters': {'RLHF': 15, 'DPO': 3},  # Approximate complexity
        'Training Stability': {'RLHF': 6, 'DPO': 9},  # Subjective score 1-10
        'Implementation Complexity': {'RLHF': 8, 'DPO': 3},  # Subjective score 1-10
        'Memory Usage': {'RLHF': 8, 'DPO': 5},  # Relative scale
        'Training Time': {'RLHF': 10, 'DPO': 4},  # Relative scale
    }
    
    # Theoretical analysis of beta parameter
    def analyze_beta_parameter():
        betas = np.logspace(-3, 0, 20)  # 0.001 to 1.0
        
        # Simulate how different betas affect the reward margin
        logp_diff = 2.0  # Policy gives 2.0 higher log prob to chosen vs rejected
        reward_margins = betas * logp_diff
        
        # Simulate preference probability (how often model prefers chosen)
        preference_probs = torch.sigmoid(torch.tensor(reward_margins)).numpy()
        
        return betas, reward_margins, preference_probs
    
    betas, margins, prefs = analyze_beta_parameter()
    
    # Create comprehensive visualization
    fig = plt.figure(figsize=(20, 12))
    
    # 1. Training pipeline comparison
    ax1 = plt.subplot(2, 4, 1)
    methods = list(metrics.keys())
    rlhf_values = [metrics[m]['RLHF'] for m in methods]
    dpo_values = [metrics[m]['DPO'] for m in methods]
    
    x_pos = np.arange(len(methods))
    width = 0.35
    
    bars1 = ax1.bar(x_pos - width/2, rlhf_values, width, label='RLHF', alpha=0.8, color='red')
    bars2 = ax1.bar(x_pos + width/2, dpo_values, width, label='DPO', alpha=0.8, color='blue')
    
    ax1.set_xlabel('Metrics')
    ax1.set_ylabel('Complexity/Resource Score')
    ax1.set_title('RLHF vs DPO Comparison')
    ax1.set_xticks(x_pos)
    ax1.set_xticklabels(methods, rotation=45, ha='right')
    ax1.legend()
    ax1.grid(True, alpha=0.3, axis='y')
    
    # 2. Beta parameter analysis
    ax2 = plt.subplot(2, 4, 2)
    ax2.semilogx(betas, margins, 'o-', linewidth=2, markersize=6)
    ax2.set_xlabel('Beta Parameter')
    ax2.set_ylabel('Reward Margin')
    ax2.set_title('Beta vs Reward Margin')
    ax2.grid(True, alpha=0.3)
    
    # 3. Preference probability
    ax3 = plt.subplot(2, 4, 3)
    ax3.semilogx(betas, prefs, 'o-', color='green', linewidth=2, markersize=6)
    ax3.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Random')
    ax3.axhline(y=0.9, color='orange', linestyle='--', alpha=0.7, label='Strong Preference')
    ax3.set_xlabel('Beta Parameter')
    ax3.set_ylabel('Preference Probability')
    ax3.set_title('Beta vs Preference Strength')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    ax3.set_ylim(0.5, 1.0)
    
    # 4. Training dynamics simulation
    ax4 = plt.subplot(2, 4, 4)
    epochs = np.arange(1, 21)
    
    # Simulate RLHF vs DPO training curves
    rlhf_curve = 1.0 - 0.8 * np.exp(-epochs/8) + 0.1 * np.sin(epochs) * np.exp(-epochs/10)
    dpo_curve = 1.0 - 0.9 * np.exp(-epochs/5)
    
    ax4.plot(epochs, rlhf_curve, 'o-', label='RLHF', linewidth=2, alpha=0.8)
    ax4.plot(epochs, dpo_curve, 's-', label='DPO', linewidth=2, alpha=0.8)
    ax4.set_xlabel('Training Epochs')
    ax4.set_ylabel('Alignment Score')
    ax4.set_title('Training Dynamics')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    # 5. Loss landscape visualization
    ax5 = plt.subplot(2, 4, 5)
    
    # Simulate loss vs log probability ratio
    logp_ratios = np.linspace(-3, 3, 100)
    dpo_losses = -np.log(1 / (1 + np.exp(-logp_ratios)))
    
    ax5.plot(logp_ratios, dpo_losses, linewidth=3)
    ax5.set_xlabel('Log P(chosen) - Log P(rejected)')
    ax5.set_ylabel('DPO Loss')
    ax5.set_title('DPO Loss Landscape')
    ax5.grid(True, alpha=0.3)
    ax5.axvline(x=0, color='red', linestyle='--', alpha=0.7, label='Equal Preference')
    ax5.legend()
    
    # 6. Computational complexity
    ax6 = plt.subplot(2, 4, 6)
    
    stages = ['SFT', 'Reward\nModel', 'RL\nTraining', 'Total']
    rlhf_times = [1, 1, 3, 5]  # Relative time units
    dpo_times = [1, 0, 1, 2]   # DPO skips reward model, simpler RL
    
    x_pos = np.arange(len(stages))
    
    bars1 = ax6.bar(x_pos - width/2, rlhf_times, width, label='RLHF', alpha=0.8, color='red')
    bars2 = ax6.bar(x_pos + width/2, dpo_times, width, label='DPO', alpha=0.8, color='blue')
    
    ax6.set_xlabel('Training Stage')
    ax6.set_ylabel('Relative Time')
    ax6.set_title('Training Time Breakdown')
    ax6.set_xticks(x_pos)
    ax6.set_xticklabels(stages)
    ax6.legend()
    ax6.grid(True, alpha=0.3, axis='y')
    
    # 7. Performance comparison
    ax7 = plt.subplot(2, 4, 7)
    
    tasks = ['Helpfulness', 'Harmlessness', 'Honesty', 'Overall']
    rlhf_scores = [85, 92, 78, 85]
    dpo_scores = [87, 89, 82, 86]  # DPO often matches or exceeds RLHF
    
    x_pos = np.arange(len(tasks))
    
    bars1 = ax7.bar(x_pos - width/2, rlhf_scores, width, label='RLHF', alpha=0.8, color='red')
    bars2 = ax7.bar(x_pos + width/2, dpo_scores, width, label='DPO', alpha=0.8, color='blue')
    
    ax7.set_xlabel('Evaluation Metric')
    ax7.set_ylabel('Score')
    ax7.set_title('Performance Comparison')
    ax7.set_xticks(x_pos)
    ax7.set_xticklabels(tasks)
    ax7.legend()
    ax7.grid(True, alpha=0.3, axis='y')
    ax7.set_ylim(70, 95)
    
    # 8. Memory usage over time
    ax8 = plt.subplot(2, 4, 8)
    
    time_steps = np.arange(1, 11)
    rlhf_memory = [2, 4, 6, 8, 8, 8, 8, 8, 8, 8]  # Ramps up, stays high
    dpo_memory = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3]   # Stays constant
    
    ax8.plot(time_steps, rlhf_memory, 'o-', label='RLHF', linewidth=2, markersize=6)
    ax8.plot(time_steps, dpo_memory, 's-', label='DPO', linewidth=2, markersize=6)
    ax8.set_xlabel('Training Progress')
    ax8.set_ylabel('Memory Usage (GB)')
    ax8.set_title('Memory Usage During Training')
    ax8.legend()
    ax8.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return metrics, betas, margins, prefs

# Run comparison
comparison_results = compare_dpo_vs_rlhf()

# Print summary
print("\n📊 DPO vs RLHF Summary:")
print("=" * 50)
print("DPO Advantages:")
print("  ✅ Simpler training pipeline (1 step vs 3)")
print("  ✅ No reward model needed")
print("  ✅ More stable training")
print("  ✅ Lower memory requirements")
print("  ✅ Faster training time")
print("  ✅ Easier to implement")
print("\nRLHF Advantages:")
print("  ✅ More interpretable rewards")
print("  ✅ Can handle complex preferences")
print("  ✅ Established track record")
print("\nWhen to use DPO:")
print("  • Simple preference tasks")
print("  • Limited computational resources")
print("  • Quick iteration cycles")
print("  • Stable training requirements")

## 🔧 Advanced DPO Variants

Several improvements and variants of DPO have been developed.

In [None]:
class RobustDPOLoss(nn.Module):
    """
    Robust DPO (R-DPO) that handles distribution shift.
    
    Addresses the sensitivity of standard DPO to preference distribution changes.
    """
    
    def __init__(self, beta: float = 0.1, eta: float = 0.1):
        super().__init__()
        self.beta = beta
        self.eta = eta  # Robustness parameter
    
    def forward(
        self,
        policy_chosen_logps: torch.Tensor,
        policy_rejected_logps: torch.Tensor,
        reference_chosen_logps: torch.Tensor,
        reference_rejected_logps: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Compute Robust DPO loss.
        """
        # Standard DPO terms
        chosen_rewards = self.beta * (policy_chosen_logps - reference_chosen_logps)
        rejected_rewards = self.beta * (policy_rejected_logps - reference_rejected_logps)
        
        # Robust regularization term
        logits = chosen_rewards - rejected_rewards
        
        # Add robustness penalty
        robust_penalty = self.eta * (chosen_rewards.pow(2) + rejected_rewards.pow(2))
        
        # Combined loss
        dpo_loss = -F.logsigmoid(logits)
        total_loss = dpo_loss + robust_penalty
        
        return total_loss.mean(), logits.mean()


class KLDPOLoss(nn.Module):
    """
    KL-regularized DPO that maintains closer alignment with reference model.
    """
    
    def __init__(self, beta: float = 0.1, kl_weight: float = 0.01):
        super().__init__()
        self.beta = beta
        self.kl_weight = kl_weight
    
    def forward(
        self,
        policy_chosen_logps: torch.Tensor,
        policy_rejected_logps: torch.Tensor,
        reference_chosen_logps: torch.Tensor,
        reference_rejected_logps: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Compute KL-regularized DPO loss.
        """
        # Standard DPO computation
        chosen_rewards = self.beta * (policy_chosen_logps - reference_chosen_logps)
        rejected_rewards = self.beta * (policy_rejected_logps - reference_rejected_logps)
        
        logits = chosen_rewards - rejected_rewards
        dpo_loss = -F.logsigmoid(logits)
        
        # KL divergence penalty
        kl_chosen = policy_chosen_logps - reference_chosen_logps
        kl_rejected = policy_rejected_logps - reference_rejected_logps
        kl_penalty = self.kl_weight * (kl_chosen.pow(2) + kl_rejected.pow(2))
        
        total_loss = dpo_loss + kl_penalty
        
        return total_loss.mean(), dpo_loss.mean(), kl_penalty.mean()


def compare_dpo_variants():
    """
    Compare different DPO variants.
    """
    print("🔬 Comparing DPO Variants")
    print("=" * 30)
    
    # Create test data
    batch_size = 16
    
    # Simulate preference data with some noise
    policy_chosen_logps = torch.randn(batch_size) + 0.5
    policy_rejected_logps = torch.randn(batch_size) - 0.5
    reference_chosen_logps = torch.randn(batch_size)
    reference_rejected_logps = torch.randn(batch_size)
    
    # Initialize different loss functions
    standard_dpo = DPOLoss(beta=0.1)
    robust_dpo = RobustDPOLoss(beta=0.1, eta=0.05)
    kl_dpo = KLDPOLoss(beta=0.1, kl_weight=0.01)
    
    # Compute losses
    standard_loss, standard_chosen, standard_rejected = standard_dpo(
        policy_chosen_logps, policy_rejected_logps,
        reference_chosen_logps, reference_rejected_logps
    )
    
    robust_loss, robust_logits = robust_dpo(
        policy_chosen_logps, policy_rejected_logps,
        reference_chosen_logps, reference_rejected_logps
    )
    
    kl_total_loss, kl_dpo_loss, kl_penalty = kl_dpo(
        policy_chosen_logps, policy_rejected_logps,
        reference_chosen_logps, reference_rejected_logps
    )
    
    # Compare results
    results = {
        'Standard DPO': {
            'loss': standard_loss.item(),
            'reward_margin': (standard_chosen - standard_rejected).item(),
            'properties': 'Simple, stable, widely used'
        },
        'Robust DPO': {
            'loss': robust_loss.item(),
            'reward_margin': robust_logits.item(),
            'properties': 'Handles distribution shift, more robust'
        },
        'KL-DPO': {
            'loss': kl_total_loss.item(),
            'dpo_component': kl_dpo_loss.item(),
            'kl_component': kl_penalty.item(),
            'properties': 'Maintains reference alignment, conservative'
        }
    }
    
    print("\nVariant Comparison:")
    print("=" * 60)
    for variant, metrics in results.items():
        print(f"\n{variant}:")
        for key, value in metrics.items():
            if isinstance(value, float):
                print(f"  {key}: {value:.4f}")
            else:
                print(f"  {key}: {value}")
    
    return results

# Compare variants
variant_results = compare_dpo_variants()

# Visualize beta sensitivity
def analyze_beta_sensitivity():
    """
    Analyze how different DPO variants respond to beta changes.
    """
    betas = np.logspace(-2, 0, 15)  # 0.01 to 1.0
    
    standard_losses = []
    robust_losses = []
    kl_losses = []
    
    # Fixed test data
    policy_chosen = torch.tensor([1.0])
    policy_rejected = torch.tensor([-1.0])
    ref_chosen = torch.tensor([0.0])
    ref_rejected = torch.tensor([0.0])
    
    for beta in betas:
        # Standard DPO
        standard_dpo = DPOLoss(beta=beta)
        loss_std, _, _ = standard_dpo(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
        standard_losses.append(loss_std.item())
        
        # Robust DPO
        robust_dpo = RobustDPOLoss(beta=beta, eta=0.05)
        loss_rob, _ = robust_dpo(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
        robust_losses.append(loss_rob.item())
        
        # KL-DPO
        kl_dpo = KLDPOLoss(beta=beta, kl_weight=0.01)
        loss_kl, _, _ = kl_dpo(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
        kl_losses.append(loss_kl.item())
    
    # Plot sensitivity
    plt.figure(figsize=(10, 6))
    plt.semilogx(betas, standard_losses, 'o-', label='Standard DPO', linewidth=2)
    plt.semilogx(betas, robust_losses, 's-', label='Robust DPO', linewidth=2)
    plt.semilogx(betas, kl_losses, '^-', label='KL-DPO', linewidth=2)
    
    plt.xlabel('Beta Parameter')
    plt.ylabel('Loss Value')
    plt.title('DPO Variant Sensitivity to Beta')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return betas, standard_losses, robust_losses, kl_losses

# Analyze sensitivity
sensitivity_results = analyze_beta_sensitivity()

## 🎯 Practical Exercises

### Exercise 1: Implement Your Own DPO Loss
Create a custom DPO loss with additional features.

In [None]:
def exercise_custom_dpo_loss():
    """
    Exercise: Implement a custom DPO loss function.
    
    Your task: Implement a DPO loss that includes:
    1. Standard DPO loss
    2. Length normalization
    3. Confidence weighting
    4. Temperature scaling
    """
    print("🧪 Exercise: Custom DPO Loss Implementation")
    print("=" * 50)
    
    class CustomDPOLoss(nn.Module):
        """
        Custom DPO loss with additional features.
        
        TODO: Complete the implementation with the features described above.
        """
        
        def __init__(
            self, 
            beta: float = 0.1,
            temperature: float = 1.0,
            length_normalization: bool = True,
            confidence_weighting: bool = True
        ):
            super().__init__()
            self.beta = beta
            self.temperature = temperature
            self.length_normalization = length_normalization
            self.confidence_weighting = confidence_weighting
        
        def forward(
            self,
            policy_chosen_logps: torch.Tensor,
            policy_rejected_logps: torch.Tensor,
            reference_chosen_logps: torch.Tensor,
            reference_rejected_logps: torch.Tensor,
            chosen_lengths: torch.Tensor = None,
            rejected_lengths: torch.Tensor = None,
            confidence_scores: torch.Tensor = None
        ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]:
            """
            Compute custom DPO loss.
            
            TODO: Implement the following features:
            
            1. Length normalization: Divide log probabilities by sequence length
            2. Confidence weighting: Weight loss by confidence scores
            3. Temperature scaling: Scale logits by temperature
            4. Return detailed metrics
            """
            
            # 1. Length normalization
            if self.length_normalization and chosen_lengths is not None:
                policy_chosen_logps = policy_chosen_logps / chosen_lengths
                policy_rejected_logps = policy_rejected_logps / rejected_lengths
                reference_chosen_logps = reference_chosen_logps / chosen_lengths
                reference_rejected_logps = reference_rejected_logps / rejected_lengths
            
            # 2. Compute rewards
            chosen_rewards = self.beta * (policy_chosen_logps - reference_chosen_logps)
            rejected_rewards = self.beta * (policy_rejected_logps - reference_rejected_logps)
            
            # 3. Temperature scaling
            logits = (chosen_rewards - rejected_rewards) / self.temperature
            
            # 4. Base DPO loss
            dpo_loss = -F.logsigmoid(logits)
            
            # 5. Confidence weighting
            if self.confidence_weighting and confidence_scores is not None:
                dpo_loss = dpo_loss * confidence_scores
            
            # Compute metrics
            metrics = {
                'loss': dpo_loss.mean(),
                'chosen_rewards': chosen_rewards.mean(),
                'rejected_rewards': rejected_rewards.mean(),
                'reward_margin': (chosen_rewards - rejected_rewards).mean(),
                'accuracy': (logits > 0).float().mean()
            }
            
            return dpo_loss.mean(), metrics
    
    # Test the custom loss
    batch_size = 8
    
    # Create test data
    policy_chosen_logps = torch.randn(batch_size) + 1.0
    policy_rejected_logps = torch.randn(batch_size) - 1.0
    reference_chosen_logps = torch.randn(batch_size)
    reference_rejected_logps = torch.randn(batch_size)
    
    # Additional features
    chosen_lengths = torch.randint(10, 50, (batch_size,)).float()
    rejected_lengths = torch.randint(10, 50, (batch_size,)).float()
    confidence_scores = torch.rand(batch_size)  # Random confidence [0, 1]
    
    # Test different configurations
    configs = [
        {'name': 'Standard', 'length_normalization': False, 'confidence_weighting': False},
        {'name': 'Length Normalized', 'length_normalization': True, 'confidence_weighting': False},
        {'name': 'Confidence Weighted', 'length_normalization': False, 'confidence_weighting': True},
        {'name': 'Full Custom', 'length_normalization': True, 'confidence_weighting': True}
    ]
    
    results = []
    
    for config in configs:
        custom_loss = CustomDPOLoss(
            beta=0.1,
            temperature=1.0,
            length_normalization=config['length_normalization'],
            confidence_weighting=config['confidence_weighting']
        )
        
        loss, metrics = custom_loss(
            policy_chosen_logps, policy_rejected_logps,
            reference_chosen_logps, reference_rejected_logps,
            chosen_lengths, rejected_lengths, confidence_scores
        )
        
        result = {'config': config['name'], **{k: v.item() for k, v in metrics.items()}}
        results.append(result)
        
        print(f"\n{config['name']} Configuration:")
        print(f"  Loss: {loss.item():.4f}")
        print(f"  Reward Margin: {metrics['reward_margin'].item():.4f}")
        print(f"  Accuracy: {metrics['accuracy'].item():.4f}")
    
    return results

# Run the exercise
exercise_results = exercise_custom_dpo_loss()

# Visualize results
if exercise_results:
    configs = [r['config'] for r in exercise_results]
    losses = [r['loss'] for r in exercise_results]
    margins = [r['reward_margin'] for r in exercise_results]
    accuracies = [r['accuracy'] for r in exercise_results]
    
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
    
    # Loss comparison
    ax1.bar(configs, losses, alpha=0.7, color='skyblue')
    ax1.set_ylabel('Loss')
    ax1.set_title('Loss by Configuration')
    ax1.tick_params(axis='x', rotation=45)
    
    # Reward margin comparison
    ax2.bar(configs, margins, alpha=0.7, color='lightgreen')
    ax2.set_ylabel('Reward Margin')
    ax2.set_title('Reward Margin by Configuration')
    ax2.tick_params(axis='x', rotation=45)
    
    # Accuracy comparison
    ax3.bar(configs, accuracies, alpha=0.7, color='orange')
    ax3.set_ylabel('Accuracy')
    ax3.set_title('Accuracy by Configuration')
    ax3.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

## 💡 Key Takeaways

### DPO Advantages:
1. **Simplicity**: Single-stage training vs multi-stage RLHF
2. **Stability**: More stable than RL-based approaches
3. **Efficiency**: Faster training and lower memory usage
4. **Performance**: Often matches or exceeds RLHF results
5. **Implementation**: Easier to implement and debug

### Core Insights:
1. **Reward Reparameterization**: Clever mathematical insight eliminates reward model
2. **Classification Framing**: Treats preferences as binary classification
3. **Direct Optimization**: Optimizes policy directly on preference data
4. **Theoretical Foundation**: Grounded in bandit problem theory

### Best Practices:
1. **Beta Selection**: Start with β = 0.1, adjust based on preference strength
2. **Data Quality**: High-quality preference data is crucial
3. **Reference Model**: Use well-trained SFT model as reference
4. **Evaluation**: Monitor both loss and implicit reward margins
5. **Variants**: Consider robust variants for challenging scenarios

### When to Use DPO:
- **Resource Constraints**: When computational resources are limited
- **Quick Iteration**: When you need fast experimentation cycles
- **Stable Training**: When training stability is important
- **Simple Preferences**: When preference data is straightforward

### Limitations:
1. **Preference Complexity**: May struggle with complex multi-dimensional preferences
2. **Distribution Shift**: Sensitive to changes in preference distribution
3. **Reward Interpretability**: Implicit rewards are less interpretable
4. **Fine-grained Control**: Less control over specific reward components

## 🚀 Next Steps

1. **Try Variants**: Experiment with Robust DPO, KL-DPO, and other variants
2. **Real Data**: Apply DPO to actual preference datasets
3. **Integration**: Combine with other techniques like LoRA for efficiency
4. **Evaluation**: Develop better metrics for preference alignment
5. **Research**: Explore recent improvements like IPO, ORPO, and others

**DPO has revolutionized alignment by showing that complex RLHF pipelines aren't always necessary. Its simplicity and effectiveness make it an essential tool for modern language model alignment!** 🎯