# Lab 3.1.7: DPO Training - Direct Preference Optimization

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how DPO aligns models to human preferences
- [ ] Implement DPO training with TRL
- [ ] Compare DPO vs SFT-only results
- [ ] Choose appropriate hyperparameters (beta)

---

## Real-World Context

### The Alignment Problem

After supervised fine-tuning (SFT), your model can follow instructions. But it might still:
- Give overly verbose responses when you want concise ones
- Be too formal when you want casual
- Refuse things it shouldn't, or not refuse things it should

**DPO teaches the model your PREFERENCES** - not just what to say, but HOW to say it.

### The Traditional RLHF Pipeline (Complex)

1. Collect human preferences (A is better than B)
2. Train a reward model
3. Use PPO to optimize policy against reward model
4. Deal with training instability, reward hacking, etc.

### DPO: A Simpler Alternative

DPO **skips the reward model entirely** and directly optimizes the policy using preference data!

---

## ELI5: What is DPO?

> **Imagine you're training a dog with treats.** Traditional RLHF is like:
> 1. First, hire a judge to rate every trick (reward model)
> 2. Then, give treats based on the judge's scores (PPO)
>
> **DPO is simpler:** You just show the dog two tricks and reward whichever one YOU prefer. No judge needed!
>
> **The math magic:** DPO proves that you can implicitly learn the reward function just from preferences. Instead of:
> - Train reward model → Optimize policy with RL
>
> You get:
> - Directly optimize policy with supervised learning on preferences
>
> **Result:** Same alignment quality, 10x simpler to implement!

---

## Part 1: The DPO Algorithm

### The Math (Simplified)

Given:
- A prompt $x$
- A chosen response $y_w$ (preferred)
- A rejected response $y_l$ (not preferred)

DPO minimizes:

$$\mathcal{L}_{DPO} = -\log \sigma\left(\beta \left[ \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right]\right)$$

Where:
- $\pi_\theta$ is your policy (the model being trained)
- $\pi_{ref}$ is the reference model (the SFT model you started from)
- $\beta$ controls how much to prefer chosen over rejected
- $\sigma$ is the sigmoid function

### Intuition

The loss encourages:
- **Increase probability of chosen response** relative to reference
- **Decrease probability of rejected response** relative to reference

The reference model prevents "reward hacking" by penalizing outputs that drift too far from the original model.

In [None]:
# Setup
import torch
import gc
from typing import Dict, List, Optional
import warnings
warnings.filterwarnings('ignore')

# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Import libraries
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer, DPOConfig
from datasets import Dataset

print("Libraries imported!")

---

## Part 2: Prepare Preference Dataset

DPO needs triplets: (prompt, chosen, rejected)

In [None]:
# Create sample preference dataset
# In practice, you'd use real preference data from human annotations

preference_data = [
    {
        "prompt": "Explain quantum computing in simple terms.",
        "chosen": "Quantum computing uses quantum bits (qubits) that can be both 0 and 1 simultaneously, unlike regular bits. This allows quantum computers to solve certain problems much faster by exploring many possibilities at once. Think of it like checking all paths in a maze simultaneously instead of one at a time.",
        "rejected": "Quantum computing is a type of computation that uses quantum mechanics phenomena such as superposition and entanglement to process information. It operates on quantum bits or qubits."
    },
    {
        "prompt": "What's the best way to learn programming?",
        "chosen": "Start with a beginner-friendly language like Python. Build small projects that interest you - a calculator, a todo app, or a simple game. Practice daily, even just 30 minutes. Join communities like Stack Overflow when you get stuck. Remember: every expert was once a beginner!",
        "rejected": "There are many ways to learn programming. You could take online courses, read books, or watch tutorials. Practice is important."
    },
    {
        "prompt": "How do I make my code run faster?",
        "chosen": "Here are key strategies: 1) Profile first - measure before optimizing. 2) Use appropriate data structures (sets for membership, dicts for lookups). 3) Avoid unnecessary loops - use vectorized operations. 4) Cache expensive computations. 5) Consider algorithm complexity (O(n) vs O(n²)).",
        "rejected": "You should optimize your code by making it more efficient."
    },
    {
        "prompt": "Write a haiku about programming.",
        "chosen": "Bugs hide in the code\nDebugging through midnight hours\nStack trace reveals truth",
        "rejected": "Here is a haiku: Code and programming, Computers do what we say, Technology works."
    },
    {
        "prompt": "What are the benefits of exercise?",
        "chosen": "Exercise benefits your whole life: physically (stronger heart, better sleep, more energy), mentally (reduced anxiety, sharper thinking, better mood), and socially (confidence, community). Even 20 minutes of walking daily can transform your health.",
        "rejected": "Exercise is good for health. It makes you stronger and healthier. You should exercise regularly."
    },
]

# Create more examples through variation
extended_data = []
for item in preference_data:
    extended_data.append(item)
    # Create variations with slight modifications
    extended_data.append({
        "prompt": "Please " + item["prompt"].lower(),
        "chosen": item["chosen"],
        "rejected": item["rejected"],
    })

preference_dataset = Dataset.from_list(extended_data)
print(f"Created preference dataset with {len(preference_dataset)} examples")
print(f"\nSample:")
print(f"  Prompt: {preference_dataset[0]['prompt'][:50]}...")
print(f"  Chosen: {preference_dataset[0]['chosen'][:50]}...")
print(f"  Rejected: {preference_dataset[0]['rejected'][:50]}...")

---

## Part 3: Load Model for DPO

In [None]:
# Configuration
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Small for demo
# For production:
# MODEL_NAME = "Qwen/Qwen3-8B-Instruct"

# DPO hyperparameters
DPO_BETA = 0.1  # Controls preference strength (0.1-0.5 typical)
LEARNING_RATE = 5e-5
NUM_EPOCHS = 1
BATCH_SIZE = 2
MAX_LENGTH = 512

print(f"DPO Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Beta: {DPO_BETA}")
print(f"  Learning rate: {LEARNING_RATE}")

In [None]:
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Tokenizer loaded!")

In [None]:
# Load model
print(f"Loading model {MODEL_NAME}...")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = prepare_model_for_kbit_training(model)

print(f"Model loaded!")
print(f"Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

In [None]:
# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

---

## Part 4: DPO Training

In [None]:
# DPO training configuration
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    
    # DPO-specific
    beta=DPO_BETA,
    
    # Training
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=4,
    num_train_epochs=NUM_EPOCHS,
    learning_rate=LEARNING_RATE,
    
    # Sequence length
    max_length=MAX_LENGTH,
    max_prompt_length=256,
    
    # Optimization
    optim="paged_adamw_8bit",
    bf16=True,
    
    # Logging
    logging_steps=5,
    
    # Saving
    save_strategy="no",  # Don't save for demo
    
    report_to="none",
    remove_unused_columns=False,
)

print("DPO config created!")

In [None]:
# Create DPO trainer
# Note: DPO needs a reference model (the model before DPO training)
# TRL handles this automatically by creating a copy

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,  # TRL creates reference automatically
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

print("DPO Trainer created!")
print(f"\nMemory after trainer setup: {torch.cuda.memory_allocated()/1e9:.2f} GB")

In [None]:
# Train with DPO!
print("="*50)
print("STARTING DPO TRAINING")
print("="*50)
print(f"\nThis will teach the model to prefer better responses!")

dpo_result = dpo_trainer.train()

print("\n" + "="*50)
print("DPO TRAINING COMPLETE!")
print("="*50)

In [None]:
# Print metrics
print("\nTraining Metrics:")
for key, value in dpo_result.metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

---

## Part 5: Evaluate DPO Results

Let's compare the model's outputs before and after DPO.

In [None]:
def generate_response(model, tokenizer, prompt, max_new_tokens=128):
    """Generate a response from the model."""
    # Format prompt
    formatted = f"<|user|>\n{prompt}</s>\n<|assistant|>\n"
    
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
    
    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the response
    if "<|assistant|>" in response:
        response = response.split("<|assistant|>")[-1].strip()
    
    return response

# Test prompts
test_prompts = [
    "Explain machine learning simply.",
    "What's the best programming language?",
    "Give me tips for better sleep.",
]

print("Testing DPO-trained model:")
print("="*50)

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    print("-"*40)
    response = generate_response(model, tokenizer, prompt)
    print(f"Response: {response[:300]}..." if len(response) > 300 else f"Response: {response}")
    print("="*50)

---

## Part 6: Understanding Beta

The `beta` parameter controls how strongly the model prefers chosen over rejected responses.

In [None]:
# Beta guidelines
print("""
╔══════════════════════════════════════════════════════════════════╗
║                      DPO BETA GUIDELINES                          ║
╠══════════════════════════════════════════════════════════════════╣
║ Beta Value │ Effect                                              ║
╠════════════╪═════════════════════════════════════════════════════╣
║ 0.01-0.05  │ Weak preference. Model changes slowly.              ║
║ 0.1        │ DEFAULT. Balanced preference learning.              ║
║ 0.2-0.3    │ Strong preference. May overfit to style.            ║
║ 0.5+       │ Very strong. Risk of mode collapse.                 ║
╠══════════════════════════════════════════════════════════════════╣
║ Recommendations:                                                 ║
║   - Start with beta=0.1                                          ║
║   - If model doesn't change enough, increase to 0.2              ║
║   - If responses become repetitive, decrease to 0.05             ║
║   - Watch for training loss divergence (too high beta)           ║
╚══════════════════════════════════════════════════════════════════╝
""")

---

## Common Mistakes

### Mistake 1: Poor Preference Data

```python
# Wrong: Chosen and rejected are too similar
{
    "prompt": "What is Python?",
    "chosen": "Python is a programming language.",
    "rejected": "Python is a coding language."  # Too similar!
}

# Right: Clear quality difference
{
    "prompt": "What is Python?",
    "chosen": "Python is a high-level programming language known for readability. It's great for beginners and experts alike.",
    "rejected": "Python is programming."  # Clearly worse
}
```

### Mistake 2: Beta Too High

```python
# Wrong: Beta too aggressive
dpo_config = DPOConfig(beta=1.0)  # Model may collapse!

# Right: Start conservative
dpo_config = DPOConfig(beta=0.1)  # Then tune if needed
```

### Mistake 3: Forgetting Reference Model

```python
# Wrong: Not using reference model
# (Training without constraint leads to reward hacking)

# Right: TRL handles this automatically with ref_model=None
trainer = DPOTrainer(model=model, ref_model=None, ...)
```

---

## Checkpoint

You've learned:
- ✅ How DPO aligns models to preferences without a reward model
- ✅ How to create preference datasets
- ✅ How to train with DPO using TRL
- ✅ How to tune the beta parameter
- ✅ Common mistakes to avoid

---

## Further Reading

- [DPO Paper](https://arxiv.org/abs/2305.18290) - Direct Preference Optimization
- [TRL DPO Documentation](https://huggingface.co/docs/trl/dpo_trainer)

---

## Cleanup

In [None]:
# Clear memory
del model, dpo_trainer
torch.cuda.empty_cache()
gc.collect()

print("Cleanup complete!")

---

## Next Steps

Continue to:

**[Lab 3.1.8: SimPO vs ORPO](lab-3.1.8-simpo-vs-orpo.ipynb)** - Compare modern alternatives to DPO that are simpler and more memory-efficient!