# Lab 3.1.8: SimPO vs ORPO - Modern Preference Optimization

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand SimPO and ORPO as DPO alternatives
- [ ] Know when to use each method
- [ ] Implement both with TRL
- [ ] Compare their memory and quality tradeoffs

---

## Why Move Beyond DPO?

DPO is great, but has some limitations:

| Issue | DPO | SimPO | ORPO |
|-------|-----|-------|------|
| Needs reference model | ✅ Yes (2x memory) | ❌ No | ❌ No |
| Quality on AlpacaEval | Baseline | **+6.4 points** | Comparable |
| Memory usage | High | Medium | **50% less** |
| Complexity | Medium | Low | Low |

**SimPO** = Better quality, simpler  
**ORPO** = Less memory, single-stage training

---

## ELI5: SimPO vs ORPO

> **Imagine you're grading essays:**
>
> **DPO** is like: "Compare this essay to the original student's average work, then decide if this is better or worse."
> - Requires keeping the "original" for comparison
>
> **SimPO (Simple Preference Optimization)** is like: "Just look at the two essays and pick the better one. Trust your instincts!"
> - No reference needed
> - Uses length-normalized scoring (longer isn't always better)
>
> **ORPO (Odds Ratio Preference Optimization)** is like: "Score both essays, then boost the good one and penalize the bad one in the same step."
> - Combines SFT + preference learning in one stage
> - Most memory-efficient

---

## Part 1: The Algorithms

### SimPO (Simple Preference Optimization)

SimPO removes the reference model and uses length-normalized log probabilities:

$$\mathcal{L}_{SimPO} = -\log \sigma\left(\frac{\beta}{|y_w|} \log \pi_\theta(y_w|x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l|x) - \gamma\right)$$

Key innovations:
- **No reference model** - saves 50% memory
- **Length normalization** - prevents favoring longer responses
- **Gamma margin** - ensures chosen is preferred by at least γ

### ORPO (Odds Ratio Preference Optimization)

ORPO combines SFT and preference learning:

$$\mathcal{L}_{ORPO} = \mathcal{L}_{SFT}(y_w) + \lambda \cdot \mathcal{L}_{OR}$$

Where the odds ratio loss is:

$$\mathcal{L}_{OR} = -\log \sigma\left(\log \frac{P(y_w|x)}{1-P(y_w|x)} - \log \frac{P(y_l|x)}{1-P(y_l|x)}\right)$$

Key innovations:
- **Single training stage** - no separate SFT then DPO
- **No reference model needed**
- **Uses odds ratios** - more stable than log probs alone

In [None]:
# Setup
import torch
import gc
from typing import Dict, List
import warnings
warnings.filterwarnings('ignore')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import ORPOTrainer, ORPOConfig
from datasets import Dataset

# Note: SimPO requires TRL >= 0.9.0
# from trl import SimPOTrainer, SimPOConfig

print("Libraries imported!")

---

## Part 2: Preference Dataset

In [None]:
# Same preference data format as DPO
preference_data = [
    {
        "prompt": "Explain neural networks simply.",
        "chosen": "Neural networks are computing systems inspired by biological brains. They consist of layers of connected nodes that learn patterns from data. Like how you learn to recognize faces by seeing many examples, neural networks learn by processing thousands of examples until they can make accurate predictions.",
        "rejected": "Neural networks are machine learning models used in AI."
    },
    {
        "prompt": "What makes good code?",
        "chosen": "Good code is: 1) Readable - others can understand it. 2) Maintainable - easy to fix and extend. 3) Tested - has automated tests. 4) Simple - does one thing well. 5) Documented - explains the 'why'. Remember: code is read more often than written!",
        "rejected": "Good code works correctly and is efficient."
    },
    {
        "prompt": "How do I stay motivated?",
        "chosen": "Build sustainable motivation: Start small (2-minute rule), track your wins, connect tasks to bigger goals, celebrate progress not perfection, rest when needed (burnout kills motivation), and surround yourself with supportive people. Remember: motivation follows action, not the other way around.",
        "rejected": "Just push through and don't give up. Think positive thoughts."
    },
    {
        "prompt": "Explain recursion.",
        "chosen": "Recursion is when a function calls itself to solve smaller versions of the same problem. Example: To count people in a line, ask the person in front 'how many are ahead of you?' and add 1. They ask the next person, and so on until someone says 'zero'. That's the base case that stops the recursion.",
        "rejected": "Recursion is when a function calls itself. It needs a base case."
    },
    {
        "prompt": "What is overfitting?",
        "chosen": "Overfitting is when a model memorizes training data instead of learning general patterns. Imagine a student who memorizes test answers but can't solve new problems. Signs: perfect training score but poor test score. Solutions: more data, simpler models, regularization, or dropout.",
        "rejected": "Overfitting happens when a model is too complex for the data."
    },
]

# Expand dataset
expanded_data = []
for item in preference_data:
    expanded_data.append(item)
    expanded_data.append({
        "prompt": "Can you " + item["prompt"].lower(),
        "chosen": item["chosen"],
        "rejected": item["rejected"],
    })

preference_dataset = Dataset.from_list(expanded_data)
print(f"Dataset: {len(preference_dataset)} examples")

---

## Part 3: ORPO Training

ORPO is fully integrated in TRL and is very memory-efficient.

In [None]:
# Configuration
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
print(f"Loading model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
)

model = prepare_model_for_kbit_training(model)

# Add LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

print(f"\nModel loaded!")
model.print_trainable_parameters()
print(f"Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

In [None]:
# ORPO Configuration
orpo_config = ORPOConfig(
    output_dir="./orpo_output",
    
    # ORPO-specific
    beta=0.1,  # Weight of the odds ratio loss
    
    # Training
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-5,
    
    # Sequence
    max_length=512,
    max_prompt_length=256,
    
    # Optimization
    optim="paged_adamw_8bit",
    bf16=True,
    
    # Logging
    logging_steps=5,
    
    save_strategy="no",
    report_to="none",
    remove_unused_columns=False,
)

print("ORPO configuration created!")

In [None]:
# Create ORPO trainer
orpo_trainer = ORPOTrainer(
    model=model,
    args=orpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

print("ORPO Trainer created!")
print(f"Memory after setup: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print("\nNote: No reference model needed! (50% less memory than DPO)")

In [None]:
# Train with ORPO
print("="*50)
print("STARTING ORPO TRAINING")
print("="*50)
print("\nORPO combines SFT + preference learning in one stage!")

orpo_result = orpo_trainer.train()

print("\n" + "="*50)
print("ORPO TRAINING COMPLETE!")
print("="*50)

In [None]:
# Print metrics
print("\nORPO Training Metrics:")
for key, value in orpo_result.metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

---

## Part 4: SimPO (Reference Implementation)

SimPO may require a newer version of TRL. Here's how it works conceptually:

In [None]:
# SimPO conceptual implementation
# (TRL may have SimPOTrainer in newer versions)

simpo_pseudocode = """
# SimPO Training (when available in TRL)

from trl import SimPOTrainer, SimPOConfig

simpo_config = SimPOConfig(
    output_dir="./simpo_output",
    
    # SimPO-specific parameters
    beta=2.0,              # Controls preference strength
    gamma_beta_ratio=0.5,  # Margin γ = 0.5 * β = 1.0
    
    # Note: No reference model needed!
    
    # Standard training params
    per_device_train_batch_size=2,
    learning_rate=5e-5,
    num_train_epochs=1,
    ...
)

trainer = SimPOTrainer(
    model=model,
    args=simpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

trainer.train()
"""

print("SimPO Pseudocode:")
print(simpo_pseudocode)

In [None]:
# Manual SimPO loss implementation for understanding
import torch.nn.functional as F

def simpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    chosen_lengths: torch.Tensor,
    rejected_lengths: torch.Tensor,
    beta: float = 2.0,
    gamma: float = 1.0,
) -> torch.Tensor:
    """
    Compute SimPO loss.
    
    SimPO uses length-normalized log probabilities and a margin.
    No reference model needed!
    """
    # Length-normalize the log probabilities
    chosen_rewards = beta * policy_chosen_logps / chosen_lengths
    rejected_rewards = beta * policy_rejected_logps / rejected_lengths
    
    # Compute loss with margin
    logits = chosen_rewards - rejected_rewards - gamma
    loss = -F.logsigmoid(logits).mean()
    
    return loss

# Demo
chosen_logps = torch.tensor([-10.0, -15.0, -12.0])
rejected_logps = torch.tensor([-20.0, -25.0, -22.0])
chosen_lens = torch.tensor([50.0, 60.0, 55.0])
rejected_lens = torch.tensor([20.0, 25.0, 22.0])

loss = simpo_loss(chosen_logps, rejected_logps, chosen_lens, rejected_lens)
print(f"Example SimPO loss: {loss.item():.4f}")

---

## Part 5: When to Use Which Method

### Decision Guide

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║                    CHOOSING YOUR PREFERENCE METHOD                            ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  Use DPO when:                                                               ║
║    ✓ You have plenty of memory (can load 2x model)                           ║
║    ✓ You want a well-tested, proven method                                   ║
║    ✓ You already have an SFT model as starting point                         ║
║                                                                              ║
║  Use SimPO when:                                                             ║
║    ✓ You want BEST quality (+6.4 on AlpacaEval)                              ║
║    ✓ You want simpler training (no reference model)                          ║
║    ✓ Your responses vary significantly in length                             ║
║                                                                              ║
║  Use ORPO when:                                                              ║
║    ✓ Memory is constrained (50% less than DPO)                               ║
║    ✓ You want single-stage training (no separate SFT)                        ║
║    ✓ You're on DGX Spark with 70B+ models                                    ║
║                                                                              ║
║  Use KTO when:                                                               ║
║    ✓ You only have binary feedback (thumbs up/down)                          ║
║    ✓ You don't have preference pairs                                         ║
║    ✓ Human-aligned loss function is important                                ║
║                                                                              ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  DGX Spark Recommendations:                                                  ║
║    • 8B models: Any method works well                                        ║
║    • 70B models: Use ORPO for memory efficiency                              ║
║    • Maximum quality: Use SimPO                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝
""")

---

## Part 6: Memory Comparison

In [None]:
def estimate_memory(model_size_gb: float, method: str) -> Dict:
    """
    Estimate memory requirements for different preference methods.
    """
    estimates = {
        "dpo": {
            "policy_model": model_size_gb,
            "reference_model": model_size_gb,  # DPO needs reference!
            "optimizer_states": model_size_gb * 0.1,  # LoRA params only
            "gradients": model_size_gb * 0.1,
        },
        "simpo": {
            "policy_model": model_size_gb,
            "reference_model": 0,  # No reference needed!
            "optimizer_states": model_size_gb * 0.1,
            "gradients": model_size_gb * 0.1,
        },
        "orpo": {
            "policy_model": model_size_gb,
            "reference_model": 0,  # No reference needed!
            "optimizer_states": model_size_gb * 0.1,
            "gradients": model_size_gb * 0.1,
        },
    }
    
    result = estimates[method]
    result["total"] = sum(result.values())
    return result

# Compare for 70B model (in 4-bit = ~35GB)
model_size = 35.0

print("Memory Comparison for 70B Model (4-bit):")
print("="*50)

for method in ["dpo", "simpo", "orpo"]:
    mem = estimate_memory(model_size, method)
    print(f"\n{method.upper()}:")
    print(f"  Policy model:    {mem['policy_model']:.1f} GB")
    print(f"  Reference model: {mem['reference_model']:.1f} GB")
    print(f"  Optimizer:       {mem['optimizer_states']:.1f} GB")
    print(f"  Gradients:       {mem['gradients']:.1f} GB")
    print(f"  ─────────────────────")
    print(f"  TOTAL:           {mem['total']:.1f} GB")

print("\n" + "="*50)
print(f"\nDGX Spark capacity: 128 GB")
print(f"DPO:   {'✅ Fits' if estimate_memory(model_size, 'dpo')['total'] < 128 else '❌ Too big'}")
print(f"SimPO: {'✅ Fits' if estimate_memory(model_size, 'simpo')['total'] < 128 else '❌ Too big'}")
print(f"ORPO:  {'✅ Fits' if estimate_memory(model_size, 'orpo')['total'] < 128 else '❌ Too big'}")

---

## Checkpoint

You've learned:
- ✅ SimPO eliminates reference model and adds length normalization
- ✅ ORPO combines SFT + preference in one stage
- ✅ When to use each method
- ✅ Memory tradeoffs for 70B models

---

## Further Reading

- [SimPO Paper](https://arxiv.org/abs/2405.14734) - Simple Preference Optimization
- [ORPO Paper](https://arxiv.org/abs/2403.07691) - Odds Ratio Preference Optimization
- [TRL Documentation](https://huggingface.co/docs/trl)

---

## Cleanup

In [None]:
# Clear memory
del model, orpo_trainer
torch.cuda.empty_cache()
gc.collect()

print("Cleanup complete!")

---

## Next Steps

Continue to:

**[Lab 3.1.9: KTO Binary Feedback](lab-3.1.9-kto-binary-feedback.ipynb)** - Learn to train with just thumbs up/down data!