# üéØ DPO Training Guide

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Gaurav14cs17/LLMs_Model/blob/main/Fine-Tuning-LLMs-Guide/notebooks/04_dpo_training.ipynb)

**Direct Preference Optimization - RLHF without Reward Models!**

### üî• Why DPO?
- **No reward model needed** (unlike PPO/RLHF)
- **Simpler training** - just preference pairs
- **More stable** than traditional RLHF
- **Better alignment** with human preferences

### üìä Data Format Required
```python
{
    "prompt": "What is the capital of France?",
    "chosen": "Paris is the capital of France.",  # Preferred response
    "rejected": "France is in Europe."            # Less preferred
}
```

**‚ö†Ô∏è Requirements**: GPU with 16GB+ VRAM


In [None]:
# Install and import
!pip install -q transformers datasets accelerate peft bitsandbytes trl

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

print(f"GPU: {torch.cuda.get_device_name(0)}")


In [None]:
# DPO requires preference pairs:
# - prompt: The input question
# - chosen: The preferred response  
# - rejected: The less preferred response

# Load preference dataset
dataset = load_dataset("Anthropic/hh-rlhf", split="train[:2000]")
print(f"Dataset: {len(dataset)} samples")

# DPO Config
dpo_config = DPOConfig(
    beta=0.1,  # DPO temperature
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    max_steps=500,
)


In [None]:
# Load base model for DPO
MODEL_NAME = "microsoft/phi-2"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Reference model (frozen copy for DPO)
ref_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto"
)

print("‚úÖ Models loaded!")


In [None]:
# Prepare DPO dataset format
def format_hh_rlhf(sample):
    """Format Anthropic HH-RLHF dataset for DPO"""
    return {
        "prompt": sample["chosen"].split("\n\nAssistant:")[0] + "\n\nAssistant:",
        "chosen": sample["chosen"].split("\n\nAssistant:")[-1],
        "rejected": sample["rejected"].split("\n\nAssistant:")[-1],
    }

# Apply formatting
train_dataset = dataset.map(format_hh_rlhf)
print(f"Training samples: {len(train_dataset)}")
print(f"Sample prompt: {train_dataset[0]['prompt'][:100]}...")


In [None]:
# DPO Training
from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    beta=0.1,  # DPO temperature - lower = more aggressive preference learning
    max_length=512,
    max_prompt_length=256,
)

print("üöÄ Starting DPO training...")
dpo_trainer.train()
print("‚úÖ DPO training complete!")


In [None]:
# Save DPO-trained model
OUTPUT_DIR = "./dpo-trained-model"
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Test the aligned model
def generate(prompt, max_tokens=150):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=max_tokens, temperature=0.7, do_sample=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print("ü§ñ Testing DPO-aligned model:")
test_prompt = "Human: How can I be more productive?\n\nAssistant:"
print(generate(test_prompt))


## üìä DPO vs Other RLHF Methods

| Method | Reward Model | Complexity | Stability | Memory |
|--------|-------------|------------|-----------|--------|
| **DPO** | ‚ùå No | ‚≠ê Simple | ‚≠ê‚≠ê‚≠ê High | ‚≠ê‚≠ê Medium |
| PPO | ‚úÖ Yes | ‚≠ê‚≠ê‚≠ê Complex | ‚≠ê Low | ‚≠ê‚≠ê‚≠ê High |
| RLHF | ‚úÖ Yes | ‚≠ê‚≠ê‚≠ê Complex | ‚≠ê‚≠ê Medium | ‚≠ê‚≠ê‚≠ê High |
| ORPO | ‚ùå No | ‚≠ê Simple | ‚≠ê‚≠ê‚≠ê High | ‚≠ê Low |

## üéØ Key DPO Hyperparameters

- **beta (Œ≤)**: Controls preference strength (0.1-0.5 typical)
  - Lower = more aggressive preference learning
  - Higher = more conservative, stays closer to reference

## üìö References
- [DPO Paper](https://arxiv.org/abs/2305.18290)
- [A Comprehensive Guide to Fine-Tuning LLMs](https://arxiv.org/html/2408.13296v1)
