## Case Study: Preference-Driven Optimization in NLP

Modern AI models often need to optimize against human preferences or satisfaction signals rather than simple supervised targets. For example, dialog systems or summarizers may be tuned using human ratings or pairwise preferences (e.g. “response A is better than B”). Historically this has been done via Reinforcement Learning from Human Feedback (RLHF): one first trains a reward model on pairwise human judgments, then uses policy optimization (e.g. PPO) to fine-tune a language model. However, RLHF can be complex and unstable. Recent methods like Direct Preference Optimization (DPO) sidestep explicit RL. DPO directly fits the language model to the preferences using a classification loss, while still implicitly solving the same reward-maximization objective. 

![image.png](images/dpo.png)

Figure: RLHF vs DPO [Rafailov 2023](https://arxiv.org/pdf/2305.18290#:~:text=is%20reinforcement%20learning%20from%20human,to%20optimize%20a%20language%20model)*


## Direct Preference Optimization (DPO) for Language Models

DPO is designed to fine-tune a pretrained language model (LM) on pairwise preference data (prompt plus two responses with a “chosen” better than “rejected” label) without a separate reward-model or RL stage. Conceptually, DPO maximizes the log-likelihood of the preferred response relative to the inferior one. Suppose a prompt $x$ has two candidate completions $y^+$ (preferred) and $y^-$ (rejected). DPO posits a Bradley–Terry (logistic) model over the pair, and optimizes:

$$
L(\theta) = -\log \sigma\!\left(\beta \big[ \log P_\theta(y^+ \mid x) - \log P_\theta(y^- \mid x) \big]\right),
$$

where $\sigma$ is the sigmoid function and $\beta > 0$ is a scaling hyperparameter (analogous to an inverse KL-penalty). In words, this loss increases the LM’s probability of the preferred response and decreases it for the rejected. Unlike naive probability-ratio losses, DPO’s derivation shows this simple binary cross-entropy on the log-prob difference solves the same constrained RLHF objective (maximizing reward with a KL-constraint) in closed form.

**Figure 1:** Workflow comparison of RLHF versus DPO. RLHF (top) first fits a reward model and then uses policy optimization (e.g. PPO) to tune the language model, whereas DPO (bottom) directly fine-tunes the model with a binary cross-entropy loss on preference pairs.

In practice, we gather or use an existing preference dataset (e.g. the Anthropic HH-RLHF data with “chosen” and “rejected” responses). We first ensure the LM is in-distribution (e.g. by supervised fine-tuning on original prompts), then run DPO training by maximizing $\log \sigma(\Delta)$ with respect to $\theta$. The DPO paper and its Hugging Face TRL implementation show that this yields models that match or exceed PPO-trained models on many tasks (e.g. sentiment steering, summarization) while being simpler to train.

## Implementation (PyTorch / Hugging Face)

We can use the Hugging Face `transformers` and `trl` libraries. Below is a schematic snippet (assumes `trl` and `transformers` are installed):

```python
from trl import DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
    beta=0.1,
)

trainer.train()


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

# Load a pretrained LM and tokenizer
model_name = "gpt2-medium"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load or prepare a preference dataset: each example has 'prompt', 'chosen', 'rejected'
# Here we use Anthropic HH-RLHF helpfulness data as an example
prefs = load_dataset("Anthropic/hh-rlhf", data_dir="helpful-base", split="train")

# Configure DPO training (choose output dir, beta, etc.)
dpo_config = DPOConfig(
    output_dir="gpt2_dpo_demo",
    num_train_epochs=3,
    batch_size=16,
    beta=0.5   # controls strength of preference vs KL (tunable)
)

# Initialize the DPO trainer
trainer = DPOTrainer(
    model=model,
    tokenizer=tokenizer,
    args=dpo_config,
    train_dataset=prefs,
    label_names=("chosen", "rejected")  # specify which fields to use
)

# Run fine-tuning
trainer.train()


  from .autonotebook import tqdm as notebook_tqdm
