# Direct Preference Optimization (DPO) Training

## Overview
This notebook trains language models using Direct Preference Optimization (DPO).
DPO teaches models to prefer better responses over worse ones.

## Step 1: Install Required Packages

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"       # prevent Weights & Biases popups
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:

import random, numpy as np, torch
random.seed(0); np.random.seed(0); torch.manual_seed(0);
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(0)

In [None]:
# Install all necessary libraries
# This may take 2-3 minutes
!pip install -q transformers datasets accelerate trl peft bitsandbytes sentencepiece protobuf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m423.1/423.1 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 2: Import Libraries and Check GPU

In [None]:
# ---------- Imports ----------
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
from trl import DPOConfig, DPOTrainer
from peft import LoraConfig, get_peft_model

# ---------- Device ----------
device = "cuda" if torch.cuda.is_available() else "cpu"
print("device:", device)
if device == "cuda":
    print("GPU:", torch.cuda.get_device_name(0))

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


device: cuda
GPU: Tesla T4


## Step 3: Choose Your Model


In [None]:
# ============================================
# SELECT MODEL
# ============================================
MODEL_NAME = "HuggingFaceTB/SmolLM2-360M-Instruct"

# (Optional swaps for later, require access and more VRAM)
# MODEL_NAME = "gpt2"
# MODEL_NAME = "google/gemma-2-2b-it"
# MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"

print("Selected model:", MODEL_NAME)

Selected model: HuggingFaceTB/SmolLM2-360M-Instruct


## Step 4: Hugging Face Authentication (Only for Gemma/Llama)

**Skip this step if using GPT-2**

For Gemma or Llama models:
1. Go to https://huggingface.co/settings/tokens
2. Create a token with read access
3. For Llama: Accept the license at https://huggingface.co/meta-llama/Llama-3.2-1B
4. For Gemma: Accept the license at https://huggingface.co/google/gemma-2b
5. Run the cell below and paste your token

In [None]:
# Only run this cell if using Gemma or Llama models
# Comment out if using GPT-2

from huggingface_hub import login

# Paste your Hugging Face token here or use the popup
login()  # This will prompt for your token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Step 5: Create Preference Dataset

DPO needs examples of:
- **Query**: The question or prompt
- **Chosen**: The good/preferred response
- **Rejected**: The bad/non-preferred response

In [None]:
import random
random.seed(0)

# richer intents + many paraphrases
GOOD = {
  "greet": [
    "Hello! How are you doing today?",
    "Hi there! It’s nice to meet you.",
    "Good morning! Hope you’re well.",
    "Hey! Great to see you."
  ],
  "thanks": [
    "Thank you so much for your help!",
    "I really appreciate your support.",
    "Thanks a lot—this means a lot to me.",
    "Many thanks for your assistance!"
  ],
  "goodbye": [
    "Goodbye! Have a wonderful day!",
    "See you later—take care!",
    "Bye for now! Wishing you well.",
    "Farewell! Hope to see you soon."
  ],
  "compliment": [
    "You did an amazing job on this!",
    "That’s impressive work—well done!",
    "Fantastic effort—you nailed it!",
    "Great job! Your work really stands out."
  ],
  "apology": [
    "I sincerely apologize for that.",
    "I’m sorry—I take responsibility.",
    "My apologies for the mistake.",
    "I’m truly sorry for the inconvenience."
  ]
}

# “hard negatives” = plausible but weaker/flat/abrupt alternatives
BAD = {
  "greet": [
    "Hello.", "Hi.", "Hey.", "Yo."
  ],
  "thanks": [
    "Thanks.", "Thx.", "Ok thanks.", "K thx."
  ],
  "goodbye": [
    "Bye.", "Later.", "See ya.", "k bye."
  ],
  "compliment": [
    "Good work.", "Not bad.", "Nice.", "Decent."
  ],
  "apology": [
    "Sorry.", "My bad.", "Oops.", "Whatever, sorry."
  ]
}

PROMPTS = {
  "greet": [
    "How should I greet someone?",
    "Give me a polite greeting.",
    "What’s a friendly way to say hello?",
    "How do I start a conversation nicely?"
  ],
  "thanks": [
    "How do I say thanks?",
    "Suggest a polite way to express gratitude.",
    "Give me a warm thank-you message.",
    "What’s a heartfelt way to say thank you?"
  ],
  "goodbye": [
    "How do I say goodbye politely?",
    "Suggest a friendly farewell.",
    "What’s a nice way to end a conversation?",
    "Give me a positive farewell message."
  ],
  "compliment": [
    "How do I compliment someone?",
    "Give me a short, strong compliment.",
    "Suggest a nice compliment for good work.",
    "What’s a motivating compliment?"
  ],
  "apology": [
    "How should I apologize?",
    "Give me a sincere apology sentence.",
    "What’s a respectful way to say sorry?",
    "Suggest a heartfelt apology."
  ]
}

rows = []
for intent, prompts in PROMPTS.items():
    for p in prompts:
        for pos in GOOD[intent]:
            # pick a *hard* negative close in meaning
            neg = random.choice(BAD[intent])
            rows.append({"prompt": p, "chosen": pos, "rejected": neg})

# optional shuffle + down/up-sample to target size (e.g., ~600 pairs)
random.shuffle(rows)
rows = rows[:600]

train_dataset = Dataset.from_list(rows)
print("Preference pairs:", len(train_dataset))


Preference pairs: 80


In [None]:
# split 90/10
split = int(0.9 * len(train_dataset))
val_dataset = train_dataset.select(range(split, len(train_dataset)))
train_dataset = train_dataset.select(range(split))

print("Train:", len(train_dataset), "Val:", len(val_dataset))

# helper: logprob of sequence under the model
import torch
from torch.nn.functional import log_softmax

@torch.no_grad()
def seq_logprob(text):
    enc = tokenizer(text, return_tensors="pt").to(model.device)
    out = model(**enc, labels=enc.input_ids)
    # negative loss is average logprob per token
    return -out.loss.item()

@torch.no_grad()
def dpo_pref_accuracy(dataset, n=50):
    # sample n pairs
    idxs = list(range(len(dataset)))
    random.shuffle(idxs); idxs = idxs[:n]
    correct = 0
    for i in idxs:
        row = dataset[i]
        lp_ch = seq_logprob(row["prompt"] + " " + row["chosen"])
        lp_rj = seq_logprob(row["prompt"] + " " + row["rejected"])
        if lp_ch > lp_rj:
            correct += 1
    return correct / max(1, len(idxs))

print("Pre-train val pref-acc (est):", dpo_pref_accuracy(val_dataset, n=80))


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Train: 72 Val: 8
Pre-train val pref-acc (est): 1.0


## Step 6: Load Model and Tokenizer

We'll use 4-bit quantization to save memory (allows larger models on free Colab)

In [None]:
from transformers import BitsAndBytesConfig

use_4bit = (device == "cuda")  # only quantize on GPU
quant_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
) if use_4bit else None

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# many small models lack pad token; set it safely
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto" if device == "cuda" else None,
    torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32,
    quantization_config=quant_cfg,
    trust_remote_code=True,
)

## Step 7: Configure LoRA (Low-Rank Adaptation)

LoRA allows efficient fine-tuning by only training a small number of parameters.
This is essential for free Colab resources.

In [None]:
# For this architecture, these targets are standard. If you switch to GPT-2 later,
# you'll need gpt2-specific targets like ["c_attn", "c_proj"].
lora_cfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    bias="none", task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_cfg)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable:,} ({100*trainable/total:.2f}% of total)")


Trainable params: 8,683,520 (4.07% of total)


## Step 8: Configure DPO Training

Set hyperparameters for Direct Preference Optimization

In [None]:
cfg = DPOConfig(
    output_dir="./dpo-output",
    report_to="none",
    remove_unused_columns=False,

    max_length=256,
    max_prompt_length=128,

    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,   # slightly larger effective batch
    num_train_epochs=2,              # fewer epochs to avoid overfitting tiny data
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,

    bf16=(device=="cuda"),
    gradient_checkpointing=True,
    optim="paged_adamw_8bit" if device=="cuda" else "adamw_torch",
    logging_steps=10,
    save_strategy="epoch",
    save_total_limit=1,

    beta=0.2,   # a touch higher than 0.1; try 0.1–0.3 to see effect
)


## Step 9: Initialize DPO Trainer

In [None]:
trainer = DPOTrainer(
    model=model,
    args=cfg,
    train_dataset=train_dataset,
)
print("DPO trainer ready.")

Extracting prompt in train dataset:   0%|          | 0/72 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/72 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/72 [00:00<?, ? examples/s]

DPO trainer ready.


## Step 10: Train the Model

In [None]:
# Start training
print("Starting DPO training...\n")
trainer.train()
print("\n✓ Training completed!")

Starting DPO training...





Step,Training Loss
10,0.6732


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Repo card metadata block was not found. Setting CardData to empty.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Repo card metadata block was not found. Setting CardData to empty.



✓ Training completed!


In [None]:
print("Post-train val pref-acc (est):", dpo_pref_accuracy(val_dataset, n=80))




Post-train val pref-acc (est): 1.0


## Step 11: Save the Model

In [None]:
# Save model and tokenizer
output_dir = "./dpo-finetuned-model"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"✓ Model saved to: {output_dir}")

✓ Model saved to: ./dpo-finetuned-model


## Step 12: Test the Model

In [None]:
def generate_response(prompt, max_new_tokens=30, temperature=0.0):
    """
    For instruct models with a chat template (e.g., SmolLM2 Instruct), we format
    the prompt accordingly. We default to deterministic decoding (temperature=0.0)
    to avoid gibberish in a classroom demo.
    """
    if hasattr(tokenizer, "apply_chat_template") and tokenizer.chat_template:
        messages = [{"role": "user", "content": prompt}]
        text = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
    else:
        # Fallback for plain causal models (e.g., gpt2)
        text = prompt

    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=(temperature > 0.0),
        temperature=max(temperature, 1e-6),
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    # Slice off the prompt tokens properly (use input length, not len(inputs[0]))
    gen_tokens = out[0, inputs.input_ids.shape[1]:]
    return tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()

# ---------- Quick test on a few prompts ----------
test_prompts = [
    "Give me a polite greeting.",
    "What’s a heartfelt way to say thank you?",
    "Suggest a friendly farewell.",
    "Give me a short, strong compliment.",
    "How should I apologize sincerely?"
]
# We Disable gradient checkpointing before generation
model.gradient_checkpointing_disable()
model.eval()

print("\n=== DPO Model Samples ===")
for q in test_prompts:
    print("Q:", q)
    print("A:", generate_response(q))
    print("-"*50)


=== DPO Model Samples ===
Q: Give me a polite greeting.
A: Hello, I'm happy to be your guide today. I'm here to assist you in navigating the world of data analysis and machine learning. I'm
--------------------------------------------------
Q: What’s a heartfelt way to say thank you?
A: "Thank you for your kind words and gestures. I hope you have a wonderful day ahead."
--------------------------------------------------
Q: Suggest a friendly farewell.
A: "Goodbye, and may the day be filled with love and joy!"
--------------------------------------------------
Q: Give me a short, strong compliment.
A: "You're a great person, and you're always willing to help others."
--------------------------------------------------
Q: How should I apologize sincerely?
A: Apologizing sincerely is a crucial part of maintaining a positive relationship. It shows that you care about the other person's feelings and are willing to make
--------------------------------------------------
