# Direct Preference Optimization (DPO) for Language Model Alignment

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/1337-Artificial-Intelligence/hackai-2025/blob/main/new_notebooks/alignment_dpo_aragpt2_arabicpreference.ipynb)

This notebook demonstrates how to align language models with human preferences using Direct Preference Optimization (DPO), a powerful technique that improves upon traditional Reinforcement Learning from Human Feedback (RLHF) methods.

Estimated time needed: **1** hour

## Lab Objective

The goal of this lab is to give you practical experience with:
- Preparing a dataset specially formatted for DPO,
- Fine-tuning a model using the DPO method,
- Evaluating how much the model's behavior improves after training


### How DPO Works (Simple Explanation)

- You show the model two answers for the same question: one that humans prefer (the **chosen** one) and one that's less preferred (the **rejected** one).
- The model learns **directly** from this comparison by adjusting itself to favor the "chosen" answers over the "rejected" ones.
- It does this **without** needing a complex reward model like in traditional reinforcement learning.

Think of it like training a dog: you show it two actions (e.g., sit nicely vs jump on people) and **reward** it for the one you like better, over and over, until it consistently chooses the good one.


## DPO vs PPO: What's the Difference?

| Aspect | DPO (Direct Preference Optimization) | PPO (Proximal Policy Optimization) |
|:------|:--------------------------------------|:----------------------------------|
| How it works | Directly trains the model from comparisons (chosen vs rejected) | Needs a reward model first, then trains the model using rewards |
| Complexity | Simpler (no reward model needed) | More complex (2 steps: train reward model + policy optimization) |
| Stability | Very stable and efficient | Stable but more sensitive to hyperparameters |
| Training Type | Preference-based fine-tuning | Reinforcement learning fine-tuning |

![image](https://cdn.labellerr.com/1%201%201%20DPO/dpo-ppo-diagram.webp)

**In short:**
- **DPO** is **easier and faster** because it skips the "build a reward model" step.
- **PPO** is a **full reinforcement learning method**, needing more setup but offering more flexibility when rewards are tricky.


![texte](https://miro.medium.com/v2/resize:fit:1100/format:webp/0*adNPXsn8v1qXiy98.png)


In the DPO’s paper, the authors apply the Bradley and Terry model, which is a preference model in the loss function. Through some algebraic wor, they demonstrate that the second step can be skipped because language models inherently act as reward models themselves. Surprisingly, once the second step is removed, the problem is significantly simplified to an optimization problem with a cross-entropy objective, as shown in Figure below

![image](https://miro.medium.com/v2/resize:fit:1100/format:webp/0*zE6I3BBUDMN9lfwV.png)

<img href="https://miro.medium.com/v2/resize:fit:1100/format:webp/0*adNPXsn8v1qXiy98.png)">

#### Setup and Installation


- Installing required libraries

**Note**: These versions are specifically selected for compatibility

In [None]:
!pip install --q torch==2.3.1 trl==0.11.4 peft==0.14.0 pandas numpy==1.26.0 datasets==3.2.0 transformers==4.45.2

- Importing required libraries



In [None]:
import os
import torch
from datasets import load_dataset

from peft import LoraConfig, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    set_seed,
    GenerationConfig
)
from trl import DPOConfig, DPOTrainer

  from .autonotebook import tqdm as notebook_tqdm


#### Model and Tokenizer Setup

For this workshop, we'll use the OPT model, a decoder-only language model from Meta AI.


In [None]:
# Check for GPU availability and set device accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Model selection:  We're using AraGPT2, an Arabic language model based on the GPT-2 architecture
MODEL_NAME = "aubmindlab/aragpt2-base" # "unsloth/Qwen2.5-1.5B" 

# The model name for the fine-tuned version
FINETUNED_MODEL_NAME = "aragpt2-base-dpo"

Using device: cuda


- Set the Hugging face token found [here](https://huggingface.co/settings/tokens)
In order to interact and use the hugging face hub

In [None]:
# Set Hugging Face token for accessing models 
os.environ["HF_TOKEN"] = "YOUR_HF_API_TOKEN" 

- Get your wandb API Key found [here](https://wandb.ai/authorize) and set it as an environment variable

In [None]:
os.environ["WANDB_API_KEY"] = "YOUR_API_TOKEN" 

In [None]:
# Load model for training
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

# Load reference model (used during training)
model_ref = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

# Load and configure tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Configure padding token and padding side
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Padding on the right side preserves the beginning of sequences

# Disable cache during forward pass to save memory
model.config.use_cache = False

#### Data Preparation



-  Load the Arabic preference [dataset](https://huggingface.co/datasets/FreedomIntelligence/Arabic-preference-data-RLHF) for RLHF

In [None]:
# This dataset contains pairs of responses where one is preferred over the other
# We use only 10% for this demo to keep training time reasonable
print("Loading preference dataset...")
ds = load_dataset("FreedomIntelligence/Arabic-preference-data-RLHF", split="train[:10%]")


Examine the dataset structure to understand its format

In [None]:
print("\nDataset sample features:", ds[0].keys())
print("\nExample entry from dataset:")
print(ds[0])


Sample features: dict_keys(['id', 'instruction', 'chosen', 'rejected'])

Sample entry:
{'id': '10001_q', 'instruction': 'هل يمكنك تحقيق ارباح من تطبيق او لعبة فقط من خلال الاعلان داخل التطبيق؟', 'chosen': 'نعم، يمكن تحقيق أرباح من تطبيق أو لعبة من خلال الإعلانات داخل التطبيق. هذا يتم عن طريق استخدام شبكات الإعلانات مثل Google AdMob أو Facebook Audience Network، حيث تُظهر الإعلانات في التطبيق أو اللعبة ويتم تحقيق العائد بناءً على عدد الأشخاص الذين ينقرون على تلك الإعلانات. العائد يمكن أن يتراوح من صفر إلى عشرات الألاف من الدولارات بناءً على شعبية التطبيق أو اللعبة وكيفية تفاعل المستخدمين مع الإعلانات.', 'rejected': 'نعم، يمكن تحقيق أرباح من تطبيق أو لعبة من خلال الإعلانات الموجودة داخل التطبيق. يعتمد هذا على مجموعة من العوامل بما في ذلك شعبية التطبيق وعدد المستخدمين، ونوع ومحتوى الإعلانات، والاستراتيجيات التسويقية المستخدمة. يمكن للشركات أن تحقق أرباح أيضًا من الإعلانات التابعة أو الإعلانات المدفوعة المستندة إلى النقر أو الإعلانات التي تظهر عند توقف المستخدمين عن استخدام التطبيق.'}


- Transform the dataset into the format required by DPO:
    - `prompt`: The input query
    - `chosen`: The preferred response
    - `rejected`: The less preferred response



In [None]:
print("\nPreparing dataset for DPO training...")
ds = ds.rename_column("instruction", "prompt").remove_columns(["id"])

# Split the dataset into training and evaluation sets
# We use a 90/10 split with a fixed random seed for reproducibility
ds = ds.train_test_split(0.1, shuffle=True, seed=42)
train_dataset, eval_dataset = ds["train"], ds["test"]
print(f"Training set size: {len(train_dataset)}, Evaluation set size: {len(eval_dataset)}")

| Chosen | Rejected | Prompt |
| --- | --- | --- |
 | نعم، يمكن تحقيق أرباح من تطبيق أو لعبة من خلال الإعلانات داخل التطبيق. هذا يتم عن طريق استخدام شبكات الإعلانات مثل Google AdMob أو Facebook Audience Network، حيث تُظهر الإعلانات في التطبيق أو اللعبة ويتم تحقيق العائد بناءً على عدد الأشخاص الذين ينقرون على تلك الإعلانات. العائد يمكن أن يتراوح من صفر إلى عشرات الألاف من الدولارات بناءً على شعبية التطبيق أو اللعبة وكيفية تفاعل المستخدمين مع الإعلانات.|نعم، يمكن تحقيق أرباح من تطبيق أو لعبة من خلال الإعلانات الموجودة داخل التطبيق. يعتمد هذا على مجموعة من العوامل بما في ذلك شعبية التطبيق وعدد المستخدمين، ونوع ومحتوى الإعلانات، والاستراتيجيات التسويقية المستخدمة. يمكن للشركات أن تحقق أرباح أيضًا من الإعلانات التابعة أو الإعلانات المدفوعة المستندة إلى النقر أو الإعلانات التي تظهر عند توقف المستخدمين عن استخدام التطبيق.| هل يمكنك تحقيق ارباح من تطبيق او لعبة فقط من خلال الاعلان داخل التطبيق



### Optional: Quantized Model Configuration (for GPUs)
For r environments with GPU support, you can use quantization to reduce memory usage: Uncomment the following blocks if working with limited GPU memory


![lora](https://pytorch.org/torchtune/0.4/_images/lora_diagram.png)

In [None]:
# !pip install -U bitsandbytes # this package is required for quantization

**_Note:_**  _You can run the installed package by restarting a Kernel._


In [None]:
# !pip install -U bitsandbytes  # Required for quantization

# from transformers import BitsAndBytesConfig

# # Configure quantization parameters
# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,                    # Load model in 4-bit precision instead of 16/32-bit
#     bnb_4bit_use_double_quant=True,       # Use double quantization for better accuracy
#     bnb_4bit_quant_type="nf4",            # Use normalized float 4-bit quantization
#     bnb_4bit_compute_dtype=torch.bfloat16 # Use bfloat16 for calculations
# )

# # Load models with quantization config
# model = AutoModelForCausalLM.from_pretrained(
#     MODEL_NAME, 
#     quantization_config=quantization_config,
#     device_map="auto"
# )

# model_ref = AutoModelForCausalLM.from_pretrained(
#     MODEL_NAME, 
#     quantization_config=quantization_config,
#     device_map="auto"
# )

# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, device_map="auto")
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.padding_side = "right"
# model.config.use_cache = False

#### LoRA Configuration for Efficient Fine-tuning
#LoRA allows us to train only a small number of adapter parameters instead of the full model

In [None]:
# PEFT (Parameter-Efficient Finetuning) configuration
print("Setting up LoRA configuration...")
peft_config = LoraConfig(
    r=4,                    # Rank of the low-rank decomposition matrices
    target_modules=[        # Which modules to apply LoRA to
        'c_proj',           # Projection layers in the transformer
        'c_attn'            # Attention layers in the transformer
    ],
    task_type="CAUSAL_LM",  # The type of task we're performing
    lora_alpha=8,           # Scaling factor for the LoRA parameters (typically 2x rank)
    lora_dropout=0.1,       # Dropout probability for LoRA layers
    bias="none",           # Whether to train bias parameters
)

####  DPO Training Configuration


In [None]:
# Configure DPO training parameters
print("Setting up DPO training configuration...")
training_args = DPOConfig(
    beta=0.1,                      # Temperature parameter for the DPO loss (typically 0.1-0.5)
                                   # Higher values make the model more conservative
    output_dir="dpo",              # Directory to save model checkpoints
    num_train_epochs=5,            # Number of training passes through the data
    per_device_train_batch_size=2, # Batch size for training (adjust based on GPU memory)
    per_device_eval_batch_size=2,  # Batch size for evaluation
    remove_unused_columns=False,   # Keep all columns in the dataset
    logging_steps=10,              # Log training progress every 10 steps
    gradient_accumulation_steps=4, # Accumulate gradients over multiple batches
                                   # Effectively increases batch size to 2 * 4 = 8
    learning_rate=1e-4,            # Learning rate for the optimizer
    evaluation_strategy="epoch",   # Evaluate after each epoch
    warmup_steps=2,                # Number of warmup steps for learning rate scheduler
    save_steps=500,                # Save checkpoint every 500 steps
    report_to='wandb'              # Report training metrics to Weights & Biases
                                   # Use 'none' to disable reporting
)



####  DPO Trainer Setup

Next step is creating the actual trainer using DPOTrainer class.


In [None]:
# Create the DPO trainer that will handle the training process
print("Setting up DPO trainer...")
trainer = DPOTrainer(
    model=model,              # The model to be fine-tuned
    ref_model=None,           # Reference model (None because we're using LoRA)
                              # When using LoRA, DPOTrainer will automatically handle the reference model
    args=training_args,       # Training arguments defined above
    train_dataset=train_dataset,  # Training data
    eval_dataset=eval_dataset,    # Evaluation data
    tokenizer=tokenizer,          # Tokenizer
    peft_config=peft_config,      # LoRA configuration
    max_length=512,               # Maximum sequence length for inputs and outputs
    
)


Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
Tokenizing train dataset: 100%|██████████| 6236/6236 [00:11<00:00, 534.04 examples/s]
Tokenizing eval dataset: 100%|██████████| 693/693 [00:01<00:00, 602.35 examples/s]


Please note that when using LoRA for the base model, it's efficient to leave the model_ref param null, in which case the DPOTrainer will unload the adapter for reference inference.


Now, you're all set for training the model.


#### Training Process



**Training can be time-consuming on CPU and may cause memory issues, If you encounter problems, skip to the next section to load a pre-trained model**

In [None]:
# Start the training process
print("Starting DPO training...")
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mafaf[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Could not estimate the number of tokens of the input, floating-point operations will not be computed


Epoch,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
0,0.2648,0.351199,7.322288,4.321124,0.860231,3.001164,-783.083191,-1090.636475,-3.003296,-3.291013
2,0.2887,0.357454,8.366143,4.700235,0.864553,3.665909,-779.292175,-1080.197876,-3.019666,-3.315789
4,0.1721,0.377715,8.918534,4.731424,0.861671,4.18711,-778.980347,-1074.674072,-3.005437,-3.29856


TrainOutput(global_step=3895, training_loss=0.2658155122525579, metrics={'train_runtime': 29344.701, 'train_samples_per_second': 1.063, 'train_steps_per_second': 0.133, 'total_flos': 0.0, 'train_loss': 0.2658155122525579, 'epoch': 4.996792815907633})

!!!!You can skip the training !!!!

In [None]:
# Save the trained model to Hugging Face Hub
# print("Pushing model to Hugging Face Hub...")
# trainer.push_to_hub(FINETUNED_MODEL_NAME, commit_message="DPO finetuning with LoRA")

In [None]:
# Load the trained model from the local checkpoint
# print("Loading trained model from checkpoint...")
# dpo_model = AutoModelForCausalLM.from_pretrained('./dpo/checkpoint-3895').to(device)
# dpo_tokenizer = AutoTokenizer.from_pretrained('./dpo/checkpoint-3895')

#### Loading Pre-trained Model (Alternative)


If training is too resource-intensive, you can load a pre-trained model

This section loads a model that's already been fine-tuned with DPO

In [None]:
# Load the DPO-fine-tuned model from Hugging Face Hub
print("Loading pre-trained DPO model from Hub...")
dpo_model = AutoModelForCausalLM.from_pretrained(f"HackAI-2025/{FINETUNED_MODEL_NAME}").to(device)
tokenizer = AutoTokenizer.from_pretrained(f"HackAI-2025/{FINETUNED_MODEL_NAME}")

Loading pre-trained DPO model from Hub...


In [None]:
# Load reference (baseline) model for comparison
model_ref = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

####  Text Generation and Comparison


In [None]:
# Set random seed for reproducible generation
set_seed(40)

# Configure generation parameters
print("Setting up text generation configuration...")
generation_config = GenerationConfig(
    max_new_tokens=70,         # Maximum number of tokens to generate
    do_sample=True,            # Use sampling instead of greedy decoding
    top_k=50,                  # Consider top 50 tokens at each step
    top_p=0.8,                 # Consider tokens with cumulative probability of 0.8
    temperature=0.8,           # Controls randomness (higher = more random)
    repetition_penalty=1.2,    # Penalize repetition of tokens
    pad_token_id=tokenizer.eos_token_id  # Use EOS token for padding
)

# Define a test prompt in Arabic
PROMPT = "كيف يمكنني التغلب على القلق والتوتر؟" # "What are the benefits of healthy food?"

# Tokenize the prompt and move to the appropriate device
inputs = tokenizer(PROMPT, return_tensors='pt').to(device)

# Generate text with the DPO-fine-tuned model
print("Generating response with DPO model...")
outputs = dpo_model.generate(**inputs, generation_config=generation_config).to(device)
print("DPO response:\t", tokenizer.decode(outputs[0], skip_special_tokens=True))

# Generate text with the baseline model for comparison
print("\nGenerating response with baseline model...")
outputs = model_ref.generate(**inputs, generation_config=generation_config).to(device)
print("Baseline response:\t", tokenizer.decode(outputs[0], skip_special_tokens=True))


Setting up text generation configuration...
Generating response with DPO model...
DPO response:	 كيف يمكنني التغلب على القلق والتوتر؟ القلق والاكتئاب والقلق هو أحد الاضطرابات النفسية التي تصيب الأشخاص الذين يعانون من التوتر ، ويمكن أن يكون الخوف والقلق هو أكثر ما يميز الشخص المصاب بالقلق عن الآخرين المصابين به .1 - الاكتئاب : وهو حالة نفسية تؤثر بشكل كبير في التفكير والتحليل والتصرف وفي الأحاسيس والمشاعر السلبية مثل الحزن والغضب والخوف والإحباط والقلق والخوف أو الغضب والحزن والخوف والقلق والخوف والإحباط وغيرها من المشاعر المختلفة .2

Generating response with baseline model...
Baseline response:	 كيف يمكنني التغلب على القلق والتوتر؟ ؟ ! ! ، من هو الشخص الذي يملك هذا الإحساس الرائع . . . هل هو شخص عادي أم مجنون ؟ . . . أم أنه إنسان فاشل ؟ . . . لا أدري . . لكن ما أعرفه أن كل منا يعاني من مشكلة . . . أما أنا فأعتقد أن الإنسان الناجح في حياته هو الإنسان الذي يمتلك هذا الشعور . . . فهو الذي يستطيع


Althought the model is trained on a small data for 5 epochs only, it can be seen that the response generated by the DPO-tuned model is more concise and straightforward.


# Exercises



In [None]:
test_questions = ["ما هي فوائد الغذاء الصحي؟",
"كيف يمكنني التغلب على القلق والتوتر؟",
"اشرح لي كيفية استخدام الذكاء الاصطناعي في التعليم.",
"ما هي أفضل طريقة لتعلم لغة جديدة؟",
"هل يجب علي الاستثمار في العملات المشفرة؟",
"ما هي أخطر تهديدات البيئة في العالم اليوم؟",
"كيف يمكنني تحسين مهارات التواصل لدي؟",
"اقترح برنامجاً لتمارين رياضية لشخص مبتدئ.",
"ما هي الخطوات اللازمة لبدء مشروع تجاري ناجح؟",
"كيف يمكن للتكنولوجيا أن تساعد في حل مشكلة تغير المناخ؟"]

## Exercise 1: Experiment with Generation Parameters
Try different generation parameters (temperature, top_p, top_k) and compare their effects on:
1. The quality of the generated text
2. The diversity of responses
3. How closely they align with human preferences

## Exercise 2: Test with Different Prompts
Create 3-5 different prompts and compare the responses from:
1. The base model (model_ref)
2. The DPO fine-tuned model
Analyze the differences and explain how the DPO training has affected the outputs.

## Exercise 3: Error Analysis
Identify cases where the DPO model still produces suboptimal responses and suggest:
1. Possible reasons for these failures
2. How you might improve the training data to address these issues
3. Alternative training strategies that might help


## Exercises

Try these exercises to better understand DPO:

1. **Experiment with Generation Parameters**
   - Try different values for temperature, top_p, and top_k
   - How do they affect the responses?

2. **Test Different Prompts**
   - Try these Arabic prompts:
   ```python
   test_questions = [
       "ما هي فوائد الغذاء الصحي؟",
       "كيف يمكنني التغلب على القلق والتوتر؟",
       "اشرح لي كيفية استخدام الذكاء الاصطناعي في التعليم.",
       "ما هي أفضل طريقة لتعلم لغة جديدة؟",
       "هل يجب علي الاستثمار في العملات المشفرة؟"
   ]
   ```

3. **Compare Responses**
   - How do the DPO model's responses differ from the baseline?
   - What makes the DPO responses better or worse?