# [Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)](https://arxiv.org/pdf/2305.18290.pdf)

### Reference Code
- https://huggingface.co/docs/trl/main/en/dpo_trainer
- https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py

Therefore the final dataset object should contain these 3 entries if you use the default DPODataCollatorWithPadding data collator.

The entries should be named:
- prompt
- chosen
- rejected

In [7]:
!pip install datasets
!pip install trl
!pip install huggingface_hub

Collecting trl
  Using cached trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.34.0->trl)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.34.0->trl)
  Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.34.0->trl)
  Using cached nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate>=0.34.0->trl)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate>=0.34.0->trl)
  Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)

In [39]:
import os
import torch
# Set GPU device
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [2]:
import torch
from datasets import Dataset, load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    HfArgumentParser,
    TrainingArguments
)

from typing import Dict, Optional
from trl import DPOTrainer, DPOConfig

# 1. load a pretrained model and tokenizer

In [3]:
model_name_or_path = "gpt2"
ignore_bias_buffers = False

model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
if ignore_bias_buffers:
    # torch distributed hack
    model._ddp_params_and_buffers_to_ignore = [
        name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
    ]

model_ref = AutoModelForCausalLM.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


The DPO trainer expects a model of AutoModelForCausalLM, compared to PPO that expects AutoModelForCausalLMWithValueHead for the value function.

## 2. Load the Dahoas RM Single Context dataset

In [14]:
def extract_anthropic_prompt(prompt_and_response):
    """Extract the anthropic prompt from a prompt and response pair."""
    search_term = "Assistant:"
    search_term_idx = prompt_and_response.rfind(search_term)
    assert search_term_idx != -1, f"Prompt and response does not contain '{search_term}'"
    return prompt_and_response[: search_term_idx + len(search_term)]

def get_hh(split: str, sanity_check: bool = False, silent: bool = False, cache_dir: str = None) -> Dataset:
    """Load the Dahoas RM Single Context dataset from Hugging Face and convert it to the necessary format.

    The dataset is converted to a dictionary with the following structure:
    {
        'prompt': List[str],
        'chosen': List[str],
        'rejected': List[str],
    }

    Prompts should be structured as follows:
      \n\nHuman: <prompt>\n\nAssistant:
    Multiple turns are allowed, but the prompt should always start with \n\nHuman: and end with \n\nAssistant:.
    """

    dataset = load_dataset("Dahoas/rm-single-context", split=split, cache_dir=cache_dir)
    if sanity_check:
        dataset = dataset.select(range(min(len(dataset), 10))) # 10 as small subset due to CUDA out of memory issue

    def split_prompt_and_responses(sample) -> Dict[str, str]:
        prompt = extract_anthropic_prompt(sample["chosen"])
        return {
            "prompt": sample["prompt"],
            "chosen": sample["chosen"][len(prompt) :],
            "rejected": sample["rejected"][len(prompt) :],
        }

    return dataset.map(split_prompt_and_responses)

In [15]:
sanity_check = True
train_dataset = get_hh("train", sanity_check=sanity_check)
eval_dataset = get_hh("test", sanity_check=sanity_check)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [16]:
print(train_dataset[0])

{'prompt': 'What does an electrician do?\n\nAssistant: An electrician’s job is to fix electrical problems in a variety of different places.  An electrician might fix the electrical wiring in a house, or to install the electric cables for new wiring in a commercial building, or help set up a piece of new electronic equipment.\n\nHuman: So they could fix the wiring in your home?\n\n', 'chosen': ' Yes, an electrician might fix the wiring in your home or office, or in a commercial building.  That is part of their job.  They may also provide design and planning for new electrical infrastructure.', 'rejected': ' Sure, in fact they could fix the wiring in my own house!  They could also change out old parts of my electronics to make them work better.'}


In [17]:
train_dataset

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 10
})

In [18]:
eval_dataset

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 10
})

# 3. Initialize training arguments:

In [27]:
hyperparams = [
    {"learning_rate": 1e-3,
     "per_device_train_batch_size": 8,
     "gradient_accumulation_steps": 1,
     "max_steps": 1000
     }, # Baseline setting
    # {"learning_rate": 5e-4,
    #  "per_device_train_batch_size": 8,
    #  "gradient_accumulation_steps": 1,
    #  "max_steps": 1000
    #  }, # Lower LR
    # {"learning_rate": 1e-3,
    #  "per_device_train_batch_size": 4,
    #  "gradient_accumulation_steps": 4,
    #  "max_steps": 1000
    #  }, # Smaller batch + accumulation
    # {"learning_rate": 1e-4,
    #  "per_device_train_batch_size": 8,
    #  "gradient_accumulation_steps": 4,
    #  "max_steps": 1000
    #  }, # Lower LR + accumulation
    # {"learning_rate": 5e-4,
    #  "per_device_train_batch_size": 8,
    #  "gradient_accumulation_steps": 4,
    #  "max_steps": 2000
    #  } # More training steps
]

# instrumentation
sanity_check = True
report_to = None
gradient_checkpointing = None
beta = 0.1

# 4. initialize  and train the DPO trainer

In [28]:
best_loss = float("inf")  # Initialize with a high value
best_hyperparams = None

In [29]:
results = []

for params in hyperparams:
    print(f"\nTraining with batch_size={params['per_device_train_batch_size']}, lr={params['learning_rate']}, grad_accum={params['gradient_accumulation_steps']}, max_steps={params['max_steps']}\n")

    training_args = DPOConfig(
      per_device_train_batch_size=params['per_device_train_batch_size'],
      max_steps=params['max_steps'],
      remove_unused_columns=False,
      gradient_accumulation_steps=params['gradient_accumulation_steps'],
      learning_rate=params['learning_rate'],
      eval_strategy="steps",
      logging_first_step=True,
      logging_steps=5,  # match results in blog post
      eval_steps=500,
      output_dir="./test",
      optim="rmsprop",
      warmup_steps=150,
      report_to=report_to,
      bf16=True,
      gradient_checkpointing=gradient_checkpointing,
      # TODO: uncomment that on the next transformers release
      # gradient_checkpointing_kwargs=gradient_checkpointing_kwargs,
    )

    dpo_trainer = DPOTrainer(
      model=model,
      ref_model=model_ref,
      args=training_args,
      processing_class=tokenizer,
      train_dataset=train_dataset,
      eval_dataset=eval_dataset
    )

    train_output = dpo_trainer.train()

    # Get final loss value
    final_loss = dpo_trainer.state.log_history[-1]["loss"] if "loss" in dpo_trainer.state.log_history[-1] else float("inf")

    results.append({
        "per_device_train_batch_size": params["per_device_train_batch_size"],
        "learning_rate": params["learning_rate"],
        "gradient_accumulation_steps": params["gradient_accumulation_steps"],
        "final_loss": final_loss,
        "train_loss": dpo_trainer.state.log_history
    })

    # Store the model with the lowest loss
    if final_loss < best_loss:
        best_loss = final_loss
        best_hyperparams = params  # Store hyperparameters of best model


Training with batch_size=8, lr=0.001, grad_accum=1, max_steps=1000



Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
500,0.0,3.377417,-3.768539,-3.538037,0.6875,-0.230503,-197.556702,-188.285828,-129.174255,-128.991089
1000,0.0,3.54719,-3.953837,-3.553049,0.625,-0.400788,-199.409683,-188.435944,-129.325958,-129.258514


In [22]:
# results = []

# for params in hyperparams:
#     print(f"\nTraining with batch_size={params['per_device_train_batch_size']}, lr={params['learning_rate']}, grad_accum={params['gradient_accumulation_steps']}, max_steps={params['max_steps']}\n")

#     training_args = DPOConfig(
#       num_train_epochs=params["num_train_epochs"],
#       learning_rate=params["learning_rate"],
#       per_device_train_batch_size=params["per_device_train_batch_size"],
#       do_eval=True,
#       per_device_eval_batch_size=params["per_device_eval_batch_size"],
#       adam_epsilon=1e-08,
#       lr_scheduler_type="linear",
#       warmup_ratio=0.1,
#       seed=42,
#       logging_steps=100,
#       save_steps=500,
#       save_strategy="steps",
#       output_dir="./test",
#       bf16=True,
#       remove_unused_columns=False,
#       gradient_checkpointing=gradient_checkpointing,
#       # TODO: uncomment that on the next transformers release
#       # gradient_checkpointing_kwargs=gradient_checkpointing_kwargs,
#     )
#     training_args = DPOConfig(
#         max_steps=params["max_steps"],
#         gradient_accumulation_steps=params["gradient_accumulation_steps"],
#         evaluation_strategy="steps",
#         logging_first_step=True,
#         logging_steps=5, # match results in blog post
#         eval_steps=500,
#         optim="rmsprop",
#         warmup_steps=150,
#         report_to=report_to,
#     )

#     dpo_trainer = DPOTrainer(
#         model=model,
#         ref_model=model_ref,
#         args=training_args,
#         tokenizer=tokenizer,
#         train_dataset=train_dataset,
#         eval_dataset=eval_dataset
#     )

#     train_output = dpo_trainer.train()

#     # Get final loss value
#     final_loss = dpo_trainer.state.log_history[-1]["loss"] if "loss" in dpo_trainer.state.log_history[-1] else float("inf")

#     results.append({
#         "per_device_train_batch_size": params["per_device_train_batch_size"],
#         "learning_rate": params["learning_rate"],
#         "gradient_accumulation_steps": params["gradient_accumulation_steps"],
#         "final_loss": final_loss,
#         "train_loss": dpo_trainer.state.log_history
#     })

#     # Store the model with the lowest loss
#     if final_loss < best_loss:
#         best_loss = final_loss
#         best_hyperparams = params  # Store hyperparameters of best model

In [None]:
# import matplotlib.pyplot as plt


# # Plot training loss trends
# plt.figure(figsize=(10, 5))
# for res in results:
#     losses = [x["loss"] for x in res["train_loss"] if "loss" in x]
#     steps = list(range(1, len(losses) + 1))
#     plt.plot(steps, losses, label=f"BS={res['per_device_train_batch_size']}, LR={res['learning_rate']}, GA={res['gradient_accumulation_steps']}")

# plt.xlabel("Training Steps")
# plt.ylabel("Loss")
# plt.title("Training Loss Across Different Hyperparameters")
# plt.legend()
# plt.show()

In [31]:
model.save_pretrained("dpo-gpt2-optimized-model")
tokenizer.save_pretrained("dpo-gpt2-optimized-model")

('dpo-gpt2-optimized-model/tokenizer_config.json',
 'dpo-gpt2-optimized-model/special_tokens_map.json',
 'dpo-gpt2-optimized-model/vocab.json',
 'dpo-gpt2-optimized-model/merges.txt',
 'dpo-gpt2-optimized-model/added_tokens.json',
 'dpo-gpt2-optimized-model/tokenizer.json')

### Upload Model to HuggingFace

In [32]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
from huggingface_hub import HfApi

repo_name = "myamjechal/dpo-gpt2-optimized-model"
api = HfApi()

# Create a new model repo if it doesn’t exist
api.create_repo(repo_name, exist_ok=True)

# Push model and tokenizer to Hugging Face Hub
model.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

print(f"Model uploaded! View it here: https://huggingface.co/{repo_name}")

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Model uploaded! View it here: https://huggingface.co/myamjechal/dpo-gpt2-optimized-model


### Load model from HuggingFace and Test

In [41]:
from transformers import AutoModelForCausalLM, AutoTokenizer

loaded_model = AutoModelForCausalLM.from_pretrained("myamjechal/dpo-gpt2-optimized-model")
loaded_tokenizer = AutoTokenizer.from_pretrained("myamjechal/dpo-gpt2-optimized-model")

In [43]:
prompt = "What does an electrician do?"
inputs = loaded_tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_length=100)
response = loaded_tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Model Response:", response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Model Response: What does an electrician do?

An electrician is responsible for the operation of the electrical system, including the electrical system's electrical system maintenance, repair, and replacement, and for the maintenance of the electrical system, including the electrical system's electrical system network, and for the maintenance of the electrical system network, including the electrical system network's electrical system network maintenance, repair, and replacement, and for the maintenance of the electrical system, including the electrical system network, and for the maintenance of


In [None]:
# # download as zip in content
# !zip -r /content/dpo_project.zip /content