# RLHF in 2024 with DPO & Hugging Face

This blog post walks you through how to use DPO to improve open LLMs using Hugging Face [TRL](https://huggingface.co/docs/trl/index), [Transformers](https://huggingface.co/docs/transformers/index) & [datasets](https://huggingface.co/docs/datasets/index) in 2024. 

Research and experiments suggest that DPO should only be applied after SFT. This mens we need an already fine-tuned LLM, which can be aligned with DPO. In this example we will use [cognitivecomputations/dolphin-2.1-mistral-7b](https://huggingface.co/cognitivecomputations/dolphin-2.1-mistral-7b) a fine-tuned Mistral 7B with ChatML template. 

1. Setup development environment
2. Create and prepare the preference dataset
3. Align LLM with trl and the DPOTrainer
4. Test and evaluate the LLM on MT-Bench

## **1. Setup development environment**

Our first step is to install Hugging Face Libraries and Pytorch, including trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.

In [None]:
# Install Pytorch & other libraries
!pip install "torch==2.1.2" tensorboard

# Install Hugging Face libraries
!pip install  --upgrade \
  "transformers[sentencepiece]==4.36.2" \
  "datasets==2.16.1" \
  "accelerate==0.26.1" \
  "evaluate==0.4.1" \
  "bitsandbytes==0.42.0" \
  # "trl==0.7.10" # \
  # "peft==0.7.1" \

# install peft & trl from github
!pip install git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e --upgrade
!pip install git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f --upgrade

If you are using a GPU with Ampere architecture (e.g. NVIDIA A10G or RTX 4090/3090) or newer, you can use Flash Attention. **[FlashAttention](https://github.com/Dao-AILab/flash-attention/tree/main)** can accelerate training time up to 3x.

*Note: If your machine has less than 96GB of RAM and lots of CPU cores, reduce the number of **`MAX_JOBS`**. On the **`g5.2xlarge`** we used **`4`**.*

In [None]:
import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'
# install flash-attn
!pip install ninja packaging
!MAX_JOBS=4 pip install flash-attn --no-build-isolation

_Installing flash attention can take quite a bit of time (10-45 minutes)._

We will use the **[Hugging Face Hub](https://huggingface.co/models)** as a remote model storage and automatically push our model, logs and information to the Hub during training. You must register on the **[Hugging Face](https://huggingface.co/join)** for this. After you have an account, we will use the **`login`** util from the **`huggingface_hub`** package to log into our account and store our token (access key) on the disk.

In [None]:
from huggingface_hub import login

login(
  token="", # ADD YOUR TOKEN HERE
  add_to_git_credential=True
)


## 3. Create and prepare the dataset

Improving the helpfulness or quality of LLMs through Aligning methods like DPO doesn’t come for free. Compared to traditional supervised fine-tuning (SFT) alignment methods require preference data. Preference data is crucial as it serves as a proxy against which the model's outputs are evaluated and aligned. A typical DPO dataset includes a triplet out of prompt, chosen, and rejected response. There are several ways to create such a dataset, including:

- Using existing open-source datasets, e.g., [SHP](https://huggingface.co/datasets/stanfordnlp/SHP)
- Using LLMs to create synthetic preferences, e.g., [Ultrafeedback](https://www.notion.so/9de9ac96f0f94aa5aed96361a26e8bf0?pvs=21)
- Using Humans to create datasets, e.g., [HH](https://www.notion.so/SageMaker-bi-weekly-sync-0be2e6ba876a4599b4c0da2681dfb78f?pvs=21)
- Using a combination of the above methods, e.g., [Orca DPO](https://huggingface.co/datasets/Intel/orca_dpo_pairs)

Each method has advantages and disadvantages and depends on the budget, time, and quality requirements. 

*It's important to recognize that preference datasets can inherently reflect the biases of the human/AI they are based on. To ensure broader applicability and fairness, it's crucial to incorporate a diverse range of feedback in creating these datasets.*

In our example, we will use the [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned?row=0) dataset. The best DPO dataset represents the real-world preferences of your users or customers. If you don’t have collected preferences yet, start with your existing SFT data and use different sizes/quality LLMs to generate feedback. This method was used to create the Orca DPO dataset, where GPT-4 was used for the accepted responses and Llama 70B Chat for the rejected responses. A DPO dataset will have the following format

```json
{"chosen": "<prompt + good response>", "rejected": "<prompt + worse response>" }
{"chosen": "<prompt + good response>", "rejected": "<prompt + worse response>" }
{"chosen": "<prompt + good response>", "rejected": "<prompt + worse response>" }
```

The `<pompt + good response>` and `<prompt + worse response>` are representend in the `conversational` format as: 

```json
[{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
```

_**Note:** If the dataset includes multiple turns you need to make sure that only the last turn between chosen and rejected is different. If not, you must reduce the conversation until only the last assistant turn is different._


The DPOTrainer expects the inputs as triples of (prompt, chosen, rejected), where `chosen` and `rejected` are the final turn of a dialogue and the `prompt` is N-1 turns. Those inputs also need to be already formated with the tempalte of the model, e.g. `<|im_start|>user\nINSTRUCTION\n<|im_end|>\n<|im_start|>assistant\n...`.

In our example we are going to load our open-source dataset using the 🤗 Datasets library and then convert it into the correct format. The  [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned?row=0) already comes with the DPO format (chosen/rejected). This means we can create our triplet and templetite it usng a `tokenizer` and the `apply_chat_template` methoh. We are randomly downsampling the dataset to 10,000 train samples and 2500 evaluation samples.

_Note: This step can be different for your use case. For example, if you might need to create the `conversational` format and concate the prompt and chosen/rejected response._



In [18]:
from transformers import AutoTokenizer
from datasets import load_dataset

# Load Tokenizer from the hub
model_id = "cognitivecomputations/dolphin-2.1-mistral-7b" # replace with your model id
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load dataset from the hub
dataset = load_dataset("argilla/ultrafeedback-binarized-preferences-cleaned", split="train")
dataset = dataset.shuffle().select(range(12500))

def rec_extract_assistant_messages(messages, index=-1):
  """Recursively extract the last assistant messages from the end of the conversation."""
  if messages[index]["role"] == "assistant":
    return [messages[index]]
  else:
    return rec_extract_assistant_messages(messages, index-1)
    
# System message used if there is no system message at the beginning of the conversation
# Can be repelaced and modified as needed
DEFAULT_SYSTEM_MESSAGE = "You are Dolphin, a helpful AI assistant."

def create_triplets(example, tokenizer, default_system_message=DEFAULT_SYSTEM_MESSAGE):
  """Create the triplets (prompt, chosen, rejected)"""
  # Extract the N-1 turns to form the prompt
  # Prepend a system message if the first message is not a system message
  prompt_messages = example["chosen"][:-1]
  if example["chosen"][0]["role"] != "system":
      prompt_messages.insert(0, {"role": "system", "content": default_system_message})
  # Now we extract the final assistant turn to define chosen/rejected responses 
  chosen_messages = rec_extract_assistant_messages(example["chosen"])
  rejected_messages = rec_extract_assistant_messages(example["rejected"])
  
  # apply template to the messages and return the triplets
  return {
    "prompt": tokenizer.apply_chat_template(prompt_messages, tokenize=False),
    "chosen": tokenizer.apply_chat_template(chosen_messages, tokenize=False),
    "rejected": tokenizer.apply_chat_template(rejected_messages, tokenize=False)
  }

dataset = dataset.map(create_triplets, remove_columns=dataset.features, fn_kwargs={"tokenizer": tokenizer})  
# split dataset into 10,000 training samples and 2,500 test samples
dataset = dataset.train_test_split(test_size=2500/12500)

# print sample cut of 
print(dataset["train"][0]["prompt"][:50])
print(dataset["train"][0]["chosen"][:50])
print(dataset["train"][0]["rejected"][:50])

# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/12500 [00:00<?, ? examples/s]

<|im_start|>system
You are Dolphin, a helpful AI a
<|im_start|>assistant
The connecting word between 
<|im_start|>assistant
Your task is to identify the


Creating json from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

8606974

## 4. Align LLM with trl and the DPOTrainer

TRL supports the DPO through a dedicated [DPOTrainer](https://huggingface.co/docs/trl/dpo_trainer) for alinging LLMs from preference data, as described in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290). The `DPOTrainer` is a subclass of the `Trainer` from the `transformers` library and supports all the same features, including logging, evaluation, and checkpointing. 

One big difference to SFT is that for DPO we need an additional Reference Model, which is used for KL-Divergence to help stabilize the training. The Reference Model is normally the same model as the one we are training, but frozen. This means for DPO you need additional memory and compute resources. To keep our example efficient we will use PEFT and adatpers. We load your fine-tuned and then add a new trainable adapters. This means that we will only tune adapters and not the whole model using DPO. The origian model will be then used as reference model itself. If you want to train all parameter with DPO you need to provide a `model` and `reference_model, but this requires more memory and compute resources.

Lets start by loading our saved datasets from disk. 

In [19]:
from datasets import load_dataset

# Load jsonl data from disk
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [20]:
# find p95 length of the prompt
from numpy import percentile
p95_prompt_length = percentile([len(tokenizer(x)["input_ids"]) for x in train_dataset["prompt"]], 99)
print(f"p99 prompt length: {p95_prompt_length}")

# find p99 of prompt + chosen
p99_prompt_chosen_length = percentile([len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) for x in train_dataset], 99)
print(f"p99 prompt + chosen length: {p99_prompt_chosen_length}")

p99 prompt length: 1328.0300000000007
p99 prompt + chosen length: 1722.0200000000004



In this example we are going to train [cognitivecomputations/dolphin-2.1-mistral-7b](https://huggingface.co/cognitivecomputations/dolphin-2.1-mistral-7b). Dolphin is a fine-tuned Mistral 7B with ChatML template support system messages. You can easily swap out the model for another model, e.g. [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) or [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) models, TII [Falcon](https://huggingface.co/tiiuae/falcon-40b), or any other LLMs by changing our `model_id` variable.

_Note: Be aware the bigger the model the more memory it will require. In our example we will use the 7B version, which can be tuned on 24GB GPUs. If you have a smaller GPU._

The first step is to load the model in int-4 using `bitsandbytes` and then add 

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import setup_chat_format

# Hugging Face model id
model_id = "cognitivecomputations/dolphin-2.1-mistral-7b" # replace with your model id

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # to prevent warnings

The `DPOTrainer` supports a native integration with `peft`, which makes it super easy to efficiently align LLMs using, e.g. QLoRA. We only need to create our `LoraConfig` and provide it to the trainer. Our `LoraConfig` parameters are the same as for the SFT example.

In [None]:
from peft import LoraConfig

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM", 
)

Before we can start our training we need to define the hyperparameters (`TrainingArguments`), DPO parameters and find the prompt and max sequence length. 

In [26]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="doplhin-dpo",               # directory to save and repository id
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=3,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.1,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="cosine",             # use cosine learning rate scheduler
    push_to_hub=True,                       # push model to hub
    report_to="tensorboard",                # report metrics to tensorboard
)

dpo_args = {
    "beta": 0.05,                           # The beta factor in DPO loss. Higher beta means less divergence
    "loss_type": "sigmoid"                  # The loss type for DPO.
}

# lets find the p99 length of the prompt and go to next multiple of 8, why 8? Because 8 is great, it's the number of bits in a byte
prompt_length = int(percentile([len(tokenizer(x)["input_ids"]) for x in train_dataset["prompt"]], 99))
prompt_length = ((prompt_length + 7) // 8) * 8
print(f"p99 prompt length: {prompt_length}")

# lets find the p99 length of the prompt + chosen
max_seq_length = int(percentile([len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) for x in train_dataset], 99))
max_seq_length = ((max_seq_length + 7) // 8) * 8
print(f"p99 prompt + chosen length: {max_seq_length}")

p99 prompt length: 1328
p99 prompt + chosen length: 1728


We now have every building block we need to create our `SFTTrainer` to start then training our model.

In [None]:
from trl import DPOTrainer
from numpy import percentile


trainer = DPOTrainer(
    model,
    ref_model=None, # set to none since we use peft
    peft_config=peft_config,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_length=max_seq_length,
    max_prompt_length= prompt_length,
    beta=dpo_args["beta"],
    loss_type=dpo_args["loss_type"],
)


Start training our model by calling the `train()` method on our `Trainer` instance. This will start the training loop and train our model for 2 epochs. Since we are using a PEFT method, we will only save the adapted model weights and not the full model.

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model 
trainer.save_model()

The training with Flash Attention for 3 epochs with a dataset of 10k samples took 01:29:58 on a `g5.2xlarge`. The instance costs `1,212$/h` which brings us to a total cost of only `1.8$`.

In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

### _Optional: Merge LoRA adapter in to the original model_

When using QLoRA, we only train adapters and not the full model. This means when saving the model during training we only save the adapter weights and not the full model. If you want to save the full model, which makes it easier to use with Text Generation Inference you can merge the adapter weights into the model weights using the `merge_and_unload` method and then save the model with the `save_pretrained` method. This will save a default model, which can be used for inference.

_Note: You might require > 30GB CPU Memory._

In [None]:

#### COMMENT IN TO MERGE PEFT AND BASE MODEL ####
# from peft import PeftModel, PeftConfig
# from transformers import AutoModelForCausalLM, AutoTokenizer
# from peft import AutoPeftModelForCausalLM

# # Load PEFT model on CPU
# model = AutoPeftModelForCausalLM.from_pretrained(
#     args.output_dir,
#     torch_dtype=torch.float16,
#     low_cpu_mem_usage=True,
# )  
# # Merge LoRA and base model and save
# merged_model = model.merge_and_unload()
# merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")

## 4. Test and evaluate the LLM on MT-Bench

After the training is done we want to evaluate and test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric. 

_Note: Evaluating Generative AI models is not a trivial task since 1 input can have multiple correct outputs. If you want to learn more about evaluating generative models, check out [Evaluate LLMs and RAG a practical example using Langchain and Hugging Face](https://www.philschmid.de/evaluate-llm) blog post._



In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline 

peft_model_id = "./code-llama-7b-text-to-sql"
# peft_model_id = args.output_dir

# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  device_map="auto",
  torch_dtype=torch.float16
)
# load into pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Let’s load our test dataset try to generate an instruction.

In [None]:
from datasets import load_dataset 
from random import randint


# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# Test on sample 
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)

print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Nice! Our model was able to generate a SQL query based on the natural language instruction. Lets evaluate our model on the full 2,500 samples of our test dataset. 
_Note: As mentioned above, evaluating generative models is not a trivial task. In our example we used the accuracy of the generated SQL based on the ground truth SQL query as our metric. An alternative way could be to automatically execute the generated SQL query and compare the results with the ground truth. This would be a more accurate metric but requires more work to setup._

In [None]:
from tqdm import tqdm


def evaluate(sample):
    prompt = pipe.tokenizer.apply_chat_template(sample["messages"][:2], tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
    predicted_answer = outputs[0]['generated_text'][len(prompt):].strip()
    if predicted_answer == sample["messages"][2]["content"]:
        return 1 
    else:
        return 0

success_rate = []
number_of_eval_samples = 1000
# iterate over eval dataset and predict
for s in tqdm(eval_dataset.shuffle().select(range(number_of_eval_samples))):
    success_rate.append(evaluate(s))

# compute accuracy
accuracy = sum(success_rate)/len(success_rate)

print(f"Accuracy: {accuracy*100:.2f}%")  
        

We evaluated our model on 1000 samples from the evaluation dataset and got an accuracy of 79.50%, which took ~25 minutes. 
This is quite good, but as mentioned you need to take this metric with a grain of salt. It would be better if we could evaluate our model by running the qureies against a real database and compare the results. Since there might be different "correct" SQL queries for the same instruction. There are also several ways on how we could improve the performance by using few-shot learning, using RAG, Self-healing to generate the SQL query.