
by [Pere Martra](https://github.com/peremartra)

# Aligning with DPO a Gemma-2 2B model.
This notebook demostrates how to align a Gemma-2 model using DPO (Direct Preference Optimization).
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Aligning_DPO_Gemma_2b_it.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>



## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model and load the Dataset. In this case, you can use a L4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **L4 GPU**.

### Gemma setup

**Before we dive into the tutorial, let's get you set up with Gemma:**

1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).
2. **Gemma Model Access:** Head over to the [Gemma-2 model page](https://huggingface.co/google/gemma-2-2b-it) and accept the usage conditions.
3. **Colab with Gemma Power:**  For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma-2 2B model. Choose an appropriate runtime when starting your Colab session.
4. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.

**Once you've completed these steps, you're ready to move on to the next section where we'll set up environment variables in your Colab environment.**

### Configure your HF token

Add your Hugging Face token to the Colab Secrets manager to securely store it.

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. Create a new secret with the name `HF_TOKEN`.
3. Copy/paste your token key into the Value input box of `HF_TOKEN`.
4. Toggle the button on the left to allow notebook access to the secret.

In [3]:
import os
from google.colab import userdata
# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

Since it’s necessary to save the model we create, the notebook mounts a disk on Google Drive. If you're running it locally on your computer, you don't need to run this line of code. You can also run it on Google Colab without mounting a disk in your Google Drive. However, if you do that, the saved model will be stored in a temporary directory, and you'll lose it every time you close the session.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Introduction to DPO



Direct Preference Optimization (DPO) is a model alignment technique similar to Reinforcement Learning from Human Feedback (RLHF). Both methods are used to align a model with the preferences or needs of its users. However, DPO has become more popular in many projects because it achieves comparable results to RLHF while requiring significantly fewer resources.

Both techniques start with a dataset that contains examples of correct and incorrect responses to the same prompt.

Here is where the methods diverge. In RLHF, this dataset is used to train a second model, known as a reward model, which plays a crucial role in the alignment process. In contrast, DPO uses the dataset directly to train the final model. This is the primary difference between the two techniques.

As you might imagine, DPO is a more straightforward approach that demands fewer resources.

The implementation of DPO you will be using is developed by Hugging Face in their TRL (Transformer Reinforcement Learning) library. DPO can be considered a type of reinforcement learning technique, where the model is "rewarded" during training based on the quality of its responses.



## Install dependencies
Run the cell below to install all the required dependencies.

In [5]:
!pip install -q torch==2.3.1+cu121
!pip install -q transformers==4.43.0
!pip install -q datasets==2.19.1
!pip install -q trl==0.8.6
!pip install -q peft==0.11.1
!pip install -q bitsandbytes==0.43.1
!pip install -q sentencepiece==0.1.99
!pip install -q accelerate==0.30.1
!pip install -q huggingface_hub==0.23.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m89.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.0/172.0 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the foll

In [6]:
#Import necessary classes.
import gc
import torch
import transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, PeftModel
from trl import DPOTrainer

Another necessary step is to login to Hugging Face.

In [7]:
from huggingface_hub import login

login(os.environ["HF_TOKEN"])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Loading the dataset
The chosen dataset is the [distilabel capybara](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized), which consists of prompt pairs, each with one correct and one incorrect response.

Before using it for training, the dataset's content needs to be formatted correctly to ensure compatibility with the DPO alignment process.

To process the dataset, it's necessary to load the tokenizer.

In [8]:
model_name = "google/gemma-2-2b-it"
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Before you begin aligning the model, it's necessary to load the dataset and transform it to fit the format required by the DPOTrainer class. This format consists of three fields: the prompt, the chosen answer, and a discarded answer.

In this example, I’m using all the rows of the dataset if you want to reduce the time needed for alignment and to fit the process on a smaller GPU, you can filter it reducing the size of the split. However, if you prefer a more complete fine-tuning process, feel free to use the full dataset.

Using the full dataset, you may need near to an hour on an A100 GPU to train for 6 epochs.

In [9]:
# Load dataset
dataset_original =  load_dataset("argilla/distilabel-capybara-dpo-7k-binarized",
                                 split='train[:]')

# Save columns
original_columns = dataset_original.column_names
print(original_columns)

Downloading readme:   0%|          | 0.00/11.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7563 [00:00<?, ? examples/s]

['source', 'conversation', 'original_response', 'generation_prompt', 'raw_generation_responses', 'new_generations', 'prompt', 'chosen', 'rejected', 'rating_chosen', 'rating_rejected', 'chosen_model', 'rejected_model']


In [10]:
dataset_original

Dataset({
    features: ['source', 'conversation', 'original_response', 'generation_prompt', 'raw_generation_responses', 'new_generations', 'prompt', 'chosen', 'rejected', 'rating_chosen', 'rating_rejected', 'chosen_model', 'rejected_model'],
    num_rows: 7563
})

The dataset contains more columns than are strictly necessary for the DPO process. However, I will use a couple of these additional columns to filter the data that will be used.

In [11]:
dataset_filtered = dataset_original.filter(
  lambda r: r["source"]=="GOAT" and (r["rating_chosen"]>=4.0 and r["rating_rejected"] >= 2)
)

Filter:   0%|          | 0/7563 [00:00<?, ? examples/s]

First, the records whose source is GOAT are selected. This part of the dataset deals with questions involving mathematical operations, and it tries to align the model to only return numbers instead of an explanation with the result.

I'll provide a couple of examples to illustrate the dataset and understand the change in the model's behavior once it's aligned.

| prompt | chosen | rejected |
| -- | ---- | ---- |
| Find 40 * 865. Exclude words; <br> show only the math. | 40 * 865 = 34600 | 34,600\n\n40 x 865 = 34,600 <br>\nNote: The response is purely numerical and does not contain any words or text that are not required for the calculation. <br>It is helpful, accurate, truthful, and harmless. <br>The final result is 34,600.
| Evaluate the equation 5725841 multiplied by 5 and provide the answer.<br> Limit your response to mathematical expressions and symbols. | 5725841 * 5 = 28629205 | To provide the answer as requested, <br>we will simply calculate the multiplication:<br>\n5 × 5725841 = 28629205\nAnswer: 28629205 |

This next filter retrieves only the rows where the rating of the chosen and rejected responses are very high. This approach aims to facilitate the model's learning, although it may not be as helpful in the later epochs of training.

Next, I will apply a second filter to keep the prompt length under control.

In [12]:
dataset_filtered = dataset_filtered.map(lambda r: {"messages": len(r["chosen"])}).filter(lambda r: r["messages"]<3)

Map:   0%|          | 0/46 [00:00<?, ? examples/s]

Filter:   0%|          | 0/46 [00:00<?, ? examples/s]

In [13]:
dataset_filtered

Dataset({
    features: ['source', 'conversation', 'original_response', 'generation_prompt', 'raw_generation_responses', 'new_generations', 'prompt', 'chosen', 'rejected', 'rating_chosen', 'rating_rejected', 'chosen_model', 'rejected_model', 'messages'],
    num_rows: 46
})

The dataset still contains all the original columns, but the number of rows has been significantly reduced. I should warn you that 46 rows are too few for proper training; this reduction is intended to allow the notebook to execute in just a few minutes and still produce results. But they are enough to cause changes in the model's response that align with the content of the dataset.

Now, the next step is to create a function to adapt the dataset’s structure to meet the requirements of the **DPOTrainer** class.

In summary, the function will take a row from the dataset and extract only the three necessary columns. Additionally, it applies a minor formatting adjustment to the responses, adapting them to the model's required format by adding the labels after the responses.

In [14]:
def chatml_format(example):
    # get everything except the last message as input
    prompt = tokenizer.apply_chat_template(example["chosen"][:-1], tokenize=False,
                                           add_generation_prompt=True)
    # get the last assistant responses
    chosen = example["chosen"][-1]["content"] + "<end_of_turn>\n"
    rejected = example["rejected"][-1]["content"] + "<end_of_turn>\n"

    return {
        "prompt": prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

I’ll use the dataset’s **map** function to apply the transformation to each row and remove the original columns.

In [15]:
# Format dataset
dataset = dataset_filtered.map(
    chatml_format,
    remove_columns=dataset_filtered.column_names
)

Map:   0%|          | 0/46 [00:00<?, ? examples/s]

In [16]:
# Print sample
dataset[3]

{'prompt': '<bos><start_of_turn>user\nEvaluate the equation 5725841 multiplied by 5 and provide the answer. Limit your response to mathematical expressions and symbols.<end_of_turn>\n<start_of_turn>model\n',
 'chosen': '5725841 * 5 = 28629205<end_of_turn>\n',
 'rejected': ' To provide the answer as requested, we will simply calculate the multiplication:\n5 × 5725841 = 28629205\nAnswer: 28629205<end_of_turn>\n'}

Now the dataset contains only the ncesary columns and with the texts adapted to the format required for Gemma.


> '\<bos>\<start_of_turn>user\ndetermine the ratio of the radius of a uranium-238 nucleus to the radius of a helium-4 nucleus.\<end_of_turn>\n\<start_of_turn>model\n'

## Train model with DPO

### Preparing configuration.

Now it's time to configure the necessary settings for alignment using DPO.

To perform a lighter fine-tuning, I will use LoRA (Low-Rank Adaptation), which significantly reduces the number of parameters that need to be trained. LoRA introduces additional layers into the model, and it's the weights of these layers that are adjusted. In this case, since we want the alignment process to have a significant impact on the model's behavior, the values for **r** and **lora_alpha** are set considerably higher than what is typically used in standard fine-tuning with LoRA.

The value of **r** indicates the size of the reparameterization; the higher the value, the more parameters are trained. A value of 16 is at the upper limit of what is recommended for small large models.

It’s generally recommended that **lora_alpha** be set to twice the value of **r**. However, since **r** can vary depending on the model size, this may lead to a very high **lora_alpha** value if you are fine-tuning a large model and, for example, specify an **r** of 64.



In [17]:
# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear"
)

The quantization configuration holds no secrets, you are reducing the model's precision to 4 bits.

In [18]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

This approach allows the model to occupy less memory, enabling the alignment process to be performed on a smaller GPU.

In [19]:
# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    attn_implementation='eager',
    torch_dtype=torch.bfloat16
)
model.config.use_cache = False
model.gradient_checkpointing_enable()

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

The next step is to set up the training parameters.

In [20]:
#Name of the model you want to create.
new_model = "test_dpo_gemma_b"

# Training arguments
#I'm using a batch_size of just 1 to avoid problems with memory consumption.
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=3,
    gradient_checkpointing_kwargs={'use_reentrant':False},
    gradient_checkpointing=True,
    remove_unused_columns=False,
    learning_rate=5.0e-06,
    logging_strategy="epoch",
    lr_scheduler_type="cosine",
    num_train_epochs=10,
    save_strategy="epoch",
    logging_steps=1,
    output_dir=new_model,
    optim="paged_adamw_32bit",
    warmup_steps=2,
    bf16=True,
    report_to="none",
)


I’ll explain the most important and specific training parameters:

**lr_scheduler_type**="cosine": The learning rate is adjusted according to a cosine schedule. It starts at the value specified in **learning_rate** and then gradually decreases.

**warmup_steps**=2:  For the first two epochs, the learning rate is adjusted by increasing its value instead of decreasing it. The aim is to stabilize the learning process.

**Gradient_accumulation_steps**=3: To save memory. I accumulate the gradients over two steps before updating the model weights.

With these parameters, I've tried to find a training setup with low memory requirements, thanks to the use of gradient accumulation, gradient checkpointing, a small batch size, and the use of bf16 along with the paged_adamw_32bit optimizer.

Now you can create the trainer, passing it the two datasets, the newly created training arguments, the LoRA configuration, and the tokenizer as parameters.

In [21]:
# Create DPO trainer
trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset,
    #eval_dataset=dataset_eval,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=2048,
    max_length=2048,
)

Map:   0%|          | 0/46 [00:00<?, ? examples/s]


The indicated beta value is a standard that balances the new training with the model's base knowledge. If you want the new training to have more weight, perhaps because you're training for a very specific task, you could specify a lower beta value.

In [22]:
# Fine-tune model with DPO
trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
15,0.6624
30,0.4206
46,0.213
61,0.1387
76,0.0967
92,0.0695
107,0.0635
122,0.0587
138,0.0511
150,0.0644


TrainOutput(global_step=150, training_loss=0.1847989583015442, metrics={'train_runtime': 367.3221, 'train_samples_per_second': 1.252, 'train_steps_per_second': 0.408, 'total_flos': 0.0, 'train_loss': 0.1847989583015442, 'epoch': 9.782608695652174})

It seems to have worked reasonably well, although there might be a potential overfitting issue, where the model adapts better to the training data than to the evaluation data. To mitigate overfitting, you could expand the dataset and try increasing the **lora_dropout** parameter in **LoraConfig**.


## Upload model to Hugging Face.

In [23]:
PATH_MODEL="/content/drive/MyDrive/final_checkpoint"

In [24]:
# Save artifacts
trainer.model.save_pretrained(PATH_MODEL)
tokenizer.save_pretrained(PATH_MODEL)



('/content/drive/MyDrive/final_checkpoint/tokenizer_config.json',
 '/content/drive/MyDrive/final_checkpoint/special_tokens_map.json',
 '/content/drive/MyDrive/final_checkpoint/tokenizer.model',
 '/content/drive/MyDrive/final_checkpoint/added_tokens.json',
 '/content/drive/MyDrive/final_checkpoint/tokenizer.json')

Execute this cell only if you are having memory issues.

In [25]:
#Flush memory
del trainer, model, tokenizer
gc.collect()
torch.cuda.empty_cache()

Now, you're going to load the original model again, but this time in its unquantized format.

In [26]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          use_fast=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The original model and the saved training are being merged.

In [27]:
model = PeftModel.from_pretrained(base_model, PATH_MODEL)
model = model.merge_and_unload()

 The model that you have in memory is now a combination of the base model and the adapter that you have trained. You can now save this new model and upload it to Hugging Face.

In [28]:
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

('test_dpo_gemma_b/tokenizer_config.json',
 'test_dpo_gemma_b/special_tokens_map.json',
 'test_dpo_gemma_b/tokenizer.model',
 'test_dpo_gemma_b/added_tokens.json')

In [29]:
model.push_to_hub(new_model,
                  private=True,
                  use_temp_dir=False)
tokenizer.push_to_hub(new_model,
                      private=True,
                      use_temp_dir=False)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/test_dpo_gemma_b/commit/3b2c1892ee1475f263c0dc4035fcb7a6e9330406', commit_message='Upload tokenizer', commit_description='', oid='3b2c1892ee1475f263c0dc4035fcb7a6e9330406', pr_url=None, pr_revision=None, pr_num=None)

## Inference

Let's test the new model and compare with the original

In [30]:
#Original Gemma Model.
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [31]:
# Format prompt
message = [
    {"role": "user", "content": "Solve 25000/2 step by step. \nLimit your response to mathematical expressions and symbols."}
]
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)


In [32]:
# Create pipeline
pipeline = transformers.pipeline(
    "text-generation",
    device="cuda",
    model=model_name,
    tokenizer=tokenizer
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [33]:
# Generate text
sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.2,
    num_return_sequences=1,
    max_length=200,
)
print(sequences[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<bos><start_of_turn>user
Solve 25000/2 step by step. 
Limit your response to mathematical expressions and symbols.<end_of_turn>
<start_of_turn>model
25000 / 2 = 12500 

Here's how we get there:

* **Division:**  The division symbol (/) means we are splitting a number into equal parts.
* **25000:** This is the dividend (the number being divided).
* **2:** This is the divisor (the number we are dividing by). 
* **12500:** This is the result of the division. 



**The response obtained with the original model contains text. Ignoring the instructions in the prompt.**

In [34]:
del pipeline, tokenizer
#Flush memory
gc.collect()
torch.cuda.empty_cache()

In [35]:
# Load the Aligned Model.
tokenizer_new_model = AutoTokenizer.from_pretrained(new_model)


In [36]:
# Create pipeline
pipeline_new = transformers.pipeline(
    "text-generation",
    device="cuda",
    model=new_model,
    tokenizer=tokenizer_new_model
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [37]:
# Generate text
prompt = tokenizer_new_model.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

sequences = pipeline_new(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.2,
    num_return_sequences=1,
    max_length=200,
)
print(sequences[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<bos><start_of_turn>user
Solve 25000/2 step by step. 
Limit your response to mathematical expressions and symbols.<end_of_turn>
<start_of_turn>model
25000 / 2 = 12500 



**The response of the DPO aligned model contains only numbers, as requested in the prompt.**

PERFECT! The new model only returns numbers, aligned with the chosen answers present in the Dataset.


## Summary

The model alignment process has been a complete success. The truth is, with the Hugging Face libraries, everything is straightforward.

The real challenge lies in understanding the technique, knowing when to apply it, and having the necessary data.



