<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/P2-MHF/Aligning_DPO_open_gemma-2b-it.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<h1><a href="https://github.com/peremartra/Large-Language-Model-Notebooks-Course">Learn by Doing LLM Projects</a></h1>
    <h3>Understand And Apply Large Language Models</h3>
    <h2>Creating and Publishing Your Own LLM.</h2>
    <h3>Aligning with DPO a Gemma 2B model.</h3>
    by <b>Pere Martra</b>
</div>

<br>

<div align="center">
    &nbsp;
    <a target="_blank" href="https://www.linkedin.com/in/pere-martra/"><img src="https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn&logo=linkedin&style=social"></a>
    
</div>

<br>
<hr>


In this Notebook we are going to align a Microsoft Gemma-2B Model using DPO, and publish it to Hugging Face!

Base Model: https://huggingface.co/google/gemma-2b-it

Dataset Used: https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized

Model obtained: https://huggingface.co/oopere/martra-open-gemma-2b-it-dpo

## To start, a brief introduction to DPO.

The revolution we're currently experiencing around Large Language Models began with the emergence of ChatGPT and its GPT-3.5 model.

Something different had been done with GPT-3.5, which was actually a derivative of GPT-3, a model that did not generate nearly as much excitement as its successor.

Many people, including myself, believe that the main difference was the use of Alignemet using RLHF - Reinforcement Learning from Human Feedback.

Nowadays RLHF has been displaced by a technique that achieves the same result in a much more efficient way: DPO - Direct Preference Optimization.

Both DPO and RLHF are alignment techniques that require a dataset containing correct and incorrect responses to the same prompt.

But from here, the differences begin. RLHF uses this dataset to train a second model, called a reward model, which will be used in the alignment process. DPO, on the other hand, uses the dataset directly to train the final model. This is the main difference between the two techniques.

As you can imagine, DPO is a more direct technique that requires fewer resources. When we're talking about models with tens of billions of parameters, any reduction in resource consumption can result in significant cost savings.

The implementation of DPO that we are going to use is the one developed by Hugging Face in their TRL library, which stands for Transformer Reinforcement Learning. DPO can be considered a reinforcement learning technique, where the model is rewarded during its training phase based on its responses.

_______________________

Since is necesary to save the created model, the notebook mounts a disk in  Google Drive. If you run it locally, you don't need to execute this line of code. You can actually also run it on Google Colab without mounting a disk in your drive, but then every time you close the session you'll lose the saved model, as it will be saved in the temporary directory of the Google Colab session.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now that the disk is mounted, it's time to load the necessary libraries:

The only one that might be new to you is the trl library, which stands for Transformer Reinforcement Learning. You'll be importing the DPOTrainer class from this library, which you'll use to perform the DPO fine-tuning of the model.

In [None]:
!pip install -q datasets==2.19.1
!pip install -q trl==0.8.6
!pip install -q peft==0.11.1
!pip install -q transformers==4.41.0
!pip install -q bitsandbytes==0.43.1
!pip install -q sentencepiece==0.1.99
!pip install -q accelerate==0.30.1
!pip install -q huggingface_hub==0.23.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.4/103.4 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━

In [None]:
#Import necessary classes.
import gc
import torch

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, PeftModel
from trl import DPOTrainer
import bitsandbytes as bnb

from getpass import getpass


Another necessary step is to log in to Hugging Face.

In [None]:
hf_token = getpass("Hugging Face: ")

In [None]:
!huggingface-cli login --token $hf_token

usage: huggingface-cli <command> [<args>] login [-h] [--token TOKEN] [--add-to-git-credential]
huggingface-cli <command> [<args>] login: error: argument --token: expected one argument


## Format dataset

The model I've chosen is the Gemma-2b-it. It's a 2.51B parameter model that is state of art model from google with the same structure than Gemini.
I've chosen a small model so that its training can be done with few resources on Google Colab or on a not very large GPU.


In [None]:
model_name = "google/gemma-2b-it"
new_model = "martra-open-gemma-2b-it-dpo"

In [None]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Before you begin training the model, is necesary need to load the dataset and transform it to fit the format required by the DPOTrainer class, that consists of three fields: the prompt, the chosen answer, and a discarded answer.

I'm loading just a fer rows of the Dataset, feel free to use the full Dataset if you have enough time.

You may need less than one hour of a A100 GPU to train with Full Dataset for a 6 epochs.

In [None]:
# Load dataset
dataset_original =  load_dataset("argilla/distilabel-capybara-dpo-7k-binarized",
                                 split='train[500:]')
dataset_eval = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized",
                            split='train[:500]')

# Save columns
original_columns = dataset_original.column_names
print(original_columns)

Downloading readme:   0%|          | 0.00/11.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7563 [00:00<?, ? examples/s]

['source', 'conversation', 'original_response', 'generation_prompt', 'raw_generation_responses', 'new_generations', 'prompt', 'chosen', 'rejected', 'rating_chosen', 'rating_rejected', 'chosen_model', 'rejected_model']


In [None]:
dataset_original

Dataset({
    features: ['source', 'conversation', 'original_response', 'generation_prompt', 'raw_generation_responses', 'new_generations', 'prompt', 'chosen', 'rejected', 'rating_chosen', 'rating_rejected', 'chosen_model', 'rejected_model'],
    num_rows: 7063
})

The dataset has many more columns than are actually necessary for the DPO process. However, I'm going to take advantage of a couple of them to filter the data to be used.

In [None]:
dataset_filtered = dataset_original.filter(
  lambda r: r["rating_chosen"]>=4.0 and r["rating_rejected"] <= 2.5
)

Filter:   0%|          | 0/7063 [00:00<?, ? examples/s]

This first filter only retrieves those rows where the rating of the chosen response is very high and the rating of the discarded responses is very low. This is a way to facilitate the model's learning, although it's also possible that it doesn't help in the last epochs of training.

I'm going to perform a second filter to keep the prompt length under control.

In [None]:
dataset_filtered = dataset_filtered.map(lambda r: {"messages": len(r["chosen"])}).filter(lambda r: r["messages"]<3) #and len(r["prompt"]) + len(r["chosen"]) + len(r["rejected"]) < 3800)


Map:   0%|          | 0/3328 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3328 [00:00<?, ? examples/s]

In [None]:
dataset_filtered

Dataset({
    features: ['source', 'conversation', 'original_response', 'generation_prompt', 'raw_generation_responses', 'new_generations', 'prompt', 'chosen', 'rejected', 'rating_chosen', 'rating_rejected', 'chosen_model', 'rejected_model', 'messages'],
    num_rows: 957
})

The dataset still have all the columns, but the number of rows has been significantly reduced. Let me warn you that 169 rows are too few to perform proper training; again, this reduction is so that will be possible to execute the notebook in just some minutes, and obtain results.

In [None]:
#Repeat the same filters with the Validation Dataset.
dataset_eval_filtered = dataset_eval.filter(
  lambda r: r["rating_chosen"]>=4.0 and r["rating_rejected"] <= 2.5
)
dataset_eval_filtered = dataset_eval_filtered.map(lambda r: {"messages": len(r["chosen"])}).filter(lambda r: r["messages"]<3 )#and len(r["prompt"]) + len(r["chosen"]) + len(r["rejected"]) < 3800)
dataset_eval_filtered

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/239 [00:00<?, ? examples/s]

Filter:   0%|          | 0/239 [00:00<?, ? examples/s]

Dataset({
    features: ['source', 'conversation', 'original_response', 'generation_prompt', 'raw_generation_responses', 'new_generations', 'prompt', 'chosen', 'rejected', 'rating_chosen', 'rating_rejected', 'chosen_model', 'rejected_model', 'messages'],
    num_rows: 77
})

Now, it's a matter of creating a function to adapt the dataset's structure to what's required by the DPOTraining class.

I have to confess that I've cheated a little bit. The function comes from the Hugging Face dataset card. I only had to remove an error that they had missed.

In summary, the function takes a row and retrieves only the three necessary columns. It also applies a small format to the responses, which I've adapted to the format required by the Model, adding the label <|end|> after the responses.


In [None]:
def chatml_format(example):
    # get everything except the last message as input
    prompt = tokenizer.apply_chat_template(example["chosen"][:-1], tokenize=False,
                                           add_generation_prompt=True)
    # get the last assistant responses
    chosen = example["chosen"][-1]["content"] + "<end_of_turn>\n"
    rejected = example["rejected"][-1]["content"] + "<end_of_turn>\n"

    return {
        "prompt": prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

I'll use the dataset's map function to execute the transformation on each row, and also remove the original columns.

In [None]:
# Format dataset
dataset = dataset_filtered.map(
    chatml_format,
    remove_columns=dataset_filtered.column_names
)
# Print sample
dataset[12]

Map:   0%|          | 0/957 [00:00<?, ? examples/s]

{'prompt': '<bos><start_of_turn>user\nthe sum of 4076359 and 91404?\nLimit your response to mathematical expressions and symbols.<end_of_turn>\n<start_of_turn>model\n',
 'chosen': '4076359 + 91404 = 4167763<end_of_turn>\n',
 'rejected': 'The sum of two numbers, 4076359 and 91404, can be calculated using basic arithmetic operations. In this case, we will add them together. The mathematical expression can be represented in the following way:\n\n4076359 + 91404 = ?\n\nWhen you perform the addition operation, you get the following result:\n\n? = 4167763<end_of_turn>\n'}

In [None]:
# Format dataset
dataset_eval = dataset_eval_filtered.map(
    chatml_format,
    remove_columns=original_columns
)

Map:   0%|          | 0/77 [00:00<?, ? examples/s]

In [None]:
# Print sample
dataset_eval[20]

{'prompt': '<bos><start_of_turn>user\n36765980697435*0?\nLimit your response to mathematical expressions and symbols.<end_of_turn>\n<start_of_turn>model\n',
 'chosen': '36765980697435 * 0 = 0<end_of_turn>\n',
 'rejected': "0 * 0/ (1 * 1) = 0, which represents the infinite number of possible combinations of multiplying nothing by nothing. This can be expressed with the following mathematical formula:\n\n0 * 0 = lim as n approaches infinity ((-1)^n) / (2^n) = 0\n\nWhere the limit refers to the process of adding more and more numbers together, and the mathematical ' signific 0 represents a real number that approaches, but never reaches, zero as n gets larger and larger.\n\nWhen we limit a calculation to only including whole numbers n=0,1,2,... (where n is a non-negative integer), we can simplify this formula to:\n\n0 * 0 = 0 ^ 0 = 1 * exp(0) = 1 * 1 = 1\n\nwhere 'exp' is the mathematical constant e, which is approximately equal to 2.71828. By comparing the values of all of these expressio

The format is adapted to chat with gemma:
https://huggingface.co/google/gemma-2b-it



\<bos>\<start_of_turn>user

Write a hello world program\<end_of_turn>

\<start_of_turn>model



In [None]:
dataset

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 957
})

In [None]:
dataset_eval

Dataset({
    features: ['prompt', 'chosen', 'rejected', 'messages'],
    num_rows: 77
})

## Train model with DPO

## Finetuning with DPOTrainer.

Time to start working with the necessary configurations to perform alignment using DPO.



In [None]:
# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    #target_modules=['o_proj', 'qkv_proj'] #phi-3
    target_modules="all-linear"
)

The value of **r** indicates the size of the reparameterization; the higher the value, the more parameters are trained. An 8 is at the upper limit of what is recommended for small models.

To further accentuate the weight of the new training, I use the **lora_alpha** value. It's a multiplier that adjusts the layers inserted by LoRA. Normally it's left at 1, but in the case of DPO, I've seen values as high as 128.

The recommendation is that **lora_alpha** should be double the value of **r**. Since **r** varies depending on the model size, you may end up with a very high lora_alpha value if you want to fine-tune a large model and, for example, specify an **r** of 64.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

The quantization configuration holds no secrets, we are reducing the model's precision to 4 bits.

In [None]:
# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16
)
model.config.use_cache = False

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

The next step is to create the training parameters.

In [None]:
# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=3,
    gradient_checkpointing=True,
    remove_unused_columns=True,
    learning_rate=5.0e-06,
    eval_strategy="epoch",
    logging_strategy="epoch",
    lr_scheduler_type="cosine",
    num_train_epochs=6,
    save_strategy="epoch",
    logging_steps=1,
    output_dir=new_model,
    optim="paged_adamw_32bit",
    warmup_steps=2,
    bf16=True,
    report_to="none",
)


I'll try to explain the value of the most important and specific ones.

**lr_scheduler_type**="cosine": The learning rate is adjusted according to a cosine schedule. It starts at the value specified in learning_rate and then gradually decreases.

**warmup_steps**=2:  For the first two epochs, the learning rate is adjusted by increasing its value instead of decreasing it. The aim is to stabilize the learning process.

**Gradient_accumulation_steps**=2: To save memory. I accumulate the gradients over two steps before updating the model weights.

With these parameters, I've tried to find a training setup with low memory requirements, thanks to the use of gradient accumulation, gradient checkpointing, a small batch size, and the use of bf16 along with the paged_adamw_32bit optimizer.

In [None]:
# Create DPO trainer
trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset_eval,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=2048,
    max_length=2048,
)



Map:   0%|          | 0/957 [00:00<?, ? examples/s]

Map:   0%|          | 0/77 [00:00<?, ? examples/s]

The parameters for DPOTrainer are quite simple. You need to pass it the configurations you've created, the two evaluation datasets, and the maximum length of the prompt and response.

The indicated beta value is a standard that balances the new training with the model's base knowledge. If you want the new training to have more weight, perhaps because you're training for a very specific task, you could specify a lower beta value.

In [None]:
# Fine-tune model with DPO
trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Epoch,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
1,0.4947,0.502502,-1.918952,-4.217087,0.792208,2.298135,-830.933594,-496.270508,-23.468319,-22.349073
2,0.2488,0.543632,-2.726507,-6.106202,0.805195,3.379694,-849.824707,-504.346039,-23.491385,-22.385691
3,0.1413,0.530881,-2.963163,-6.662573,0.779221,3.69941,-855.388367,-506.712555,-23.423925,-22.314487
4,0.1024,0.511403,-2.887546,-6.512612,0.779221,3.625065,-853.888794,-505.956421,-23.405218,-22.293304
5,0.0856,0.508095,-2.881293,-6.501287,0.779221,3.619995,-853.775574,-505.89386,-23.40406,-22.290236
6,0.0832,0.504025,-2.866399,-6.489909,0.779221,3.62351,-853.661743,-505.744934,-23.400942,-22.29084




TrainOutput(global_step=1914, training_loss=0.19268197226051004, metrics={'train_runtime': 2678.6839, 'train_samples_per_second': 2.144, 'train_steps_per_second': 0.715, 'total_flos': 0.0, 'train_loss': 0.19268197226051004, 'epoch': 6.0})

It seems to have worked reasonably well, although there might be a potential overfitting issue, where the model adapts better to the training data than to the evaluation data. To mitigate overfitting, you could expand the dataset and try increasing the **lora_dropout** parameter in **LoraConfig**.


## Upload model

In [None]:
#PATH_MODEL="/content/drive/MyDrive/final_checkpoint"
PATH_MODEL="/content/drive/MyDrive/apress_checkpoint"

In [None]:
# Save artifacts
trainer.model.save_pretrained(PATH_MODEL)
tokenizer.save_pretrained(PATH_MODEL)





('/content/drive/MyDrive/apress_checkpoint/tokenizer_config.json',
 '/content/drive/MyDrive/apress_checkpoint/special_tokens_map.json',
 '/content/drive/MyDrive/apress_checkpoint/tokenizer.model',
 '/content/drive/MyDrive/apress_checkpoint/added_tokens.json',
 '/content/drive/MyDrive/apress_checkpoint/tokenizer.json')

Execute this cell only if you are having memory issues. (Not you, of course, I mean your environment 🤗).

In [None]:
#Flush memory
#del dpo_trainer, model
gc.collect()
torch.cuda.empty_cache()

Now, you're going to load the original model again, but this time in its unquantized format.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          use_fast=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The original model and the saved training are being merged.

In [None]:
model = PeftModel.from_pretrained(base_model, PATH_MODEL)
model = model.merge_and_unload()

 The model that you have in memory is now a combination of the base model and the adapter that you have trained. You can now save this new model and upload it to Hugging Face.

In [None]:
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

('martra-open-gemma-2b-it-dpo/tokenizer_config.json',
 'martra-open-gemma-2b-it-dpo/special_tokens_map.json',
 'martra-open-gemma-2b-it-dpo/tokenizer.model',
 'martra-open-gemma-2b-it-dpo/added_tokens.json')

In [None]:
model.push_to_hub(new_model,
                  private=True,
                  use_temp_dir=False)
tokenizer.push_to_hub(new_model,
                      private=True,
                      use_temp_dir=False)

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/martra-open-gemma-2b-it-dpo/commit/eb36ff73dc67fcb18337b23f2c72ea0fa0ac1fcd', commit_message='Upload tokenizer', commit_description='', oid='eb36ff73dc67fcb18337b23f2c72ea0fa0ac1fcd', pr_url=None, pr_revision=None, pr_num=None)

## Inference

Let's test the new model and compare with the original

In [None]:
# Format prompt
message = [
    {"role": "user", "content": "3713841893836/4?\nLimit your response to mathematical expressions and symbols."}
]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Generate text
sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.2,
    num_return_sequences=1,
    max_length=200,
)
print(sequences[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<bos><start_of_turn>user
3713841893836/4?
Limit your response to mathematical expressions and symbols.<end_of_turn>
<start_of_turn>model
The result of 3713841893836 divided by 4 is 9234850425938.


**The response obtained with the original model contains text.**

In [None]:
new_model="oopere/martra-open-gemma-2b-it-dpo"
tokenizer_new_model = AutoTokenizer.from_pretrained(new_model)
prompt = tokenizer_new_model.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline_new = transformers.pipeline(
    "text-generation",
    model=new_model,
    tokenizer=tokenizer_new_model
)

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [None]:
# Generate text
sequences = pipeline_new(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.2,
    num_return_sequences=1,
    max_length=200,
)
print(sequences[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<bos><start_of_turn>user
3713841893836/4?
Limit your response to mathematical expressions and symbols.<end_of_turn>
<start_of_turn>model
3713841893836/4 = 9734850425930


**The response of the DPO aligned model contains only numbers.**

PERFECT! The new model only returns numbers, as we want!

If you don't want to wait to the training, just test my model on hugging Face. It has been trained with the same Dataset for 2 hours in a A100 GPU on Colab.

# Test model from Hugging Face.

In [None]:
#Test DPO Model on hugging Face.
new_model="martra-open-gemma-2b-it-dpo"
tokenizer_new_model = AutoTokenizer.from_pretrained(new_model)
prompt = tokenizer_new_model.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline_new = transformers.pipeline(
    "text-generation",
    model=new_model,
    tokenizer=tokenizer_new_model
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Generate text
sequences = pipeline_new(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.2,
    num_return_sequences=1,
    max_length=200,
)
print(sequences[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<bos><start_of_turn>user
3713841893836/4?
Limit your response to mathematical expressions and symbols.<end_of_turn>
<start_of_turn>model
3713841893836/4 = 9734850425930


# Summary

The model alignment process has been a complete success. The truth is, with the Hugging Face libraries, everything is straightforward.

The challenge is knowing about the technique, when to apply it, and having the necessary data.

In this notebook, you've addressed the first two points.

I got a lot of inspiration from:

* [RLHF in 2024 with DPO & Hugging Face](https://www.philschmid.de/dpo-align-llms-in-2024-with-trl) by Phil Schmid.

* [Fine-tune a Mistral-7b model with Direct Preference Optimizatio](https://medium.com/towards-data-science/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac) by Maxime Labonne

