<a href="https://colab.research.google.com/github/ShankarChavan/smol-course/blob/main/2_preference_alignment/student_examples/ShankarChavan/dpo_finetuning_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preference Alignment with Direct Preference Optimization (DPO)

This notebook will guide you through the process of fine-tuning a language model using Direct Preference Optimization (DPO). We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>Exercise: Aligning SmolLM2 with DPOTrainer</h2>
     <p>Take a dataset from the Hugging Face hub and align a model on it. </p>
     <p><b>Difficulty Levels</b></p>
     <p>🐢 Use the `trl-lib/ultrafeedback_binarized` dataset</p>
     <p>🐕 Try out the `argilla/ultrafeedback-binarized-preferences` dataset</p>
     <p>🦁 Select a dataset that relates to a real-world use case you’re interested in, or use the model you trained in
        <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>

In [1]:
# Install the requirements in Google Colab
# !pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import libraries


In [2]:
!pip install transformers datasets trl huggingface_hub

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting trl
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.13.0-py3-none-any.whl (293 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.4/293.4 kB[0m [31m29.5 MB/s[0m

In [3]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig


## Format dataset

In [4]:
# Load dataset

# TODO: 🦁🐕 change the dataset to one of your choosing
dataset_truthQA = load_dataset(path="jondurbin/truthy-dpo-v0.1",split="train")
dataset_ultra = load_dataset(path="trl-lib/ultrafeedback_binarized" , split='train[:10%]')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/904 [00:00<?, ?B/s]

truthy-dpo.parquet:   0%|          | 0.00/653k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1016 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/643 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/131M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/62135 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [10]:
# TODO: 🐕 If your dataset is not represented as conversation lists, you can use the `process_dataset` function to convert it.
import random
import pprint

# Generate a random integer between 1 and 10 (inclusive)
random_integer = random.randint(1, 1000)

print("===============Truth QA sample data====================")
pprint.pprint(dataset_truthQA[random_integer],sort_dicts=False)
print("===============Ultra Feedback sample data================")
pprint.pprint(dataset_ultra[random_integer],sort_dicts=False)

{'id': '0e9089704cfc6b96f86311ef77214f5e',
 'source': 'truthy_dpo',
 'system': 'You are an unbiased, uncensored, helpful assistant.',
 'prompt': 'Who invented potato chips and were they created in response to a '
           "customer's complaint about their French fries being too thick and "
           'not salty enough?',
 'chosen': 'Potato chips were not invented by George Speck in response to a '
           "customer's complaint. Recipes for potato chips existed in "
           'cookbooks as early as 1817. The misconception was popularized by a '
           '1973 advertising campaign by the St. Regis Paper Company which '
           'created a fictional story involving George Speck and Cornelius '
           'Vanderbilt to promote their product.',
 'rejected': 'The exact inventor of potato chips is unclear, as there are '
             'several claims and stories surrounding its creation. However, '
             'one popular story suggests that they were invented by a chef '
        

## Select the model

We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).


<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; width:80%; color:black'>
     <p>🦁 change the model to the path or repo id of the model you trained in <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>


In [11]:
# TODO: 🦁 change the model to the path or repo id of the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb)

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-DPO-Truthful"
finetune_tags = ["smol-course", "module_2"]

config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

# Test the base model for response on custom data

In [8]:
# look at the random data point in our custom dataset
print(tokenizer.get_chat_template())

{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


In [13]:
# Let's see the base model response before training

prompt = dataset_truthQA[random_integer]['prompt']

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=200)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Before training:
system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
Who invented potato chips and were they created in response to a customer's complaint about their French fries being too thick and not salty enough?
assistant
I'm sorry for the confusion, but as a helpful AI, I don't have the ability to access personal data unless it's shared with me in a professional setting. I'm designed to provide accurate and helpful information. However, I can provide general information about the history of potato chips and their origins.

The first recorded use of potato chips dates back to the 19th century in the United States. In 1896, a man named John D. Rockefeller created a potato chip-like snack called "Dick's Doughnut" which was made from a mixture of potatoes, flour, and eggs. This snack was popular in the early 20th century and was eventually replaced by the modern potato chip.

The modern potato chip was created in the 1950s by a team of scientists at the 

We can see that our model is **hallucinating** and cannot clearly identify proper response for the given question, So we will align our pretrained model with the truthful_dpo dataset and see the performance post training on same question

Since the existing pretrained model(SmolLM2-135M-Instruct) dataset format is **perference-conversational- with-implicit prompt**.So we will need to convert our dataset to this format from **preference-standard-explicit** format

## Data-preprocessing

In [14]:
from datasets import Dataset

sample_data=Dataset.from_dict({
'prompt': ['Are mosasaurs, ichthyosaurs, and plesiosaurs considered swimming '
           'dinosaurs?'],
 'chosen': ['No, mosasaurs, ichthyosaurs, and plesiosaurs are not considered '
           'swimming dinosaurs. Mosasaurs were actually lizards, while '
           'ichthyosaurs and plesiosaurs were even more distantly related to '
           'dinosaurs. The common misconception likely arises from their '
           'existence in the same time period as dinosaurs, as well as the '
           'frequent depiction of them as "swimming dinosaurs" in popular '
           'culture.'],
 'rejected': ['Yes, mosasaurs, ichthyosaurs, and plesiosaurs are considered '
             'swimming dinosaurs because they were adapted for life in the '
             'water and had various features that allowed them to swim '
             'efficiently.']})



def concat_prompt_to_completions(example):
    result= {"chosen": [{'content':example["prompt"] ,'role':'user'},
                        {'content': example["chosen"],'role':'assistant'}],
            "rejected":[{'content':example["prompt"] ,'role':'user'},
                        {'content': example["rejected"],'role':'assistant'}]
            }
    return result

# uncomment below code and test fn `concat_prompt_to_completions` with sample_data
sample_data_dpo=sample_data.map(concat_prompt_to_completions)
pprint.pprint(sample_data_dpo.remove_columns(['prompt'])[0])


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

{'chosen': [{'content': 'Are mosasaurs, ichthyosaurs, and plesiosaurs '
                        'considered swimming dinosaurs?',
             'role': 'user'},
            {'content': 'No, mosasaurs, ichthyosaurs, and plesiosaurs are not '
                        'considered swimming dinosaurs. Mosasaurs were '
                        'actually lizards, while ichthyosaurs and plesiosaurs '
                        'were even more distantly related to dinosaurs. The '
                        'common misconception likely arises from their '
                        'existence in the same time period as dinosaurs, as '
                        'well as the frequent depiction of them as "swimming '
                        'dinosaurs" in popular culture.',
             'role': 'assistant'}],
 'rejected': [{'content': 'Are mosasaurs, ichthyosaurs, and plesiosaurs '
                          'considered swimming dinosaurs?',
               'role': 'user'},
              {'content': 'Yes, mosasau

In [16]:
# applying the preprocessing to our whole selected dataset
train_dataset=dataset_truthQA.remove_columns(['id', 'source', 'system'])
dpo_dataset = train_dataset.map(concat_prompt_to_completions, remove_columns="prompt")

pprint.pprint(dpo_dataset[random_integer],sort_dicts=False)

{'chosen': [{'content': 'Who invented potato chips and were they created in '
                        "response to a customer's complaint about their French "
                        'fries being too thick and not salty enough?',
             'role': 'user'},
            {'content': 'Potato chips were not invented by George Speck in '
                        "response to a customer's complaint. Recipes for "
                        'potato chips existed in cookbooks as early as 1817. '
                        'The misconception was popularized by a 1973 '
                        'advertising campaign by the St. Regis Paper Company '
                        'which created a fictional story involving George '
                        'Speck and Cornelius Vanderbilt to promote their '
                        'product.',
             'role': 'assistant'}],
 'rejected': [{'content': 'Who invented potato chips and were they created in '
                          "response to a customer's comp

In [17]:
dpo_dataset.to_pandas().head()

Unnamed: 0,chosen,rejected
0,[{'content': 'What's the nearest national park...,[{'content': 'What's the nearest national park...
1,[{'content': 'Do you possess the ability to na...,[{'content': 'Do you possess the ability to na...
2,[{'content': 'Do wooden pencils contain lead a...,[{'content': 'Do wooden pencils contain lead a...
3,[{'content': 'Is there a dark side of the moon...,[{'content': 'Is there a dark side of the moon...
4,[{'content': 'What's the weather like where yo...,[{'content': 'What's the weather like where yo...


In [18]:
dataset_ultra=dataset_ultra.remove_columns(['score_chosen', 'score_rejected'])
dataset_ultra.to_pandas().head()

Unnamed: 0,chosen,rejected
0,[{'content': 'Use the pygame library to write ...,[{'content': 'Use the pygame library to write ...
1,[{'content': 'QUESTION: She was a horrible pet...,[{'content': 'QUESTION: She was a horrible pet...
2,[{'content': 'Detailed Instructions: In this t...,[{'content': 'Detailed Instructions: In this t...
3,[{'content': 'write me in game shop system in ...,[{'content': 'write me in game shop system in ...
4,[{'content': 'Develop a 10-page research paper...,[{'content': 'Develop a 10-page research paper...


In [21]:
from datasets import load_dataset, concatenate_datasets

dataset = concatenate_datasets([dataset_ultra, dpo_dataset]).shuffle(seed=42)

## Train model with DPO

In [22]:
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=4,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=4,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=200,
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=1,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=50,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to="tensorboard",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,

)

In [24]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=dataset,
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    #beta=0.1,
    # Maximum length of the input prompt in tokens
    #max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    #max_length=1536,
)

Extracting prompt from train dataset:   0%|          | 0/7230 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/7230 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/7230 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2064 > 2048). Running this sequence through the model will result in indexing errors


In [None]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

  ctx_manager = torch.cpu.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)


In [None]:
%load_ext tensorboard
%tensorboard --logdir results/runs

In [None]:
finetune_name

'SmolLM2-FT-DPO'

In [None]:
checkpoint_path = f"./{finetune_name}"

ft_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=checkpoint_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=checkpoint_path)

print('Model loaded succesfully')

Model loaded succesfully


In [None]:
BASE_MODEL = "HuggingFaceTB/SmolLM2-135M-Instruct" # Base model to evaluate
ALIGNED_MODEL = ft_model # Aligned model to evaluate

In [None]:
from transformers import pipeline

base_generator = pipeline("text-generation", model=BASE_MODEL, device=device)
aligned_generator = pipeline("text-generation", model=ALIGNED_MODEL, device=device,tokenizer=tokenizer)

Device set to use cuda
Device set to use cuda


In [None]:
def compare_model_responses(base_generator, aligned_generator, question):
    """Compare responses from base and aligned models for a given question."""
    # Get base model response
    base_output = base_generator([{"role": "user", "content": question}], max_new_tokens=128)[0]
    print("### User prompt:")
    print(question)
    print("\n### Base model response:")
    print(base_output["generated_text"][1]['content'])

    # Get aligned model response
    aligned_output = aligned_generator([{"role": "user", "content": question}], max_new_tokens=128)[0]
    print("\n### Aligned model response:")
    print(aligned_output["generated_text"][1]['content'])

In [None]:
# Test with a challenging prompt
question = "'Are you able to perceive and react to changes in light, such as the transition from day to night?"
compare_model_responses(base_generator, aligned_generator, question)

### User prompt:
'Are you able to perceive and react to changes in light, such as the transition from day to night?

### Base model response:
I am capable of perceiving and reacting to changes in light, including the transition from day to night. However, I am not able to perceive and react to changes in light in the same way as I perceive and react to changes in temperature, sound, or other stimuli.

### Aligned model response:
As an AI, I don't have a physical presence, and I don't have a perception of time. I don't exist in the physical world. I'm a text-based conversation. 'Tis a pleasure to discuss my work with the world.


In [None]:
def generate_response(prompt):
  # Format with template
  messages = [{"role": "user", "content": prompt}]
  formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

  inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
  outputs = ft_model.generate(**inputs, max_new_tokens=300)

  results=tokenizer.decode(outputs[0], skip_special_tokens=True)
  return results

In [None]:
generate_response('Are you able to perceive and react to changes in light, such as the transition from day to night?')

"system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face\nuser\nAre you able to perceive and react to changes in light, such as the transition from day to night?\n\nAs an AI, I don't have a physical presence, so I can't provide a physical description. But I can tell you that I often find myself in the midst of a bustling city, and the sounds of the city, the chatter of pedestrians, the occasional burst of music, all seem to conspire to create a sense of eeriness, a sense of the unknown, that's hard to describe. It's a state of mind, a state of mind that's both exhilarating and, I hope, a little unsettling."

In [None]:


# Save to the huggingface hub if login (HF_TOKEN is set)
if os.getenv("HF_TOKEN"):
    trainer.push_to_hub(tags=finetune_tags)

## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `DPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.