# Installs

## Open notebook in:
| Colab                                 |  Gradient                                                                                                                                         |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/Transformers-in-Action/blob/main/CH07/ch07_DPO.ipynb)                                              | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/Nicolepcx/Transformers-in-Action/blob/main/CH07/ch07_DPO.ipynb)|             

In [1]:
# Clone repo, if it's not already cloned, to be sure all runs smoothly
# on Colab or Paperspace
import os

if not os.path.isdir('Transformers-in-Action'):
    !git clone https://github.com/Nicolepcx/Transformers-in-Action.git
else:
    print('Repository already exists. Skipping clone.')


current_path = %pwd
if '/Transformers-in-Action' in current_path:
    new_path = current_path + '/utils'
else:
    new_path = current_path + '/Transformers-in-Action/utils'
%cd $new_path


Repository already exists. Skipping clone.
/content/Transformers-in-Action/utils


# About this notebook


In this notebook we continue from notebook `ch_07_SFT.ipynb`to train the model with DPO, here in this notebook, we build on top of this step.


The code of the notebook is inspired by the [The Alignment Handbook](https://github.com/huggingface/alignment-handbook) from Hugging Face for the [trl library](https://huggingface.co/docs/trl/en/index) and by the [unsloth library](https://github.com/unslothai/unsloth).


#Install requirements

In [2]:
from requirements import *

In [3]:
install_base_packages()
install_required_packages_ch07()

[1mInstalling base requirements...
[0m
✅ transformers==4.26.1 installation completed successfully!

✅ datasets==2.10.1 installation completed successfully!

[1mInstalling chapter 7 requirements...
[0m
✅ accelerate==0.26.1 installation completed successfully!

✅ wandb installation completed successfully!

✅ peft==0.7.1 installation completed successfully!

✅ safetensors==0.4.1 installation completed successfully!

✅ trl==0.7.10 installation completed successfully!

✅ tree-of-thoughts-llm==0.1.0 installation completed successfully!



In [4]:
%%capture
import torch

# Function to determine the appropriate Unsloth installation based on CUDA major version
def install_unsloth():
    major_version = torch.cuda.get_device_capability()[0]  # Get the major version
    if major_version >= 8:
        # For new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
        !pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
    else:
        # For older GPUs (V100, Tesla T4, RTX 20xx)
        !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"

# Install Unsloth based on the GPU's CUDA major version
install_unsloth()


In [5]:
!pip install -U transformers -q

# Imports

In [6]:
import wandb
from unsloth import FastLanguageModel

import os
import re
import pprint
import textwrap
from typing import List, Literal, Optional

from datasets import load_dataset, concatenate_datasets, DatasetDict
from transformers import TrainingArguments
from trl import DPOTrainer

# We have to patch the DPO Trainer
from unsloth import PatchDPOTrainer
PatchDPOTrainer()



In [7]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [8]:

model, model_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = 4096,
    dtype = None, # Auto dectect type
    load_in_4bit = True # Use 4bit quantization to reduce memory usage.
)


config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.2
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.1.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.22.post7. FA = False.
 "-____-"     Apache 2 free license: http://github.com/unslothai/unsloth


You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` attribute will be overwritten with the one you passed to `from_pretrained`.


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

In [9]:

DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '' }}\n{% endif %}\n{% endfor %}"

def apply_chat_template(example, tokenizer, assistant_prefix="\n"):
    def _strip_prefix(s, pattern):
        return re.sub(f"^{re.escape(pattern)}", "", s)

    def _concatenate_messages(messages):
        return ' '.join(msg['content'] for msg in messages)

    if all(key in example for key in ('chosen', 'rejected')):
        # Process 'chosen' field
        if isinstance(example['chosen'], list):
            example['chosen'] = _strip_prefix(_concatenate_messages(example['chosen'][1:]), assistant_prefix)

        # Process 'rejected' field
        if isinstance(example['rejected'], list):
            example['rejected'] = _strip_prefix(_concatenate_messages(example['rejected'][1:]), assistant_prefix)

        # Process 'prompt' field, if it's a list of messages
        if 'prompt' in example and isinstance(example['prompt'], list):
            example['prompt'] = _strip_prefix(_concatenate_messages(example['prompt']), assistant_prefix)

    return example


def get_sampled_datasets(dataset_name, splits, fraction, shuffle=True):
    """
    Loads and samples a fraction of the specified dataset splits.

    Args:
        dataset_name (str): The name of the dataset to load.
        splits (List[str]): The specific splits of the dataset to load.
        fraction (float): The fraction of the dataset to sample.
        shuffle (bool): Whether to shuffle the dataset.

    Returns:
        DatasetDict: A dictionary containing the sampled datasets.
    """
    raw_datasets = DatasetDict()
    for split in splits:
        dataset = load_dataset(dataset_name, split=split)
        if shuffle:
            dataset = dataset.shuffle(seed=42)
        sampled_dataset = dataset.select(range(int(fraction * len(dataset))))
        raw_datasets[split] = sampled_dataset
    return raw_datasets


def format_dataset_for_dpo(dataset, tokenizer):
    formatted_dataset = dataset.map(
        lambda example: apply_chat_template(example, tokenizer),
        remove_columns=[col for col in dataset.column_names if col not in ['chosen', 'rejected', 'prompt']],
        desc="Formatting dataset for DPO",
    )
    return formatted_dataset





In [10]:
dataset_name = "HuggingFaceH4/ultrafeedback_binarized"
# We only use the preference modelling (prefs) splits of the dataset
splits = ["train_prefs", "test_prefs"]
# The fraction of the dataset to sample, this will run for about 6.5 hours,
# adjust for shorter runtime
fraction = 0.005
tokenizer = model_tokenizer

# Get sampled datasets
raw_datasets = get_sampled_datasets(dataset_name, splits, fraction)

# Format datasets for DPO
formatted_datasets = DatasetDict()
for split in raw_datasets.keys():
    formatted_datasets[split] = format_dataset_for_dpo(raw_datasets[split], tokenizer)



Downloading readme:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/HuggingFaceH4___parquet/HuggingFaceH4--ultrafeedback_binarized-3d6b85969d759989/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.72M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/184M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/226M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.29M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.02M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/226M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating test_sft split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train_gen split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Generating train_prefs split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Generating test_prefs split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test_gen split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train_sft split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/HuggingFaceH4___parquet/HuggingFaceH4--ultrafeedback_binarized-3d6b85969d759989/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.




Formatting dataset for DPO:   0%|          | 0/305 [00:00<?, ? examples/s]

Formatting dataset for DPO:   0%|          | 0/10 [00:00<?, ? examples/s]

In [11]:

# Prepare the datasets for training
def prepare_dataset_for_training(dataset):
    # Keep only the necessary columns and drop the rest
    necessary_columns = ['chosen', 'rejected', 'prompt']
    return dataset.remove_columns([col for col in dataset.column_names if col not in necessary_columns])

for split in formatted_datasets.keys():
    formatted_datasets[split] = prepare_dataset_for_training(formatted_datasets[split])

print("Formatted datasets ready for DPO.")

Formatted datasets ready for DPO.


## Data Prep
You will use the [Ultra Feedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) to train the model, you wil just use the `train_prefs` and `test_prefs` from the dataseet, since you will perform DPO.

Print the dataset

In [12]:
# Map the dataset and transform the fields
transformed_datasets = raw_datasets.map(
    lambda example: apply_chat_template(example, tokenizer),
    remove_columns=[col for col in raw_datasets["train_prefs"].column_names if col not in ['chosen', 'rejected', 'prompt']],
    desc="Formatting prompt template",
)

Formatting prompt template:   0%|          | 0/305 [00:00<?, ? examples/s]

Formatting prompt template:   0%|          | 0/10 [00:00<?, ? examples/s]

In [13]:
print(raw_datasets["train_prefs"])
print(raw_datasets["test_prefs"])

Dataset({
    features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
    num_rows: 305
})
Dataset({
    features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
    num_rows: 10
})


In [14]:
transformed_datasets

DatasetDict({
    train_prefs: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 305
    })
    test_prefs: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 10
    })
})

In [15]:
#@title Print out an example
row = transformed_datasets["train_prefs"][10]
print("Prompt: ")
print("===" *40, "\n")
pprint.pprint(row["prompt"])
print("Chosen: ")
print("\n","===" *40,"\n")
pprint.pprint(row["chosen"])
print("Rejected: ")
print("\n","===" *40,"\n")
pprint.pprint(row["rejected"])

Prompt: 

('Q: In September 2015, Amazon announced the release of the Fire 7, priced at '
 'US $49.99 for the 8GB version that displays advertisements on the lock '
 'screen. As of March 2016 it was the lowest-priced Amazon tablet. In June '
 '2016, its price was dropped briefly to US $39.99. This fifth generation '
 'tablet includes for the first time a micro SD card slot for extra storage.\n'
 '\n'
 'Answer this question: when did the amazon fire 7 come out?\n'
 'A: September 2015\n'
 'Explain how we arrive at this answer: ')
Chosen: 


('We arrive at the answer of September 2015 by looking at the specific '
 'information provided in the question. The question states that Amazon '
 'announced the release of the Fire 7 in September 2015, and also mentions '
 'that it was the lowest-priced Amazon tablet as of March 2016. This indicates '
 'that the tablet was released in September 2015 and was available for '
 'purchase by the public at that time. The mention of the price drop in June 

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [16]:
#@title Load model
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 42,
    max_seq_length = 4096,
)

Unsloth 2024.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


# Train the model

In [17]:
training_args = TrainingArguments(
                per_device_train_batch_size = 2,
                gradient_accumulation_steps = 4,
                warmup_ratio = 0.1,
                num_train_epochs = 2,
                learning_rate = 5e-6,
                fp16 = not torch.cuda.is_bf16_supported(),
                bf16 = torch.cuda.is_bf16_supported(),
                logging_steps = 1,
                report_to = "wandb",
                optim = "adamw_8bit",
                weight_decay = 0.0,
                lr_scheduler_type = "cosine",
                seed = 42,
                output_dir = "outputs",
)


dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,  # Or chose a reference model
    args=training_args,
    beta=0.1,
    train_dataset=transformed_datasets["train_prefs"],
    eval_dataset=transformed_datasets["test_prefs"],
    tokenizer=tokenizer,
    max_length=1024,
    max_prompt_length=512,
)

# Train the model
dpo_trainer.train()




Map:   0%|          | 0/305 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Currently logged in as: [33mnicolepcx[0m. Use [1m`wandb login --relogin`[0m to force relogin


Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.6931,0.0,0.0,0.0,0.0,-293.761353,-460.124146,-2.721918,-2.86528
2,0.6931,0.0,0.0,0.0,0.0,-247.199371,-346.906921,-3.034508,-3.144804
3,0.6945,-0.001312,0.001425,0.5,-0.002737,-395.738098,-414.268738,-3.289375,-3.210734
4,0.6936,-0.001178,-0.000198,0.5,-0.000981,-257.499023,-376.355896,-2.98802,-3.020177
5,0.6921,0.00219,0.000162,0.75,0.002028,-142.915558,-201.646729,-3.009515,-3.037443
6,0.6928,0.003709,0.003083,0.375,0.000627,-249.648911,-250.648834,-3.010854,-3.104269
7,0.6928,0.006068,0.005412,0.5,0.000656,-305.182861,-256.79007,-2.726028,-2.723203
8,0.6951,0.014975,0.018775,0.375,-0.0038,-179.152374,-211.35379,-2.917761,-3.122582
9,0.6867,0.022811,0.009837,0.875,0.012973,-269.008331,-301.085388,-2.873059,-3.086717
10,0.6913,0.027943,0.023929,0.5,0.004014,-300.224243,-278.391083,-3.206071,-3.160622


TrainOutput(global_step=76, training_loss=0.5269594770905218, metrics={'train_runtime': 2511.2658, 'train_samples_per_second': 0.243, 'train_steps_per_second': 0.03, 'total_flos': 0.0, 'train_loss': 0.5269594770905218, 'epoch': 1.99})

In [18]:
#@title Save trained model
model.save_pretrained("lora_dpo_model")

# Generate prompt outputs with DPO model

In [19]:
#Prepare dataset for prompt
alpaca_template = """Write a response that completes the task from below, following the instruction.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


In [20]:
#Prepare the data with tokenizer
prompt = model_tokenizer(
    [
        alpaca_template.format(
            "What is the iconic symbol of freedom at the US east coast?",  # instruction
            "",  # input
            "",  # output
        )
    ] * 1, return_tensors="pt").to("cuda")

# Model's generation settings
generation_parameters = {
    "max_new_tokens": 256,  # Maximum number of new tokens to generate
    "use_cache": True  # Whether to use past key values for attention
}

# Generate outputs using the model and the specified generation parameters
outputs = model.generate(**prompt, **generation_parameters)

# Decode the generated outputs
decoded_outputs = model_tokenizer.batch_decode(outputs, skip_special_tokens=True)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [21]:
#@title Get outputs from model
# Define the maximum line width for the print function
max_line_width = 80

# Cleaning and formatting the output
cleaned_outputs = []
for output in decoded_outputs:
    # Splitting the text into sections based on '\n'
    sections = output.split('\n')

    # Find the index where the actual content starts (skipping the first line)
    start_idx = 1 if len(sections) > 1 and sections[0].startswith("Write a response") else 0

    # Rejoin the relevant sections
    relevant_content = "\n".join(sections[start_idx:])

    # Remove unwanted characters and replace '###' with '\n'
    relevant_content = relevant_content.replace("###", "\n").replace("[", "").replace("]", "").replace("'", "")

    # Split the text into sections based on '\n'
    sections = relevant_content.split('\n')

    # Wrap text for each section and join them back with double newlines
    wrapped_sections = [textwrap.fill(section, width=max_line_width) for section in sections]
    formatted_output = '\n'.join(wrapped_sections)

    # Add the cleaned and formatted text to the list
    cleaned_outputs.append(formatted_output)

# Print the cleaned and formatted output
for text in cleaned_outputs:
    print(text)



 Instruction:
What is the iconic symbol of freedom at the US east coast?


 Input:



 Response:
The iconic symbol of freedom at the US east coast is the Statue of Liberty.
