# About this notebook

Reinforcement learning from human feedback (RLHF), a method used for refining the responses of chatbot systems like ChatGPT, Alpaca, and StableVicuna. However, RLHF is a complex method that requires a reward model reflecting human preferences as a foundational element. This model can be developed specifically for the task at hand or used from a pre-trained version created by others. The next step involves fine-tuning the LLM using RL to maximize the policy based on this reward model.

Direct preference optimization (DPO) offers a streamlined alternative by directly optimizing the LLM's policy by passing the need for an explicit reward model. DPO and RLHF share the same ultimate objective to align the LLM's outputs with human preferences. However, DPO simplifies the approach by directly incorporating human preferences into the optimization process without first modeling them as a separate reward function.

The essence of DPO lies in its method of directly adjusting the language model's parameters to favor preferred responses over less desired ones, based on direct feedback. This is achieved through a constraint optimization process, where the Kullback-Leibler (KL) divergence plays a crucial role. The KL divergence measures the difference between the probability distribution of the LLM's responses and a target distribution that represents human preferences. By minimizing this divergence, DPO ensures the model's outputs are closely aligned with what is preferred, effectively making the optimization task resemble a classification problem where each response is classified as preferred or not. he process involves:

- Supervised  ne-tuning step (same as for RLHF)
- Annotating data with preference labels (same as for RLHF)
- DPO-Step

Thus, DPO directly optimizes the language model on preference data (preferred prompts), streamlining the process by eliminating the intermediate step of reward modeling required in RLHF. The graphical comparison in the figure below illustrates how DPO simplifies the alignment of LLMs with human preferences by directly incorporating preference feedback into the optimization process.

In this notebook we will do the SFT-step, and in notebook `ch07_DPO.ipynb`, the DPO-step.


The code of the notebook is inspired by the [The Alignment Handbook](https://github.com/huggingface/alignment-handbook) from Hugging Face for the [trl library](https://huggingface.co/docs/trl/en/index) and by the [unsloth library](https://github.com/unslothai/unsloth).

# Install unsloth

In [4]:
%%capture
import torch

# Function to determine the appropriate Unsloth installation based on CUDA major version
def install_unsloth():
    major_version = torch.cuda.get_device_capability()[0]  # Get the major version
    if major_version >= 8:
        # For new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
        !pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
    else:
        # For older GPUs (V100, Tesla T4, RTX 20xx)
        !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"

# Install Unsloth based on the GPU's CUDA major version
install_unsloth()


# Imports

In [5]:
from peft import PeftModel
from unsloth import FastLanguageModel
import wandb
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import textwrap



In [6]:
model, model_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit",
    max_seq_length = 4096,
    dtype = None, # Auto dectect type
    load_in_4bit = True # Use 4bit quantization to reduce memory usage.
)

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.2
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.1.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.22.post7. FA = False.
 "-____-"     Apache 2 free license: http://github.com/unslothai/unsloth


You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` attribute will be overwritten with the one you passed to `from_pretrained`.


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Here, we use a so called LoRA adapter, that we need only to update a small portion of all parameters!

In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 42,
    max_seq_length = 4096,
)

Unsloth 2024.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [8]:
#@title Alpaca dataset preparation code
alpaca_template = """Write a response that completes the task from below, following the instruction.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def prepare_prompts(data):
    texts = [alpaca_template.format(inst, inp, out) for inst, inp, out in zip(data["instruction"], data["input"], data["output"])]
    return {"text": texts}


# Loading and formatting the dataset
alpaca_dataset = load_dataset("yahma/alpaca-cleaned", split="train")
formatted_dataset = alpaca_dataset.map(prepare_prompts, batched=True)


Downloading readme:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

### Train the model
Here you use the Huggingface TRL's [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer) to train the model.

In [9]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [10]:
training_args = TrainingArguments(
                per_device_train_batch_size = 2,
                gradient_accumulation_steps = 4,
                warmup_steps = 5,
                max_steps = 60,
                learning_rate = 2e-4,
                fp16 = not torch.cuda.is_bf16_supported(),
                bf16 = torch.cuda.is_bf16_supported(),
                logging_steps = 1,
                report_to = "wandb",
                optim = "adamw_8bit",
                weight_decay = 0.01,
                lr_scheduler_type = "cosine",
                seed = 42,
                output_dir = "outputs"
)

sft_trainer = SFTTrainer(
    model = model,
    train_dataset = formatted_dataset,
    dataset_text_field = "text",
    max_seq_length = 4096,
    args = training_args
)

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

In [15]:
#@title Train the model

sft_trainer.train()

Step,Training Loss
1,0.8861
2,0.7546
3,0.7592
4,0.7633
5,0.922
6,0.7884
7,0.7951
8,0.8905
9,0.8853
10,0.611


TrainOutput(global_step=60, training_loss=0.8482993344465891, metrics={'train_runtime': 512.8674, 'train_samples_per_second': 0.936, 'train_steps_per_second': 0.117, 'total_flos': 6046368357433344.0, 'train_loss': 0.8482993344465891, 'epoch': 0.01})

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [12]:
# Prepare the prompt
prompt = model_tokenizer(
    [
        alpaca_template.format(
            "What is the iconic symbol of freedom at the US east coast?",  # instruction
            "",  # input
            "",  # output
        )
    ] * 1, return_tensors="pt").to("cuda")

# Model's generation settings
generation_parameters = {
    "max_new_tokens": 256,  # Maximum number of new tokens to generate
    "use_cache": True  # Whether to use past key values for attention
}

# Generate outputs using the model and the specified generation parameters
outputs = model.generate(**prompt, **generation_parameters)

# Decode the generated outputs
decoded_outputs = model_tokenizer.batch_decode(outputs, skip_special_tokens=True)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [13]:
# Define the maximum line width (number of characters per line)
max_line_width = 80

# Cleaning and formatting the output
cleaned_outputs = []
for output in decoded_outputs:
    # Splitting the text into sections based on '\n'
    sections = output.split('\n')

    # Find the index where the actual content starts (skipping the first line)
    start_idx = 1 if len(sections) > 1 and sections[0].startswith("Write a response") else 0

    # Rejoin the relevant sections
    relevant_content = "\n".join(sections[start_idx:])

    # Remove unwanted characters and replace '###' with '\n'
    relevant_content = relevant_content.replace("###", "\n").replace("[", "").replace("]", "").replace("'", "")

    # Split the text into sections based on '\n'
    sections = relevant_content.split('\n')

    # Wrap text for each section and join them back with double newlines
    wrapped_sections = [textwrap.fill(section, width=max_line_width) for section in sections]
    formatted_output = '\n'.join(wrapped_sections)

    # Add the cleaned and formatted text to the list
    cleaned_outputs.append(formatted_output)

# Print the cleaned and formatted output
for text in cleaned_outputs:
    print(text)



 Instruction:
What is the iconic symbol of freedom at the US east coast?


 Input:



 Response:
The Statue of Liberty is the iconic symbol of freedom at the US east coast.
Located on Liberty Island in New York Harbor, the statue was a gift from France
to the United States in 1886 to celebrate the centennial of American
independence. The statue, designed by French sculptor Frédéric Auguste
Bartholdi, represents the Roman goddess of freedom, Libertas, holding a torch in
her right hand and a tablet in her left hand, inscribed with the date of the
Declaration of Independence. The statue has become a symbol of freedom,
democracy, and immigration, and millions of visitors from around the world visit
the statue each year.


### Saving, loading finetuned models
To save your final model, you can either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

In [14]:
model.save_pretrained("sft_model_with_lora") # Save locally
# model.push_to_hub("your_name/lora_sft_model") # Saving in Hugging Face