# Chapter 12 - Fine Tuning Generation Models

- Supervised Fine Tuning (SFT)
- Preference Tuning

```
<img src="imgs/nertags.png" alt="Patching" width="500" height="200">
```

# The Three LLM Training Steps: Pretraining, Supervised Fine-Tuning, and Preference Tuning

### 1. Language modeling

The first step in creating a high-quality LLM is to pretrain it on one or more massive text datasets

It attempts to predict the next token

This produces a base model, also commonly referred to as a pretrained or foundation model.

### 2. Fine-tuning 1 (supervised fine-tuning)

LLMs are more useful if they respond well to instructions

With supervised fine-tuning (SFT), we can adapt the base model to follow instructions. 

the parameters of the base model are updated to be more in line with our target task

Like a pretrained model, it is **trained using next-token prediction but instead of only predicting the next token, it does so based on a user input**

SFT can also be used for other tasks, like classification, but is often used to go from a base generative model to an instruction (or chat) generative model.

### 3. Fine-tuning 2 (preference tuning)

The final step further improves the quality of the model and makes it more aligned with the expected behavior of AI safety or human preferences. 

Preference tuning is a form of fine-tuning and, as the name implies, aligns the output of the model to our preferences, which are defined by the data that we give it. 


**In this chapter, we use a base model that was already trained on massive datasets and explore how we can fine-tune it using both fine-tuning strategies**

# Supervised Fine-Tuning (SFT)

- The most common fine-tuning process is full fine-tuning. The main difference is that we now use a smaller but labeled dataset whereas the pretraining process was done on a large dataset without any labels

<img src="imgs/sft.png" alt="Patching" width="500" height="200">

 To make our LLM follow instructions, we will need **question-response data**.

 - During full fine-tuning, the model takes the input (instructions) and applies next-token prediction on the output (response) 


## Parameter Efficient Fine-Tuning (PEFT)

Updating all parameters of a model has a large potential of increasing its performance but comes with several disadvantages. 
- costly
- slow training
- high storage


### Adapters

Adapters are a core component of many PEFT-based techniques. The method proposes a set of additional modular components inside the Transformer that can be fine-tuned to improve the model’s performance on a specific task without having to fine-tune all the model weights. 

PEFT paper showed that fine-tuning 3.6% of the parameters of BERT for a task can yield comparable performance to fine-tuning all the model’s weights

<img src="imgs/adapters.png" alt="Patching" width="300" height="200">

 Adapters add a small number of weights in certain places in the network that can be fine-tuned efficiently while leaving the majority of model weights frozen.

 **These Adapters have to be placed in every Transformer block**

 Find specialized adapters in [AdapterHub](https://adapterhub.ml/)

### Low-Rank Adaptation (LoRA)

As an alternative to adapters, low-rank adaptation (LoRA) was introduced and is at the time of writing is a widely used and effective technique for PEFT. 

LoRA is a technique that (like adapters) only requires updating a small set of parameters.

<img src="imgs/lora.png" alt="Patching" width="300" height="200">

- Like adapters, this subset allows for much quicker fine-tuning since we only need to update a small part of the base model. 

-  We create this subset of parameters by **approximating large matrices that accompany the original LLM with smaller matrices**. 

<img src="imgs/lorarank.png" alt="Patching" width="300" height="200">

During training, we only need to update these smaller matrices instead of the full weight changes. The updated change matrices (smaller matrices) are then combined with the full (frozen) weights 

<img src="imgs/loratune.png" alt="Patching" width="300" height="200">

Papers like “Intrinsic dimensionality explains the effectiveness of language model fine-tuning” demonstrate that language models “have a very low intrinsic dimension

This means that we can find small ranks that approximate even the massive matrices of an LLM. 

A 175B model like GPT-3, for example, would have a weight matrix of **12,288 × 12,288** inside each of its 96 Transformer blocks. **That’s 150 million parameters.**

 If we can successfully adapt that matrix into rank 8, that would only require two 12,288 × 2 matrices resulting in 197K parameters per block.

In [3]:
pct = (197 / 150_000) * 100
print(f'In LoRA we would only train {pct:.2f}% of the total weights')

In LoRA we would only train 0.13% of the total weights


**This smaller representation is quite flexible in that you can select which parts of the base model to fine-tune.**  For instance, we can only fine-tune the Query and Value weight matrices in each Transformer layer.

#### Quantization: Compressing the model for (more) efficient training

We can make LoRA even more efficient by **reducing the memory requirements of the model’s original weights** before projecting them into smaller matrices. 

The weights of an LLM are numeric values with a given precision

<img src="imgs/floats.png" alt="Patching" width="600" height="200">

However, if we lower the number of bits we also lower the memory requirements of that model.

**With quantization, we aim to lower the number of bits while still accurately representing the original weight values.**

Quantizing weights that are close to one another results in the same reconstructed weights thereby removing any differentiating factor.

<img src="imgs/reconstructweight.png" alt="Patching" width="600" height="200">

### QLORA

The authors of QLoRA, a quantized version of LoRA, found a way to go from a higher number of bits to a lower value and vice versa without differentiating too much from the original weights

They used **blockwise quantization** to map certain blocks of higher precision values to lower precision values. 

Instead of directly mapping higher precision to lower precision values, **additional blocks are created that allow for quantizing similar weights**.

<img src="imgs/qlora.png" alt="Patching" width="600" height="200">

A nice property of neural networks is that their values are generally normally distributed between –1 and 1. 

**This property allows us to bin the original weights to lower bits based on their relative density**

The mapping between weights is more efficient as it takes into account the relative frequency of weights. This also reduces issues with outliers.

**Using distribution-aware blocks we can prevent values close to one another from being represented with the same quantized value.**

<img src="imgs/qloradist.png" alt="Patching" width="500" height="200">


**As a result, we can go from a 16-bit float representation to a measly 4-bit normalized float representation.**

*Note that the quantization of LLMs in general is also helpful for inference as quantized LLMs are smaller in size and therefore require less VRAM.*

## Instruction Tuning with QLORA

In this section, we will fine-tune a completely open source and smaller version of Llama, TinyLlama, to follow instructions using the QLoRA procedure.

Consider this model a base or pretrained model, one that was trained with language modeling but cannot yet follow instructions.

### 1. Prepare Instruction Dataset (UltraChat dataset)

**Chat Template**: To have the LLM follow instructions, we will need to prepare instruction data that follows a chat template.

<img src="imgs/chattemplate.png" alt="Patching" width="500" height="200">

**UltraChat dataset**: this dataset is a filtered version of the original UltraChat dataset that contains almost 200k conversations between a user and an LLM.

In [4]:
from transformers import AutoTokenizer
from datasets import load_dataset

In [5]:
# Load a tokenizer to use its chat template
template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

In [6]:
def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama is using"""

    # Format answers
    chat = example["messages"]
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)

    return {"text": prompt}

In [7]:
# Load and format the data using the template TinyLLama is using
dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k",  split="test_sft")
      .shuffle(seed=42)
      .select(range(3_000))
)

Downloading readme:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/243M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/243M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.4M [00:00<?, ?B/s]

Generating train_sft split:   0%|          | 0/207865 [00:00<?, ? examples/s]

Generating test_sft split:   0%|          | 0/23110 [00:00<?, ? examples/s]

Generating train_gen split:   0%|          | 0/256032 [00:00<?, ? examples/s]

Generating test_gen split:   0%|          | 0/28304 [00:00<?, ? examples/s]

In [8]:
dataset = dataset.map(format_prompt)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [9]:
# Example of formatted prompt
print(dataset["text"][2576])

<|user|>
Given the text: Knock, knock. Who’s there? Hike.
Can you continue the joke based on the given text material "Knock, knock. Who’s there? Hike"?</s>
<|assistant|>
Sure! Knock, knock. Who's there? Hike. Hike who? Hike up your pants, it's cold outside!</s>
<|user|>
Can you tell me another knock-knock joke based on the same text material "Knock, knock. Who's there? Hike"?</s>
<|assistant|>
Of course! Knock, knock. Who's there? Hike. Hike who? Hike your way over here and let's go for a walk!</s>



In [11]:
print(dataset["prompt"][2576])

Given the text: Knock, knock. Who’s there? Hike.
Can you continue the joke based on the given text material "Knock, knock. Who’s there? Hike"?


In [13]:
for dd in dataset["messages"][2576]:
    print(dd)

{'content': 'Given the text: Knock, knock. Who’s there? Hike.\nCan you continue the joke based on the given text material "Knock, knock. Who’s there? Hike"?', 'role': 'user'}
{'content': "Sure! Knock, knock. Who's there? Hike. Hike who? Hike up your pants, it's cold outside!", 'role': 'assistant'}
{'content': 'Can you tell me another knock-knock joke based on the same text material "Knock, knock. Who\'s there? Hike"?', 'role': 'user'}
{'content': "Of course! Knock, knock. Who's there? Hike. Hike who? Hike your way over here and let's go for a walk!", 'role': 'assistant'}


### 2. Model Quantization

Now that we have our data, we can start loading in our model. This is where we apply the Q in QLoRA, namely quantization. We use the bitsandbytes package to compress the pretrained model to a 4-bit representation.

In BitsAndBytesConfig, you can define the quantization scheme. We follow the steps used in the original QLoRA paper and load the model in 4-bit (load_in_4bit) with a normalized float representation (bnb_4bit_quant_type) and double quantization (bnb_4bit_use_double_quant):

In [14]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

In [15]:
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# 4-bit quantization configuration - Q in QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4",  # Quantization type
    bnb_4bit_compute_dtype="float16",  # Compute dtype
    bnb_4bit_use_double_quant=True,  # Apply nested quantization
)


In [16]:
# Load the model to train on the GPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",

    # Leave this out for regular SFT
    quantization_config=bnb_config,
)

config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

In [17]:
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [20]:
model.is_quantized

True

### 3. LoRA Configuration


In [21]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model


In [22]:
# Prepare LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)


- r: this is the rank of the compressed matrices (recall this from Figure 12-13) Increasing this value will also increase the sizes of compressed matrices leading to less compression and thereby improved representative power. Values typically range between 4 and 64.
- lora_alpha: controls the amount of change that is added to the original weights. In essence, it balances the knowledge of the original model with that of the new task. A rule of thumb is to choose a value twice the size of r.
- target_modules: controls which layers to target. The LoRA procedure can choose to ignore specific layers, like specific projection layers. This can speed up training but reduce performance and vice versa.

In [24]:
# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

### 4. Training Configuration

In [25]:
from transformers import TrainingArguments

In [26]:
output_dir = "./training_results"

# Training arguments
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True
)

- **num_train_epochs**: The total number of training rounds. Higher values tend to degrade performance so we generally like to keep this low.
- **learning_rate**: Determines the step size at each iteration of weight updates. The authors of QLoRA found that higher learning rates work better for larger models (>33B parameters).
- **lr_scheduler_type**: A cosine-based scheduler to adjust the learning rate dynamically. It will linearly increase the learning rate, starting from zero, until it reaches the set value. After that, the learning rate is decayed following the values of a cosine function.
- **optim**: The paged optimizers used in the original QLoRA paper.

### 5. Training

In [27]:
from trl import SFTTrainer

In [28]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    max_seq_length=512,

    # Leave this out for regular SFT
    peft_config=peft_config,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]



In [29]:
# Train model
trainer.train()

  0%|          | 0/375 [00:00<?, ?it/s]

{'loss': 1.669, 'grad_norm': 0.27586522698402405, 'learning_rate': 0.00019964928592495045, 'epoch': 0.03}
{'loss': 1.4762, 'grad_norm': 0.263204962015152, 'learning_rate': 0.0001985996037070505, 'epoch': 0.05}
{'loss': 1.4508, 'grad_norm': 0.1895277500152588, 'learning_rate': 0.0001968583161128631, 'epoch': 0.08}
{'loss': 1.488, 'grad_norm': 0.19641973078250885, 'learning_rate': 0.00019443763702374812, 'epoch': 0.11}
{'loss': 1.4785, 'grad_norm': 0.19293968379497528, 'learning_rate': 0.0001913545457642601, 'epoch': 0.13}
{'loss': 1.3908, 'grad_norm': 0.20264646410942078, 'learning_rate': 0.00018763066800438636, 'epoch': 0.16}
{'loss': 1.4949, 'grad_norm': 0.22744004428386688, 'learning_rate': 0.00018329212407100994, 'epoch': 0.19}
{'loss': 1.4503, 'grad_norm': 0.19911472499370575, 'learning_rate': 0.000178369345732584, 'epoch': 0.21}
{'loss': 1.4273, 'grad_norm': 0.20271213352680206, 'learning_rate': 0.00017289686274214118, 'epoch': 0.24}
{'loss': 1.4047, 'grad_norm': 0.229094922542572

TrainOutput(global_step=375, training_loss=1.4168811581929526, metrics={'train_runtime': 232.0384, 'train_samples_per_second': 12.929, 'train_steps_per_second': 1.616, 'total_flos': 9994755938844672.0, 'train_loss': 1.4168811581929526, 'epoch': 1.0})

In [30]:
# Save QLoRA weights
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")

### 6. Merge Weights

After we have trained our QLoRA weights, we still need to combine them with the original weights to use them. 

**We reload the model in 16 bits, instead of the quantized 4 bits, to merge the weights.**

Although the tokenizer was not updated during training, we save it to the same folder as the model for easier access:

In [31]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)

In [32]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 2048)
        (layers): ModuleList(
          (0-21): 22 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear(
                (base_layer): Linear(in_features=2

In [33]:
# Merge LoRA and base model
merged_model = model.merge_and_unload()

After merging the adapter with the base model, we can use it with the prompt template that we defined earlier:

In [34]:
from transformers import pipeline

# Use our predefined prompt template
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of data, including text, audio, and video, and are capable of generating complex sentences and phrases that are often difficult to create by humans.

LLMs are used in a variety of applications, including natural language processing (NLP), machine translation, and chatbots. They can be used to generate text in different languages, such as English, French, or German, and can also be used to generate images, videos, or other forms of content.

One of the most significant applications of LLMs is in the field of natural language generation (NLG). NLG is the process of generating human-like language from a computer program. LLMs can be used to generate text in a variety of fields, including marketing, finance, and medicine.

Another application of LLMs is in the field of chatb

The aggregate output shows that the model now closely follows our instructions, which is not possible with the base model.

In [35]:
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# Load the model to train on the GPU
original_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

In [37]:
# # Use our predefined prompt template
# prompt = """<|user|>
# Tell me something about Large Language Models.</s>
# <|assistant|>
# """

# # Run our instruction-tuned model
# pipe_original = pipeline(task="text-generation", model=original_model, tokenizer=tokenizer)
# print(pipe_original(prompt)[0]["generated_text"])

# Evaluating Generative Models

### Word-Level Metrics

These classic techniques compare a reference dataset with the generated tokens on a token(set) level. 

Common word-level metrics include perplexity, ROUGE, BLEU, and BERTScore

**Perplexity**: how well a language model predicts a text. Given input text, the model predicts how likely the next token is. With perplexity, we assume a model performs better if it gives the next token a high probability. In other words, the models should not be “perplexed” when presented with a well-written document.

**They do not account for consistency, fluency, creativity, or even correctness of the generated text.**

### Benchmarks

A common method for evaluating generative models on language generation and understanding tasks is on well-known and public benchmarks, such as:

- **MMLU**: The Massive Multitask Language Understanding (MMLU) benchmark tests the model on 57 different tasks, including classification, question answering, and sentiment analysis.
- **GLUE**: The General Language Understanding Evaluation (GLUE) benchmark consists of language understanding tasks covering a wide degree of difficulty.
- **TruthfulQA**: TruthfulQA measures the truthfulness of a model’s generated text.
- **GSM8k**: The GSM8k dataset contains grade-school math word problems. It is linguistically diverse and created by human problem writers.
- **HellaSwag**: HellaSwag is a challenge dataset for evaluating common-sense inference. It consists of multiple-choice questions that the model needs to answer. It can select one of four answer choices for each question.

For coding:
- **HumanEval**:  The HumanEval benchmark is used for evaluating generated code based on 164 programming problems.

A downside to public benchmarks is that models can be overfitted to these benchmarks to generate the best responses.

Moreover, these are still broad benchmarks and might not cover very specific use cases. 

Lastly, another downside is that some benchmarks require strong GPUs with a long running time (over hours) to compute, which makes iteration difficult.


### Leaderboards

Whenever a model is released, you will often see it evaluated on several benchmarks to showcase how it performs across the board.

A common leaderboard is the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), which, at the time of writing, includes six benchmarks, including HellaSwag, MMLU, TruthfulQA, and GSM8k.


### Automated Evaluation

Part of evaluating a generative output is the quality of its text. For instance, even if two models were to give the same correct answer to a question, the way they derived that answer might be different. 

Similarly, although two summaries might be similar, one could be significantly shorter than another, which is often important for a good summary.

To evaluate the quality of the generated text above the correctness of the final answer, **LLM-as-a-judge** was introduced

An interesting variant of this method is **pairwise comparison**. Two different LLMs will generate an answer to a question and a third LLM will be the judge to declare which is better.


### Human Evaluation

The gold standard of evaluation is generally considered to be human evaluation

Even if an LLM scores well on broad benchmarks, it still might not score well on domain-specific tasks.

Moreover, benchmarks do not fully capture human preference and all methods discussed before are merely proxies for that.

A great example of a human-based evaluation technique is the [Chatbot Arena](https://lmarena.ai/). For instance, if a low-ranked LLM beats a high-ranked LLM, its ranking changes significantly. In chess, this is referred to as the Elo rating system.

# Preference Tuning / Alignment / RLHF

Although our model can now follow instructions, we can further improve its behavior by a final training phase that aligns it to how we expect it to behave in different scenarios. 

For instance, when asked “What is an LLM?” we might prefer an elaborate answer that describes the internals of an LLM compared to the answer “It is a large language model” without further explanations. 

**Preference evaluator**: We can ask a person (preference evaluator) to evaluate the quality of that model generation. 

<img src="imgs/pref.png" alt="Patching" width="500" height="200">

- If the score is high, the model is updated to encourage it to generate more like this type of generation.
- If the score is low, the model is updated to discourage such generations.

## Automating Preference Evaluation Using Reward Models

We need a step before the preference-tuning step, namely to train a reward model.

<img src="imgs/reward.png" alt="Patching" width="600" height="150">

- We take a copy of the instruction-tuned model and slightly change it so that instead of generating text, it now outputs a single score.


<img src="imgs/rewardhead.png" alt="Patching" width="600" height="150">

## The Inputs and Outputs of a Reward Model


We cannot directly use the reward model. It needs to first be trained to properly score generations. So let’s get a **preference dataset** 


### Reward model training dataset

One common shape for preference datasets is for a training example to have a prompt, with one accepted generation and one rejected generation.

One way to generate preference data is to present a prompt to the LLM and have it generate two different generations.


### Reward model training step

Now that we have the preference training dataset, we can proceed to train the reward model.

A simple step is that we use the reward model to:

1. Score the accepted generation
2. Score the rejected generation


<img src="imgs/qualityhead.png" alt="Patching" width="400" height="300">

### Final pipeline (3 stages)

- Collect preference data
- Train a reward model
- Use the reward model to fine-tune the LLM (operating as the preference evaluator)


<img src="imgs/rewardmodel.png" alt="Patching" width="600" height="300">


**Llama 2, for example, trains two reward models: one that scores helpfulness and another that scores safety**

### PPO

A common method to fine-tune the LLM with the trained reward model is Proximal Policy Optimization (PPO).

PPO is a popular reinforcement technique that optimizes the instruction-tuned LLM by making sure that the LLM does not deviate too much from the expected rewards

**A disadvantage of PPO is that it is a complex method that needs to train at least two models, the reward model and the LLM, which can be more costly than perhaps necessary.**


### DPO

Direct Preference Optimization (DPO) is an alternative to PPO and does away with the reinforcement-based learning procedure

Instead of using the reward model to judge the quality of a generation, we let the LLM itself do that. 

We use a copy of the LLM as the reference model to judge the shift between the reference and trainable model in the quality of the accepted generation and rejected generation.


<img src="imgs/DPO.png" alt="Patching" width="600" height="300">

By calculating this shift during training, we can optimize the likelihood of accepted generations over rejected generations by tracking the difference in the reference model and the trainable model.


### DPO Calculations

To calculate this shift and its related scores, the **log probabilities of the rejected generations and accepted generations are extracted from both models**. 

This process is performed at a token level where the probabilities are combined to calculate the shift between the reference and trainable models:

<img src="imgs/dposcores.png" alt="Patching" width="600" height="300">

Using these scores, we can optimize the parameters of the trainable model to be more confident of generating the accepted generations and less confident of generating the rejected generations.

Compared to PPO, the authors found DPO to be more stable during training and more accurate.

# Preference Tuning with DPO

We will still be using TinyLlama but this time an instruction-tuned version that was first trained using full fine-tuning and then further aligned with DPO.

Compared to our initial instruction-tuned model, this LLM was trained on much larger datasets.

In this section, we will demonstrate how you can further align this model using DPO with reward-based datasets.

### Templating Alignment Data

We will use a dataset that for each prompt contains an accepted generation and a rejected generation. 

This dataset was in part generated by ChatGPT with scores on which output should be accepted and which rejected:

In [38]:
from datasets import load_dataset

def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama is using"""

    # Format answers
    system = "<|system|>\n" + example['system'] + "</s>\n"
    prompt = "<|user|>\n" + example['input'] + "</s>\n<|assistant|>\n"
    chosen = example['chosen'] + "</s>\n"
    rejected = example['rejected'] + "</s>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

In [39]:
# Apply formatting to the dataset and select relatively short answers
dpo_dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

Downloading readme:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/79.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12859 [00:00<?, ? examples/s]

In [40]:
dpo_dataset = dpo_dataset.filter(
    lambda r:
        r["status"] != "tie" and
        r["chosen_score"] >= 8 and
        not r["in_gsm8k_train"]
)

Filter:   0%|          | 0/12859 [00:00<?, ? examples/s]

In [41]:
dpo_dataset = dpo_dataset.map(format_prompt, remove_columns=dpo_dataset.column_names)

Map:   0%|          | 0/5922 [00:00<?, ? examples/s]

In [42]:
dpo_dataset

Dataset({
    features: ['chosen', 'rejected', 'prompt'],
    num_rows: 5922
})

In [43]:
dpo_dataset[0]

{'chosen': 'Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One.</s>\n',
 'rejected': ' Sure! Here\'s a sentence that describes all the data you provided:\n\n"Midsummer House is a moderately priced Chinese restaurant with a customer rating of 3 out of 5, located near All Bar One, offering a variety of delicious dishes."</s>\n',
 'prompt': '<|system|>\nYou are an AI assistant. You will be given a task. You must generate a detailed and long answer.</s>\n<|user|>\nGenerate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One</s>\n<|assistant|>\n'}

In [45]:
print(dpo_dataset[0]["prompt"])

<|system|>
You are an AI assistant. You will be given a task. You must generate a detailed and long answer.</s>
<|user|>
Generate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One</s>
<|assistant|>



## Model Quantization


We load our base model and load it with the LoRA we created previously. As before, we quantize the model to reduce the necessary VRAM for training:

In [46]:
from peft import AutoPeftModelForCausalLM
from transformers import BitsAndBytesConfig, AutoTokenizer

In [47]:
# 4-bit quantization configuration - Q in QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4",  # Quantization type
    bnb_4bit_compute_dtype="float16",  # Compute dtype
    bnb_4bit_use_double_quant=True,  # Apply nested quantization
)

In [48]:
# Merge LoRA and base model
model = AutoPeftModelForCausalLM.from_pretrained(
    "training/TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
    quantization_config=bnb_config,
)

In [49]:
merged_model = model.merge_and_unload()



In [50]:
# Load LLaMA tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"

## LORA Config

In [51]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

## DPO Trainer

For the sake of simplicity, we will use the same training arguments as we did before with one difference. Instead of running for a single epoch (which can take up to two hours), we run for 200 steps instead for illustration purposes.

We added the `warmup_ratio` parameter, which increases the learning rate from 0 to the learning_rate value we set for the first 10% of steps.

**Warmup**: By maintaining a small learning rate at the start (i.e., warmup period), we allow the model to adjust to the data before applying larger learning rates

In [None]:
from trl import DPOConfig

output_dir = "./training_results_ch12"

# Training arguments
training_arguments = DPOConfig(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=1e-5,
    lr_scheduler_type="cosine",
    max_steps=200,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True,
    warmup_ratio=0.1
)

## Training

In [53]:
from trl import DPOTrainer

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,
    args=training_arguments,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=512,
    max_length=512,
)

# Fine-tune model with DPO
dpo_trainer.train()


Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.


Map:   0%|          | 0/5922 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


  0%|          | 0/200 [00:00<?, ?it/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


{'loss': 0.6924, 'grad_norm': 1.8012945652008057, 'learning_rate': 4.5e-06, 'rewards/chosen': -0.0006841827416792512, 'rewards/rejected': -0.002232985571026802, 'rewards/accuracies': 0.25, 'rewards/margins': 0.0015488029457628727, 'logps/rejected': -113.9967041015625, 'logps/chosen': -106.90666198730469, 'logits/rejected': -2.9157962799072266, 'logits/chosen': -2.85026478767395, 'epoch': 0.01}
{'loss': 0.6777, 'grad_norm': 2.293241024017334, 'learning_rate': 9.5e-06, 'rewards/chosen': 0.002104945247992873, 'rewards/rejected': -0.02996930107474327, 'rewards/accuracies': 0.4625000059604645, 'rewards/margins': 0.03207424655556679, 'logps/rejected': -158.38888549804688, 'logps/chosen': -125.90263366699219, 'logits/rejected': -3.060880422592163, 'logits/chosen': -2.9251315593719482, 'epoch': 0.03}
{'loss': 0.6458, 'grad_norm': 2.260286569595337, 'learning_rate': 9.938441702975689e-06, 'rewards/chosen': 0.006185319274663925, 'rewards/rejected': -0.09871865808963776, 'rewards/accuracies': 0.4

TrainOutput(global_step=200, training_loss=0.6047369384765625, metrics={'train_runtime': 224.7541, 'train_samples_per_second': 7.119, 'train_steps_per_second': 0.89, 'total_flos': 0.0, 'train_loss': 0.6047369384765625, 'epoch': 0.2701789935832489})

In [54]:
# Save adapter
dpo_trainer.model.save_pretrained("training/TinyLlama-1.1B-dpo-qlora")

We have created a second adapter. To merge both adapters, we iteratively merge the adapters with the base model:

In [55]:
from peft import PeftModel

# Merge LoRA and base model
model = AutoPeftModelForCausalLM.from_pretrained(
    "training/TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)
sft_model = model.merge_and_unload()

# Merge DPO LoRA and SFT model
dpo_model = PeftModel.from_pretrained(
    sft_model,
    "training/TinyLlama-1.1B-dpo-qlora",
    device_map="auto",
)
dpo_model = dpo_model.merge_and_unload()

This combination of SFT+DPO is a great way to first fine-tune your model to perform basic chatting and then align its answers with human preference. However, it does come at a cost since we need to perform two training loops and potentially tweak the parameters in two processes.

Since the release of DPO, new methods of aligning preferences have been developed. Of note is **Odds Ratio Preference Optimization (ORPO), a process that combines SFT and DPO into a single training process**

In [56]:
from transformers import pipeline

# Use our predefined prompt template
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=dpo_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of data, including text, audio, and video, and are capable of generating complex sentences and phrases that are often difficult to create by humans.

LLMs are used in a variety of applications, including natural language processing (NLP), machine translation, and chatbots. They can be used to generate text in different languages, such as English, French, or German, and can also be used to generate images, videos, or other forms of content.

One of the most significant applications of LLMs is in the field of natural language generation (NLG). NLG is the process of generating human-like language from a computer program. LLMs can be used to generate text in a variety of fields, including marketing, finance, and medicine.

Another application of LLMs is in the field of chatb