# Different LLM Fine-Tuning Techniques

In this article, I will talk about the different fine-tuning techniques for LLMs. I will describe each technique and then show them in action using Hugging Face <img src="imgs/hf-logo.svg" alt="drawing" width="30"/> libraries. 

# Fine-tuning vs. Pre-training

## Pre-training

The main goal of pre-training is to develop a general understanding of language across a broad spectrum. This foundational knowledge is crucial for the model to perform a wide range of language tasks. Pre-training involves a massive and diverse corpus that includes books, articles, websites, and other forms of written text from a wide array of subjects and domains. The model is trained using <em>unsupervised</em> learning techniques, primarily predicting the next word in a sequence. This extensive training helps the model understand context, grammar, and a vast vocabulary. However, pre-training requires significant computational power and time, often involving hundreds of GPUs or TPUs running for weeks or months due to the big dataset and the complexity of the model. As a result, the model is highly generalist, capable of performing reasonably well across a wide range of tasks without further modification.

## Fine-tuning

Fine-tuning tailors the pre-trained model to specific tasks or domains. The aim is to optimize the model's performance for particular applications, such as legal document analysis, medical diagnosis, or customer service interactions. Fine-tuning needs more specific and limited data compared to pre-training. It focuses on the particular language, style, or content relevant to the task or domain at hand. During fine-tuning, the model continues to learn, but this time under <em>supervised</em> conditions where it adjusts its parameters based on feedback specific to the task (e.g., correct answers in a Q&A task). Unlike pre-training, fine-tuning requires fewer computational resources. It’s usually quicker because the model is already largely trained and only needs adjustments to its existing knowledge base. The resulting model is expert in its specific area and may lose some of its general applicability outside the targeted domain. 

As we mentioned above, fine-tuning is conducted in supervised manner. Let's delve deeper in Supervised Fine Tuning.  


## Supervised Fine-tuning (SFT)

This technique is called "supervised" because the new dataset used for fine-tuning is labeled, meaning that the correct answers (or labels) are provided for each input example. The fine-tuning process uses this labeled data to adjust the model's weights to better predict the correct labels for new inputs. Typically, a lower learning rate is used during fine-tuning compared to initial training. This is to make smaller adjustments to the weights of the model, because it makes the fine-tuning process more stable and ensures the model retains the previously learned features without drastic alterations. 

SFT trains models to follow the instructions given in the labeled dataset, which enables them to perform well on the new task . However, these models might still produce harmful or unethical outputs despite their capacity to follow instructions. To better align these models with human values, further training with pairwise preference data is essential, employing methods like reinforcement learning with human feedback (RLHF) and direct preference optimization (DPO). Figure 1 shows the LLM training process. 


<center><figure><img src="imgs/OPRO.png" alt="drawing" width="1000"/><figcaption>Fig. 1: Language Model Training</figcaption></figure></center> 

# Preference Alignment without Reinforcement Learning

Working with RLHF and developing a reward model has always been a challenging task, since it requires extensive hyperparameter searching due to the instability of PPO and the sensitivity of the reward models. For this reason, researchers recently have proposed several preference alignment approaches that eliminate the use of reward models and directly use preferred and rejected responses to align the model towards the better responses. In this section, I will talk about two recent techniques namely DPO and ORPO. 

## Direct Preference Optimization (DPO)

DPO fine-tunes a language model to align with human preferences without explicit use of a reward model or reinforcement learning. DPO achieves the same goals as current RLHF algorithms but is easier to implement and train. DPO update increases the relative log probability of preferred to dispreferred responses, but it incorporates a dynamic, per-example importance weight that prevents the model degeneration.

### In a simple way, DPO works as follow:


At the beginning of the fine-tuning, a reference policy (model) $ \pi_{ref}$ is generated, by freezing the trainable parameters of the initial model fin-tuned in a supervised manner (i.e., SFT) 

A dataset with tuples of $ (x, y_{w}, y_{l}) $ is generated, where $ x $ is the prompt, and $ y_{w} $ is the preferred (chosen) completion and $ y_{l} $ is the dispreferred (rejected) completion (see listing 1 for a toy dataset).

For each prompt in the dataset, both the refernce policy and language model policy, $ \pi_{\theta} $ (i.e., the new trainable model),  score the chosen and rejected responses that are used to in the DPO loss function:

$ L_{DPO}(\pi_{ref},\pi_{\theta}) = - \mathbb{E_{(x,y_{w},y_{i})}} \; [log \; \sigma( \beta \; log \frac {\pi_{\theta}(y_{w} | x)}{\pi_{ref}(y_{w} | x)} - \beta \; log \frac {\pi_{\theta}(y_{l} | x)}{\pi_{ref}(y_{l} | x)} )]$

In this equation, $ \sigma $ is sigmoid function and $ \beta $ is temperature parameter (mostly in the range 0.1 to 0.5). This controls how much we pay attention to the reference model. 


Intuitively, the gradient of the loss function $L_{DPO}$ increases the likelihood of the preferred completions $y_{w}$ and decreases the likelihood of dispreferred completions
$y_{l}$


In [3]:
dpo_dataset_dict = {
    "prompt": [
        "hello",
        "how are you",
        "What is your name?",
        "What is your name?",
        "Which is the best programming language?",
        "Which is the best programming language?",
        "Which is the best programming language?",
    ],
    "chosen": [
        "hi nice to meet you",
        "I am fine",
        "My name is Mary",
        "My name is Mary",
        "Python",
        "Python",
        "Java",
    ],
    "rejected": [
        "leave me alone",
        "I am not fine",
        "Whats it to you?",
        "I dont have a name",
        "Javascript",
        "C++",
        "C++",
    ],
}

Listing 1: an example of ORPO-DPO dataset [2].

# Odds Ratio Preference Optimization (ORPO)

DPO techniques sometimes degrades the quality of the model. Recently, researchers have proposed odds ratio preference optimization (ORPO) technique which combines the supervised fine-tuning with a preference alignment technique, which efficiently penalizes the model from learning undesired generation styles during SFT. 

Unlike DPO that uses a reference policy (model) to align the new model, ORPO do so without a reference model in a single-step manner by assigning a weak penalty to the rejected responses and a strong reward to the chosen responses with a simple log odds ratio term added to the negative log-likelihood (NLL) loss of SFT.

SFT plays a significant role in tailoring the pre-trained language models to the desired domain by increasing the log probabilities of pertinent tokens. However, this also increases the likelihood of generating undesirable tokens. ORPO reduces the generation of unwanted answers by adding an odds ratio-based penalty to the conventional negative log-likelihood (NLL):

$ L_{ORPO} = \mathbb{E_{(x,y_{w},y_{i})}} \; [L_{SFT} + \lambda.L_{OR}]$

$L_{SFT}$ is negative log-likelihood loss function to maximize the likelihood of generating the reference tokens. $L_{OR}$ is given by:

$L_{OR} = -log \; \sigma( log \frac {odds_{\theta}(y_{w} | x)}{odds_{\theta}(y_{l} | x)})$

where, $ odds(y|x) $ simply indicates how much more likely it is for the model to generate $y$ than not generating it. 

In the above, $L_{OR}$ maximizes the odds ratio between the likelihood of generating the favored response $y_{w}$ and the disfavored response $y_{l}$. Log sigmoid function $ log \; \sigma $ converts it to a minimization problem to be added to the original loss function, and $\lambda$ is a weight parameter.

In general $ L_{ORPO} $ aligns the language model to adapt to the specific subset of the desired domain and disfavor generations in the rejected response sets.

SFT uses NLL to penalize the model if it doesn't predict the reference answers well. However, it only focuses on generation of chosen responses and there is no mechanism to penalize generation of rejected responses. 

ORPO expect datasets in a same format as DPO shown in listing 1.

Figure 2, compares ORPO with DPO

<center><figure><img src="imgs/ORPO-DPO.png" alt="drawing" width="600"/><figcaption>Fig. 2: DPO vs. ORPO</figcaption></figure></center> 

Now that we have a basic understanding about these fine-tuning techniques, let's enjoy seeing them in action. In this section, I will demonstrate how to fine-tune a new Llama 3 model, using the aforementioned techniques. 

# Fine-tuning Llama 3

First and foremest, we should prepare our preference dataset in ORPO/DPO compatible format. The entries should be named:

* prompt
* chosen
* rejected

Moreover, they need to follow the chat template of the underlying language model. A good prefernce dataset with more than 40k samples is available on Hugging Face hub [orpo-dpo-mix-40k] (https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k). <code>apply_chat_template()</code> is part of the tokenizer. It converts conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects. Let see it in action for llama3:

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
chat = [
   {"role": "user", "content": "Hello, how are you?"},
   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]
tokenizer.apply_chat_template(chat, tokenize=False)

<|im_start|>user
Hello, how is the weather today?<|im_end|>
<|im_start|>assistant
It's currently cloudy and 55.4 F?<|im_end|>

## Quantization

We can reduce the memory and computation cost by applying a quantization to the model that will represent weights and activations in a lower precision format such as 8-bit integers.<code>BitsAndBytes</code> library for this purpose. 

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, #1
    bnb_4bit_quant_type="nf4", #2
    bnb_4bit_compute_dtype=torch.bfloat16, #3
    bnb_4bit_use_double_quant=True, #4
)

Here, #1 loads the model in 4-bit quantization, #2 says to use NormalFloat4 format which is a 4-bit quantization used in QLoRA. #3 will use torch.bfloat16 as data type for computation which can improve computation speed in some cases such as matrix operations. Finally, #4 will apply a nested quantization technique for better memory efficiency without sacrificing performance.

We also need to leverage LoRAConfig to train the 4-bit model which will be used in the trainer:

In [None]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

## Fine-tuning using DPO

As we discussed above, DPO needs a curated dataset with "chosen" and "rejected" responses to a "prompt". Hugging Face has more than 50 DPO-compatible datasets [here](https://huggingface.co/datasets?other=dpo). 

## Fine-tuning using ORPO

Hugging Face TRL library supports ORPO fine-tuning using the ORPO Trainer <code>ORPOTrainer()</code> for training language models from a preference dataset. 

First, we set the hyperparameters along with the other configuration using <code>ORPOConfig()</code>. Here are a set of hyperparameters suggested in the original paper [].

In [None]:
orpo_config = ORPOConfig(
    learning_rate=8e-6, #1
    beta=0.1,# lambda in the original paper
    max_length=1024,#3
    max_prompt_length=512,#4
    optim="paged_adamw_8bit",#5
    num_train_epochs=1,#6
)

Then, we define a trainer and train our mode

In [None]:
trainer = ORPOTrainer(
    model=base_model,
    args=orpo_args,
    train_dataset=dataset["train"],
    peft_config=peft_config,
    tokenizer=tokenizer,
)

trainer.train()

In the above code snippet, base_model is the model we are going to fine-tune, orpo_args defined before, we use "train split" of our DPO-compatible dataset for model fine-tuning. We can also apply peft techniques to reduce the training time. Finally perform the fine-tuning!