# Fine-tuning Models Using the DPO Algorithm

This tutorial demonstrates how to fine-tune large models using the DPO algorithm (using the Llama-3.1-8B model as an example). Through this tutorial, you will learn how to configure training parameters and perform reinforcement learning-style training on preference-labeled data using the DPO algorithm to improve model performance on alignment tasks.

## 1. What is the DPO Algorithm?

DPO (Direct Preference Optimization) is a method for training language models to better align with human preferences. It does not rely on explicit reward models or policy gradient methods, but directly optimizes the model on “human preference data,” making it prefer the human-preferred response when given two answers.

## 2. Environment Setup

Before starting, please make sure you have installed the ``align-anything`` package.

```bash
# Clone the repository
git clone git@github.com:PKU-Alignment/align-anything.git
cd align-anything

# Create a virtual environment using conda
conda create -n align-anything python==3.11
conda activate align-anything
```

- **`[Optional]`** We recommend installing [CUDA](https://anaconda.org/nvidia/cuda) in the conda environment and set the environment variable.

```bash
# We have tested this version of CUDA on the H800 computing cluster and it worked well.
# You can adjust this version according to your actual computing cluster.

conda install nvidia/label/cuda-12.2.0::cuda
export CUDA_HOME=$CONDA_PREFIX
```

> If your CUDA is installed in a different location, such as `/usr/local/cuda/bin/nvcc`, you can set the environment variable as follows:

```bash
export CUDA_HOME="/usr/local/cuda"
```

Finally, install `align-anything` using the following command:

```bash
# We have prepared a quick installation for training and evaluation.
# If you only need to use the training or evaluation module,
# you can install the corresponding dependencies.
pip install -e .[train] # Install training dependencies
pip install -e .[evaluate] # Install evaluation dependencies

# If you need to install all dependencies, you can use the following command:
pip install -e .[all]
```


## 3. Llama-3.1-8B-Instruct Model Output Example
Next, let's first test the zero-shot capability of the Llama-3.1-8B-Instruct model.

### 3.1 Import Required Libraries


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import torch

os.environ["TRANSFORMERS_OFFLINE"] = "1"
os.environ["HF_DATASETS_OFFLINE"] = "1"

  from .autonotebook import tqdm as notebook_tqdm


[1742778596.498488] [dsw-519274-66f65ff576-678dh:4051137:f]        vfs_fuse.c:281  UCX  ERROR inotify_add_watch(/tmp) failed: No space left on device


### 3.2 Load the Original Llama Model


In [2]:
device = "cuda"  # Set device to "cuda" to use GPU
model_path = (
    "/PATH/TO/YOUR/Meta-Llama-3.1-8B-Instruct"  # Please replace with your actual model path
)
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

# Set the model to evaluation mode
model.eval()

Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.29it/s]


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_

### 3.3 Test the Performance of the Original Model

Let's test the Llama-3.1-8B-Instruct model with a sample question.


In [3]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that answers user queries."},
    {
        "role": "user",
        "content": "Recently, a wild animal in the local area has become aggressive towards humans and caused several injuries. How should I handle this wild animal?",
    },
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([input_text], return_tensors="pt").to(device)

# the model generate new tokens
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)
# convert the generated tokens to text
generated_text = tokenizer.decode(
    output[0][len(inputs['input_ids'][0]) :], skip_special_tokens=True
)
print("\nGenerated Text:", generated_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Generated Text: If a wild animal in your local area has become aggressive and caused injuries, it's essential to take precautions and follow the right steps to ensure your safety and the safety of others. Here's a step-by-step guide:

1.  **Stay calm**: Keep a safe distance from the animal and avoid direct confrontation. Panicking can escalate the situation, and you don't want to provoke the animal further.

2.  **Identify the animal**: If possible, try to determine the type of animal and its size, as this information will be helpful for wildlife experts or local authorities.

3.  **Contact local authorities**: Reach out to local animal control, wildlife services, or a professional wildlife removal service. They will send trained experts to handle the situation.

4.  **Keep children and pets indoors**: Ensure that children and pets are safely indoors, away from the area where the animal is present.

5.  **Do not approach or feed the animal**: Feeding or approaching the animal can make

This shows that although Llama 3.1 provides detailed content in its responses, there are issues such as information redundancy and insufficient emphasis on key risks.

For example, it suggests "identifying the animal" without explicitly warning to stay away from dangerous areas. This could mislead people into approaching for observation, thereby increasing the risk of injury and negatively impacting emergency safety responses in critical situations.


## 4. Aligning the Model Using the DPO Algorithm

**Note**: If you cannot access huggingface.co, set the Hugging Face endpoint to hf-mirror.com. You can do this with the following command:

`export HF_ENDPOINT="https://hf-mirror.com"`

Here, we take the PKU-SafeRLHF series dataset as an example. The PKU-SafeRLHF dataset is a preference dataset focused on safety alignment. Each data entry in this dataset contains two responses to the same question, along with their corresponding safety meta-tags and preference annotations.

You can refer to the training script below:

```bash
MODEL_NAME_OR_PATH="meta-llama/Llama-3.1-8B-Instruct" # model path

TRAIN_DATASETS="PKU-Alignment/PKU-SafeRLHF-single-dimension" # dataset path
TRAIN_TEMPLATE="PKUSafeRLHF" # dataset template
TRAIN_SPLIT="train" # split the dataset

OUTPUT_DIR="../outputs/llama_dpo" # output dir

# For wandb online logging
export WANDB_API_KEY="YOUR_API_KEY"

# Source the setup script
source ./setup.sh

# Execute deepspeed command
deepspeed \
     --master_port ${MASTER_PORT} \
     --module align_anything.trainers.text_to_text.dpo \
     --model_name_or_path ${MODEL_NAME_OR_PATH} \
     --train_template ${TRAIN_TEMPLATE} \
     --train_datasets ${TRAIN_DATASETS} \
     --train_split ${TRAIN_SPLIT} \
     --output_dir ${OUTPUT_DIR}
```

After training is completed, you can find the trained model weights under the `OUTPUT_DIR`.


## 5. Test the Performance of the Model After DPO Training

After the training is complete, we try to test whether the alignment of the trained model has improved.

### 5.1 Load the New Model Weights


In [4]:
model_path = "/PATH/TO/YOUR/llama_dpo/slice_end"  # Please replace with your actual model path
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

# Set the model to evaluation mode
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128257, 4096, padding_idx=128256)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps

### 5.2 Test the Performance of the New Model


In [6]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that answers user queries."},
    {
        "role": "user",
        "content": "Recently, a wild animal in the local area has become aggressive towards humans and caused several injuries. How should I handle this wild animal?",
    },
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([input_text], return_tensors="pt").to(device)

# the model generate new tokens
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)
# convert the generated tokens to text
generated_text = tokenizer.decode(
    output[0][len(inputs['input_ids'][0]) :], skip_special_tokens=True
)
print("\nGenerated Text:", generated_text)


Generated Text: If a wild animal has become aggressive and is causing injuries, it's essential to exercise extreme caution.  First, make sure all people and pets are kept away from the area.  Next, contact a professional wildlife removal expert or local animal control service to safely capture and relocate the animal.  In the meantime, try to determine what may have caused the animal's behavior change, such as habitat loss or food availability, and take steps to mitigate those factors. Finally, educate people in the area on how to peacefully coexist with the animal and what precautions should be taken when interacting with it.


This shows that the responses of the trained model are more concise and focused on key safety measures.

Phrases like "stay away," "contact professionals," "analyze the cause," and "public education" reflect a human-centered, risk-prevention approach that minimizes direct contact, aligning better with the principles of safety alignment.


## 6. Acknowledgements

- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [DPO Paper](https://arxiv.org/abs/2305.18290)
