# Tutorial on Direct Preference Optimization (DPO) Using the ERNIE-4.5-0.3B Large Language Model

## 1. Introduction

Welcome to this advanced tutorial on using ERNIE Kit to perform Direct Preference Optimization (DPO) on the ERNIE-4.5-0.3B model! After completing the model's pre-training (PT) and supervised fine-tuning (SFT), we now have an ERNIE model with strong general language capabilities and the ability to precisely follow instructions. However, to ensure that the model's generated responses are not only correct but also align with human preferences—such as being more helpful, safer, and more stylistically appealing—we typically need to perform further “alignment” training. This tutorial will delve into how to use DPO technology to achieve a higher level of alignment for your ERNIE model.

### 1.1 What is Direct Preference Optimization (DPO)?

**Direct Preference Optimization (DPO)** is a revolutionary model alignment technique that is simpler and more efficient than traditional methods. Before DPO emerged, the mainstream alignment method in the industry was Reinforcement Learning from Human Feedback (RLHF), which typically involved multiple complex and resource-intensive stages:

1.  **Supervised Fine-Tuning (SFT)**: First, the pre-trained large language model is fine-tuned using instructions to enable it to understand and execute various commands. This is the first step in alignment, endowing the model with foundational “capabilities.”
2.  **Training the Reward Model (RM)**: Next, a large amount of human preference ranking data for the model's various outputs must be collected. For example, for the same question, the model generates answers A, B, and C. Annotators indicate which is best, which is second-best, and which is worst. Using this ranking data, an independent “reward model” is trained. The model's task is to mimic human judgment, scoring any given answer, with higher scores indicating stronger human preference.
3.  **Reinforcement Learning (RL)**: Finally, the model from the SFT stage is used as an “agent” and trained using reinforcement learning under the guidance of the reward model (typically using the Proximal Policy Optimization (PPO) algorithm). The agent attempts to generate new answers, which are scored by the reward model. The agent continuously adjusts its strategy based on the scores to maximize rewards, thereby making its outputs increasingly aligned with human preferences.

Although RLHF has proven highly effective, its process is lengthy, implementation is complex, training is unstable, and it requires maintaining and coordinating multiple models (policy model, reward model, value model, reference model, etc.), placing significant demands on both technical expertise and computational resources.

**The core idea of DPO lies in its “directness”—it cleverly bypasses the need for explicit training of the reward model and complex reinforcement learning steps.** DPO posits that the language model itself can implicitly represent a reward function. It directly utilizes paired human preference data (e.g., given a prompt, response A is better than response B) and optimizes the language model directly through a uniquely designed loss function. The objective of this loss function is to increase the model's probability of generating “preferred” responses while decreasing its probability of generating “unpreferred” responses. Essentially, DPO transforms a complex reinforcement learning problem into a more manageable supervised learning problem with a specific objective.

**Key differences between DPO and RLHF/SFT:**

| Feature | Supervised fine-tuning (SFT) | Traditional RLHF (RM+PPO) | Direct preference optimization (DPO) |
| :--- | :--- | :--- | :--- |
| **Core Objective** | Learn to follow instructions and master specific task solutions | Learn to generate responses aligned with human preferences through reward model signals | Directly learn to generate responses aligned with human preferences based on preference data |
| **Required Data** | “Instruction-Ideal Output” pairs | “Instruction-Multiple Outputs-Human Ranking” (for RM), plus SFT data | “Instruction-preferred output-non-preferred output” triplets |
| **Learning Paradigm** | Supervised learning | RM: Supervised learning; PPO: Reinforcement learning | Supervised learning (preferences reflected through a special loss function) |
| **Implementation Complexity** | Relatively simple and direct | Very high (multi-stage training involving complex RL algorithms and hyperparameters) | Moderate (single-stage policy learning, but data format requires specific constraints) |
| **Number of Models Required** | 1 (policy model) | At least 2 core models (policy model, reward model), often accompanied by a value model and reference model | 1 core model (policy model), typically using a frozen SFT model as a reference |

In summary, while SFT teaches the model “what it can do,” DPO (and RLHF) teach the model “how to do it better, more in line with human expectations, and more safely and reliably.” DPO provides a more direct, concise, and stable technical path to achieve this goal.

### 1.2 Application Scenarios and Value of DPO

As a cutting-edge alignment technique, DPO has extremely broad and far-reaching application scenarios and significance. We can understand it from the following dimensions:

|  | Application Area | Core Value and Far-Reaching Significance |
| :---: | :--- | :--- |
| 💡 | **Improving Answer Quality (Helpfulness)** | By finely learning human preferences, the model can generate more accurate, information-rich, and logically rigorous answers, transforming from an “instruction executor” into a true “intelligent assistant.” |
| 🛡️ | **Enhancing Model Safety (Harmlessness)** | By learning the boundaries between harmless and harmful content, the model can effectively suppress the generation of inappropriate, biased, or dangerous statements, ensuring its outputs align with social norms and ethical standards. |
| 🎨 | **Precise Style Control** | Based on specific preference data, the model can be guided to adopt specific communication styles, such as polite empathy in customer service scenarios, concise clarity in coding scenarios, and creative freedom in creative scenarios. |
| 🚀 | **Streamlined Alignment Process (Efficiency)** | As an efficient alternative to RLHF, DPO lowers the technical barriers and computational costs of achieving high-quality alignment, making it suitable for teams and individuals with limited resources to iterate quickly. |
| 🔬 | **Driving Cutting-Edge Innovation (Innovation)** | The success of DPO has inspired the development of a series of new algorithms such as IPO, KTO, SimPO, and ORPO, serving as the foundation for understanding and keeping pace with the latest trends in large-scale model technology. |

### 1.3 Why choose ERNIE-4.5-0.3B for DPO?

|  | Reason | Detailed explanation |
| :---: | :--- | :--- |
| 🏆 | **Exceptional SFT Foundation** | `ERNIE-4.5-0.3B` is built on the advanced technology of Wenxin 4.5 and has undergone high-quality SFT, endowing it with robust instruction following capabilities and world knowledge, providing a solid foundation for the “cherry on top” of DPO. |
| 💻 | **Appropriate Parameter Scale** | The 0.3B parameter count maintains strong capabilities while keeping computational resource requirements manageable, making it ideal for individual developers, academic researchers, and small and medium-sized enterprises to conduct DPO experiments and rapid iterations. |
| 🛠️ | **Full ERNIE Kit Support** | ERNIE Kit offers an end-to-end DPO solution (`run_dpo.py`) from data processing to training execution, enabling developers to efficiently apply DPO without needing to implement complex algorithms from scratch. |

### 1.4 Objectives and Benefits of This Tutorial

This tutorial is designed for developers and researchers who wish to advance their model alignment skills. By completing this tutorial, you will gain the following:

|  | Learning Outcomes | Description |
| :---: | :--- | :--- |
| 🧠 | **Deep Understanding of DPO Principles** | Gain a thorough understanding of the core concepts, mathematical principles, practical value, and fundamental differences between DPO and SFT/RLHF. |
| 🛠️ | **Proficient Mastery of the ERNIE Kit DPO Workflow** | Master the entire process from environment setup, data preparation, to model training and evaluation. |
| 📊 | **Expertise in DPO Data Processing** | Learn to process preference data formats (prompt, chosen, rejected) that meet ERNIE Kit requirements. |
| ⚙️ | **Flexible Configuration and Practice** | Learn to modify DPO configuration files, start, monitor, and debug training tasks, and evaluate optimized models. |

Now, let’s embark on this exciting journey together and explore how to use DPO technology to refine the powerful ERNIE model into a smarter, safer, and more human-centric AI!

## 2. Environment Preparation

Performing Direct Preference Optimization (DPO) is similar to performing SFT in that it requires a well-configured development environment. This includes installing the core deep learning framework PaddlePaddle, the dedicated ERNIE Kit development kit, and obtaining the latest source code to use the most cutting-edge DPO features. We assume you already have practical experience with SFT, so you should be familiar with the environment setup process. However, this section provides more detailed guidance to ensure everything is set up correctly.

### 2.1 Installing PaddlePaddle and ERNIE Kit

First, we need to ensure that we have installed the correct versions of PaddlePaddle and ERNIE Kit. For optimal training performance and compatibility, we strongly recommend using the GPU version of PaddlePaddle.

*If you are running AI Studio, you do not need to run the code block below.*

In [None]:
# Step 1: Ensure that your pip tool is the latest version to avoid potential installation issues.
!python -m pip install --upgrade pip

# Step 2: Install the GPU version of PaddlePaddle.
# The following command is applicable to environments with CUDA 11.8 or higher. If your CUDA version is different,
# please visit the PaddlePaddle official website to obtain the exact installation instructions: https://www.paddlepaddle.org.cn/install/quick
!python -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple

In [None]:
# Step 3: Install ERNIE Kit
# ERNIE Kit offers two installation methods. We recommend installing from source code to ensure you can use the latest DPO features and bug fixes.
# First, you need to clone the ERNIE Kit repository from GitHub.
# !git clone https://github.com/PaddlePaddle/ERNIE.git
# Then, install using pip in edit mode (-e) so that any changes you make to the code take effect immediately.
# Replace the following path with the actual path where you cloned the ERNIE Kit repository.
!python -m pip install -e  ./ERNIE-develop

**Post-installation verification**:

Execute the following code to check whether PaddlePaddle and ERNIE Kit have been successfully installed and confirm that the GPU environment is available.

In [None]:
import paddle

print(f"PaddlePaddle Version: {paddle.__version__}")

# Run PaddlePaddle's built-in check tool, which will provide detailed environment information.
try:
    paddle.utils.run_check()
    if paddle.device.cuda.device_count() > 0:
        print(f"[SUCCESS] PaddlePaddle GPU is available! Found {paddle.device.cuda.device_count()} GPU(s).")
    else:
        print("[WARNING] PaddlePaddle GPU check passed, but no GPU found. Training will proceed on CPU, which will be very slow.")
except Exception as e:
    print(f"[ERROR] PaddlePaddle GPU check failed: {e}")
    print("If you intended to use GPU, please carefully check your CUDA environment, NVIDIA driver, and the PaddlePaddle version you installed.")

PaddlePaddle Version: 3.1.0
Running verify PaddlePaddle program ... 
PaddlePaddle works well on 1 GPU.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
[SUCCESS] PaddlePaddle GPU is available! Found 1 GPU(s).


I0718 09:54:48.481559  9362 pir_interpreter.cc:1524] New Executor is Running ...
W0718 09:54:48.483019  9362 gpu_resources.cc:114] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.8, Runtime API Version: 12.6
I0718 09:54:48.483794  9362 pir_interpreter.cc:1547] pir interpreter is running by multi-thread mode ...


### 2.2 Download the ERNIE Kit repository code

The core DPO script of ERNIE Kit (`examples/long_text/run_dpo.py` or similar path) and all related model configuration files are located in its GitHub repository. Even if you have installed ERNIE Kit via pip, we strongly recommend that you clone the entire code repository, as all experiments and training will be based on this code.

In [None]:
# If you haven't cloned yet, run this command
# git clone https://github.com/PaddlePaddle/ERNIE.git

# After cloning, you will have a folder named ERNIE-develop (or ERNIE)
# All subsequent commands will assume that you are in the root directory of this folder
# For example: cd ./ERNIE-develop
%cd ./ERNIE-develop

/home/aistudio/ERNIE-develop


## 3. DPO Data Preparation

Unlike the “command-ideal output” data format used in the SFT stage, the essence of Direct Preference Optimization (DPO) lies in utilizing human **preference judgments** for different model outputs. Therefore, DPO training data directly and accurately reflects these paired preference choices, which is the basis for its “direct” optimization.

### 3.1 Characteristics and Format of DPO Data

The basic unit of DPO training data is a triplet consisting of: **“Prompt”**, **“Chosen Response”**, and **“Rejected Response”**. This data structure explicitly informs the model which type of response aligns more closely with human expectations when presented with the same prompt, and which does not.

**DPO Data Formats Supported by ERNIE Kit:**

According to ERNIE Kit's design, the data format expected by DPO scripts is **JSON Lines (jsonl) files**. This is a widely used format where each line in the file represents an independent, valid JSON object, corresponding to a preference sample. A typical sample structure is shown below:

```json
{
    "prompt": "Please explain what artificial intelligence is to me.",
    "chosen": "Artificial intelligence (AI) is a branch of computer science dedicated to creating machines that can mimic, extend, and surpass human intelligence. It encompasses multiple fields such as machine learning, natural language processing, and computer vision, aiming to enable computers to learn, reason, perceive, and solve problems like humans. For example, voice assistants on our smartphones, self-driving cars, and recommendation systems are all applications of artificial intelligence in daily life.",
    "rejected": "Artificial intelligence is just robots."
}
```

**Field-by-field breakdown**:

* `prompt`: `(string)`, required. This is the user instruction, question, or preceding dialogue input to the model. It serves as the context for preference judgments and the starting point for all comparisons.
* `chosen`: `(string)`, required. This is the response to `prompt` that was labeled by humans as **more preferred** or **higher quality**. It represents the direction the model should learn and emulate.
* `rejected`: `(string)`, required. This is the response to the `prompt` that was labeled by humans as **less preferred** or **lower quality**. It represents the direction the model should avoid and suppress.

**Why this “prompt-chosen-rejected” triplet format?**

The DPO loss function requires a direct comparison of the probabilities (or implicit preference scores) assigned by the model to the `chosen` and `rejected` responses. Therefore, each training data point must clearly provide this paired comparison of (chosen, rejected) outputs, along with the input prompt `prompt` they correspond to. This structure allows the loss function to directly calculate preference differences and update model parameters accordingly.

**Key elements of high-quality DPO data:**

Building or selecting a high-quality DPO dataset is critical to success. A good dataset should have the following characteristics:

*   **Clear and meaningful preference differences**: There should be clear, distinguishable quality differences between chosen and rejected responses. These differences should accurately reflect the specific preferences you want the model to learn, such as:
    *   **Factual accuracy**: Chosen responses are more accurate, while Rejected responses contain factual errors.  
    *   **Harmlessness**: Chosen responses are safe and polite, while Rejected responses contain harmful, offensive, or biased content.
    * **Helpfulness and detail**: Chosen answers are more comprehensive and helpful, while Rejected answers are too brief or off-topic.  
    * **Instruction adherence**: Chosen answers strictly adhere to all constraints in the instructions (e.g., format, role), while Rejected answers do not.
    * **Style preference**: Chosen answers match a specific style (e.g., professional, humorous), while Rejected answers do not.
* **Prompt diversity**: The `prompt` should broadly cover various types of instructions, topics, and difficulty levels that the model may encounter in real-world applications to ensure that the preferences learned by the model have good generalization capabilities.
* **Authenticity and Challenge of Responses**: `chosen` and `rejected` responses should ideally be the model's actual outputs from the SFT phase, or carefully constructed examples that represent typical model errors or potential areas for improvement. The difference between the two should not be overly extreme (e.g., a perfect answer vs. a completely irrelevant answer); moderate difficulty better stimulates the model's learning potential.
* **Consistent Annotation Standards**: If data is annotated by multiple people, a unified and clear set of preference judgment criteria must be established and followed to ensure the reliability of data quality.

### 3.2 Using the Sample DPO Dataset Included in ERNIE Kit

To help users get started quickly, the ERNIE Kit project has included a small sample DPO dataset in the `examples/data/` directory. This dataset can be used directly to run the DPO training script in the tutorial without downloading it from the internet.

**Dataset files:**

*   `examples/data/dpo-train.jsonl`: Preference dataset used for DPO training.
*   `examples/data/dpo-eval.jsonl`: Preference dataset used to evaluate model performance during training.

Both files follow the JSON Lines format discussed earlier, containing the `prompt`, `chosen`, and `rejected` fields.

**Viewing data samples:**

Assuming your current working directory is the root directory of ERNIE Kit, you can use the following code to view the contents of the training data and gain an intuitive understanding of its structure and quality.

In [None]:
import os
import json

# ERNIE Kit's built-in DPO training data path
dpo_train_file = "./examples/data/dpo-train.jsonl"

if os.path.exists(dpo_train_file):
    print(f"--- View the included DPO training data samples: {dpo_train_file} ---")
    with open(dpo_train_file, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i < 3: # Only view the first 3 samples
                sample = json.loads(line.strip())
                print(f"\n--- sample {i+1} ---")
                # Based on the file format, ‘src’ is a list, and we take the first element as the prompt.
                print(f"Prompt: {sample['src'][0]}")
                # ‘response’ is a list containing two answers, and the [1, 0] in the ‘sort’ field indicates that the first answer is chosen and the second is rejected.
                print(f"\nChosen: {sample['response'][0]}")
                print(f"\nRejected: {sample['response'][1]}")
                print("-" * 20)
            else:
                break
else:
    print(f"[ERROR] DPO training data file not found: {dpo_train_file}. Please confirm that your working directory is the ERNIE Kit root directory and that the file actually exists.")


--- View the included DPO training data samples: ./examples/data/dpo-train.jsonl ---

--- sample 1 ---
Prompt: 请写一份关于支撑航空和飞行的物理基本原理的全面解释，包括升力、推力、重量和阻力等主题，以及伯努利定律、牛顿运动定律和空气性质等关键科学概念。使用清晰简洁的语言，并提供详细的例子和插图，以帮助读者理解这些概念。考虑塑造航空业的历史和技术发展，以及物理学在推动飞行能力方面发挥的作用。

Chosen: ['飞行物理学：航空原理全面指南\n\n**介绍**\n\n理解支撑航空和飞行的物理原理对于理解飞机在空中飞行的机制至关重要。在本指南中，我们将探讨飞行的基本因素，包括升力、推力、重量和阻力，并探索关键的科学概念，如伯努利定理、牛顿运动定律和空气的特性。通过使用清晰、简明的语言，并提供详细的例子和插图，本指南旨在使航空和飞行的概念对读者易于理解和吸引。\n\n**1. 飞行的四个力量**\n\n首先，认识到在飞行过程中作用于飞机的四个关键力量是很重要的：升力、推力、重量和阻力。这些力量之间的微妙平衡是飞机保持空中飞行和维持其所需飞行路径所必需的。\n\n1.1. 升力：升力是抵抗飞机重量并支撑其在空中的力量。升力是通过操纵飞机机翼周围的气压分布来产生的。机翼的形状，称为翼型，是为了优化升力而设计的。升力主要是通过伯努利定理和牛顿运动定律来解释的，我们稍后将详细探讨。\n\n1.2. 推力：推力是由飞机发动机产生的向前的力量，推动飞机穿过空气。喷气发动机、螺旋桨发动机甚至火箭发动机都可以提供必要的推力来克服阻力，这是另一个控制飞行的关键因素。\n\n1.3. 重量：重量是由于飞机质量所产生的重力。它向下作用并抵抗升力。为了保持水平飞行，飞机必须产生足够的升力来抵消其重量。\n\n1.4. 阻力：阻力是阻碍飞机在空气中运动的力量。它主要由两种类型组成：形状阻力，由于飞机的形状而产生，以及皮肤摩擦阻力，由于空气与飞机表面之间的摩擦而产生。最小化阻力对于实现高效飞行至关重要。\n\n**2. 伯努利定理和升力**\n\n伯努利定理是流体动力学中的基本概念，对于理解升力至关重要。该定理指出，当流体（在这种情况下是空气）的速度增加时，其压力降低。飞机机翼采

By running the above code, you can clearly see the specific format of the DPO data, which is crucial for understanding the subsequent training process. Now that the data is ready, we can proceed to explore the core components of DPO training.

By examining the printed samples, you can clearly see how each data point is composed of `prompt`, `chosen`, and `rejected`, which helps you better understand the DPO training mechanism and provides a reference for building your own dataset in the future.

### 3.3 Key Points of Data Preprocessing and Tokenization

Similar to SFT and the pretraining stage, text data must be converted into a numerical sequence (Token IDs) that the model can understand before DPO training. The `run_dpo.py` script in ERNIE Kit automatically handles complex preprocessing logic in the background, but understanding its core mechanisms is crucial:

1.  **Tokenization**: The script uses a Tokenizer that is fully compatible with your base model (ERNIE-4.5-0.3B, typically the version after SFT). For each sample, the `prompt`, `chosen`, and `rejected` text segments are tokenized independently.

2.  **Sequence Construction and Loss Calculation**:
*   The core of DPO is comparing the model's preference for the `chosen` and `rejected` responses. This is typically achieved by calculating the model's **log-probabilities** on these two response sequences.
    * To calculate these probabilities, the script constructs two input sequences: one is the concatenation of `prompt` and `chosen` (`prompt + chosen`), and the other is the concatenation of `prompt` and `rejected` (`prompt + rejected`).
    * **Intelligent handling of labels**: When calculating the loss, we are only concerned with the model's predictive ability for the **response part**, while the `prompt` part is just context and should not be included in the loss. This is achieved by **masking** the tokens in the non-response part of the `labels` (typically set to -100).
        * For the `prompt + chosen` sequence, the corresponding `labels` have the `prompt` part masked, with only the `chosen` part being the true token ID.
* Similarly, for the `prompt + rejected` sequence, the `labels` have the `prompt` part masked, with only the `rejected` part being the true token ID.
    * **DPO loss function**: Ultimately, the DPO loss function uses the difference in log probabilities calculated by the model on these two sets of sequences to guide the update of model parameters, with the goal of making the probability of the `chosen` response higher than that of the `rejected` response.

3.  **Length control: truncation and padding**:
    * In practical applications, text lengths vary. The script uses the `max_length` (or similar) parameter to handle this uniformly. Sequences that are too long (`prompt + response`) are truncated, while shorter ones are padded with padding tokens to reach the same length, enabling batch training.

4.  **Role of the Reference Model**:
    * The standard DPO algorithm introduces a “reference model,” which is typically the model at the end of the SFT phase. During DPO training, its parameters are **completely frozen and not updated**.
* When calculating the log probabilities of the `chosen` and `rejected` responses, the DPO loss function subtracts the log probability of the reference model for the same response. This can be understood as a form of **regularization** aimed at preventing the policy model (i.e., the model we are training) from deviating too far from the general language capabilities it learned during the SFT phase by overly accommodating preference data, thereby causing “catastrophic forgetting.” It ensures that the model does not lose its foundational ability to “speak human language” while learning “preferences.”

Fortunately, as users, we typically only need to correctly specify the dataset path, SFT model path, and related sequence length and DPO core hyperparameters (such as `beta`) in the configuration file. The `run_dpo.py` script in ERNIE Kit handles all the complex details of tokenization, sequence concatenation, label masking, and loss calculation for us.

At this point, we have a high-quality, properly formatted preference dataset and a deep understanding of the underlying processing logic. This is the solid foundation for successful DPO training.

### 3.1 ERNIE Kit DPO Data Format

Based on our analysis of the `ERNIE-develop` project, the DPO script in ERNIE Kit expects data in the form of **JSON Lines (jsonl) files**, where each line is a JSON object representing a preference sample. A typical sample structure is as follows:

```json
{
    "system": "System prompt",
    "src": ["User Question 1", "User Question 2"],
    "tgt": ["Model's response to User Question 1"],
    "response": [
        ["Preferred response to User Question 2"],
        ["Non-preferred response to User Question 2"]
    ],
    "sort": [1, 0]
}
```

**Field explanations**:

*   `system`: (optional) `str`. System-level prompt used to set the model's role or behavioral guidelines.
*   `src`: `List[str]`. User input sequence. In multi-round conversations, this will be a list containing multiple user inputs.
* `tgt`: `List[str]`. The model's responses in the previous rounds of dialogue.
* `response`: `List[List[str]]`. A list containing two lists. The first list is the **chosen response** (preferred response), and the second list is the **rejected response** (non-preferred response).
* `sort`: `List[int]`. A list containing two integers, `[1, 0]`, indicating that the first response in `response` is preferred and the second is not preferred.

ERNIE Kit provides a sample dataset, which we can use as a reference to prepare our own data.

## 3.2 Example of Using the DPO Dataset

ERNIE Kit provides `dpo-train.jsonl` and `dpo-eval.jsonl` as examples in the `examples/data/` directory. We can use these data directly for DPO training.

**Viewing Data Samples:**

In [None]:
import json

# Assuming you are in the root directory of the ERNIE project
dpo_train_file = "./examples/data/dpo-train.jsonl"

with open(dpo_train_file, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i < 2:
            sample = json.loads(line.strip())
            print(json.dumps(sample, indent=4, ensure_ascii=False))
        else:
            break

{
    "src": [
        "请写一份关于支撑航空和飞行的物理基本原理的全面解释，包括升力、推力、重量和阻力等主题，以及伯努利定律、牛顿运动定律和空气性质等关键科学概念。使用清晰简洁的语言，并提供详细的例子和插图，以帮助读者理解这些概念。考虑塑造航空业的历史和技术发展，以及物理学在推动飞行能力方面发挥的作用。"
    ],
    "tgt": [],
    "response": [
        [
            "飞行物理学：航空原理全面指南\n\n**介绍**\n\n理解支撑航空和飞行的物理原理对于理解飞机在空中飞行的机制至关重要。在本指南中，我们将探讨飞行的基本因素，包括升力、推力、重量和阻力，并探索关键的科学概念，如伯努利定理、牛顿运动定律和空气的特性。通过使用清晰、简明的语言，并提供详细的例子和插图，本指南旨在使航空和飞行的概念对读者易于理解和吸引。\n\n**1. 飞行的四个力量**\n\n首先，认识到在飞行过程中作用于飞机的四个关键力量是很重要的：升力、推力、重量和阻力。这些力量之间的微妙平衡是飞机保持空中飞行和维持其所需飞行路径所必需的。\n\n1.1. 升力：升力是抵抗飞机重量并支撑其在空中的力量。升力是通过操纵飞机机翼周围的气压分布来产生的。机翼的形状，称为翼型，是为了优化升力而设计的。升力主要是通过伯努利定理和牛顿运动定律来解释的，我们稍后将详细探讨。\n\n1.2. 推力：推力是由飞机发动机产生的向前的力量，推动飞机穿过空气。喷气发动机、螺旋桨发动机甚至火箭发动机都可以提供必要的推力来克服阻力，这是另一个控制飞行的关键因素。\n\n1.3. 重量：重量是由于飞机质量所产生的重力。它向下作用并抵抗升力。为了保持水平飞行，飞机必须产生足够的升力来抵消其重量。\n\n1.4. 阻力：阻力是阻碍飞机在空气中运动的力量。它主要由两种类型组成：形状阻力，由于飞机的形状而产生，以及皮肤摩擦阻力，由于空气与飞机表面之间的摩擦而产生。最小化阻力对于实现高效飞行至关重要。\n\n**2. 伯努利定理和升力**\n\n伯努利定理是流体动力学中的基本概念，对于理解升力至关重要。该定理指出，当流体（在这种情况下是空气）的速度增加时，其压力降低。飞机机翼采用翼型形状，使空气在机翼上方移动得比下方快。这导致上表面的压力降

This will print out the first two samples in the dataset, helping us understand its specific structure and content.

## 4. Start DPO Training

Now that we have prepared the environment, model, and data, it is time to dive into the core of DPO training. This chapter will first analyze DPO in depth from a theoretical perspective, and then provide detailed guidance on how to use the powerful features of ERNIE Kit to configure and start training tasks.

### 4.1 In-depth Analysis of DPO Principles

Understanding how DPO works helps us better adjust parameters and analyze results. The core idea of DPO is to **directly convert human preferences into an optimizable loss function**, thereby bypassing the complex reward model training and reinforcement learning process in RLHF.

#### 4.1.1 From Preference to Policy: The Core Idea of DPO

1.  **Implicit Reward Model**: DPO cleverly assumes that there is an unknown, ideal reward model `r(x, y)` that can score any “prompt-response” pair `(x, y)`. Human preference data (i.e., which response is better) reflects the outcomes of this reward model. Specifically, if for prompt `x`, response `y_w` (chosen) is preferred over `y_l` (rejected), then we assume `r(x, y_w) > r(x, y_l)`.

2.  **Bradley-Terry model**: DPO borrows the Bradley-Terry model to model this preference probability. The model states that the probability `p*(y_w > y_l | x)` that humans prefer `y_w` over `y_l` can be expressed as:

    $p^*(y_w > y_l | x) = \frac{\exp(r(x, y_w))}{\exp(r(x, y_w)) + \exp(r(x, y_l))} = \sigma(r(x, y_w) - r(x, y_l))$
   

    where `σ` is the Sigmoid function. This formula intuitively represents that the greater the difference in reward scores between two answers, the closer the probability of human preference for one of them approaches 1.

3.  **Connecting the language model with the reward**: This is the most ingenious step in DPO. It establishes an analytical relationship between the language model policy `π_θ` (i.e., the model we are training) and the aforementioned implicit reward model `r`. Through a series of derivations, it can be proven that the reward function can be expressed as:

    $r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$

    Here:
    * `π_θ(y|x)` is the probability that the current policy model generates the response `y`.
    * `π_ref(y|x)` is the probability that the reference model (typically the model after SFT) generates the response `y`.
    * `β` is a hyperparameter that controls the divergence (degree of difference) between the policy model and the reference model. It balances the two objectives of “accommodating preferences” and “maintaining language capability.”

#### 4.1.2 DPO Loss Function: Intuitive Understanding and Mathematical Expression

By substituting the above reward function expression into the Bradley-Terry model, we can represent human preference probabilities using language model probabilities. Then, by maximizing the log-likelihood of these preference probabilities, we ultimately obtain the DPO loss function:

$-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}} \left[ \log \sigma\left( \beta \log \frac{\pi_\theta\left(y_w|x\right)}{\pi_{\text{ref}}\left(y_w|x\right)} - \beta \log \frac{\pi_\theta\left(y_l|x\right)}{\pi_{\text{ref}}\left(y_l|x\right)} \right) \right] = \mathcal{L}_{\text{DPO}}\left(\pi_\theta; \pi_{\text{ref}}\right)$


**Intuitive understanding of this complex formula:**

*   **Core objective**: The objective of the loss function is to maximize the difference between the model's preference for the `chosen` answer and its preference for the `rejected` answer.
*   **How preference is measured**: The model's “preference” for a response is measured by the term `log(π_θ / π_ref)`. This indicates how much the current model's probability for this response has increased relative to the reference model.
* **Training Process**: At each training step, the model adjusts the parameter `θ` to maximize the value of `log(π_θ(y_w|x) / π_ref(y_w|x))` while minimizing the value of `log(π_θ(y_l|x) / π_ref(y_l|x))`. As a result, the difference between the two increases, and the value of `log σ(...)` also increases, ultimately reducing the overall loss `L_DPO`.

**DPO vs. RLHF: Why is DPO superior?**

*   **Simple and stable**: DPO merges the complex two-stage RLHF process (reward modeling + reinforcement learning) into a simple classification task with a new loss function. It does not require training an independent reward model and avoids the sampling and training instability issues in reinforcement learning.
*   **Data-efficient**: DPO directly utilizes preference data, eliminating the need for extensive sampling from the policy model during training, as in PPO. This results in lower computational costs and higher data utilization.
*   **Equivalent or better performance**: Multiple studies have shown that DPO can achieve performance equivalent to or even surpassing RLHF while being simpler and more cost-effective.

### 4.3 Starting DPO Training

Next, we use the `erniekit` tool to start DPO training. `erniekit` is a convenient command-line toolkit provided by the ERNIE project that helps us easily manage tasks such as model training, evaluation, and deployment.

**Single card training command:**

We specify the DPO training task using the `--stage DPO` parameter. At the same time, we also configure core parameters such as `--model_name_or_path` (pre-trained model path), `--train_dataset_path` (training dataset), `--eval_dataset_path` (evaluation dataset), and `--output_dir` (model output path). Furthermore, hyperparameters such as `--max_seq_len` (maximum sequence length), `--learning_rate` (learning rate), and `--max_steps` (maximum training steps) are also set to precisely control the training process.

Please note:

According to the call stack, ernie/modeling_moe.py passes a tuple containing six elements to ernie/loss/dpo.py, which expects five elements. The extra parameter score_deltas causes an error. Therefore, we need to manually delete this parameter

- In /home/aistudio/ERNIE-develop/ernie/modeling_moe.py, around line 1709
- For a quick demonstration, I will reduce the number of training epochs.

In [20]:
!erniekit train \
    --stage DPO \
    --model_name_or_path ../data/models/30656/ERNIE-4.5-0.3B-Paddle \
    --train_dataset_path ./examples/data/dpo-train.jsonl \
    --eval_dataset_path ./examples/data/dpo-eval.jsonl \
    --output_dir ./output/dpo_tutorial_checkpoint \
    --max_seq_len 8192 \
    --learning_rate 5.0e-7 \
    --warmup_steps 5 \
    --max_steps 10 \
    --save_steps 10 \
    --logging_steps 1 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 36 \
    --bf16 True \
    --do_train \
    --fp16_opt_level O2

LAUNCH INFO 2025-07-18 12:04:39,160 -----------  Configuration  ----------------------
LAUNCH INFO 2025-07-18 12:04:39,160 auto_cluster_config: 0
LAUNCH INFO 2025-07-18 12:04:39,160 auto_parallel_config: None
LAUNCH INFO 2025-07-18 12:04:39,160 auto_tuner_json: None
LAUNCH INFO 2025-07-18 12:04:39,160 devices: 0
LAUNCH INFO 2025-07-18 12:04:39,160 elastic_level: -1
LAUNCH INFO 2025-07-18 12:04:39,160 elastic_timeout: 30
LAUNCH INFO 2025-07-18 12:04:39,160 enable_gpu_log: True
LAUNCH INFO 2025-07-18 12:04:39,160 gloo_port: 6767
LAUNCH INFO 2025-07-18 12:04:39,160 host: None
LAUNCH INFO 2025-07-18 12:04:39,160 ips: None
LAUNCH INFO 2025-07-18 12:04:39,160 job_id: default
LAUNCH INFO 2025-07-18 12:04:39,160 legacy: False
LAUNCH INFO 2025-07-18 12:04:39,160 log_dir: erniekit_dist_log
LAUNCH INFO 2025-07-18 12:04:39,160 log_level: INFO
LAUNCH INFO 2025-07-18 12:04:39,160 log_overwrite: False
LAUNCH INFO 2025-07-18 12:04:39,160 master: 127.0.0.1:8080
LAUNCH INFO 2025-07-18 1

**Monitoring the training process:**

Once training begins, you will see a large amount of log output in the terminal. Pay attention to the following key information:

*   **`loss`**: This is the DPO loss value. It should steadily decrease as training progresses.
*   **`rewards/chosen` and `rewards/rejected`**: These are the average implicit reward scores for chosen and rejected answers. You should observe the value of `rewards/chosen` gradually increasing, while the value of `rewards/rejected` gradually decreases or remains at a low level.
*   **`rewards/accuracies`**: This indicates the proportion of samples in a batch where the model correctly assigns higher rewards to `chosen` responses than to `rejected` responses. This value should tend toward 1.0.
* **`rewards/margins`**: This is the average difference between `rewards/chosen` and `rewards/rejected`. The larger this value, the stronger the model's ability to distinguish between good and bad answers.
* **`eval_loss`**: At each evaluation step (`eval_steps`), the loss is calculated on the validation set. Observing this value can help you determine if the model is overfitting.

After training is complete, the final model weights and configuration file will be saved in the directory specified by `output_dir` (e.g., `./output/ERNIE-4.5-0.3B-dpo`). The model in this directory is the final product of our successful DPO, which now better understands human preferences.

## 5. Evaluation and Inference

After DPO training, our model theoretically understands human preferences better. But “the proof is in the pudding.” This chapter will guide you on how to scientifically evaluate the model's effectiveness and deploy it to actual interactive inference tasks to intuitively experience its performance improvement.

### 5.1 DPO Model Evaluation

DPO evaluation differs from traditional metrics such as accuracy and F1 score, as it focuses more on measuring **generation quality and preference alignment**. The core of the evaluation is to determine whether the content generated by the model is more in line with our expectations (e.g., safer, more detailed, more creative, etc.) than that generated by the SFT model when faced with the same prompt.

#### 5.1.1 Evaluation Strategy

1.  **Construct a high-quality evaluation set**:
*   Designing a specialized evaluation set containing various challenging prompts is crucial. These prompts should be able to stimulate the model's capabilities in different dimensions, such as:
*   **Safety**: Include some edge cases that may induce the model to generate unsafe or biased responses.
        *   **Instruction Compliance**: Design prompts with complex, multi-step, or constrained instructions.
*   **Creativity and Open-Ended Questions**: Pose questions that require the model to exercise imagination or engage in deep thinking.
*   **Factual and Knowledge-Based**: Include questions that require accurate knowledge reserves to answer.
*   This evaluation set should not appear in the training or validation data to ensure the fairness of the evaluation.

2.  **Comparative Evaluation (A/B Test)**:
*   The most effective evaluation method is to conduct a **head-to-head** comparison. For each prompt in the evaluation set, generate responses using both the **SFT model** and the **DPO model**.
    * Present the responses generated by the two models (anonymized and in random order) to human evaluators, who then judge which response is better, or if they are equivalent/both poor.  
    * Collect a large number of evaluation results, calculate the win rate, draw rate, and loss rate of the DPO model, thereby quantifying its improvement relative to the SFT model.

3.  **Automated evaluation (using a stronger model as a judge)**:  
    *   When there are insufficient human resources for large-scale manual evaluation, stronger models (such as GPT-4, ERNIE-4.0) can be used as “judges.”
    * Design a referee prompt template to input user prompts, SFT model responses, and DPO model responses into the referee model, which then determines which is better and provides a rationale.  
    * While automated evaluation cannot fully replace human assessment, it offers a scalable, cost-effective evaluation method that quickly provides an initial impression of model performance.

## 6. Summary and Outlook

Congratulations! You have now completed this tutorial on using the ERNIE Kit to perform direct preference optimization (DPO) on the ERNIE-4.5-0.3B model. Let's review this journey and look ahead to future possibilities.

### 6.1 Review of the Core Content of This Tutorial

In this tutorial, we systematically learned and practiced the following core content:

1.  **Core Principles of DPO**: We gained a deep understanding of how DPO cleverly transforms human preference data into a simple classification loss, thereby bypassing the complex reward modeling and reinforcement learning processes of RLHF, achieving more efficient and stable model alignment.

2. **Environment and Code Preparation**: We set up a development environment based on PaddlePaddle and ERNIE Kit, and learned how to install ERNIE Kit from source code to access the latest features and maximum flexibility.

3.  **Data Preparation and Analysis**: We mastered the `(prompt, chosen, rejected)` triplet data format required by DPO and used the `UltraFeedback Binarized` dataset as an example to download, unzip, and view data samples through code, gaining a deep understanding of the key elements of high-quality preference data.

4.  **DPO Training Configuration and Execution**: We thoroughly reviewed the DPO configuration file in ERNIE Kit, particularly focusing on the understanding and setup of key hyperparameters such as `model_name_or_path` (must be an SFT model), `beta`, and `learning_rate`. We also learned how to initiate single-GPU and multi-GPU training via the command line and how to interpret key metrics in the training logs to monitor the training process.

### 6.2 Limitations and Future Prospects of DPO

Although DPO is powerful, it is not a silver bullet and has its limitations:

*   **Dependence on preference data quality**: The effectiveness of DPO is highly dependent on the quality and consistency of preference data. Ambiguous, biased, or inconsistent data can seriously mislead the model's learning.
*   **Sensitivity to Beta Values**: The choice of the hyperparameter `beta` significantly impacts the final results and requires careful tuning through experimentation.
*   **Risk of Model Collapse**: Although more stable than RLHF, DPO may still lead to models sacrificing diversity and fluency in generation to overly accommodate preferences in certain scenarios.

**Cutting-Edge DPO Variants and Future Directions:**

Academia and industry are continuously exploring alignment algorithms superior to DPO. Once you have a deep understanding of DPO, consider the following cutting-edge directions:

*   **IPO (Identity Preference Optimization)**: A variant of DPO that claims to better prevent model overfitting on preference data by modifying the loss function. ERNIE Kit already supports this loss.
*   **KTO (Kahneman-Tversky Optimization)**: Inspired by human decision theory, KTO allows alignment using single samples labeled as “good” or “bad,” eliminating the strict requirement for paired preference data and significantly lowering the data annotation threshold.
*   **SimPO (Simple Preference Optimization)**: A more concise loss function design aimed at improving training efficiency and final performance.

These new algorithms are continuously driving the development of large-scale model alignment technology, and they also foreshadow that we will have more and better tools in the future to enable AI to better serve humanity.

### 6.3 Conclusion

This tutorial has opened the door to the world of large-scale model preference alignment. True learning begins with practice. We strongly encourage you to:

*   **Experiment with different hyperparameters**: Adjust parameters such as `beta` and `learning_rate` to observe their impact on model behavior.
*   **Build your own dataset**: Attempt to create a small preference dataset tailored to your own business context or area of interest, and train it using the methods outlined in this tutorial.
*   **Explore other models**: Apply the methods from this tutorial to other models supported by ERNIE Kit.

Thank you for following this tutorial. We hope you continue to create smarter, more reliable AI that aligns with human values as you explore large models. Enjoy your exploration!

# Feedback/Contact me: WeChat: G_Fuji