# Tutorial on Supervised Fine-Tuning (SFT) of Large Language Models Based on ERNIE-4.5-0.3B

## 1. Introduction

Welcome to this tutorial on supervised fine-tuning (SFT) of the ERNIE-4.5-0.3B model using ERNIEkit! In the previous pre-training tutorial, we explored how to build a language model from scratch, endowing it with the foundational ability to understand and generate text. However, a pre-trained general-purpose large model is like a well-informed but untrained generalist—it possesses vast knowledge but may not necessarily follow our specific instructions precisely to complete specific tasks.

### 1.1 What is Supervised Fine-Tuning (SFT) for Large Language Models?

**Supervised Fine-Tuning (SFT)** is a critical step in transforming general-purpose large models into specialized assistants. It involves further training the model based on a pre-trained model using a labeled dataset that includes “instructions (Instruction/Prompt)” and “expected outputs (Response/Output).” This process is akin to providing “pre-service training” for a well-rounded generalist, teaching it how to understand and follow human instructions, generate high-quality responses to specific questions, or mimic a particular conversational style and format.

**Core differences between SFT and pre-training (PT):**

| Feature | Pre-training (PT) | Supervised Fine-Tuning (SFT) |
|---|---|---|
| **Objective** | Learn general language patterns, grammatical structures, and world knowledge | Learn to follow instructions, solve specific tasks, and align with human intentions and values |
| **Data** | Large-scale, **unlabeled** plain text data (e.g., web pages, books, code) | Relatively small-scale, **high-quality labeled** “instruction-output” pairs |
| **Learning Approach** | Unsupervised or self-supervised learning (e.g., predicting the next word) | Supervised learning (learning the mapping from input instructions to desired outputs) |
| **Computational Resources** | Typically requires massive resources, including vast amounts of data and extremely long training times | Relatively modest, enabling efficient and rapid iteration on pre-trained models |
| **Model Role** | Train a “base model” (Base Model) to establish foundational capabilities | Perform “instruction tuning” (Instruction Tuning) on the “base model” to specialize model behavior |

In summary, pre-training enables the model to “learn to speak,” while supervised fine-tuning teaches the model to “speak properly” and “follow instructions,” aligning its behavior with human expectations.

### 1.2 Application Scenarios and Significance of SFT

SFT is a core technology for unlocking the potential of large language models, with extremely broad significance and application scenarios. We can get a clear picture of this from the table below:

|  | Application Scenario | Core Significance | 
| :---: | :--- | :--- | 
| 🎯 | **Enhancing Task Performance** | By applying SFT on domain-specific instruction data, the model's professionalism, accuracy, and reliability can be significantly improved in tasks such as customer service Q&A, code generation, legal document summarization, and medical consultations. | 
| 🙏 | **Model Alignment** | SFT is a critical step in aligning the model with human intentions and values. By designing carefully crafted prompts and desired responses (typically emphasizing usefulness, authenticity, and harmlessness), we can guide the model to generate responsible and helpful content. | 
| 💬 | **Building Robust Conversational AI** | All top-tier chatbots have undergone extensive SFT training. This enables them to engage in fluent, natural, and logically coherent multi-turn conversations, truly becoming users' reliable assistants. | 
| 🎭 | **Adhering to Specific Styles or Formats** | We can train models via SFT to output in fixed formats (e.g., JSON, Markdown), assume specific roles (e.g., Shakespeare, technical expert), or generate text with a particular emotional tone. |

### 1.3 Why choose ERNIE-4.5-0.3B for SFT?

We chose `ERNIE-4.5-0.3B` as the main model for our SFT practice based on the following key advantages:

|  | Advantage | Detailed explanation |
| :---: | :--- | :--- |
| 🏆 | **Exceptional pre-training foundation** | As the latest member of the Wenxin large model family, `ERNIE-4.5-0.3B` has undergone extensive pre-training with massive, high-quality data, endowing it with world-leading Chinese language processing capabilities and general language understanding abilities. This serves as an exceptionally robust starting point for SFT. |
| 💰 | **Moderate parameter scale with high cost-effectiveness** | With 0.3B (300 million) parameters, full-parameter SFT is now feasible on consumer-grade or mainstream computing resources. This significantly lowers the barrier for developers and researchers to conduct experiments and rapid iterations, making it one of the best choices for exploring SFT. |
| 🛠️ | **Comprehensive support from ERNIEkit** | ERNIEkit is a development kit tailored for the Wenxin large model by PaddlePaddle, offering end-to-end tools from pre-training, SFT, to inference deployment. Its SFT scripts are carefully optimized to support full-parameter fine-tuning, LoRA, and other parameter-efficient fine-tuning (PEFT) methods, with flexible configurations and user-friendly operation. |

### 1.4 Target Audience and Learning Outcomes of This Tutorial

This tutorial is designed to help readers from different backgrounds delve into the world of SFT. See which category you belong to and what you will gain:

|  | Target Audience | Learning Outcomes (You Will Be Able To) |
| :---: | :--- | :--- |
| 🧑‍🎓 | **Advanced Learners** | Gain a deep understanding of the core principles of SFT, its application value, and its fundamental differences from pre-training. |
| 👨‍💻 | **Developers/Engineers** | Master the complete process and best practices for conducting SFT on the ERNIE-4.5-0.3B model using ERNIEkit. |
| 🔬 | **Researchers** | Understand the standard format and construction methods for SFT data, and gain an initial understanding of parameter-efficient fine-tuning (PEFT) techniques such as LoRA. |
| 🚀 | **All Readers** | Learn how to interpret and configure ERNIEkit's SFT `yaml` files, launch training tasks, analyze the training process, and finally perform inference testing on the fine-tuned model to intuitively experience the enhanced capabilities brought by SFT. |

Now, let’s embark on this journey together and use SFT technology to “train” the powerful ERNIE-4.5-0.3B model into a more intelligent and capable AI assistant tailored to your needs!

## 2. Environment Preparation

The first step in performing supervisory fine-tuning (SFT) is to set up a stable and efficient development environment. This includes installing the core deep learning framework PaddlePaddle, the model development kit ERNIEkit, and preparing the star of our practice—the ERNIE-4.5-0.3B model.

### 2.1 Install PaddlePaddle and aistudio-sdk

Ensure that you have installed the latest or recommended version of PaddlePaddle. We recommend using the GPU version of PaddlePaddle for optimal training speed. Additionally, we need to install `aistudio-sdk`, a convenient tool that helps us easily download model resources from AI-Studio.

*Students running this project on AI Studio do not need to run the environment installation code block below.*

In [None]:
# Ensure pip is the latest version
!python -m pip install --upgrade pip

# Install PaddlePaddle (GPU version, recommended to use the latest stable version)
# If your environment (such as CUDA version) is different, please visit the official website to obtain the corresponding instructions: https://www.paddlepaddle.org.cn/install/quick
!python -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple

# Install aistudio-sdk
!pip install --upgrade aistudio-sdk

**Verify installation**:

In [None]:
import paddle

print(f"PaddlePaddle Version: {paddle.__version__}")

# Check if GPU is available

try:
    paddle.utils.run_check()
    if paddle.device.cuda.device_count() > 0:
        print(f"PaddlePaddle GPU is available! Found {paddle.device.cuda.device_count()} GPU(s).")
    else:
        print("PaddlePaddle GPU check passed, but no GPU found. Will use CPU.")
except Exception as e:
    print(f"PaddlePaddle GPU check failed: {e}")
    print("If you intended to use GPU, please check your CUDA setup and PaddlePaddle installation.")



PaddlePaddle Version: 3.1.0
Running verify PaddlePaddle program ... 
PaddlePaddle works well on 1 GPU.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
PaddlePaddle GPU is available! Found 1 GPU(s).


I0717 18:57:50.384768   265 pir_interpreter.cc:1524] New Executor is Running ...
W0717 18:57:50.386147   265 gpu_resources.cc:114] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.8, Runtime API Version: 12.6
I0717 18:57:50.386685   265 pir_interpreter.cc:1547] pir interpreter is running by multi-thread mode ...


### 2.2 Download the ERNIEkit repository code and the ERNIE-4.5-0.3B model

The ERNIEkit SFT script (`train.py`) and related models and training configuration files are located in its official repository. Additionally, we need to use the `aistudio` command-line tool to download the ERNIE-4.5-0.3B model weights.

*Students running AI Studio do not need to run the code block below*

- ERNIE: /home/aistudio/ERNIE-develop.zip
- model: /home/aistudio/data/models/30654/ERNIE-4.5-0.3B-Base-Paddle

In [None]:
# 1. Clone the ERNIEkit repository
# We will clone the latest ERNIE-develop branch to get the latest features and optimizations.
!git clone https://github.com/PaddlePaddle/ERNIE.git -b develop ERNIE-develop

# 2. Download the ERNIE-4.5-0.3B model weights.
# Use the command line tool provided by aistudio-sdk to download.
# The model will be downloaded to the baidu/ERNIE-4.5-0.3B-Paddle directory.
!aistudio download --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle --local_dir baidu/ERNIE-4.5-0.3B-Paddle

**Important directory structure explanation**:

After executing the above commands, your working directory should have a structure similar to the following:

```.

├── ERNIE-develop/
│   ├── examples/
│   │   ├── configs/
│   │   │   └── ERNIE-4.5-0.3B/
│   │   │       └── sft/
│   │   │           └── run_sft_8k.yaml  <-- This is our SFT configuration file
│   │   └── data/
│   │       └── sft-train.jsonl        <-- This is the example SFT data
│   ├── train.py                     <-- This is the ERNIEkit training script
│   └── ... (other ERNIEkit code)
└── baidu/
    └── ERNIE-4.5-0.3B-Paddle/       <-- This is where the model weights and configuration files are stored
        ├── model_state.pdparams
        ├── tokenizer_config.json
        └── ... (other model files)
```

For Ai Studio users, the model directory is located in the data folder!

In the subsequent tutorials, we will mainly operate in the `ERNIE-develop` directory and use the `train.py` script. combined with the `examples/configs/ERNIE-4.5-0.3B/sft/run_sft_8k.yaml` configuration file, to load the model from the `baidu/ERNIE-4.5-0.3B-Paddle` directory for fine-tuning.

Please ensure that your directory structure matches this to proceed smoothly with the subsequent steps.

The environment and resources are now ready. Let’s delve into the core of SFT—the data.

## 3. SFT Data Preparation

“Garbage in, garbage out.” This famous saying is perfectly illustrated in SFT. The effectiveness of SFT depends largely on the quality of the instructional data used. High-quality data can guide the model to learn correct behavioral patterns, while low-quality data may cause the model to produce biased, inaccurate, or even harmful outputs.

## 3.1 Characteristics and Format of SFT Data

SFT data consists of paired “instructions (Instruction/Prompt)” and “desired outputs (Response/Output)”. ERNIEkit uses the JSON Lines (jsonl) format, where each line represents an independent JSON object, corresponding to a training sample. This format is clear, scalable, and highly suitable for processing large-scale datasets.

** Detailed explanation of ERNIEkit SFT data format:**

A standard ERNIEkit SFT sample structure is as follows:

```json
{“src”: “Hello”, ‘tgt’: “Hello! It's a pleasure to serve you.”}
{“src”: “Please write a poem about the moon.”, ‘tgt’: "Bright moonlight before my bed, I wonder if it's frost on the ground. I lift my head to gaze at the bright moon, then lower my head to think of my hometown."}
```

*   `src` (Source): Represents the instructions or questions input to the model. This content undergoes templating processing and serves as the model's input prompt (Prompt).
*   `tgt` (Target): Represents the expected response generated by the model. This is the target output that the model needs to learn and emulate.

**The Importance of Dialogue Templates:**

Dialogue models like ERNIE-4.5 typically follow a specific dialogue template during training. This template defines the roles of the user and assistant and uses special tokens to separate different rounds. The ERNIEkit training script automatically wraps `src` and `tgt` into the template format used during model pre-training. For example, a simplified template might look like this:

```
<|im_start|>user
{src}<|im_end|>
<|im_start|>assistant
{tgt}<|im_end|>
```

During training, the model's task is to generate the `assistant` portion based on the `user` content. Through this process, the model learns to assume the role of the assistant in the dialogue.

**Key Elements of High-Quality SFT Data:**

*   **Diversity**: Instructions should cover as many topics and skills as possible, such as question-answering, creation, summarization, translation, code generation, etc.
*   **Complexity**: Instructions should include tasks of varying difficulty levels, from simple fact-based questions to complex tasks requiring deep reasoning and creativity.
*   **Accuracy**: The content of `tgt` must be accurate, high-quality, and harmless.
*   **Consistency**: The style, format, and level of detail of the output should remain consistent unless you intentionally train the model to master multiple styles.

### 3.2 Using the sample SFT dataset provided by ERNIEkit

ERNIEkit provides a sample dataset named `sft-train.jsonl` in the `ERNIE-develop/examples/data/` directory. This is a small, cleaned dataset that is ideal for quickly running through the SFT process and verifying functionality.

Let's take a look at the contents of this sample data:

In [None]:
import json

# Define sample data path
sample_data_path = "./ERNIE-develop/examples/data/sft-train.jsonl"

# Read and print the first 5 samples
print(f"View sample data:{sample_data_path}\n---")
with open(sample_data_path, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= 5:
            break
        data_sample = json.loads(line.strip())
        print(f"【sample {i+1}】")
        print(f"  SRC: {data_sample['src']}")
        print(f"  TGT: {data_sample['tgt']}\n")

View sample data: ./ERNIE-develop/examples/data/sft-train.jsonl
---
【sample 1】
  SRC: ['我们如何在日常生活中减少用水？']
  TGT: ['1. 使用节水装置，如节水淋浴喷头和水龙头。 \n2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。 \n3. 在社区中提高节水意识。 \n4. 检查水管和灌溉系统的漏水情况，并及时修复它们。 \n5. 洗澡时间缩短，使用低流量淋浴头节约用水。 \n6. 收集雨水，用于园艺或其他非饮用目的。 \n7. 刷牙或擦手时关掉水龙头。 \n8. 减少浇水草坪的时间。 \n9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。 \n10. 只购买能源效率高的洗碗机和洗衣机。']

【sample 2】
  SRC: ['政府可以采取哪些策略来减少空气污染？']
  TGT: ['1. 实施强制的车辆排放标准和基于激励的计划，以降低车辆的碳足迹。\n2. 增加公共交通工具，减少公众对车辆的依赖。\n3. 增加对空气污染的影响的认识，鼓励市民减少污染物的生成。\n4. 投资于可再生能源的研究和开发，如太阳能和风能。\n5. 在工厂和发电厂安装空气污染控制装置，例如洗涤器。\n6. 对车辆和工厂使用清洁燃料。\n7. 实施更好的城市规划和控制拓展。\n8. 改善农业效率，减少化肥和杀虫剂的使用。\n9. 种植更多的树木以减少空气污染。\n10. 减少木材、煤炭和生物质的燃烧。']

【sample 3】
  SRC: ['可再生能源的存在对环境有什么影响？']
  TGT: ['可再生能源的存在可以帮助减少空气污染和温室气体排放，因为它们几乎不会排放二氧化碳、二氧化硫等空气污染物。此外，使用可再生能源可以促进能源效率的进一步提高和能源利用的改善。可再生能源也可以帮助减少对化石燃料的依赖，这不仅可以减少排放，而且还可以帮助减少全球气候变化的风险。最后，可再生能源可以帮助保护自然资源，减少为了能源生产而开发更多土地和资源的需要。']

【sample 4】
  SRC: ['解释神经网络如何学习。']
  TGT: ['神经网络是一种机器学习算法，它使用连接的节点集合来近似可以将输入变量映射到输出的函数。为了学习神经网

### 3.3 How to Prepare Your Own SFT Dataset

In practical applications, you typically need to build a dataset tailored to your specific needs. The following are the general steps for constructing a custom SFT dataset:

1.  **Define the Target Task**: Clearly identify the capabilities you want the model to possess. Is it for intelligent customer service in a vertical field, or for generating marketing copy in a specific style?
2.  **Collect or generate raw data**:
*   **Collect**: Extract “question-answer” pairs from existing documents, tickets, FAQs, or user logs.
*   **Manual annotation**: Hire domain experts or annotators to write high-quality prompts and responses based on task requirements.
    *   **Model generation**: Use more powerful models (such as GPT-4) to generate preliminary instructions and answers, then have them screened and refined by humans. This is one of the mainstream methods for building large-scale SFT datasets.
3.  **Cleaning and formatting**: 
    *   Remove low-quality, duplicate, or irrelevant data.
    * Perform data augmentation, e.g., for the same instruction, there can be multiple different ways to phrase the question.  
    * Organize the data into the `jsonl` format required by ERNIEkit, i.e., each line contains a JSON object with `src` and `tgt` keys.
4.  **Split the dataset**: Split the dataset into a training set (train) and an evaluation set (eval/dev). The evaluation set is used to monitor model performance during training and prevent overfitting. The typical ratio is 95%:5% or 98%:2%.

Preparing high-quality data is the cornerstone of SFT success. Now that we have covered data preparation, the next step is to delve into the ERNIEkit SFT configuration file to understand how to control the entire fine-tuning process through it.

## 4. SFT Configuration Analysis (`run_sft_8k.yaml`)

The power of ERNIEkit lies in its flexibility and configurability. All training parameters are centralized in a `yaml` configuration file, and by modifying this file, we can precisely control every aspect of the SFT. Let’s take `ERNIE-develop/examples/configs/ERNIE-4.5-0.3B/sft/run_sft_8k.yaml` as an example and analyze the key configuration items one by one.

### 4.1 Data Configuration (Data)

This section defines the data sets used for training and evaluation.

```yaml
### data
train_dataset_type: “erniekit”
eval_dataset_type: “erniekit”
train_dataset_path: “./examples/data/sft-train.jsonl”
train_dataset_prob: “1.0”
eval_dataset_path: “./examples/data/sft-eval.jsonl”
eval_dataset_prob: “1.0”
max_seq_len: 8192
num_samples_each_epoch: 6000000
```

* `train_dataset_type`, `eval_dataset_type`: Dataset format type, here it is `erniekit`, indicating the use of the previously discussed `{“src”: ..., “tgt”: ...}` format.
*   `train_dataset_path`, `eval_dataset_path`: Paths to the training and evaluation datasets. **You need to modify these to your own data paths.**
* `train_dataset_prob`, `eval_dataset_prob`: Dataset sampling probability. When there are multiple datasets, you can set different probabilities for mixed training.
* `max_seq_len`: Maximum sequence length processed by the model. `8192` (8K) is a common setting for ERNIE-4.5, which can handle very long contexts. If your GPU memory is limited, you can reduce this value appropriately, but this may affect the model's ability to process long texts.
*   `num_samples_each_epoch`: The number of samples included in each epoch. This value is typically set to a large number to ensure the model sees enough data in each epoch.

### 4.2 Model Configuration (Model)

This section defines the base model to be fine-tuned and its related settings.

```yaml
### model
model_name_or_path: baidu/ERNIE-4.5-0.3B-Paddle
fine_tuning: Full
fuse_rope: True
use_sparse_head_and_loss_fn: True
```

* `model_name_or_path`: **Core parameter**. Specifies the path to the base model. Here, we use the previously downloaded `baidu/ERNIE-4.5-0.3B-Paddle`.
* `fine_tuning`: Fine-tuning method. `Full` indicates full parameter fine-tuning, i.e., updating all model weights. This is the most thorough but also the most resource-intensive method. Other options such as `lora` will be introduced later.
* `fuse_rope`: Whether to use the optimized Rotary Position Embedding. `True` can improve computational efficiency.
* `use_sparse_head_and_loss_fn`: Whether to use sparse head and loss functions. This is an optimization for specific model structures; keep the default setting.

### 4.3 Fine-tuning Core Configuration (Finetuning)

This is the core part that controls the entire training process.

```yaml
### finetuning
# base
stage: SFT
seed: 23
do_train: True
do_eval: True
distributed_dataloader: False
dataloader_num_workers: 1
batch_size: 1
num_train_epochs: 1
max_steps: 100
max_evaluate_steps: 10000
eval_steps: 10000
evaluation_strategy: steps
save_steps: 10000000
save_total_limit: 5
save_strategy: steps
logging_steps: 1
release_grads: True
gradient_accumulation_steps: 8
logging_dir: ./vdl_log
output_dir: ./output
disable_tqdm: True
```

* `stage`: Training stage, which is `SFT` here.
* `seed`: Random seed, used to ensure the reproducibility of experiments.
* `do_train`, `do_eval`: Whether to perform training and evaluation.
* `batch_size`: **Important**. Batch size on each device. Since large models consume a lot of memory, `1` is a common starting value.
* `gradient_accumulation_steps`: **Important**. Number of gradient accumulation steps. This is a “virtual” technique to increase the batch size. The actual `batch_size` is `batch_size` * `gradient_accumulation_steps` * `num_gpus`. Here, the equivalent batch size is `1 * 8 = 8`. Increasing this value can simulate a larger batch size, which helps stabilize training but slows down the training speed.
* `num_train_epochs`: The total number of training epochs. For SFT, typically 1 to 3 epochs are sufficient.
* `max_steps`: The maximum number of training steps. If this value is set, it will override `num_train_epochs`. For quick experiments, it can be set to a smaller value (e.g., `100`).
* `evaluation_strategy`, `eval_steps`: Evaluation strategy. `steps` indicates that an evaluation is performed every `eval_steps` steps.
* `save_strategy`, `save_steps`: Save strategy. `steps` indicates that a checkpoint is saved every `save_steps` steps.
* `logging_steps`: How many steps to print a log.
* `output_dir`: Output directory for training artifacts (such as model checkpoints).

### 4.4 Training and Optimizer Configuration (Train & Optimizer)

This section controls hyperparameters such as learning rate and optimizer.

```yaml
# train
warmup_steps: 20
learning_rate: 1.0e-5
lr_scheduler_type: cosine
min_lr: 1.0e-6

# optimizer
weight_decay: 0.1
adam_epsilon: 1.0e-8
adam_beta1: 0.9
adam_beta2: 0.95
offload_optim: True
```

* `learning_rate`: **Core hyperparameter**. The size of the learning rate directly affects the convergence speed and final performance of the model. `1.0e-5` is a commonly used value for fine-tuning all parameters. For PEFT methods such as LoRA, a larger learning rate may be required.
* `lr_scheduler_type`: Learning rate scheduling strategy. `cosine` indicates the use of the cosine annealing strategy, which is a highly effective and commonly used strategy.
* `warmup_steps`: Warmup steps. During the initial training phase, the learning rate linearly increases from a very small value to the specified `learning_rate`, which helps stabilize the training process.
* `weight_decay`: Weight decay, a regularization technique to prevent overfitting.
* `offload_optim`: Whether to offload the optimizer state to CPU memory. This is a memory-saving technique but may slightly reduce training speed.


### 4.5 Performance and Parallel Configuration

This section is used to configure distributed training and mixed precision training to improve efficiency and save resources.

```yaml
# performance
tensor_parallel_degree: 1
pipeline_parallel_degree: 1
sharding_parallel_degree: 1
sharding: stage1
sequence_parallel: True
recompute: False
compute_type: bf16
fp16_opt_level: O2
```

* `tensor_parallel_degree`, `pipeline_parallel_degree`, `sharding_parallel_degree`: Parallelism degree for distributed training. For single-card training, these should all be set to `1`.
* `sharding`: ZeRO optimization strategy. `stage1` can significantly save memory when training with multiple cards.
* `sequence_parallel`: Sequence parallelism, a memory optimization technique for long sequences.
* `recompute` (Gradient Checkpointing): **Important**. A technique that trades computation for memory. Setting this to `True` will recomputes intermediate results from the forward pass during backpropagation instead of storing them. This can significantly reduce memory consumption but increases training time by approximately 20-30%. **Enabling this option is recommended when memory is insufficient.**
* `compute_type`: Computation precision. `bf16` (BFloat16) is a half-precision floating-point format designed specifically for deep learning. It significantly reduces memory usage and accelerates computation while maintaining good numerical stability. It is recommended for use on supported hardware (such as NVIDIA A100/H100, RTX 30/40 series).
* `fp16_opt_level`: Mixed-precision training level. `O2` is a commonly used setting that can be used in conjunction with `bf16`.

With an understanding of these configuration options, you now have the ability to freely customize the SFT process. In the next section, we will integrate all this knowledge and formally launch the SFT training task.

## 5. Start SFT training

With the theory and configuration ready, it's time to start SFT training and witness the transformation of the model's capabilities. ERNIEkit can start the entire training process with a simple command.

### 5.1 Start the training command

We will use the `train.py` script in the `ERNIE-develop` directory and specify our configuration file `run_sft_8k.yaml`.

**Execute in the terminal or command line:**

In [None]:
# First, make sure you are in the ERNIE-develop directory.
%cd ./ERNIE-develop

/home/aistudio/ERNIE-develop


In [4]:
!erniekit train examples/configs/ERNIE-4.5-0.3B/sft/run_sft_8k.yaml

LAUNCH INFO 2025-07-17 19:05:48,655 -----------  Configuration  ----------------------
LAUNCH INFO 2025-07-17 19:05:48,656 auto_cluster_config: 0
LAUNCH INFO 2025-07-17 19:05:48,656 auto_parallel_config: None
LAUNCH INFO 2025-07-17 19:05:48,656 auto_tuner_json: None
LAUNCH INFO 2025-07-17 19:05:48,656 devices: 0
LAUNCH INFO 2025-07-17 19:05:48,656 elastic_level: -1
LAUNCH INFO 2025-07-17 19:05:48,656 elastic_timeout: 30
LAUNCH INFO 2025-07-17 19:05:48,656 enable_gpu_log: True
LAUNCH INFO 2025-07-17 19:05:48,656 gloo_port: 6767
LAUNCH INFO 2025-07-17 19:05:48,656 host: None
LAUNCH INFO 2025-07-17 19:05:48,656 ips: None
LAUNCH INFO 2025-07-17 19:05:48,656 job_id: default
LAUNCH INFO 2025-07-17 19:05:48,656 legacy: False
LAUNCH INFO 2025-07-17 19:05:48,656 log_dir: erniekit_dist_log
LAUNCH INFO 2025-07-17 19:05:48,656 log_level: INFO
LAUNCH INFO 2025-07-17 19:05:48,656 log_overwrite: False
LAUNCH INFO 2025-07-17 19:05:48,656 master: 127.0.0.1:8080
LAUNCH INFO 2025-07-17 1

### 5.2 Interpreting Training Logs

Once training begins, you will see a large amount of log information output. Learning to interpret these logs is key to monitoring the training status and determining whether the model is converging normally.

Let's break down the key information:

* `global_step`: The current total number of training steps. `10/100` indicates that the current step is the 10th step, with a total of 100 steps required for training (defined by `max_steps`).
* `epoch`: The current number of training epochs.
* `loss`: **The most important metric**. It measures the gap between the model's predicted output and the actual `tgt`. **We expect to see the `loss` value steadily decrease as training progresses**. If the `loss` remains high, fluctuates wildly, or becomes `NaN`, it indicates that there may be a problem with the training (e.g., learning rate is too high, data is problematic, etc.).
* `learning_rate`: The current learning rate. You can see it reaches the set value after `warmup_steps`, then changes according to the strategy of `lr_scheduler_type`.
* `speed` / `ips`: Training speed. These represent the number of steps and samples processed per second, respectively. They can be used to estimate the total training time.
* `eta`: Estimated remaining training time (Estimated Time of Arrival).

**Monitoring points:**

1.  **Is the loss decreasing?**: This is the primary health indicator. A steadily decreasing loss curve is a sign of successful training.
2.  **GPU utilization and memory**: Use the `nvidia-smi` command to view the GPU's working status in real time. Ensure that GPU utilization is as high as possible and that memory usage is within a reasonable range. If memory overflow (Out of Memory, OOM) occurs, you need to reduce `batch_size`, `max_seq_len`, or enable `recompute`.
3.  **Evaluation Results (Eval Loss)**: When `eval_steps` is reached, the model is tested on the evaluation set. We also expect `eval_loss` to decrease, indicating that the model's generalization ability is improving. If `train_loss` decreases while `eval_loss` increases, it indicates that the model may be overfitting.

After training is complete, all fine-tuned model weights and configuration files will be saved in the directory specified by `output_dir` in your `yaml` file. By default, there will be a folder named `checkpoint-xxx`, where `xxx` is the step number at the time of saving.

## 6. Summary and Outlook

Congratulations! Through studying and practicing this tutorial, you have successfully mastered the core techniques of supervised fine-tuning (SFT) of the ERNIE-4.5-0.3B large language model using ERNIEkit. Starting from the basic concepts of SFT, we have completed the entire process of environment preparation, data processing, configuration parsing, model training, and inference evaluation step by step.

### 6.1 Core Review of This Tutorial

Let’s recap the key learning points from this journey:

1.  **Understood the essence of SFT**: We clarified that SFT is the process of teaching a model to follow human intent and complete specific tasks by using labeled “instruction-output” data on top of a pre-trained model. It serves as a bridge connecting general-purpose large models with specific application scenarios.
2.  **Mastered the ERNIEkit SFT workflow**: We learned to use the powerful ERNIEkit toolchain to form a complete closed-loop process, from downloading models and preparing data, to configuring and starting training, and finally to inference verification.
3.  **Mastered SFT configuration**: We thoroughly analyzed the parameters in the `run_sft_8k.yaml` configuration file, understood how to control and optimize the SFT process by adjusting parameters such as learning rate, batch size, sequence length, and optimization strategy, and learned key memory optimization techniques such as gradient accumulation, recomputation, and mixed precision.
4.  **Emphasized the importance of data**: We recognized that high-quality, diverse SFT data is the cornerstone of fine-tuning success and learned how to build datasets that meet ERNIEkit requirements.

Through SFT, we have endowed the ERNIE-4.5-0.3B model with new, customized capabilities, taking a solid step toward transforming it from a “well-rounded generalist” into a “specialized assistant.”

### 6.2 Future Learning and Exploration Directions

SFT has opened the door to the world of customized large models, but this is just the beginning. The technical stack of large language models is still evolving rapidly, and here are some directions worth exploring further:

*   **Parameter-Efficient Fine-Tuning (PEFT)**:
    *   **LoRA (Low-Rank Adaptation)**: While full-parameter fine-tuning yields good results, it consumes significant resources and requires storing a complete model copy for each task. LoRA is a PEFT method that freezes most of the original weights by injecting small, trainable “adapter” matrices into certain layers of the model. This way, we only need to train and store a very small portion (typically less than 1%) of the parameters to achieve results close to full parameter fine-tuning. ERNIEkit provides excellent support for LoRA. You can try it by modifying `fine_tuning: Full` to `fine_tuning: lora` in the configuration file and configuring the relevant parameters.

*   **Reinforcement Learning from Human Feedback (RLHF)**:
*   SFT teaches the model to “obey,” but how can we make the model's responses more aligned with human preferences (e.g., more useful, less harmful, more interesting)? RLHF is the key technology to address this challenge. It collects data on human preferences for different model responses to train a “reward model,” then uses reinforcement learning algorithms (such as PPO) to optimize the language model so that the content it generates achieves higher reward scores. This is the necessary path toward more advanced and aligned AI.

*   **More advanced models and technologies**:  
    *   Keep an eye on the latest developments in Wenxin large models and ERNIEkit, exploring models with larger parameter scales, more advanced algorithms, and more efficient training frameworks.

*   **Model deployment and serviceization**:  
    *   Deploying your fine-tuned model as an online service is the final step in realizing its value. Explore tools like Paddle Serving and FastDeploy to learn how to perform model quantization and compression, and provide high-performance inference APIs.

The era of large language models is filled with endless opportunities and challenges. We hope this tutorial serves as a solid starting point for your exploration of this exciting field. Through continuous learning and practice, you will be able to build increasingly powerful AI applications. Wishing you steady progress on your AI exploration journey!

# Feedback/Contact me: WeChat: G_Fuji