# Tutorial on pre-training large language models based on ERNIE 4.5-0.3B

## 1. Introduction

Welcome to this tutorial on pre-training the `ERNIE 4.5-0.3B` model using `ERNIEKit`! In the current AI wave, large language models (LLMs) are undoubtedly one of the brightest stars. As a leading large model in China, the Baidu ERNIE series of models has attracted widespread attention for its outstanding performance and continuous technological innovation.

### 1.1 What is large language model pre-training (PT)?

Imagine how we learn language. Before formally learning grammar rules and vocabulary meanings, we first extensively expose ourselves to and imitate the way people around us speak. Large language model pre-training (Pre-training, PT) is similar to this.

In simple terms, **pre-training refers to the process of training a model on large-scale unlabeled text data to enable it to learn the intrinsic patterns, grammatical structures, semantic information, and world knowledge of language.** At this stage, the model is not tailored for a specific task (such as downstream question-answering or translation) but focuses on building a foundational capability for general language understanding and generation. This is akin to laying a solid linguistic foundation for the model, enabling it to better adapt to various specific NLP tasks in the future.

The core task of pre-training is typically “next token prediction.” Given a text segment, the model must predict the most likely next word. Through repeated prediction exercises on massive text datasets, the model gradually learns word collocation patterns, syntactic structures, and even common sense and logical reasoning.

### 1.2 Why choose ERNIE 4.5-0.3B and ERNIEKit?

Before diving into practical applications, let's take some time to compare the `ERNIE 4.5` model and the `ERNIEKit` toolkit with other mainstream open-source solutions currently available. This will help us better understand their respective features and advantages.

#### 1.2.1 Overview of the ERNIE 4.5 Model Family

ERNIE 4.5 is not just a single model, but a large family of models launched by Baidu, covering a wide range of specifications from lightweight to trillion-parameter models. To better understand the positioning of the 0.3B model we are about to use, let's first take a look at the ERNIE 4.5 family as a whole:

| Model Category | Model Name (Partial) | Parameter Scale (Total Parameters/Activated Parameters) | Key Features | Application Scenarios |
| :--- | :--- | :--- | :--- | :--- |
| MoE Large Language Model | `ERNIE-4.5-300B-A47B` | ~300B / 47B | Flagship pure language model with top-tier performance | Complex tasks requiring extreme performance |
| MoE Large Language Model | `ERNIE-4.5-21B-A3B` | ~21B / 3B | High-performance, cost-effective MoE model | Scenarios requiring strong capabilities but with limited resources |
| Multimodal large model | `ERNIE-4.5-VL-424B-A47B`| ~424B / 47B | Supports text, image, and video input | Complex text-image-video understanding and generation |
| **Dense Language Model** | **`ERNIE-4.5-0.3B`** | **300 million (0.3B)** | **Lightweight, resource-friendly, standard architecture** | **Learning, experimentation, rapid validation, low-resource deployment** |

*Note: MoE (Mixture-of-Experts) is an advanced model architecture with a large total number of parameters, but only a portion of the parameters are activated when processing each input, enabling efficient inference. Dense models, on the other hand, utilize all parameters during each computation.*

#### 1.2.2 Why use ERNIE 4.5-0.3B as an example?

From the table above, we can clearly see the unique positioning of `ERNIE 4.5-0.3B` within the family. In this tutorial, we have chosen it as the main focus based on the following considerations:

1.  **Designed for learning and experimentation**: Unlike MoE giant models that often require large clusters for training, the `0.3B` (300 million) parameter count is specifically designed for developers to learn, research, and conduct rapid experiments. Its resource consumption is “user-friendly,” allowing us to complete the entire process from pre-training to fine-tuning on a personal computer or a single server.
2. **Excellent resource efficiency**: We can run the entire process of this tutorial on a single consumer-grade GPU (such as RTX 3090/4090 with 24GB VRAM). This significantly lowers the hardware barrier for learning large-scale model technology.
3.  **Representative of standard architecture**: As a standard dense Transformer model, the internal structure of `ERNIE 4.5-0.3B` is easier to understand, without complex components like MoE. Mastering its training process means grasping the core principles of most standard large models, laying a solid foundation for understanding more complex models in the future.
4.  **Full Support from Official Toolchains**: `ERNIEKit` provides consistent, industrial-grade toolchain support for the entire series of models, including `0.3B`. Using it to learn the `0.3B` model is akin to driving a “training vehicle” in a “professional driving school,” enabling smooth transfer of learned skills to future operations of larger, more complex models.

Therefore, `ERNIE 4.5-0.3B` is the ideal “first stop” for entering the world of large language models.

#### 1.2.3 Comparison of mainstream development toolkits

With a good model, you also need the right tools. Large model development toolkits are responsible for handling data, driving training, executing inference, and a series of other complex processes.

| Features | **ERNIEKit** | **Hugging Face (transformers+trl)** | **LLaMA-Factory** |
| :--- | :--- | :--- | :--- |
| **Developer/Community** | Baidu | Hugging Face Community | hiyouga (Community) |
| **Core Framework** | PaddlePaddle | PyTorch, TensorFlow, JAX | PyTorch |
| **Main Features** | Full lifecycle: pre-training, SFT, DPO, LoRA, quantization (QAT), deployment | Core component library, supports almost all models and training methods, requires writing a lot of code | One-stop fine-tuning framework, configuration-driven, supports multiple efficient fine-tuning algorithms |
| **Usability** | Industrial-grade, command line + configuration files, provides WebUI, moderate learning curve | Highly flexible but requires strong coding skills, steep learning curve | Extremely user-friendly for beginners, provides WebUI, simple configuration, gentle learning curve |
| **Model Support** | Focuses on ERNIE series models, provides deep optimization | Supports the widest range of open-source models (Model Hub) | Supports mainstream models such as LLaMA, Qwen, ChatGLM, etc. |
| **Application Scenarios** | ERNIE model deep development, industrial-grade applications, PaddlePaddle technology stack | General model research, algorithm experiments, tasks requiring high customization | Quick setup, individual developers, education, efficient fine-tuning experiments |

#### 1.2.4 Why choose ERNIEKit?

As its name suggests, `ERNIEKit` is the official toolkit tailored for the ERNIE model family. Choosing it means:

*   **Optimal compatibility and performance**: ERNIEKit offers native and in-depth support for the ERNIE model series. It provides targeted optimizations to maximize the training and inference performance of the models within the PaddlePaddle framework.
*   **Industrial-grade end-to-end workflow**: It is not just a fine-tuning tool but covers the entire workflow from data processing, pre-training, supervised fine-tuning (SFT), preference alignment (DPO), parameter-efficient adjustment (LoRA), to quantization-aware training (QAT) and deployment. Learning ERNIEKit helps understand the complete picture of large-scale model industrialization.
* **Official maintenance and support**: As an official tool, its iterative updates are synchronized with the latest developments in ERNIE models, providing the most timely and authoritative technical support.

In this tutorial, we will use `ERNIE 4.5-0.3B` as an example and explore the mysteries of pre-training using the underlying pre-training scripts provided in `ERNIEKit`. This combination (**official model + official tool**) is the optimal path for deeply understanding Baidu Wenxin's large-scale model technology system. Even when using a relatively small model and dataset, the core principles and processes are applicable to understanding the pre-training of larger models.

### 1.3 Target Audience and Learning Outcomes of This Tutorial

This tutorial is primarily intended for:

* Beginners interested in pre-training large language models.
* Developers who wish to learn how to pre-train models using the ERNIEKit (PaddlePaddle) framework.
* Learners with a basic understanding of Python and deep learning who wish to gain hands-on experience with LLM.

By the end of this tutorial, you will be able to:

* Understand the basic concepts and significance of large language model pre-training.
* Master the complete process of pre-training the ERNIE 4.5 model using ERNIEKit.
* Understand methods for preparing pre-training data.
* Learn how to configure the `yaml` file and launch the pre-training task.
* Conduct an initial analysis of pre-training results.
* Lay the groundwork for subsequent model fine-tuning and specific task applications.

Let’s embark on this exciting learning journey together!

## 2. Environment Preparation

Before embarking on our pre-training journey, we need to ensure that we have the correct development environment. This mainly involves installing the core deep learning frameworks PaddlePaddle and ERNIEKit toolkit.

### 2.1 Installing PaddlePaddle and ERNIEKit

ERNIEKit has certain environment requirements. **We strongly recommend using the officially recommended Docker image** to avoid issues caused by inconsistent environments. If Docker is unavailable, ensure your local environment meets the prerequisites outlined in the official documentation (https://github.com/PaddlePaddle/ERNIE/blob/develop/docs/erniekit.md#21-prerequisites), such as CUDA >= 12.3, Python 3.10+, etc.

**1. Clone the ERNIE-develop repository**

The ERNIEKit code and all scripts are contained in the `ERNIE-develop` repository. First, we need to clone it to our local machine.

In [1]:
# git clone https://github.com/PaddlePaddle/ERNIE.git
%cd ERNIE-develop

/home/aistudio/ERNIE-develop


**2. Install dependencies**

All dependencies required by ERNIEKit are listed in the `requirements/gpu/requirements.txt` file.

In [None]:
!python -m pip install -r requirements/gpu/requirements.txt

**Verify installation**:

After installation, you can run the following code to verify that PaddlePaddle has been successfully installed and can correctly identify your GPU.

In [None]:
import paddle

# Check if GPU is available
try:
    paddle.utils.run_check()
    print("PaddlePaddle GPU is available!")
except Exception as e:
    print(f"PaddlePaddle GPU check failed: {e}")
    print("If you intended to use GPU, please check your CUDA setup and PaddlePaddle installation.")



Running verify PaddlePaddle program ... 
PaddlePaddle works well on 1 GPU.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
PaddlePaddle GPU is available!


I0717 10:30:30.392132   294 pir_interpreter.cc:1524] New Executor is Running ...
W0717 10:30:30.393514   294 gpu_resources.cc:114] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.8, Runtime API Version: 12.6
I0717 10:30:30.394114   294 pir_interpreter.cc:1547] pir interpreter is running by multi-thread mode ...


At this point, our environment setup is complete! Next, we will prepare the data for pre-training.

## 3. Data Preparation

Large language models are “large” not only in terms of the number of parameters, but also in terms of the massive amount of data on which their training relies. The data used in the pre-training stage is the source from which the model learns language knowledge.

### 3.1 Pre-training Data Format (.bin / .idx)

To efficiently load and process large-scale text data, `ERNIEKit` and `PaddleNLP` use the standard binary `mmap` (memory-mapped file) format.

*   **.bin file**: This is a binary file that stores the sequence of token IDs obtained after all text data has been processed by the tokenizer. In simple terms, all text has been converted into numbers.
*   **.idx file**: This is an index file that records the starting position and length of each training sample (usually an article or document) in the `.bin` file. This allows for quick access to any data during training.

This data organization format has the following advantages:

* **Efficient reading**: The file content can be directly mapped in memory, avoiding a large number of disk I/O operations.
* **Memory saving**: It is not necessary to load all data into memory at once; data can be read on demand.

### 3.2 Obtaining pre-trained data

To help users get started quickly, PaddleNLP officially provides a processed subset of the OpenWebTextCorpus dataset, containing approximately 100,000 articles. We will use this dataset as an example in this tutorial.

In [None]:
# Create a data folder in the ERNIE-develop directory
!mkdir data

In [None]:
# Download data file
!wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin -P ./data/
!wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx -P ./data/

--2025-07-17 10:32:02--  https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin
Resolving bj.bcebos.com (bj.bcebos.com)... 100.67.184.196, 100.64.80.160, 100.67.184.48
Connecting to bj.bcebos.com (bj.bcebos.com)|100.67.184.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 246736546 (235M) [application/octet-stream]
Saving to: './data/llama_openwebtext_100k.bin'


2025-07-17 10:32:04 (123 MB/s) - './data/llama_openwebtext_100k.bin' saved [246736546/246736546]

--2025-07-17 10:32:04--  https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
Resolving bj.bcebos.com (bj.bcebos.com)... 100.67.184.48, 100.64.80.160, 100.67.184.196
Connecting to bj.bcebos.com (bj.bcebos.com)|100.67.184.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2000042 (1.9M) [application/octet-stream]
Saving to: './data/llama_openwebtext_100k.idx'


2025-07-17 10:32:04 (126 MB/s

After the download is complete, there should be two files in your `./data/` directory:
*   `llama_openwebtext_100k.bin`
*   `llama_openwebtext_100k.idx`

**Important Note:** Although these data files were originally prepared for the LLaMA model, their `mmap` format is universal and can also be used for pre-training models such as ERNIE. During actual training, we will use the ERNIE 4.5 Tokenizer to process the token IDs converted from these text data.

### 3.3 (Optional) Using Custom Data

If you want to use your own text data, you need to convert it to `.bin` and `.idx` formats. The `PaddleNLP` repository (note that this is not the ERNIE-develop repository) provides a data preprocessing script `llm/tools/preprocess_data.py`.

The general process is as follows:
1.  **Collect** your raw text data (e.g., `.txt` or `.jsonl` files).
2.  **Run the script**: Use the `preprocess_data.py` script, specifying the `tokenizer_name_or_path` corresponding to the model (e.g., use the tokenizer from `PaddlePaddle/ERNIE-4.5-0.3B-Paddle`), to convert your text files into `.bin` and `.idx` files.
    ```bash
    # Example command (must be run in the PaddleNLP repository and its dependencies must be installed)
    # python llm/tools/preprocess_data.py \
    #     --model_name_or_path PaddlePaddle/ERNIE-4.5-0.3B-Paddle \
    #     --input_path /path/to/your/text_files.list \
    #     --output_prefix /path/to/your/output/my_data
    ```
For detailed usage, please refer to the relevant documentation in the PaddleNLP repository. For this tutorial, we will directly use the pre-downloaded data.

Now that the data is ready, we will briefly introduce the basic principles of large model pre-training.

## 4. Introduction to the Principles of Model Pre-training

Before we start running the code, it is helpful to take some time to understand the basic principles behind pre-training.

### 4.1 Core Idea: Next Token Prediction

The most core and classic task in large language model pre-training is **Next Token Prediction**, also known as Causal Language Modeling (CLM).

**In simple terms, given a text sequence, the model's objective is to predict the next token in the sequence.**  

**For example:**  

Suppose we have the following sentence: "The weather is really nice today, let's go to the park together."

1. When the model sees: `["The"]` -> It needs to predict: `"weather"`
2. When the model sees: `["The", "weather"]` -> It needs to predict: `"is"`
3. When the model sees: `["The", "weather", "is"]` -> It needs to predict: `"really"`
4. When the model sees: `["The", "weather", "is", "really"]` -> It needs to predict: `"nice"`

The model is exposed to a massive amount of text data and continuously plays this “word prediction game” on this data. By comparing its predictions with the next word in the actual text, the model calculates an “error rate” (i.e., loss) and then adjusts its internal parameters (weights) based on this error to improve the accuracy of its predictions.

Although this process is simple, when the amount of data is large enough and the model parameters are numerous enough, the model must learn to reduce prediction errors by:

*   **Vocabulary knowledge**: which words frequently appear together.
*   **Grammatical structure**: subject-verb-object, tense, etc.
*   **Semantic coherence**: how the context should flow.
*   **Common sense knowledge**: for example, “the sky is blue.”

### 4.2 Model Architecture: Transformer

Currently, almost all mainstream large language models, including ERNIE 4.5, are based on the **Transformer** architecture. For the Next Token Prediction task, the **Decoder-Only** architecture of the Transformer is typically used.

Its core components include:
1.  **Embedding Layer**: Converts input token IDs into vectors that capture semantic meaning.
2.  **Positional Encoding**: Injects positional information about tokens within the sequence into the model.
3.  **Multi-Head Self-Attention**: The core of the Transformer, allowing the model to dynamically focus on information from all other tokens in the sequence while processing a single token.
4.  **Feed-Forward Network**: Performs further nonlinear transformations to enhance the model's expressive capabilities.
5.  **Layer Normalization and Residual Connections**: Stabilize the training process, enabling the training of deeper networks.

By stacking multiple such Transformer Blocks (e.g., ERNIE 4.5-0.3B has 32 layers), the model can learn highly complex language patterns.

### 4.3 Loss Function: Cross-Entropy

After the model predicts the probability distribution of the next word, we use **cross-entropy loss** to measure the accuracy of the prediction. It compares the **probability distribution predicted by the model** with the **actual next word** to determine the “difference” between them. The goal of training is to continuously adjust the model's parameters to minimize the total cross-entropy loss across the entire training dataset.  

With an understanding of these basic principles, we can now proceed with greater confidence into practical implementation!  

## 5. Start pre-training

With the theoretical knowledge in place and the data ready, it's time to put theory into practice and launch our first ERNIE 4.5 pre-training task!

**Important prerequisites:**
1.  Please ensure that you have installed all dependencies according to the instructions in the **“2. Environment Preparation”** section.
2. Please ensure that you have downloaded the data files and placed them in the `./data/` directory according to the instructions in the **“3. Data Preparation”** section.
3. **Working Directory**: All subsequent commands assume that your current terminal working directory is `ERNIE-develop`.

### 5.1 Pre-training Script and Configuration File

We know that the core script for pre-training in `ERNIEKit` is `./examples/pre-training/ernie/pretrain`. We need to use a `yaml` configuration file to tell this script which model to use, what data to use, and how to train.

The `pretrain_96_gpus.yaml` file in the original repository is designed for trillion-scale MoE models on large clusters and cannot be used directly. We need to create a simplified configuration file specifically for training the `ERNIE 4.5-0.3B` model on a single GPU.

**Create the configuration file `pretrain_ernie_0.3b_demo.yaml`**

Create a new `yaml` file named `pretrain_ernie_0.3b_demo.yaml` in the `ERNIE-develop` directory and copy the following content into it:

```yaml
# Example configuration file for ERNIE 4.5-0.3B single-card pre-training

# ---------------------------model args------------------------------------- ------------#
model_args:
    model_name_or_path: “../../../data/models/30654/ERNIE-4.5-0.3B-Base-Paddle” # Specify the model name; ERNIEKit will attempt to download it from AI Studio or HuggingFace
    tokenizer_name: “../../../data/models/30654/ERNIE-4.5-0.3B-Base-Paddle” # Specify the tokenizer
    output_dir: ./output_pretrain/ # Output directory for model weights, logs, etc. during training
    max_seq_length: 1024 # Maximum sequence length processed by the model, adjustable based on GPU memory

# ---------------------------trainer args-------------------------------------- -----------#
trainer_args:
    input_dir: “1.0 ../../data/llama_openwebtext_100k 1.0 ../../data/llama_openwebtext_100k” # Weights and paths of training and validation datasets
    split: “998,1,1” # Split ratio for training, validation, and test sets

    do_train: True
    dataloader_num_workers: 0 # Recommended to set to 0 on Windows; can be increased appropriately on Linux
    disable_tqdm: True
    logging_steps: 10 # Print logs every 10 steps
    eval_steps: 100 # Evaluate every 100 steps
    save_steps: 200 # Save model weights (checkpoint) every 200 steps
    max_steps: 400 # Maximum number of training steps, for quick demonstration
    
    # --- Learning rate and optimizer ---
    adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 1e-8
learning_rate: 1e-4
min_lr: 1e-5
lr_scheduler: “wsd_cosine” # Use a cosine annealing learning rate strategy with warmup
max_grad_norm: 1.0
    weight_decay: 0.1
    warmup_steps: 50 # Learning rate warmup steps
    
    # --- Batch size and memory ---
    gradient_accumulation_steps: 8 # Gradient accumulation steps
    per_device_train_batch_size: 1 # Batch size per card, actual batch_size = per_device_train_batch_size * gradient_accumulation_steps
    per_device_eval_batch_size: 2
    head_dim: 128 # Attention head dimension size, must be consistent with the pre-trained model
    
    # --- Performance and accuracy ---
    bf16: False # Whether to enable BF16. If the GPU does not support it (such as the 30/40 series), set it to False
    fp16: True  # Whether to enable FP16 (mixed precision) training, which is friendly to the 30/40 series GPUs
    fp16_opt_level: “O1” # FP16 optimization level, O1 is more stable
    
    # --- Distributed parameters (set to 1 or default for single-card training) ---
    moe_group: “dummy” # For non-MoE models, set to dummy to disable MoE-related logic
    pipeline_parallel_degree: 1
    tensor_parallel_degree: 1
    sharding: “” # Do not use sharding for single-card training
    
    # --- Other ---
    seed: 42
    save_total_limit: 2 # Maximum number of checkpoints to save
    overwrite_output_dir: true # Overwrite the output directory
```

**Key Parameter Explanation**:
*   `model_name_or_path`: Specifies the model we want to train. ERNIEKit will automatically handle the download.
*   `output_dir`: The location where all training products (model weights, logs) are saved.
*   `input_dir`: Points to the downloaded `.bin` and `.idx` data files. **Note**: No file extensions are needed here; only the path is required.
* `max_seq_length`: The maximum text length the model can process. The larger this value, the higher the GPU memory usage. 1024 is safe for most GPUs with 24GB of memory.
* `max_steps`, `save_steps`, `logging_steps`: Control the total number of training steps, save frequency, and logging frequency. We set these to smaller values to see results quickly.
* `per_device_train_batch_size` & `gradient_accumulation_steps`: These are key parameters for controlling the **effective batch size** and **GPU memory usage**. `per_device_train_batch_size` specifies the number of samples per forward pass, directly affecting GPU memory usage. By accumulating gradients over `gradient_accumulation_steps` steps before updating the model, we can achieve an effective batch size of `1 * 8 = 8` without increasing memory usage.
*   `fp16: True`: Enables mixed-precision training, significantly reducing memory usage and accelerating training, which is critical for consumer-grade GPUs.

### 5.2 Downloading the model

Before starting training, we need to download the `ERNIE-4.5-0.3B-Paddle` model file to our local machine. You can use `aistudio-sdk` (recommended) or `huggingface-cli`.

**First, install aistudio-sdk:**

In [None]:
!pip install --upgrade aistudio-sdk

**Then download the model:**

In [None]:
!aistudio download --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle --local_dir baidu/ERNIE-4.5-0.3B-Paddle

After the download is complete, the path `“baidu/ERNIE-4.5-0.3B-Paddle”` specified in our configuration file will be found correctly. If you encounter a `time out` during download, you can try downloading again or use the model mounted in [this project](https://aistudio.baidu.com/project/edit/9382861): /home/aistudio/data/models/30654/ERNIE-4.5-0.3B-Base-Paddle.


This project will use the mounted model for demonstration purposes!

### 5.3 Starting Single-Card Pre-training

Once everything is ready, we can execute the following commands in the `ERNIE-develop` directory to start single-card pre-training.

- /home/aistudio/ERNIE-develop/examples/pre-training/models/moe/token_dispatcher/fp8_utils.py, comment out the import of Fp8
- /home/aistudio/ERNIE-develop/examples/pre-training/models/fp8_linear.py, comment out the import of Fp8
- /home/aistudio/ERNIE-develop/examples/pre-training/models/ernie/modeling.py, comment out the import of Fp8

Fix MoE parameter error (if encountered)

If you encounter errors such as `KeyError: ‘moe_group’` while running the script, it is because the script assumes that the configuration file must contain the `moe_group` parameter (which is designed for MoE models), but our simplified configuration does not include it (because `ERNIE-4.5-0.3B` is a dense model).

**Quick fix steps**:
1. Open the file `ERNIE-develop/examples/pre-training/ernie/pretrain.py` (using a text editor).
2. Find this line (approximately line 394): `if trainer_args[“moe_group”].lower() in {“mp”, “tp”, ‘model’, “dummy”}:`
3. Modify it to: `moe_group = trainer_args.get(“moe_group”, None)  # Add a check to avoid KeyError\nif moe_group and moe_group.lower() in {“mp”, “tp”, ‘model’, “dummy”}:`
4. Add `self.head_dim = None` to the ErnieMoEConfig class in `/home/aistudio/ERNIE-develop/examples/pre-training/models/ernie/configuration.py` (approximately line 247).
5. Add the following to approximately line 1502 in `/home/aistudio/ERNIE-develop/examples/pre-training/models/ernie/modeling_moe.py` (reason: the official code uses the float64 type when calculating the sin and cos caches for RoPE, while the example model expects float32):
```
            cos_cached = np.cos(emb)[:, :].astype(“float32”)
            sin_cached = np.sin(emb)[:, :].astype(“float32”)
```
6. In `/home/aistudio/ERNIE-develop/examples/pre-training/ernie/src/trainers/pretraining_trainer.py` around line 1096, remove:  (Add a check for a distributed environment)
```python
# dist.all_reduce(tr_loss, dist.ReduceOp.SUM)
# tr_loss_scalar = tr_loss.item() / dist.get_world_size()

```
Add:
```python
            if self.args.world_size > 1:
                dist.all_reduce(tr_loss, dist.ReduceOp.SUM)
                tr_loss_scalar = tr_loss.item() / dist.get_world_size()
            else:
                tr_loss_scalar = tr_loss_single_dp_scalar

```

Remove (approximately line 1150):
```python
dist.all_reduce(numel_tensor)
```
Add:

```python
                if self.args.world_size > 1:
                    dist.all_reduce(numel_tensor)
```

Remove (approximately line 1212):
```python

paddle.distributed.barrier()
```

Add:
```python
if self.args.world_size > 1:
                paddle.distributed.barrier()
```
6. Save the file and rerun the startup command.

This modification is safe and will not affect our training because we do not use the MoE feature.

In [2]:
# 1. 首先，切换工作目录到 pre-training 文件夹
%cd ./examples/pre-training/

# 2. 然后，从此目录启动训练脚本
#    注意：配置文件的路径因为工作目录变化，需要使用 ../../ 来返回到项目根目录
!python -u ernie/pretrain.py --config ../../pretrain_ernie_0.3b_demo.yaml

/home/aistudio/ERNIE-develop/examples/pre-training
W0717 18:20:34.138628  2289 gpu_resources.cc:114] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.8, Runtime API Version: 12.6
[INFO] 2025-07-17 18:20:34,169 [trainer_utils.py:   71]:    The Training Main Process Started Successfully. time: 2025-07-17 18:20:34, pid: 2289
[INFO] 2025-07-17 18:20:34,633 [process_utils.py:  180]:    Check affinity before setting: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127}
[INFO] 2025-07-17 18:20

### 5.4 Observing the Training Process

Once training begins, you will see log information similar to the following in the terminal (specific values may vary):

**Key Log Interpretation:**
*   `global_step`: The current total number of training steps.
*   `loss`: The loss value calculated for the current step. **This is a very important metric!** We expect to see `loss` gradually decrease as training progresses.
*   `learning_rate`: The current learning rate. You can observe it gradually increasing within the `warmup_steps`.
*   `speed`: Training speed, indicating the number of steps processed per second.
*   `vram`: VRAM usage during training (MB).

**Expected Phenomena:**
*   **Loss Decrease**: This is the most critical indicator, signifying that the model is learning from the data.
* **VRAM usage**: For the `ERNIE-4.5-0.3B` model, under the above configuration, VRAM usage is expected to be around 10GB. If a `CUDA out of memory` error occurs, it indicates insufficient VRAM, and the configuration needs to be adjusted (see the “Advanced and Common Issues” section below).

## 6. Result Analysis and Interpretation

When training reaches the `max_steps` you set in the configuration file, the pre-training process is complete. All training outputs are saved in the directory specified in the configuration file `output_dir`, i.e., `./output_pretrain/`.

After training is complete, you can view the contents of this directory:

In [3]:
ls -lh ./output_pretrain/

total 1.6G
-rw-r--r-- 1 aistudio aistudio  23K Jul 17 18:24 added_tokens.json
-rw-r--r-- 1 aistudio aistudio  179 Jul 17 18:24 all_results.json
drwxr-xr-x 1 aistudio aistudio 4.0K Jul 17 17:25 [0m[01;34mcheckpoint-200[0m/
drwxr-xr-x 2 aistudio aistudio 4.0K Jul 17 18:24 [01;34mcheckpoint-400[0m/
-rw-r--r-- 1 aistudio aistudio 4.1K Jul 17 18:24 config.json
-rw-r--r-- 1 aistudio aistudio  167 Jul 17 18:24 generation_config.json
drwxr-xr-x 1 aistudio aistudio 4.0K Jul 17 15:47 [01;34mlog[0m/
-rw-r--r-- 1 aistudio aistudio 1.6G Jul 17 18:24 model_state.pdparams
drwxr-xr-x 1 aistudio aistudio 4.0K Jul 17 18:20 [01;34mruns[0m/
-rw-r--r-- 1 aistudio aistudio  16K Jul 17 18:24 special_tokens_map.json
-rw-r--r-- 1 aistudio aistudio 9.8K Jul 17 18:24 static_name_to_dyg_name.json
-rw-r--r-- 1 aistudio aistudio 1.6M Jul 17 18:24 tokenizer.model
-rw-r--r-- 1 aistudio aistudio 132K Jul 17 18:24 tokenizer_config.json
-rw-r--r-- 1 aistudio aistudio  179 Jul 17 18:24 train_result

**Key files and directories explained:**
*   **Root directory files**: These contain the final model and configuration at the end of training (when `max_steps` is reached).
*   **`checkpoint-<step_number>/`**: These are checkpoint directories saved periodically according to the `save_steps` parameter. Each checkpoint contains the complete model state at that point in time, which can be used to resume training from that point or load the model at that stage for evaluation.
*   **`visualdl/`**: VisualDL log files. You can start a web service using `visualdl --logdir ./output_pretrain/visualdl` to visually view metrics such as the loss curve.

**Preliminary evaluation of pre-training effectiveness:**
For this brief tutorial, the primary focus is on whether the **`loss` shows a clear downward trend**. For example, a decrease from an initial range of 9-10 to 6-7 can be considered a successful basic pre-training process, indicating that the model has learned some statistical patterns of the language. To achieve a truly powerful model, longer training on a larger dataset is required.

## 7. Advanced Topics and Common Issues

### 7.1 Insufficient Memory (CUDA Out of Memory)

This is the most common issue. Solutions:
1.  **Increase the gradient accumulation steps**: Further increase `gradient_accumulation_steps` in the configuration file (e.g., 16, 32). This is the most effective and recommended method.
2.  **Reduce the batch size**: Set `per_device_train_batch_size` to a smaller value (but it is usually already set to 1).
3.  **Reduce the sequence length**: Reduce `max_seq_length` (e.g., 512), but this will affect the model's ability to learn long-range dependencies.
4.  **Use more aggressive optimization**: `ERNIEKit` supports advanced memory optimization techniques such as gradient recomputation (Recompute), which can be enabled in the configuration file by setting `recompute: True` (possibly under `model_args`), but this will increase computation time.

## 8. Summary and Outlook

Congratulations! By completing this tutorial, you have:

*  Gained a comprehensive understanding of the process of pre-training the `ERNIE 4.5` model using `ERNIEKit`.
*  Mastered how to prepare data, configure `yaml` files, and start and monitor training.
*  Gained an in-depth understanding of the core concepts, key parameters, and results analysis of pre-training.

The core of this tutorial is that by analyzing the underlying scripts of `ERNIEKit`, we have successfully simplified a complex pre-training process designed for large clusters into a mini-process that can run on a single consumer-grade GPU, making it suitable for learning and experimentation.

**Looking ahead:**
The pre-trained model serves as a general-purpose “language foundation.” Moving forward, you can explore:

*   **Larger-scale pre-training**: Train on more data for longer periods to obtain more powerful foundational models.
*   **Model fine-tuning (Fine-tuning)**: This is typically the next step after pre-training. Fine-tuning a pre-trained model using supervised data (such as question-answer pairs) can make it better at following instructions and completing specific tasks. `ERNIEKit` provides a powerful `erniekit train` command to support fine-tuning methods such as SFT and DPO.

We hope this tutorial opens the door to exploring the world of the ERNIE large model. Thank you for reading!

# Feedback/Contact me: WeChat: G_Fuji