# EuroBERT - Continuous Pre-training with Optimus Library

This tutorial guides you through training a EuroBERT model using the Optimus library, extending its language support by adding Finnish. We cover installation, data preprocessing, model training, and loading the trained model.

This tutorial is divided into two parts: training in pure Python and executing the optimus library directly from the command line (useful for distributed settings and server training).

**Table of contents**
- 🐍 [Python](#python)
- 💻 [Command Line](#command-line)

**Resources:**

- 🤖 [EuroBERT](https://huggingface.co/EuroBERT)
- 🚀 [Optimus training library](https://github.com/Nicolas-BZRD/EuroBERT)
- 📄 [Paper](https://arxiv.org/pdf/2503.05500)
- 📚 [Data we will use](https://huggingface.co/datasets/Finnish-NLP/wikipedia_20230501_fi_cleaned)

## Installing Optimus

Before training EuroBERT, install the Optimus library.

⚠️ **On Google Colab, you may encounter dependency conflicts, and the runtime may need to restart after the first installation. Simply rerun the cell once the installation is complete to load Optimus and continue the tutorial.**

In [None]:
try:
  import optimus
  print("\033[92mOptimus already installed\033[0m")
except ImportError:
  !pip install git+https://github.com/Nicolas-BZRD/EuroBERT.git

---

## Python

## Data Preparation

For this tutorial we use the Finnish Wikipedia dataset from [Hugging Face](https://huggingface.co/datasets/Finnish-NLP/wikipedia_20230501_fi_cleaned). The preprocessing steps ensure that the data is formatted correctly for training.

**Steps:**
1.   Download the dataset.
2.   Tokenize the text using the EuroBERT tokenizer.
3.   Pack the data **(optional)**
4.   Datamix

To perform this operation, we need to import the dataprocess function as follows:

In [None]:
from optimus import dataprocess

## 1. Dataset
For efficient tokenization, Optimus uses raw data that can be downloaded from Hugging Face using git clone.

In [None]:
!git clone https://huggingface.co/datasets/Finnish-NLP/wikipedia_20230501_fi_cleaned

## 2. Tokenize

We download the Llama tokenizer to use with `tiktoken`, which provides a significantly faster tokenization process (approximately 2x faster). Alternatively, you can skip this step and directly fill the `tokenizer` with the Hugging Face model ID `EuroBERT/EuroBERT-210m` and set `tiktoken=False`.

In [None]:
!wget https://github.com/Nicolas-BZRD/EuroBERT/tree/main/exemples/tokenizer/tokenizer.model

Finnish wikipedia consists of 11 columns, including "text", "id", and "url". These three columns align with the expected format of our processing script for [Wikipedia dumps](https://github.com/Nicolas-BZRD/EuroBERT/blob/main/optimus/dataprocess/dataset/wikipedia.py).

If you need to work with other datasets, refer to the [existing dataset](https://github.com/Nicolas-BZRD/EuroBERT/blob/main/optimus/dataprocess/dataset) scripts as examples to create a compatible processing script.

In [None]:
dataprocess.tokenize_dataset(
    input_dir="/content/wikipedia_20230501_fi_cleaned",  # Path to the raw dataset
    tokenizer="/content/tokenizer.model",  # Path to the EuroBERT tokenizer model or HuggingFace model ID
    dataset="wikipedia",  # Dataset format (e.g., 'wikipedia')
    output_dir="/content/tokenized",  # Directory where the tokenized data will be saved
    num_workers="max",  # Use the maximum available workers for parallel processing
    head=1,  # Sample only 1 record (~8134444 tokens)
    tiktoken=True  # Enable TikToken for efficient tokenization
)

As you can see, Optimus tokenization is quite fast. You should be able to tokenize 8 million Wikipedia tokens in less than 8 seconds using a single CPU on Google Colab.

In [None]:
import json
with open("/content/tokenized/metadata.json", "r") as f:
    data = json.load(f)
    print(json.dumps(data, indent=4))

Additionally, we can observe the first sample of our data to verify that everything seems good.

In [None]:
dataprocess.inspect_dataset(input_dir="/content/tokenized", tokenizer="EuroBERT/EuroBERT-210m", num_samples=1)

## 3. Pack **(optional)**

Packing data ensures that all sentences have the same length during training, providing a consistent effective batch size. Otherwise, you can skip this step and pass the `config.data.var_len=True` argument during training. For this tutorial, we will pack sentences to a size of 2048.

In [None]:
dataprocess.pack_dataset(input_dir="/content/tokenized", output_dir="/content/packed", block_size=2048, num_workers=1)

Inspecting the data now reveal that ecery sample have a sentence length of 2048 as expected.

In [None]:
dataprocess.inspect_dataset(input_dir="/content/packed/train", tokenizer="/content/tokenizer.model", num_samples=2, tiktoken=True)

## 4. Create the Datamix

### Data Mix Creation

With our data processed, we can now create the data mix. This consists of a JSON file listing the different datasets we have processed and wish to incorporate during training in a list. We can individually select the number of samples to include from each dataset, and the Optimus library will automatically create the training mix, ensuring shuffling between each dataset.

- Proportion: ratio (float)
- Choose: samples (int)

```json
[
  {
    "local": "dataset_processed_path",
    "choose": 200,
  },
    {
    "local": "dataset2_processed_path",
    "proportion": 1.5,
  }
]
```

In [None]:
import os

train = [
    {
        "local": "/content/packed/train",
        "choose": 200
    },
]

os.makedirs("/content/datamix", exist_ok=True)
with open("/content/datamix/train.json", "w") as f:
    json.dump(train, f)

**The mix should be named `train.json` otherwise, it will not be found by Optimus during training.**

---

## 5. Training

After processing the data, we start training our model. In this section, we use Python entirely. You can also run the command `python optimus.train` with all configuration arguments, which will achieve similar behavior. For example:

```bash
python -m optimus.train --huggingface_id EuroBERT/EuroBERT-210m --output_dir "/content/model" --lr_scheduler "OneCycleLR" --div_factor 10 --end_start 0.9 --final_div_factor 100 --save_step 100 --data_mix_path "/content/datamix" --batch_size 1 --gpu
```

In [None]:
from optimus.trainer.configuration.configs import Config
from optimus.trainer.data import Data
from optimus.trainer.model.load import load_model, load_tokenizer
from optimus.trainer.pretrain import Pretrain

Let's configure our training. As you can see, we specify the model name, learning rate configuration, and data mix.

In [None]:
config = Config()

config.model.huggingface_id = "EuroBERT/EuroBERT-210m"
config.model.gpu = True # If you don't have GPU set it to False.

config.train.output_dir = "/content/model"
config.train.lr_scheduler = "OneCycleLR"
config.train.div_factor = 10
config.train.pct_start = 0.3
config.train.final_div_factor = 100
config.train.save_step = 100

config.data.data_mix_path = "/content/datamix"
config.data.batch_size = 1

We recommend that you check the training documentation in Optimus for the complete list of configuration options. Alternatively, you can run the following Python code to print the different configuration sections:

```python
print("Model")
print(json.dumps(asdict(config.model), indent=4))
print("Data")
print(json.dumps(asdict(config.data), indent=4))
print("Train")
print(json.dumps(asdict(config.train), indent=4))
```

We can then launch the training. In this example, we do not use distributed training, so we set the distributed object responsible for training supervision to `None`.

In [None]:
distributed = None

model = load_model(config)
tokenizer = load_tokenizer(config)

data = Data(config, tokenizer)

pretrain = Pretrain(model, data, distributed, config)
pretrain.train()

config.log_print("Training completed successfully.")

---

## Command Line

Following the step descript in the Python section it's also possible to launch training in only a few command line as follow:

In [None]:
# Remove word done previously
import torch
import gc
del model, tokenizer, data, pretrain, distributed
gc.collect()
torch.cuda.empty_cache()
!rm -r /content/*

In [None]:
# Download ressources
!git clone https://huggingface.co/datasets/Finnish-NLP/wikipedia_20230501_fi_cleaned
!wget https://raw.githubusercontent.com/Nicolas-BZRD/EuroBERT/tuto-continuous-pretraining/exemples/tokenizer/tokenizer.model

In [None]:
# Dataset Mix
!mkdir -p /content/datamix && echo '[{"local": "/content/packed/train", "choose": 200}]' > /content/datamix/train.json

The training is performed from A to Z with three four command lines.

In [None]:
!python -m optimus.dataprocess.tokenize_dataset --input_dir "/content/wikipedia_20230501_fi_cleaned" --tokenizer "/content/tokenizer.model" --dataset "wikipedia" --output_dir "/content/tokenized" --num_workers "max" --head 1 --tiktoken
!python -m optimus.dataprocess.pack_dataset --input_dir "/content/tokenized" --output_dir "/content/packed" --block_size 2048 --num_workers 1
!python -m optimus.train --huggingface_id EuroBERT/EuroBERT-210m --output_dir "/content/model" --lr_scheduler "OneCycleLR" --div_factor 10 --pct_start 0.3 --final_div_factor 100 --save_step 100 --data_mix_path "/content/datamix" --batch_size 1 --gpu

For extensive training requiring further optimization, feel free to reach us at `nicolas(dot)boizard[at]centralesupelec(dot)fr`