![a](res/banner.jpg)

<h1 style="text-align: center;">Getting into Modalities in 15mins</h1>

<hr/>

**Let's train a dense model with Modalities involving the following steps:**

1. Data Preprocessing (Indexation, Tokenization)
2. Model Pretraining (GPT Model)
3. Monitoring (Weights&Biases)


**Folder structure:**

Throughout the tutorial, we will use the Jupyter Notebook `modalities_demo.ipynb` to guide us through the process. The notebook is located in the root directory of the tutorial, along with the `configs` and `data` directories. The `configs` directory contains configuration files for the model pretraining and tokenization, while the `data` directory contains subdirectories for storing checkpoints, preprocessed data, raw data, and tokenizer-related files.

```text
└── getting_started_15mins                          # Root directory for the tutorial
    ├── modalities_demo.ipynb                       # The Jupyter Notebook which we will be using for the tutorial.
    ├── configs                      
    │   ├── pretraining_config.yaml                 # Configuration file for the model pretraining, containing a full represe.
    │   └── tokenization_config.yaml                # Configuration file for tokenization, specifying settings like vocabulary size, token types, etc.
    └── data                         
        ├── checkpoints                             # Directory where checkpoints (model and optimizer states saved during training) are stored.
        │   └── <checkpoints>        
        ├── preprocessed                            # Directory containing preprocessed data that is ready for training or analysis.
        │   └── <files>              
        ├── raw                      
        │   └── fineweb_edu_num_docs_483606.jsonl   # JSONL file containing raw data for training or testing.
        └── tokenizer                
            ├── tokenizer.json                      # JSON file defining the tokenizer model, including token mappings.
            └── tokenizer_config.json               # Configuration file specifying additional settings for the tokenizer (e.g., special tokens, padding).
    
```

## Prepraration steps

Firstly, we need to install Modalities via pip

```bash
pip install modalities
```

and download the raw training data. 
We are going to use a  subset (500k documents) of the FineWeb-Edu dataset, as it is already cleaned, filtered and deduplicated.

```bash
cd data/raw
wget https://huggingface.co/datasets/ModalitiesTeam/FW_EDU_SUBSET_500k_docs/resolve/main/fineweb_edu_num_docs_483606.jsonl?download=true -O fineweb_edu_num_docs_483606.jsonl
```

**Disclaimer:**

Don't run modalities in jupyter notebooks!


But this time for demonstration purposes:

<img src="res/notebooks_1.png" alt="Alt text" style="width:30%;"/>

<small> credits: Joel Grus - I don't like Notebooks</small>

# Data Preprocessing


Before training the model, we will preprocess the raw data. In the first step, we will create an index of the data that stores the starting byte position and byte length of every document. The index will be used to efficiently index the JSONL file during the tokenization in the second step. 

The raw JSONL dataset and has the following properties:

* Subset of FineWeb-Edu (~500k documents) encoded as JSONL file
* already cleaned, filtered and deduplicated

Each line in the JSONL is a proper JSON object containing a single document. 
```json
{
   "text":"What is the difference between 50 Ohm and 75 Ohm Coax? [...]",
   "id":"<urn:uuid:57e09efe-1c29-49f8-a086-e1bb5dd552c9>",
   "dump":"CC-MAIN-2021-39",
   "url":"http://cablesondemandblog.com/wordpress1/2014/03/",
   "file_path":"s3://commoncrawl/crawl-data/[...]20210918002307-00380.warc.gz",
   "language":"en",
   "language_score":0.9309850335121155,
   "token_count":2355,
   "score":3.625,
   "int_score":4
}
```

While the meta data is generally interesing and can be used to further filter the dataset, we are only interested in the text field for now, providing us with the actual training data.

## Indexation



The goal of the indexation process is to determine the starting byte position and length of each document in the raw data file.

Architecturally, as shown in the diagram below, a reader process reads the raw data file line by line and writes the starting byte position and length of each document to the queue. For each line in the queue, the processor first validates the JSON object and then writes the starting byte position and length of the document to the index file.


<img src="res/modalities_indexation_bright.svg" alt="Alt text" style="width:80%;"/>

We run the indexation with the command shown below. 

The `modalities data create_raw_index` command triggers the process of creating the index from the raw data.
The `--index_path argument` specifies the location where the generated index file will be saved. In this example, the index will be stored at `data/preprocessed/fineweb_edu_num_docs_483606.idx`.
The last part, i.e., `data/raw/fineweb_edu_num_docs_483606.jsonl` is the input file in JSONL (JSON Lines) format containing the raw data. The command will process this file to create the index.


In [1]:
!modalities data create_raw_index --index_path data/preprocessed/fineweb_edu_num_docs_483606.idx \
                                               data/raw/fineweb_edu_num_docs_483606.jsonl

reading raw data from data/raw/fineweb_edu_num_docs_483606.jsonl
writing index to data/preprocessed/fineweb_edu_num_docs_483606.idx
Processed Lines: 483606it [00:18, 26703.57it/s]
Created index of length 483606


## Throughput optimized tokenization


Now that the we have the raw JSONL dataset indexed, we can proceed with the tokenization. 

In Modalities, tokenization is the process of converting raw text data into a sequence of tokens that can be used as input to the model. This is achieved by scaling up the number of processors performing the tokenization on batches of documents in parallel, as shown in the diagram below. Typically, we use one processor per CPU core to maximize throughput and adapt the queue sizes and batches sizes for optimal throughput. 

The processors place the tokenized documents as byte streams in the queue from which the writer reads and writes the tokenized documents to the output file.

<img src="res/modalities_tokenization_bright.svg" alt="Alt text" style="width:100%;"/>

The tokenized dataset file is heavily optimized for efficient indexing. As layed out in the diagram below, the header specifies the size of the data segment and size of a single token in bytes. With this information at hand, the file format is self-contained and does not need any additional information to be read. The data segment contains the concatenated byte streams of the tokenized documents.
The documents are indexed by their starting byte position and length stored in the index segment. This allows for efficient random access to the tokenized documents in O(1) time complexity.

Additionally, the shuffling of the data can be performed independently of the actual documents, as only the index can be shuffled which has a much lower memory-footprint. Internally, we implemented a numpy array-like view on top of the data segment. 

<img src="res/modalities_file_format_bright.svg" alt="Alt text" style="width:70%;"/>


We define the tokenization config as printed out below. It defines the tokenizer component including all the necessary settings to make it fully reproducible. Under settings we additionally define the performance optimization settings, such as number of CPUs to use and queue sizes, as well as, the input and output file paths.  

In [2]:
from IPython.display import Markdown, display

def display_markdown(file_path):
    with open(file_path, 'r') as file:
        code = file.read()
    display(Markdown(f'```yaml\n{code}\n```'))


In [3]:
tokenization_config_path = "configs/tokenization_config.yaml"
display_markdown(tokenization_config_path)

```yaml
settings:
  src_path: data/raw/fineweb_edu_num_docs_483606.jsonl
  dst_path: data/preprocessed/fineweb_edu_num_docs_483606.pbin
  index_path: data/preprocessed/fineweb_edu_num_docs_483606.idx
  jq_pattern: .text
  num_cpus: ${node_env:num_cpus}
  eod_token: <|endoftext|>
  processing_batch_size: 10
  raw_samples_queue_size: 300
  processed_samples_queue_size: 300

tokenizer:
  component_key: tokenizer
  variant_key: pretrained_hf_tokenizer
  config:
    pretrained_model_name_or_path: data/tokenizer
    padding: false
    truncation: false
```

In [4]:
!modalities data pack_encoded_data configs/tokenization_config.yaml

Instantiated <class 'modalities.tokenization.tokenizer_wrapper.PreTrainedHFTokenizer'>: tokenizer
Processed batches: 100%|█████████████| 483606/483606 [00:17<00:00, 27330.58it/s]


# Training

In Modalities, we scale up the training via Fully Sharded Data Parallel (FSDP), as defined in the paper [Zhao, Yanli, et al. "Pytorch fsdp: experiences on scaling fully sharded data parallel." arXiv preprint arXiv:2304.11277 (2023).](https://arxiv.org/pdf/2304.11277)

**Goal:** Maximizing the token throughput during training by trading off communication overhead for a lower memory footprint. 

* Before training model is split into FSDP units and each FSDP unit is sharded across all ranks
* Each rank is a data parallel process receiving only a subset of the data
* Each rank materializes one FSDP unit at a time during the forward pass by receving the sharded weights from its peers

<img src="res/fsdp_bright.svg" alt="Alt text" style="width:90%;"/>


adopted from Zhao, Yanli, et al. "Pytorch fsdp: experiences on scaling fully sharded data parallel." arXiv preprint arXiv:2304.11277 (2023).

While FSDP happens under the hood of Modalities the user can still parameterize the training process via the `pretraining_config.yaml` file. In fact, the training config is specified in a way that every component during training, e.g, dataset, dataloader, model, etc., are fully reproducible. On the one hand, this leads to larger, somewhat more complex config files, however it also allows to fully reproduce the training process. Especially in the field of LLMs, where the training process is expensive, complex and involves excessive amounts of ablations, this is a crucial feature to keep track of the entire configuration of the system in a reproducible manner. 

The config file is shown in the print out below. 


In [None]:
tokenization_config_path = "configs/pretraining_config.yaml"
display_markdown(tokenization_config_path)

Below you find the command for running the distributed training with modalities across multiple 4 GPUs on a single node. Let's break it down into its components:

* `CUDA_VISIBLE_DEVICES=0,1,2,3`: This environment variable specifies which GPUs will be used for the job. In this case, GPUs with IDs 0, 1, 2, 3 are selected for training.

* `torchrun`: This is a utility from PyTorch used to launch distributed training. It automatically manages multiple processes for distributed training.

* `--rdzv-endpoint localhost:29515`: Specifies the rendezvous endpoint. Here, localhost is the machine's address, and 29515 is the port. The rendezvous endpoint coordinates the processes involved in distributed training.

* `--nnodes 1`: Specifies the number of nodes to be used in the distributed setup. Since this is a single-node setup, 1 is used.

* `--nproc_per_node 4`: This argument tells torchrun how many processes to launch on each node. In this case, 4 processes are launched per node, corresponding to the 4 GPUs (IDs 0, 1, 2, 3) specified by `CUDA_VISIBLE_DEVICES`.

* `$(which modalities) run`: This part dynamically finds the path to the modalities executable and runs it. The run command triggers the main process to start the training.

* `--config_file_path configs/pretraining_config.yaml`: The `--config_file_path` argument provides the path to the configuration file for the training job. In this example, the configuration is provided in `configs/pretraining_config.yaml`, which includes settings like model architecture, optimizer, dataset, dataloader and other training components.


Once executed, the training process will start, and you will see the training logs in the terminal. The logs will include information about the training progress, such as the loss values, learning rate, and other metrics. Additionally, you can monitor the training process using Weights & Biases, which modalities automatically logs. Make sure that you are logged into your Weights & Biases account to track the training metrics.

In [7]:
! CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --rdzv-endpoint localhost:29515 \
                                        --nnodes 1 \
                                        --nproc_per_node 4 \
                                        $(which modalities) run --config_file_path configs/pretraining_config.yaml

W0906 16:14:45.871000 139806215406656 torch/distributed/run.py:757] 
W0906 16:14:45.871000 139806215406656 torch/distributed/run.py:757] *****************************************
W0906 16:14:45.871000 139806215406656 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0906 16:14:45.871000 139806215406656 torch/distributed/run.py:757] *****************************************
Instantiated <class 'modalities.models.components.layer_norms.RMSLayerNorm'>: model_raw -> config -> attention_norm
Instantiated <class 'modalities.models.components.layer_norms.RMSLayerNorm'>: model_raw -> config -> ffn_norm
Instantiated <class 'modalities.models.components.layer_norms.RMSLayerNorm'>: model_raw -> config -> lm_head_norm
Instantiated <class 'modalities.models.gpt2.gpt2_model.GPT2LLM'>: model_raw
Instantiated <cla