![a](../docs/source/banner.jpg)

<h1 style="text-align: center;">Live Demo @ AI24</h1>

<hr/>

# Scope of the next 15mins:
**Let's train a dense model with Modalities!**

* Data Preprocessing (Indexation, Tokenization)
* Model Pretraining (GPT Model)
* Monitoring (Weights&Biases)

**Assumption:**
* Modalities is already installed
* Raw data is already downloaded, cleaned and filtered (FineWeb-Edu)
* Tokenizer is already trained and available (we use the GPT2 tokenizer)


**Folder structure:**

```text
└── ai_24_demo
    ├── modalities_demo.ipynb
    ├── configs
    │   ├── pretraining_config.yaml
    │   └── tokenization_config.yaml
    └── data
        ├── checkpoints
        │   └─ <checkpoints>
        ├── preprocessed
        │   └── <files>
        ├── raw
        │   └── fineweb_edu_num_docs_483606.jsonl
        └── tokenizer
            ├── tokenizer.json
            └── tokenizer_config.json        
```

**Disclaimer:**

Don't run modalities in jupyter notebooks!


But this time for demonstration purposes:

<img src="res/notebooks_1.png" alt="Alt text" style="width:50%;"/>

<small> credits: Joel Grus - I don't like Notebooks</small>

# Data Preprocessing


Dataset: 
* FineWeb-Edu (~500k documents) encoded as JSONL file
* cleaned, filtered and deduplicated

Example line:
```json
{
   "text":"What is the difference between 50 Ohm and 75 Ohm Coax? [...]",
   "id":"<urn:uuid:57e09efe-1c29-49f8-a086-e1bb5dd552c9>",
   "dump":"CC-MAIN-2021-39",
   "url":"http://cablesondemandblog.com/wordpress1/2014/03/",
   "file_path":"s3://commoncrawl/crawl-data/[...]20210918002307-00380.warc.gz",
   "language":"en",
   "language_score":0.9309850335121155,
   "token_count":2355,
   "score":3.625,
   "int_score":4
}
```

TODO Add dataset statistics

## Indexation



**Goal:** Find the starting byte position and length of each document in the raw data file.

![a.png](res/modalities_indexation_bright.svg)

In [2]:
!modalities data create_raw_index --index_path data/preprocessed/fineweb_edu_num_docs_483606.idx \
                                               data/raw/fineweb_edu_num_docs_483606.jsonl

reading raw data from data/raw/fineweb_edu_num_docs_483606.jsonl
writing index to data/preprocessed/fineweb_edu_num_docs_483606.idx
Processed Lines: 483606it [00:12, 39468.36it/s]
Created index of length 483606


## Throughput optimized tokenization



**Goal:** Tokenize the raw data and save the tokenized data in an indexing-optimized binary file.

#### Tokenization Pipeline
<img src="res/modalities_tokenization_bright.svg" alt="Alt text" style="width:90%;"/>


#### Tokenized dataset format optimized for indexing

<img src="res/modalities_file_format_bright.svg" alt="Alt text" style="width:70%;"/>

**Advantages:**
* self-contained binary file format
* Index tuples allow to index document in O(1); implemented as numpy memmap view of the file
* Dataset is loaded into RAM on demand minimizing memory footprint
* Shuffling of data can be done by shuffling the index tuples instead of the actual data

In [3]:
from IPython.display import Markdown, display

def display_markdown(file_path):
    with open(file_path, 'r') as file:
        code = file.read()
    display(Markdown(f'```yaml\n{code}\n```'))


In [4]:
tokenization_config_path = "configs/tokenization_config.yaml"
display_markdown(tokenization_config_path)

```yaml
settings:
  src_path: data/raw/fineweb_edu_num_docs_483606.jsonl
  dst_path: data/preprocessed/fineweb_edu_num_docs_483606.pbin
  index_path: data/preprocessed/fineweb_edu_num_docs_483606.idx
  jq_pattern: .text
  num_cpus: ${node_env:num_cpus}
  eod_token: <|endoftext|>
  processing_batch_size: 10
  raw_samples_queue_size: 300
  processed_samples_queue_size: 300

tokenizer:
  component_key: tokenizer
  variant_key: pretrained_hf_tokenizer
  config:
    pretrained_model_name_or_path: data/tokenizer
    padding: false
    truncation: false
```

In [5]:
!modalities data pack_encoded_data configs/tokenization_config.yaml

Instantiated <class 'modalities.tokenization.tokenizer_wrapper.PreTrainedHFTokenizer'>: tokenizer
Processed batches: 100%|█████████████| 483606/483606 [00:17<00:00, 27944.25it/s]


# Training

* Before training model is split into FSDP units and each FSDP unit is sharded across all ranks
* Each rank is a data parallel process receiving only a subset of the data
* Each rank materializes one FSDP unit at a time during the forward pass by receving the sharded weights from its peers

### Scaling up model training with Fully Sharded Data Parallel (FSDP)
**Goal:** Maximizing the token throughput during training by trading off communication for memory. 
<img src="res/fsdp_bright.svg" alt="Alt text" style="width:80%;"/>


adopted from Zhao, Yanli, et al. "Pytorch fsdp: experiences on scaling fully sharded data parallel." arXiv preprint arXiv:2304.11277 (2023).

In [6]:
tokenization_config_path = "configs/pretraining_config.yaml"
display_markdown(tokenization_config_path)

```yaml
settings:  
  experiment_id: ${modalities_env:experiment_id}
  config_file_path: ${modalities_env:config_file_path}
  referencing_keys:
    sample_key: input_ids
    target_key: target_ids
  training:
    training_log_interval_in_steps: 5
    checkpointing_interval_in_steps: 50
    evaluation_interval_in_steps: 50
    global_num_seen_tokens: 0
    activation_checkpointing_modules: [GPT2Block]
    gradient_acc_steps: 1
    local_train_micro_batch_size: 64
    sequence_length: 256
  cuda_env:
    local_rank: ${cuda_env:LOCAL_RANK}
    global_rank: ${cuda_env:RANK}
    world_size: ${cuda_env:WORLD_SIZE}
  paths:
    checkpointing_path: data/checkpoints

collate_fn:  
  component_key: collate_fn
  variant_key: gpt_2_llm_collator
  config:
    sample_key: ${settings.referencing_keys.sample_key}
    target_key: ${settings.referencing_keys.target_key}

train_dataset:
  component_key: dataset
  variant_key: packed_mem_map_dataset_continuous
  config:
    raw_data_path: data/preprocessed/fineweb_edu_num_docs_483606.pbin
    sequence_length: ${settings.training.sequence_length}
    sample_key:  ${settings.referencing_keys.sample_key}

train_dataloader:
  component_key: data_loader
  variant_key: default
  config:
    num_workers: 2
    pin_memory: true
    shuffle: false
    fixed_num_batches: 1000
    dataloader_tag: train
    dataset:
      instance_key: train_dataset
      pass_type: BY_REFERENCE
    batch_sampler:
      component_key: batch_sampler
      variant_key: default
      config:
        batch_size: ${settings.training.local_train_micro_batch_size}
        drop_last: true
        sampler:
          component_key: sampler
          variant_key: distributed_sampler
          config:
            rank: ${settings.cuda_env.global_rank}
            num_replicas: ${settings.cuda_env.world_size}
            shuffle: true
            dataset:
              instance_key: train_dataset
              pass_type: BY_REFERENCE
    collate_fn:
      instance_key: collate_fn
      pass_type: BY_REFERENCE


eval_dataloaders: []

checkpoint_saving:
  component_key: checkpoint_saving
  variant_key: default
  config:
    checkpoint_saving_strategy:
      component_key: checkpoint_saving_strategy
      variant_key: save_k_most_recent_checkpoints_strategy
      config:
        k: -1   # -1 to save all checkpoints
    checkpoint_saving_execution:
      component_key: checkpoint_saving_execution
      variant_key: fsdp
      config:
        checkpoint_path: ${settings.paths.checkpointing_path}
        global_rank: ${settings.cuda_env.global_rank}
        experiment_id: ${settings.experiment_id}
        get_num_tokens_from_num_steps_callable:
          component_key: number_conversion
          variant_key: num_tokens_from_num_steps_callable
          config:
            num_ranks: ${settings.cuda_env.world_size}
            local_micro_batch_size: ${settings.training.local_train_micro_batch_size}
            sequence_length: ${settings.training.sequence_length} 

loss_fn:
  component_key: loss
  variant_key: clm_cross_entropy_loss
  config:
    target_key: target_ids
    prediction_key: logits

wrapped_model:
  component_key: model
  variant_key: fsdp_wrapped
  config:
    model:
      instance_key: model
      pass_type: BY_REFERENCE
    sync_module_states: true
    mixed_precision_settings: BF_16
    sharding_strategy: FULL_SHARD
    block_names: [GPT2Block]

model: 
  component_key: model
  variant_key: model_initialized
  config:
    model:
      instance_key: model_raw
      pass_type: BY_REFERENCE
    model_initializer:
      component_key: model_initialization
      variant_key: composed
      config:
        model_type: gpt2
        weight_init_type: scaled
        mean: 0.0
        std: 0.02
        num_layers: ${model_raw.config.n_layer}

model_raw:
  component_key: model
  variant_key: gpt2
  config:
    sample_key: ${settings.referencing_keys.sample_key}
    poe_type: NOPE
    sequence_length: ${settings.training.sequence_length}
    prediction_key: ${loss_fn.config.prediction_key}
    vocab_size: 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
    n_layer: 2
    n_head_q: 8
    n_head_kv: 4
    ffn_hidden: 128
    n_embd: 128
    dropout: 0.0
    bias: true # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
    attention_config:
      qkv_transforms:
        - type_hint: RotaryTransform
          config:
            n_embd: ${model_raw.config.n_embd}
            n_head: ${model_raw.config.n_head_q} #it has to be head_q here
            seq_length_dim: -2
    attention_implementation: manual
    activation_type: swiglu
    attention_norm:
      component_key: layer_norm
      variant_key: rms_norm
      config:
        ndim: ${model_raw.config.n_embd}
        bias: true
        epsilon: 1e-5
    ffn_norm:
      component_key: layer_norm
      variant_key: rms_norm
      config:
        ndim: ${model_raw.config.n_embd}
        bias: true
        epsilon: 1e-5
    lm_head_norm:
      component_key: layer_norm
      variant_key: rms_norm
      config:
        ndim: ${model_raw.config.n_embd}
        bias: true
        epsilon: 1e-5

scheduler:
  component_key: scheduler
  variant_key: onecycle_lr
  config:
    optimizer:
      instance_key: optimizer
      pass_type: BY_REFERENCE
    max_lr: 6e-4
    div_factor: 10
    final_div_factor: 1
    total_steps: 1000
    pct_start: 0.01
    anneal_strategy: cos

optimizer:  
  component_key: optimizer
  variant_key: adam_w
  config:
    lr: 0.0001
    betas: [0.9, 0.95]
    eps: 1e-8
    weight_decay: 1e-1
    weight_decay_groups_excluded: [embedding, layernorm]
    wrapped_model: 
      instance_key: wrapped_model
      pass_type: BY_REFERENCE

gradient_clipper:
  component_key: gradient_clipper
  variant_key: fsdp
  config:
    wrapped_model:
      instance_key: wrapped_model
      pass_type: BY_REFERENCE
    norm_type: P2_NORM
    max_norm: 1.0

batch_progress_subscriber:
  component_key: progress_subscriber
  variant_key: dummy
  config: {}

evaluation_subscriber:
  component_key: results_subscriber
  variant_key: wandb
  config:
    global_rank: ${settings.cuda_env.global_rank}
    project: ai_24_demo
    mode: ONLINE
    experiment_id: ${settings.experiment_id}
    directory: wandb_storage
    config_file_path: ${settings.config_file_path}

```

In [17]:
! CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --rdzv-endpoint localhost:29515 \
                                        --nnodes 1 \
                                        --nproc_per_node 4 \
                                        $(which modalities) run --config_file_path configs/pretraining_config.yaml

W0829 13:26:50.011000 140682708366400 torch/distributed/run.py:757] 
W0829 13:26:50.011000 140682708366400 torch/distributed/run.py:757] *****************************************
W0829 13:26:50.011000 140682708366400 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0829 13:26:50.011000 140682708366400 torch/distributed/run.py:757] *****************************************
Instantiated <class 'modalities.models.components.layer_norms.RMSLayerNorm'>: model_raw -> config -> attention_norm
Instantiated <class 'modalities.models.components.layer_norms.RMSLayerNorm'>: model_raw -> config -> ffn_norm
Instantiated <class 'modalities.models.components.layer_norms.RMSLayerNorm'>: model_raw -> config -> lm_head_norm
Instantiated <class 'modalities.models.gpt2.gpt2_model.GPT2LLM'>: model_raw
Instantiated <cla