## Environment Setup

This section prepares the environment for training our Mamba model. We'll install all the required libraries and dependencies. This ensures that all the necessary components are in place before we proceed with loading the data and building the model.

First, we'll clone the `causal-conv1d` repository from GitHub. This repository provides a highly optimized CUDA implementation of 1D depthwise causal convolutions, which is a key component of the Mamba architecture. By using this optimized implementation, we can significantly speed up the training process. After cloning, we'll switch to a specific version (`v1.5.2`) of the repository to ensure reproducibility.

In [None]:
!git clone https://github.com/Dao-AILab/causal-conv1d.git
%cd causal-conv1d
!git checkout v1.5.2

Now, we will install the `causal-conv1d` library from the source code we just cloned. The environment variable `CAUSAL_CONV1D_FORCE_BUILD=TRUE` is set to ensure that the library is compiled from scratch, even if a pre-existing version is found. This helps in avoiding any potential conflicts and ensures that we are using the correct version of the library.

In [4]:
!CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install .

Processing /content/causal-conv1d
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting ninja (from causal_conv1d==1.5.2)
  Using cached ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (5.1 kB)
Using cached ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (180 kB)
Building wheels for collected packages: causal_conv1d
  Building wheel for causal_conv1d (pyproject.toml) ... [?25l[?25hdone
  Created wheel for causal_conv1d: filename=causal_conv1d-1.5.2-cp312-cp312-linux_x86_64.whl size=102969996 sha256=5c7d4f44ffa50cf9932b8a76d79f21b95a740f2a21c4e64467db54a476c4a6b5
  Stored in directory: /root/.cache/pip/wheels/ff/46/f6/494a696282daefdb5ace79c1dd44012d2d496de2279db3022b
Successfully built causal_conv1d
Installing collected packages: ninja, causal_conv1d
Successfully installed causal_conv1d-1.5.2 ninja-1.13.0


Next, we install the `mamba-ssm` library, which contains the core implementation of the Mamba model. We use the `-q` flag for a quiet installation (less verbose output) and the `-U` flag to upgrade the package to the latest version if it's already installed.

In [5]:
!pip install mamba-ssm -q -U

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for mamba-ssm (pyproject.toml) ... [?25l[?25hdone


We will now install three essential libraries: `datasets`, `accelerate`, and `wandb`.
- The `datasets` library from Hugging Face provides an easy way to load and process a wide variety of datasets.
- `accelerate` simplifies running PyTorch code across different hardware setups, such as CPUs, multiple GPUs, and TPUs, with minimal modifications to your code.
- `wandb` (Weights & Biases) is a powerful tool for experiment tracking. It allows us to log and visualize training metrics, which is crucial for monitoring the performance of our model and comparing different experiments.

In [6]:
!pip install datasets accelerate wandb



In [7]:
!pip install -q git+https://github.com/huggingface/transformers@main

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


## Platform Logins

In this section, we will log in to the Hugging Face Hub and Weights & Biases (Wandb). This is necessary to access pre-trained models and datasets, as well as to track our training experiments effectively.

We'll now log in to the Hugging Face Hub. This step is crucial for downloading pre-trained models and datasets, and for uploading your trained models to the Hub. When you execute the following cell, you will be prompted to enter your Hugging Face authentication token.

In [8]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Next, we'll log in to Weights & Biases (Wandb). This will enable us to track the progress of our model's training in real-time. We can monitor key metrics such as loss and accuracy, and keep a detailed record of our experiments. Upon running the cell, you might need to provide your API key or authenticate through your web browser.

In [None]:
import wandb

wandb.login()

Here, we set up some environment variables to configure how Wandb operates.
- `WANDB_START_METHOD=thread`: This sets the start method for Wandb to "thread". This can be beneficial in certain environments to prevent potential compatibility issues.
- `WANDB_PROJECT=canarim-mamba-110m`: We are naming our Wandb project "canarim-mamba-110m". All the training runs from this notebook will be organized and logged under this project name, making it easy to compare and analyze them.

In [10]:
%env WANDB_START_METHOD=thread
%env WANDB_PROJECT=canarim-mamba-110m

env: WANDB_START_METHOD=thread
env: WANDB_PROJECT=canarim-mamba-110m


## Installation Check

In this section, we will verify the versions of the key libraries we have installed. This is a good practice to ensure that all dependencies are correctly installed and to help with debugging if any issues arise.

In [11]:
!pip show transformers

Name: transformers
Version: 4.56.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: mamba_ssm, peft, sentence-transformers


In [12]:
!pip show datasets

Name: datasets
Version: 4.0.0
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: dill, filelock, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyyaml, requests, tqdm, xxhash
Required-by: torchtune


In [13]:
!pip show mamba-ssm

Name: mamba_ssm
Version: 2.2.5
Summary: Mamba state-space model
Home-page: https://github.com/state-spaces/mamba
Author: Tri Dao, Albert Gu
Author-email: Tri Dao <tri@tridao.me>, Albert Gu <agu@cs.cmu.edu>
License: Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      dire

In [14]:
!pip show causal-conv1d

Name: causal_conv1d
Version: 1.5.2
Summary: Causal depthwise conv1d in CUDA, with a PyTorch interface
Home-page: https://github.com/Dao-AILab/causal-conv1d
Author: Tri Dao
Author-email: tri@tridao.me
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: ninja, packaging, torch
Required-by: 


In [15]:
!pip show torch

Name: torch
Version: 2.8.0+cu126
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-cufile-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, causal_conv1d, fastai, mamba_ssm, peft, sentence-transformers, timm, torchaudio, torchdata, torchvision


## Library Imports

Now, let's import the necessary modules and classes from the libraries we've installed. These imports are essential for loading the dataset, setting up the model architecture, and executing the training process.

In [16]:
import json
import os
import torch

import datasets

from datasets import load_dataset
from transformers import default_data_collator
from transformers import AutoTokenizer, MambaConfig, MambaForCausalLM
from transformers import Trainer, TrainingArguments

## Dataset Loading

In this section, we will load the dataset that will be used for both training and evaluating our Mamba model. A high-quality dataset is fundamental for training a capable language model.

We will load the "nicholasKluge/Pt-Corpus-Instruct-tokenized-large" dataset from the Hugging Face Hub. This dataset is pre-tokenized, which means the text has already been converted into a numerical format that the model can understand. We'll use `num_proc=12` to specify that 12 processes should be used to load the data in parallel, which will significantly speed up the loading time. The result will be a `DatasetDict` object, which conveniently holds both the training and testing splits of the data.

In [20]:
ds = load_dataset("nicholasKluge/Pt-Corpus-Instruct-tokenized-large", num_proc=12)
ds

Resolving data files:   0%|          | 0/162 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/162 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/3033690 [00:00<?, ? examples/s]

Setting num_proc from 12 to 2 for the test split as it only contains 2 shards.


Generating test split:   0%|          | 0/30000 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/162 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3033690
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 30000
    })
})

## Tokenizer Configuration

Now, we will set up the tokenizer. A tokenizer is responsible for converting raw text into a sequence of tokens (numerical representations) that the model can process. It's a crucial component in any natural language processing pipeline.

In [21]:
tokenizer_model_id = "nicholasKluge/TeenyTinyLlama-460m"

We will load a pre-trained tokenizer using `AutoTokenizer.from_pretrained`. The `tokenizer_model_id` specifies which tokenizer to load from the Hugging Face Hub. We set `use_fast=True` to load a "fast" tokenizer, which is implemented in Rust and is generally more performant. This tokenizer will handle the conversion of text into numerical IDs that our Mamba model can understand.

In [22]:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_id, use_fast=True)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

Let's inspect the length of the `input_ids` for the first example in our training set. This will give us an idea of the sequence length we are working with.

In [23]:
len(ds["train"][0]["input_ids"])

2048

Now, let's decode the `input_ids` of the first training example back into human-readable text. This is a good way to verify that our tokenizer is working as expected and to get a feel for the data we are training on.

In [24]:
tokenizer.decode(ds["train"][0]["input_ids"])

'tempos, iniciado por George Lucas em 1977. A Disney investiu US$ 4,4 bilhões pelos direitos de fazer as próximas sequências.\nAbrams prometeu que "Despertar da Força" não vai ser uma "viagem nostálgica" sobre as trajetórias dos irmãos Leia e Luke e a transformação do pai da dupla, Anakin Skywalker no vilão Darth Vader.\nFilmagem foi mantida em segredo\nPoucos detalhes foram liberados sobre o novo filme; sabe-se que o episódio acontece 30 anos após o final de "O Retorno do Jedi" (1983), terceira parte da trilogia inicial. Outra trilogia surgiu entre 1999 e 2005, contando cronologicamente o que teria passado antes dos primeiros episódios. O segundo trio de aventuras, no entanto, não fez tanto sucesso quanto o primeiro.\nOutros personagens igualmente míticos estiveram presentes no evento: o robô C-3PO e Chewbacca, o gigante peludo, amigo fiel de Han Solo.\nPara viver os novos personagens, foram convidados atores como Oscar Isaac e Lupita Nyong\'o, além de novatos como John Boyega e Daisy

## Model Hyperparameter Definition

In this section, we will define the key hyperparameters that determine the architecture of our Mamba model. These parameters control the size and complexity of the model.

Here, we set the core hyperparameters for our Mamba model:
- `vocab_size`: This is the total number of unique tokens in our tokenizer's vocabulary.
- `hidden_size`: This defines the dimensionality of the model's internal representations and embeddings. It's a key factor in determining the model's capacity.
- `expand`: An expansion factor used within the Mamba blocks to determine the size of the intermediate layers.
- `num_hidden_layers`: This specifies the number of Mamba blocks to stack, which determines the depth of the model.
- `intermediate_size`: The size of the intermediate layer within each Mamba block, calculated by multiplying `hidden_size` by the `expand` factor.

In [25]:
vocab_size = 32000
hidden_size = 768
expand = 2
num_hidden_layers = 12
intermediate_size = hidden_size * expand

## Model Parameter Calculation

In this section, we'll estimate the total number of parameters in our Mamba model based on the hyperparameters we've defined. It's important to note that the following calculations are based on a Transformer-like architecture for estimation purposes. The actual parameter count in a Mamba model is determined by its unique components, including the State Space Model (SSM) blocks and their projection matrices. This estimation will still give us a good sense of the model's size.

First, let's calculate the number of parameters in the embedding layer. This layer is responsible for mapping each token ID from our vocabulary to a dense vector of size `hidden_size`. The number of parameters is simply the product of the vocabulary size and the hidden size.

In [26]:
# Cálculo dos parâmetros dos Embeddings
embedding_params = vocab_size * hidden_size
embedding_params

24576000

Here, we estimate the number of parameters for the attention and MLP components within each decoder layer. **Important Note:** This calculation is based on a standard Transformer architecture (with 4 linear projections in the attention mechanism). However, the Mamba architecture does not use a traditional attention mechanism. Instead, it employs a State Space Model (SSM) block. This calculation serves as a rough estimation and might not precisely reflect the parameter count of a Mamba block.

In [27]:
# Cálculo dos parâmetros das camadas de atenção e MLP em cada camada de decodificador
# 4 projeções lineares na atenção e 3 na MLP, todas sem bias
attention_params_per_layer = 4 * (hidden_size * hidden_size)
attention_params_per_layer

2359296

This cell estimates the number of parameters in the MLP (Multi-Layer Perceptron) part of each decoder layer. As with the attention calculation, this is based on a Transformer-like structure and should be considered an approximation for our Mamba model.

In [28]:
mlp_params_per_layer = (hidden_size * intermediate_size) + \
    (hidden_size * intermediate_size) + \
    (intermediate_size * hidden_size)
mlp_params_per_layer

3538944

Now, we calculate the total number of parameters across all the decoder layers (in our case, the Mamba blocks). We do this by multiplying our estimated number of parameters per layer by the total number of hidden layers.

In [29]:
decoder_layer_params = (attention_params_per_layer + mlp_params_per_layer) * num_hidden_layers
decoder_layer_params

70778880

The `lm_head` (language model head) is the final layer of the model. It takes the model's final hidden state and projects it back to the vocabulary size to produce a probability distribution over all possible next tokens. The number of parameters in this layer is the product of the hidden size and the vocabulary size.

In [30]:
# Cálculo dos parâmetros da cabeça do modelo de linguagem (lm_head)
lm_head_params = hidden_size * vocab_size
lm_head_params

24576000

Finally, we sum up the parameters from the embedding layer, the decoder layers (using our Transformer-based estimation), and the language model head to get an estimate of the total number of parameters in our model.

In [31]:
total_params = embedding_params + decoder_layer_params + lm_head_params
print(f"Total de parametros do modelo: {total_params:,}")

Total de parametros do modelo: 119,930,880


## Training Configuration

In this section, we will define the hyperparameters and settings for the training process. We will be using the `Trainer` class from the Hugging Face `transformers` library to handle the training loop.

Here, we set several important hyperparameters for our training run:
- `bs` (batch size): The number of training examples to process in a single forward and backward pass on each device.
- `ga_steps` (gradient accumulation steps): The number of steps to accumulate gradients before performing a weight update. This is a useful technique to effectively increase the batch size without increasing memory usage. The effective batch size will be `bs * ga_steps`.
- `epochs`: The total number of times the model will iterate over the entire training dataset.
- `steps_per_epoch`: The number of training steps to perform in each epoch.
- `lr` (learning rate): The initial learning rate for the optimizer.

In [32]:
bs=16        # batch size
ga_steps=2   # gradient acc. steps
epochs=1
steps_per_epoch=1000
lr=0.00005

We specify the directory where all the outputs of the training process, such as model checkpoints and logs, will be saved.

In [33]:
output_dir = "./canarim-mamba-110m"

A `data_collator` is a function that takes a list of samples from the dataset and prepares them into a batch. Here, we use `default_data_collator`, which is suitable for our pre-tokenized dataset as it will simply batch the samples together.

In [34]:
data_collator = default_data_collator

We will now create a `MambaConfig` object. This object holds all the configuration details for our Mamba model's architecture. We will pass the hyperparameters we defined earlier, such as `vocab_size`, `hidden_size`, and `num_hidden_layers`, as well as the special token IDs from our tokenizer (padding, beginning of sequence, and end of sequence).

In [35]:
config_model = MambaConfig(
    vocab_size= 32000,
    hidden_size=768,
    num_hidden_layers= 12,
    pad_token_id = tokenizer.pad_token_id,
    bos_token_id = tokenizer.bos_token_id,
    eos_token_id = tokenizer.eos_token_id
)

With the configuration object created, we can now instantiate our Mamba model. We will use the `MambaForCausalLM` class, which is specifically designed for causal language modeling tasks (i.e., predicting the next token in a sequence).

In [36]:
model = MambaForCausalLM(
    config_model,
)
model

MambaForCausalLM(
  (backbone): MambaModel(
    (embeddings): Embedding(32000, 768)
    (layers): ModuleList(
      (0-11): 12 x MambaBlock(
        (norm): MambaRMSNorm(768, eps=1e-05)
        (mixer): MambaMixer(
          (conv1d): Conv1d(1536, 1536, kernel_size=(4,), stride=(1,), padding=(3,), groups=1536)
          (act): SiLU()
          (in_proj): Linear(in_features=768, out_features=3072, bias=False)
          (x_proj): Linear(in_features=1536, out_features=80, bias=False)
          (dt_proj): Linear(in_features=48, out_features=1536, bias=True)
          (out_proj): Linear(in_features=1536, out_features=768, bias=False)
        )
      )
    )
    (norm_f): MambaRMSNorm(768, eps=1e-05)
  )
  (lm_head): Linear(in_features=768, out_features=32000, bias=False)
)

The `TrainingArguments` class allows us to specify a wide range of settings and hyperparameters for our training process. Here, we configure things like the output directory, batch sizes, evaluation strategy, logging frequency, gradient accumulation, number of epochs, learning rate scheduler, and enabling `bf16` for mixed-precision training to speed up computation and reduce memory usage. We also configure it to report the training metrics to Wandb.

In [38]:
args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    eval_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch,
    save_steps=steps_per_epoch,
    save_total_limit=3,
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="cosine",
    learning_rate=lr,
    bf16=True,
    ddp_find_unused_parameters=False,
    save_safetensors=False,
    report_to="wandb"
)

## Trainer Initialization

Now that we have our model, training arguments, data collator, and datasets ready, we can initialize the `Trainer` object. The `Trainer` will bring all these components together to manage the training and evaluation process.

We create an instance of the `Trainer` class, passing in our model, tokenizer, training arguments, data collator, and our training and evaluation datasets. The `Trainer` provides a high-level API that abstracts away much of the boilerplate code typically required for training a model in PyTorch.

In [None]:
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
)

## Model Training

It's time to start training our Mamba model! With everything set up, a single call to the `train()` method of our `Trainer` object will kick off the training process.

By calling `trainer.train()`, we initiate the training loop. The `Trainer` will handle all the details, including iterating through the training data, performing forward and backward passes, updating the model's weights, and periodically evaluating the model on the test set. It will also log metrics to Wandb and save model checkpoints at the specified intervals. Let's start the training!

In [40]:
trainer.train()

Step,Training Loss,Validation Loss
