# Train Msingi1: A Swahili Language Model

This notebook trains the Msingi1 language model on Google Colab using TPU/GPU acceleration.

## 1. Setup Environment

First, let's set up our environment and install dependencies.

In [None]:
# Clone the repository
!git clone https://github.com/YOUR_USERNAME/msingi1.git
%cd msingi1

In [None]:
# Install dependencies
!pip install -r requirements.txt
!pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html  # For TPU support
!pip install wandb  # For experiment tracking

## 2. Check Available Hardware

Let's verify what hardware accelerator we have access to.

In [None]:
import torch
import os

if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
    device = 'cuda'
elif os.environ.get('COLAB_TPU_ADDR'):
    print("TPU available")
    device = 'tpu'
else:
    print("No GPU/TPU found, using CPU")
    device = 'cpu'

## 3. Mount Google Drive

We'll mount Google Drive to save our checkpoints and dataset.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create directories
!mkdir -p checkpoints
!mkdir -p data

## 4. Prepare Dataset

Copy your dataset to the appropriate location.

In [None]:
# Copy dataset from Drive if needed
!cp -r "/content/drive/MyDrive/path/to/swahili/data" data/

# Or download dataset directly
# !wget -O data/swahili_data.zip URL_TO_YOUR_DATASET
# !unzip data/swahili_data.zip -d data/

## 5. Train Tokenizer

First, let's train our custom ByteLevelBPE tokenizer.

In [None]:
from src.train_tokenizer import train_tokenizer

# Train tokenizer
train_tokenizer(
    data_path="data/Swahili data/Swahili data/train.txt",
    save_dir="tokenizer",
    vocab_size=50000,
    min_frequency=2
)

## 6. Initialize Training

Now let's set up our model and start training.

In [None]:
import torch
from src.model import MsingiConfig
from src.train import train
from src.data_processor import load_dataset

# Load dataset
train_texts = load_dataset('data/Swahili data/Swahili data/train.txt')
print(f"Loaded {len(train_texts)} text samples")

# Initialize config
config = MsingiConfig(
    vocab_size=50000,
    max_position_embeddings=2048,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    gradient_checkpointing=True  # Enable for memory efficiency
)

## 7. Start Training

Finally, let's start the training process.

In [None]:
# Start training
train(
    config=config,
    train_texts=train_texts,
    val_texts=None,  # We'll use a portion of train data for validation
    num_epochs=100,
    batch_size=4,
    learning_rate=3e-4,
    warmup_steps=1000,
    grad_acc_steps=16,
    save_steps=1000,
    checkpoint_dir='checkpoints',
    device=device
)

## 8. Save Model to Drive

After training, let's save our model to Google Drive.

In [None]:
# Create model directory in Drive
!mkdir -p "/content/drive/MyDrive/msingi1/models"

# Copy checkpoints to Drive
!cp -r checkpoints/* "/content/drive/MyDrive/msingi1/models/"
!cp -r tokenizer "/content/drive/MyDrive/msingi1/"