# main.ipynb
This notebook contains some template code to help you with loading/preprocessing the data.

We start with some imports and constants.
The training data is found in the `data` subfolder.
There is also a tokenizer I've trained for you which you can use for the project.

Can be executed once — does not depend on the execution environment.

In [None]:
# ========================================
# 1. Clone the Repository (if required)
# ========================================
!git clone https://github.com/KatsuhitoArasaka/BabyLM-Tiny.git
%cd BabyLM-Tiny

# ============================
# 2. Set Paths
# ============================
TRAIN_PATH = './data/train.txt'       # Path to your training data
DEV_PATH = './data/dev.txt'           # Path to your validation data
SPM_PATH = './data/tokenizer.model'   # Path to your tokenizer model

Execute after restarting the environment -- when changing the device (CPU ↔ GPU) you need to restart the kernel.

In [None]:
# ==========================
# 3. Install Dependencies
# ==========================
!pip install transformers datasets wandb --quiet

# ===============================
# 4. Authentication with wandb
# ===============================
import wandb
wandb.login()  # Enter your API key when prompted

# ==========================
# 5. Import Libraries and Set Device
# ==========================
import torch
import datasets
from functools import partial
from datasets import load_dataset
from transformers import DebertaV2Tokenizer as Tokenizer
# from transformers.models.deberta_v2.tokenization_deberta_v2 import DebertaV2Tokenizer as Tokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'  # Check for GPU availability
print(f"Using device: {DEVICE}")


Here are we load the dataset and tokenizer:

In [None]:
# ============================
# 6. Load Dataset and Tokenizer
# ============================
dataset = datasets.load_dataset('text', data_files={'train': TRAIN_PATH, 'validation': DEV_PATH}) # loads the dataset
tokenizer = Tokenizer.from_pretrained(SPM_PATH) # loads the tokenizer

Next we need to tokenize the data. I've split this into 2 functions, 1 for an encoder model, 1 for a decoder model.
These functions are written to take a dictionary of lists, and return the same. This is specifically to allow us to use the `map()` function in the `datasets` library, which we show below later.

### Tokenization (Encoder)
For encoder tokenization, we will be doing MLM, and masking is applied randomly and dynamically during training. Right now, the function below is a preprocessing step, and thus only applied once. So we skip masking for now, and just say the input and labels are the same.

The argument `add_special_tokens` is set to `True` (it is also true by default, but I put it here for clarity), which means that special tokens that go at the start and end of the text will be added. [CLS] and [SEP] are special tokens marking the beginning and end of the text. So, e.g. "The sky is blue." will become "[CLS] The sky is blue. [SEP]"

### Tokenization (Decoder)
For decoder tokenization, we will be doing CLM, so predicting the next token. Therefore we need to offset the labels from the inputs so that the 1st token is used to predict the 2nd, etc. See the paragraph above for an explanation of [CLS] and [SEP].

In [None]:
# ============================
# 7. Tokenization Function (Encoder/Decoder)
# ============================
def tokenize_encoder(examples, tokenizer):
    batch = {
        "input_ids": [],
        "labels": [],
    }

    for example in examples["text"]:
        tokens = tokenizer.encode(example, add_special_tokens=True) # will add [CLS] and [SEP]
        batch["input_ids"].append(tokens)
        batch["labels"].append(tokens)

    return batch

def tokenize_decoder(examples, tokenizer):
    batch = {
        "input_ids": [],
        "labels": [],
    }

    for example in examples["text"]:
        tokens = tokenizer.encode(example, add_special_tokens=True) # will add [CLS] and [SEP]
        batch["input_ids"].append(tokens[:-1])
        batch["labels"].append(tokens[1:])

    return batch


Now we can apply the tokenization. I show it for `tokenize_encoder` but it is the same for `tokenize_decoder`. First the function needs to only take 1 parameter, which is fine because our tokenizer is constant, so we can just apply `functools.partial`. Next, we apply `map()`, which allows fast parallel preprocessing of the dataset.

In [None]:
# Apply the tokenization to the dataset
tokenize_fn = partial(tokenize_encoder, tokenizer = tokenizer) # need to make the function unary for map()
dataset = dataset.map(tokenize_fn, batched = True, num_proc = 4, remove_columns = ['text']) # map works with functions that return a dictionary

In [None]:
# ============================
# 8. Setup Model, Training, and Logging (This is the part that may change per model)
# ============================

# Specific Settings for the Model you choose
# Replace this block with model-specific settings
model_name = "microsoft/deberta-v3-small"  # Example: DeBERTa model for MLM
model = AutoModelForMaskedLM.from_pretrained(model_name)  # Load a Masked Language Model (MLM)

# Data collator to support masked language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

# ============================
# 9. Setup Training Arguments
# ============================
training_args = TrainingArguments(
    output_dir="./results",             # Directory to save the model
    evaluation_strategy="steps",         # Evaluation strategy (can be "steps" or "epoch")
    eval_steps=500,                      # Steps between evaluations
    logging_dir="./logs",                # Directory to save logs
    logging_steps=100,                   # Steps between logging
    report_to="wandb",                   # Log to wandb
    run_name="deberta-mlm-run",          # Run name in wandb
    per_device_train_batch_size=8,       # Batch size for training
    per_device_eval_batch_size=8,        # Batch size for evaluation
    num_train_epochs=3,                  # Number of training epochs
    save_steps=1000,                     # Save model after every 1000 steps
    save_total_limit=2,                  # Limit the number of saved models
    load_best_model_at_end=True,         # Load the best model at the end of training
)

# ============================
# 10. Train the Model
# ============================
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

In [None]:
# ============================
# 11. Save the Model
# ============================
trainer.save_model("trained-deberta")

# ============================
# 12. Log Progress in wandb
# ============================
wandb.finish()  # log the final metrics and mark the run as complete