# template.ipynb
This notebook contains some template code to help you with loading/preprocessing the data.

We start with some imports and constants.
The training data is found in the `data` subfolder.
There is also a tokenizer I've trained for you which you can use for the project.

Can be executed once — does not depend on the execution environment.

In [2]:
# ========================================
# 1. Clone the Repository (if required)
# ========================================
TOKEN = "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"  # ← GitHub Personal Access Token (PAT)
# XXXXXXXXXXXXXXXXXXXXXXXXXXXX
!git clone https://{TOKEN}@github.com/KatsuhitoArasaka/BabyLM-Tiny.git
%cd BabyLM-Tiny

# ============================
# 2. Set Paths
# ============================
TRAIN_PATH = './data/train.txt'       # Path to your training data
DEV_PATH = './data/dev.txt'           # Path to your validation data
SPM_PATH = './data/tokenizer.model'   # Path to your tokenizer model

Cloning into 'BabyLM-Tiny'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 33 (delta 13), reused 7 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (33/33), 2.50 MiB | 7.61 MiB/s, done.
Resolving deltas: 100% (13/13), done.
/content/BabyLM-Tiny


Execute after restarting the environment -- when changing the device (CPU ↔ GPU) you need to restart the kernel.

In [10]:
# ==========================
# 3. Install Dependencies
# ==========================
!pip install transformers datasets wandb --quiet

# ===============================
# 4. Authentication with wandb
# ===============================
import wandb
wandb.login()  # Enter your API key when prompted

# ==========================
# 5. Import Libraries and Set Device
# ==========================
import torch
import datasets
from functools import partial
from datasets import load_dataset
from transformers import DebertaV2Tokenizer as Tokenizer
# from transformers.models.deberta_v2.tokenization_deberta_v2 import DebertaV2Tokenizer as Tokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
from transformers import AutoModelForMaskedLM

import random
import numpy as np
from transformers import set_seed
set_seed(42)
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)


DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'  # Check for GPU availability
print(f"Using device: {DEVICE}")



Using device: cpu


In [None]:
# ================================================
# 6. Start a new wandb run to track this script
# ================================================
run = wandb.init(
    # Set the wandb entity where your project will be logged (generally your team name).
    entity="Low-Resource_Pretraining",
    # Set the wandb project where this run will be logged.
    project="NLP_LRP_BabyLM",
    # Track hyperparameters and run metadata.
    config={
        "learning_rate": 0.02,  # the main parameter for configuring the optimizer
        "architecture": "DeBERTa",  # a description of the model architecture, to track which model was used in the project
        "epochs": 10,  # the number of training epochs, an important parameter for understanding the duration of the experiment and its settings

        # "dataset": "CIFAR-100",
        # "batch_size": 8,
    },
    run_name="deberta-mlm-run-test1"  # unique name for this run
)

Here are we load the dataset and tokenizer:

In [18]:
# ============================
# 7. Load Dataset and Tokenizer
# ============================

#dataset = datasets.load_dataset('text', data_files={'train': TRAIN_PATH, 'validation': DEV_PATH}) # loads the dataset

# loading datasets
with open(TRAIN_PATH, 'r', encoding='utf-8') as f:
    train_data = [{"text": line.strip()} for line in f if line.strip()]

with open(DEV_PATH, 'r', encoding='utf-8') as f:
    val_data = [{"text": line.strip()} for line in f if line.strip()]

# create DatasetDict object
dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_list(train_data),
    "validation": datasets.Dataset.from_list(val_data)
})


tokenizer = Tokenizer.from_pretrained(SPM_PATH) # loads the tokenizer

Next we need to tokenize the data. I've split this into 2 functions, 1 for an encoder model, 1 for a decoder model.
These functions are written to take a dictionary of lists, and return the same. This is specifically to allow us to use the `map()` function in the `datasets` library, which we show below later.

### Tokenization (Encoder)
For encoder tokenization, we will be doing MLM, and masking is applied randomly and dynamically during training. Right now, the function below is a preprocessing step, and thus only applied once. So we skip masking for now, and just say the input and labels are the same.

The argument `add_special_tokens` is set to `True` (it is also true by default, but I put it here for clarity), which means that special tokens that go at the start and end of the text will be added. [CLS] and [SEP] are special tokens marking the beginning and end of the text. So, e.g. "The sky is blue." will become "[CLS] The sky is blue. [SEP]"

### Tokenization (Decoder)
For decoder tokenization, we will be doing CLM, so predicting the next token. Therefore we need to offset the labels from the inputs so that the 1st token is used to predict the 2nd, etc. See the paragraph above for an explanation of [CLS] and [SEP].

In [26]:
# ============================
# 8. Tokenization Function (Encoder/Decoder)
# ============================
def tokenize_encoder(examples, tokenizer, max_length):
    batch = {
        "input_ids": [],
        "labels": [],
    }

    for example in examples["text"]:
        tokens = tokenizer.encode(example,
                                  add_special_tokens=True,
                                  padding="max_length",
                                  truncation=True,
                                  max_length=max_length) # will add [CLS] and [SEP]
        batch["input_ids"].append(tokens)
        batch["labels"].append(tokens)

    return batch

def tokenize_decoder(examples, tokenizer, max_length):
    batch = {
        "input_ids": [],
        "labels": [],
    }

    for example in examples["text"]:
        tokens = tokenizer.encode(example,
                                  add_special_tokens=True,
                                  padding="max_length",
                                  truncation=True,
                                  max_length=max_length) # will add [CLS] and [SEP]
        batch["input_ids"].append(tokens[:-1])
        batch["labels"].append(tokens[1:])

    return batch

Now we can apply the tokenization. I show it for `tokenize_encoder` but it is the same for `tokenize_decoder`. First the function needs to only take 1 parameter, which is fine because our tokenizer is constant, so we can just apply `functools.partial`. Next, we apply `map()`, which allows fast parallel preprocessing of the dataset.

In [27]:
# Apply the tokenization to the dataset
tokenize_fn = partial(tokenize_encoder, tokenizer = tokenizer, max_length = 128) # need to make the function unary for map()
tokenized_dataset = dataset.map(tokenize_fn, batched = True, num_proc = 4, remove_columns = ['text']) # map works with functions that return a dictionary

tokenize_fn = partial(tokenize_encoder, tokenizer = tokenizer, max_length = 128)

Map (num_proc=4):   0%|          | 0/116191 [00:00<?, ? examples/s]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Map (num_proc=4):   0%|          | 0/11570 [00:00<?, ? examples/s]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


In [29]:
# ============================
# 9. Setup Model, Training, and Logging (This is the part that may change per model)
# ============================

# Specific Settings for the Model you choose
# Replace this block with model-specific settings
model_name = "microsoft/deberta-v3-small"  # Example: DeBERTa model for MLM
model = AutoModelForMaskedLM.from_pretrained(model_name)  # Load a Masked Language Model (MLM)

# Data collator to support masked language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

# ============================
# 10. Setup Training Arguments
# ============================
training_args = TrainingArguments(
    output_dir="./results",             # Directory to save the model
    eval_strategy="steps",         # Evaluation strategy (can be "steps" or "epoch")
    eval_steps=5000,                      # Steps between evaluations
    logging_dir="./logs",                # Directory to save logs
    logging_steps=1000,                   # Steps between logging
    per_device_train_batch_size=8,       # Batch size for training
    per_device_eval_batch_size=8,        # Batch size for evaluation
    num_train_epochs=2,                  # Number of training epochs
    save_steps=5000,                     # Save model after every 1000 steps
    save_total_limit=2,                  # Limit the number of saved models
    load_best_model_at_end=True,         # Load the best model at the end of training
    fp16=True,
)

# ============================
# 11. Train the Model
# ============================
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Some weights of DebertaV2ForMaskedLM were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
# ============================
# 12. Save the Model
# ============================
trainer.save_model("trained-deberta")

# ============================
# 13. Log Progress in wandb
# ============================
wandb.finish()  # log the final metrics and mark the run as complete