# template.ipynb
This notebook contains some template code to help you with loading/preprocessing the data.

We start with some imports and constants.
The training data is found in the `data` subfolder.
There is also a tokenizer I've trained for you which you can use for the project.

Can be executed once — does not depend on the execution environment.

In [1]:
# ========================================
# 1. Clone the Repository (if required)
# ========================================
TOKEN = "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"  # ← GitHub Personal Access Token (PAT)  ⚠️
# XXXXXXXXXXXXXXXXXXXXXXXXXXXX
!git clone https://{TOKEN}@github.com/KatsuhitoArasaka/BabyLM-Tiny.git  # your repository link  ⚠️
%cd BabyLM-Tiny

# ===============
# 2. Set Paths
# ===============
TRAIN_PATH = './data/train.txt'       # Path to your training data  ⚠️
DEV_PATH = './data/dev.txt'           # Path to your validation data  ⚠️
SPM_PATH = './data/tokenizer.model'   # Path to your tokenizer model  ⚠️
# Path to evaluating scripts
BLIMP_SCRIPT = "./evaluate_blimp.py"
GLUE_SCRIPT = "./evaluate_glue.py"

Cloning into 'BabyLM-Tiny'...
remote: Enumerating objects: 56, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 56 (delta 16), reused 8 (delta 8), pack-reused 38 (from 1)[K
Receiving objects: 100% (56/56), 2.52 MiB | 12.75 MiB/s, done.
Resolving deltas: 100% (27/27), done.
/content/BabyLM-Tiny


Execute after restarting the environment -- when changing the device (CPU ↔ GPU) you need to restart the kernel.

In [2]:
# ==========================
# 3. Install Dependencies
# ==========================
!pip install transformers datasets wandb --quiet

# ===============================
# 4. Authentication with wandb
# ===============================
import wandb
wandb.login()  # Enter your API key when prompted ⚠️

# =====================================
# 5. Import Libraries and Set Device
# =====================================
import subprocess
import json

import torch
import datasets
from functools import partial
from datasets import load_dataset
# from transformers import DebertaV2Tokenizer as Tokenizer
# from transformers.models.deberta_v2.tokenization_deberta_v2 import DebertaV2Tokenizer as Tokenizer
from transformers import AutoTokenizer # for custom tokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
from transformers import AutoModelForMaskedLM

import random
import numpy as np
from transformers import set_seed
set_seed(42)
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)


DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'  # Check for GPU availability
print(f"Using device: {DEVICE}")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnikitagorety[0m ([33mLow-Resource_Pretraining[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Using device: cuda


In [3]:
# ================================================
# 6. Start a new wandb run to track this script
# ================================================
run = wandb.init(
    # Set the wandb entity where your project will be logged (generally your team name).
    entity="Low-Resource_Pretraining",
    # Set the wandb project where this run will be logged.
    project="NLP_LRP_BabyLM",
    # Track hyperparameters and run metadata.
    config={
        "learning_rate": 0.02,  # the main parameter for configuring the optimizer
        "architecture": "DeBERTa",  # a description of the model architecture, to track which model was used in the project
        "epochs": 10,  # the number of training epochs, an important parameter for understanding the duration of the experiment and its settings

        # "dataset": "CIFAR-100",
        # "batch_size": 8,
    },
)

Here are we load the dataset and tokenizer:

In [4]:
# ================================
# 7. Load Dataset and Tokenizer
# ================================

#dataset = datasets.load_dataset('text', data_files={'train': TRAIN_PATH, 'validation': DEV_PATH}) # loads the dataset

# loading datasets
with open(TRAIN_PATH, 'r', encoding='utf-8') as f:
    train_data = [{"text": line.strip()} for line in f if line.strip()]

with open(DEV_PATH, 'r', encoding='utf-8') as f:
    val_data = [{"text": line.strip()} for line in f if line.strip()]

# create DatasetDict object
dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_list(train_data),
    "validation": datasets.Dataset.from_list(val_data)
})


# tokenizer = Tokenizer.from_pretrained(SPM_PATH) # loads the tokenizer
tokenizer = AutoTokenizer.from_pretrained(SPM_PATH) # loads custom tokenizer



Next we need to tokenize the data. I've split this into 2 functions, 1 for an encoder model, 1 for a decoder model.
These functions are written to take a dictionary of lists, and return the same. This is specifically to allow us to use the `map()` function in the `datasets` library, which we show below later.

### Tokenization (Encoder)
For encoder tokenization, we will be doing MLM, and masking is applied randomly and dynamically during training. Right now, the function below is a preprocessing step, and thus only applied once. So we skip masking for now, and just say the input and labels are the same.

The argument `add_special_tokens` is set to `True` (it is also true by default, but I put it here for clarity), which means that special tokens that go at the start and end of the text will be added. [CLS] and [SEP] are special tokens marking the beginning and end of the text. So, e.g. "The sky is blue." will become "[CLS] The sky is blue. [SEP]"

### Tokenization (Decoder)
For decoder tokenization, we will be doing CLM, so predicting the next token. Therefore we need to offset the labels from the inputs so that the 1st token is used to predict the 2nd, etc. See the paragraph above for an explanation of [CLS] and [SEP].

In [5]:
# =============================================
# 8. Tokenization Function (Encoder/Decoder)
# =============================================
def tokenize_encoder(examples, tokenizer, max_length):
    batch = {
        "input_ids": [],
        "labels": [],
    }

    for example in examples["text"]:
        tokens = tokenizer.encode(example,
                                  add_special_tokens=True,
                                  padding="max_length",
                                  truncation=True,
                                  max_length=max_length) # will add [CLS] and [SEP]
        batch["input_ids"].append(tokens)
        batch["labels"].append(tokens)

    return batch

def tokenize_decoder(examples, tokenizer, max_length):
    batch = {
        "input_ids": [],
        "labels": [],
    }

    for example in examples["text"]:
        tokens = tokenizer.encode(example,
                                  add_special_tokens=True,
                                  padding="max_length",
                                  truncation=True,
                                  max_length=max_length) # will add [CLS] and [SEP]
        batch["input_ids"].append(tokens[:-1])
        batch["labels"].append(tokens[1:])

    return batch

Now we can apply the tokenization. I show it for `tokenize_encoder` but it is the same for `tokenize_decoder`. First the function needs to only take 1 parameter, which is fine because our tokenizer is constant, so we can just apply `functools.partial`. Next, we apply `map()`, which allows fast parallel preprocessing of the dataset.

In [6]:
# Apply the tokenization to the dataset
tokenize_fn = partial(tokenize_encoder, tokenizer = tokenizer, max_length = 128) # need to make the function unary for map()
tokenized_dataset = dataset.map(tokenize_fn, batched = True, num_proc = 4, remove_columns = ['text']) # map works with functions that return a dictionary

Map (num_proc=4):   0%|          | 0/116191 [00:00<?, ? examples/s]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Map (num_proc=4):   0%|          | 0/11570 [00:00<?, ? examples/s]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


In [None]:
# =====================================================================================
# 9. Setup Model, Training, and Logging (This is the part that may change per model)
# =====================================================================================

# Specific Settings for the Model you choose            ⚠️
# Replace this block with model-specific settings       ⚠️
model_name = "microsoft/deberta-v3-small"  # Example: DeBERTa model for MLM
model = AutoModelForMaskedLM.from_pretrained(model_name)  # Load a Masked Language Model (MLM)

# Data collator to support masked language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

# ===============================
# 10. Setup Training Arguments
# ===============================
training_args = TrainingArguments(
    output_dir="./results",             # Directory to save the model
    eval_strategy="steps",         # Evaluation strategy (can be "steps" or "epoch")
    eval_steps=2000,                      # Steps between evaluations
    logging_dir="./logs",                # Directory to save logs
    logging_steps=1000,                   # Steps between logging
    per_device_train_batch_size=8,       # Batch size for training
    per_device_eval_batch_size=8,        # Batch size for evaluation
    num_train_epochs=3,                  # Number of training epochs
    save_steps=2000,                     # Save model after every n steps
    save_total_limit=3,                  # Limit the number of saved models
    load_best_model_at_end=True,         # Load the best model at the end of training
    fp16=True,
    learning_rate=3e-5, # you can experiment with 3e-5 to 1e-4
    weight_decay=0.01, # regularization, helps to avoid overfitting
    warmup_steps=500, # warm-up for a smooth start
)

# ======================
# 11. Train the Model
# ======================
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

In [None]:
# ============================================================
# 12. Save the trained model locally (for later evaluation)
# ============================================================

# This will save the model to a folder inside your Colab environment
# The folder will be deleted when the session ends unless uploaded elsewhere (e.g., Hugging Face, Google Drive)
save_path = "./trained_models/model_name"  # ⚠️ Change `model_name` if training multiple models in one session

trainer.save_model(save_path)        # Save model weights, config, etc.
tokenizer.save_pretrained(save_path) # Save tokenizer (required for evaluation)

In [None]:
# ========================================
# 13. Evaluate the model (BLiMP + GLUE)
# ========================================

MODEL_TYPE = "encoder"  # or "decoder", if you train autoregressive model       ⚠️

# ---- Run BLiMP Evaluation ----
blimp_output = subprocess.run(
    ["python", BLIMP_SCRIPT,
     "--model_type", MODEL_TYPE,
     "--model_path", save_path,
     "--batch_size", "16"],  # For Colab (T4/P100): usually 16 or 32.
    capture_output=True, text=True
)

# Parse BLiMP results from stdout
blimp_lines = blimp_output.stdout.strip().split("\n")
blimp_results = {line.split(":")[0].strip(" -"): float(line.split(":")[1]) for line in blimp_lines if ":" in line}
blimp_avg = blimp_results.get("Average", sum(blimp_results.values()) / len(blimp_results))

# ---- Log BLiMP to wandb ----
wandb.log({"blimp_avg": blimp_avg, "blimp_details": wandb.Html(blimp_output.stdout.replace('\n', '<br>'))})

# ---- Save BLiMP results ----
# Don't forget to create folder for a corresponding model, if not created                       ⚠️
# and replace `modelname` in 'modelname_results' and `dataset_date` in 'blimp_dataset_date',
# otherwise you may accidentally overwrite the results from another model or won't be able to save them.
with open("models_evaluation_results/modelname_results/blimp_dataset_date.json", "w") as f:
    json.dump(blimp_results, f, indent=2)


# ---- Run GLUE Evaluation (по одному сабсету) ----
glue_subsets = ['cola','sst2', 'mrpc', 'qnli', 'rte', 'boolq', 'multirc']
glue_scores = {}

for subset in glue_subsets:
    glue_eval = subprocess.run(
        ["python", GLUE_SCRIPT,
         "--subset", subset,
         "--model_type", MODEL_TYPE,
         "--model_path", save_path],
        capture_output=True, text=True
    )

# Parse from print("Epoch: x, Result: ...") → take the last line
for line in glue_eval.stdout.strip().split("\n")[::-1]:
    if "Best result:" in line:
        glue_scores[subset] = float(line.split(":")[1])
        break

# ---- Log GLUE to wandb ----
wandb.log({"glue_avg": sum(glue_scores.values()) / len(glue_scores), **{f"glue_{k}": v for k, v in glue_scores.items()}})

# ---- Save GLUE results ----
# Don't forget to replace `results_glue_modelname` in the file name with the model name or date,        ⚠️
# otherwise you may accidentally overwrite the results from another model.
with open("models_evaluation_results/modelname_results/glue_dataset_date.json", "w") as f:
    json.dump(glue_scores, f, indent=2)


In [None]:
# ===============================================
# 14. Upload Trained Model to Hugging Face Hub
# ===============================================

# Install and login (only needs to be done once)
!pip install -q huggingface_hub
from huggingface_hub import login
login()  # Paste your HF token from https://huggingface.co/settings/tokens

from huggingface_hub import create_repo, upload_folder

hf_repo_name = "deberta-small-babylm"  # Change this to a unique name for your model       ⚠️
hf_username = "your_username"          # Replace with your actual Hugging Face username    ⚠️
repo_id = f"{hf_username}/{hf_repo_name}"

# Create a public repo (use private=True if needed)
create_repo(repo_id, exist_ok=True, private=True)

# Upload entire trained model folder
upload_folder(
    repo_id=repo_id,
    folder_path=save_path,
    path_in_repo=".",  # Upload everything from the folder
    repo_type="model"
)

print(f"Model uploaded to: https://huggingface.co/{repo_id}")

In [None]:
# ===========================
# 15. Finish wandb Logging
# ===========================
wandb.finish()  # log the final metrics and mark the run as complete