This notebook demonstrates how to fine-tune BERT for the Masked Language Modeling (MLM) task using a subset of the WikiText-2 dataset. The workflow includes loading a pre-trained BERT model and tokenizer, tokenizing the text data, applying random masking to input tokens, and training the model using the cross-entropy loss. The dataset is reduced to a small subset (100 samples) for faster demonstration. After training, the fine-tuned model is saved and tested using a simple fill-mask pipeline to predict masked tokens in a sentence, showcasing BERT's ability to understand and complete natural language contexts.

In [None]:
!!pip install transformers torch datasets


 'Collecting datasets',
 '  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)',
 'Collecting dill<0.3.9,>=0.3.0 (from datasets)',
 '  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)',
 'Collecting xxhash (from datasets)',
 '  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)',
 'Collecting multiprocess<0.70.17 (from datasets)',
 '  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)',
 'Collecting fsspec (from torch)',
 '  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)',
 'Downloading datasets-3.2.0-py3-none-any.whl (480 kB)',
 '\x1b[?25l   \x1b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\x1b[0m \x1b[32m0.0/480.6 kB\x1b[0m \x1b[31m?\x1b[0m eta \x1b[36m-:--:--\x1b[0m',
 '\x1b[2K   \x1b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\x1b[0m\x1b[90m╺\x1b[0m\x1b[90m━━\x1b[0m \x1b[32m450.6/480.6 kB\x1b[0m \x1b[31m13.1 MB/s\x1b[0m eta \x1b[36m0:00:01\x1b[0m',
 '\x1b[2K   \x1b[90m━━━━━━━━━━━━━━━━━

In [None]:
# Import necessary libraries
import torch
from torch.utils.data import DataLoader
from transformers import (
    BertTokenizer,              # Tokenizer for BERT
    BertForMaskedLM,            # BERT model for Masked Language Modeling
    DataCollatorForLanguageModeling,  # Handles masking of input tokens
    AdamW                       # Optimizer
)
from datasets import load_dataset  # To load text datasets

# ---------------------------
# 1. Load a Smaller Dataset
# ---------------------------
# Load the WikiText-2 dataset and select a small subset
print("Loading dataset...")
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
small_dataset = dataset.select(range(100))  # Select the first 1000 examples

# ---------------------------
# 2. Load BERT Tokenizer and Model
# ---------------------------
print("Loading BERT tokenizer and model...")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# ---------------------------
# 3. Tokenize Dataset
# ---------------------------
# Function to tokenize text and truncate to a maximum length
def tokenize_function(examples):
    return tokenizer(
        examples["text"],                   # Input text
        truncation=True,                    # Truncate text to max length
        max_length=128,                     # Maximum sequence length
        return_special_tokens_mask=True     # Indicate special tokens like [CLS], [SEP]
    )

print("Tokenizing dataset...")
tokenized_datasets = small_dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# ---------------------------
# 4. Prepare Data Loader with Masking
# ---------------------------
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

# DataLoader for batching
print("Preparing DataLoader...")
dataloader = DataLoader(
    tokenized_datasets,       # Tokenized dataset
    batch_size=8,             # Batch size for faster runs
    shuffle=True,             # Shuffle the data at each epoch
    collate_fn=data_collator  # Apply masking during batching
)

# ---------------------------
# 5. Setup Training Components
# ---------------------------
optimizer = AdamW(model.parameters(), lr=5e-5)  # Define optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Use GPU if available
model.to(device)

# ---------------------------
# 6. Training Loop
# ---------------------------
print("Starting training...")
model.train()
num_epochs = 1  # Reduced number of epochs for demo purposes

for epoch in range(num_epochs):
    total_loss = 0  # Track total loss for the epoch

    for batch in dataloader:
        # Move the batch to the same device as the model (GPU or CPU)
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss  # Masked LM loss

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate loss
        total_loss += loss.item()

    # Print average loss for the epoch
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {total_loss / len(dataloader):.4f}")

# ---------------------------
# 7. Save the Trained Model
# ---------------------------
print("Saving model and tokenizer...")
model.save_pretrained("bert_masked_lm_demo")
tokenizer.save_pretrained("bert_masked_lm_demo")

print("Training complete. Model saved as 'bert_masked_lm_demo'.")


Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Loading BERT tokenizer and model...


BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

Tokenizing dataset...


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Preparing DataLoader...
Starting training...




Epoch 1/1, Loss: 2.9378
Saving model and tokenizer...
Training complete. Model saved as 'bert_masked_lm_demo'.


Step 3: Testing the Model
After training, test the model's ability to predict masked tokens:

In [None]:
from transformers import pipeline

# ---------------------------
# 1. Load Pre-trained BERT Model (Before Training)
# ---------------------------
print("Inference using Pre-trained BERT (Before Training):")
pretrained_pipeline = pipeline("fill-mask", model="bert-base-uncased")

# Example sentence with a [MASK] token
# The Declaration of Independence was signed in [MASK].
#Water boils at [MASK] degrees Celsius at sea level.
# "The movie 'Inception' was directed by [MASK]."
sentence = "The capital of Japan is [MASK]."
pretrained_result = pretrained_pipeline(sentence)

# Display top predictions before training
print("\n--- Predictions Before Training ---")
for prediction in pretrained_result:
    print(f"Prediction: {prediction['token_str']}, Score: {prediction['score']:.4f}")

# ---------------------------
# 2. Load Fine-Tuned BERT Model (After Training)
# ---------------------------
print("\nInference using Fine-Tuned BERT (After Training):")
fine_tuned_pipeline = pipeline("fill-mask", model="bert_masked_lm_demo")

# Perform inference with fine-tuned model
fine_tuned_result = fine_tuned_pipeline(sentence)

# Display top predictions after training
print("\n--- Predictions After Training ---")
for prediction in fine_tuned_result:
    print(f"Prediction: {prediction['token_str']}, Score: {prediction['score']:.4f}")



Inference using Pre-trained BERT (Before Training):


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



--- Predictions Before Training ---
Prediction: tokyo, Score: 0.5642
Prediction: osaka, Score: 0.0945
Prediction: kyoto, Score: 0.0914
Prediction: nara, Score: 0.0457
Prediction: kobe, Score: 0.0405

Inference using Fine-Tuned BERT (After Training):

--- Predictions After Training ---
Prediction: tokyo, Score: 0.3358
Prediction: nara, Score: 0.0992
Prediction: osaka, Score: 0.0968
Prediction: kyoto, Score: 0.0951
Prediction: kobe, Score: 0.0495
