
# BERT Overview

### What you'll learn
- How tokenization works with BERT (wordpiece, padding strategies, and the **attention mask**).
- How to inspect BERT: configuration, architecture, vocabulary, special tokens, and **[unused]** tokens.
- How to use BERT for its original tasks: Masked Language Modelling and Next Sentence Prediction

## 1) Setup

In [1]:
# If transformers/torch aren't installed in your environment, uncomment and run:
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# !pip install transformers>=4.41.0 datasets==2.19.0

import os
import math
import random
import json
from dataclasses import dataclass
from typing import List, Dict, Tuple
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
import pandas as pd

from transformers import BertTokenizerFast, BertConfig, BertModel, AutoConfig
from transformers.optimization import get_linear_schedule_with_warmup

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


  from .autonotebook import tqdm as notebook_tqdm


Device: cpu


## 2) Tokenizer: effects, padding, attention mask

In [2]:
# Load the tokenizer
MODEL_NAME = "bert-base-uncased"  
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME, use_fast=True)

In [3]:
# Inspect vocab & special tokens
print("Max length:", tokenizer.model_max_length)
print("Vocab size:", tokenizer.vocab_size)
print("Special tokens:", tokenizer.all_special_tokens)
for token_id in tokenizer.all_special_ids:
    print(tokenizer.decode(token_id), "->", token_id)

Max length: 512
Vocab size: 30522
Special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
[UNK] -> 100
[SEP] -> 102
[PAD] -> 0
[CLS] -> 101
[MASK] -> 103


In [4]:
# Show a few [unused] tokens (BERT ships many)
unused = [t for t in tokenizer.get_vocab().keys() if t.startswith("[unused")]
unused_digits = [int(t.replace("unused", "").replace("[", "").replace("]", "")) for t in unused]
unused_digits = sorted(unused_digits)
unused = [f"[unused{i}]" for i in unused_digits]
print("First 10 [unused] tokens:", unused[:10])
print("Last 10 [unused] tokens:", unused[-10:])

First 10 [unused] tokens: ['[unused0]', '[unused1]', '[unused2]', '[unused3]', '[unused4]', '[unused5]', '[unused6]', '[unused7]', '[unused8]', '[unused9]']
Last 10 [unused] tokens: ['[unused984]', '[unused985]', '[unused986]', '[unused987]', '[unused988]', '[unused989]', '[unused990]', '[unused991]', '[unused992]', '[unused993]']


In [5]:
#sample = "I absolutely loved the cinematography, but the acting was so-so."
sample = "Tokenization and padding are fundamental steps in most NLP pipelines."
enc = tokenizer(sample, return_tensors="pt")

print("Original text:", sample)
tokens = []
for token_id in enc["input_ids"][0].tolist():
    tokens.append(tokenizer.decode(token_id))
print("Tokenized:", tokens)
print("Num of tokens:", len(tokens))

print("\n")
for k,v in enc.items():
    print(k, v.shape, v[0])

tokens = []
for token_id, attention_mask_id in zip(enc["input_ids"][0], enc["attention_mask"][0]):
    tokens.append({
        "token": tokenizer.decode(token_id),
        "token_id": token_id,
        "attention_mask_id": attention_mask_id
    })

print("\n\n")
print(pd.DataFrame.from_dict(tokens).to_markdown())

Original text: Tokenization and padding are fundamental steps in most NLP pipelines.
Tokenized: ['[CLS]', 'token', '##ization', 'and', 'pad', '##ding', 'are', 'fundamental', 'steps', 'in', 'most', 'nl', '##p', 'pipeline', '##s', '.', '[SEP]']
Num of tokens: 17


input_ids torch.Size([1, 17]) tensor([  101, 19204,  3989,  1998, 11687,  4667,  2024,  8050,  4084,  1999,
         2087, 17953,  2361, 13117,  2015,  1012,   102])
token_type_ids torch.Size([1, 17]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
attention_mask torch.Size([1, 17]) tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])



|    | token       |   token_id |   attention_mask_id |
|---:|:------------|-----------:|--------------------:|
|  0 | [CLS]       |        101 |                   1 |
|  1 | token       |      19204 |                   1 |
|  2 | ##ization   |       3989 |                   1 |
|  3 | and         |       1998 |                   1 |
|  4 | pad         |      11687 |         

In [6]:
import pandas as pd

# Padding strategies and attention mask
batch = [
    "This movie was amazing!",
    "Bad.",
    "I would not watch it again, honestly.",
]


try:
    enc_nopad = tokenizer(batch, padding=False, truncation=True, return_tensors="pt")
except Exception as e:
    print("Encoding strings with different lengths without padding gives error!")
    print("Error:", e)

print("\n--- Padding to a fixed `max_length` (e.g., 16) ---")
print("Purpose: Ensures all sequences have the exact same length, often used for fixed-size model inputs.")
enc_pad_left = tokenizer(batch, padding="max_length", max_length=16, truncation=True, return_tensors="pt") # if we don't specify the value for max length, it will take by default the one of the model
print("Input IDs shape:", enc_pad_left["input_ids"].shape)
print("Attention Mask shape:", enc_pad_left["attention_mask"].shape)
print("First sequence input_ids (padded):")
tokens = []
for token_id, attention_mask_id in zip(enc_pad_left["input_ids"][0], enc_pad_left["attention_mask"][0]):
    tokens.append({
        "token": tokenizer.decode(token_id),
        "token_id": token_id,
        "attention_mask_id": attention_mask_id
    })
print(pd.DataFrame.from_dict(tokens).to_markdown())

print("\n\n--- Padding to the `longest` sequence in the current batch ---")
print("Purpose: Minimizes padding by only matching the longest sequence in the batch, saving computation.")
enc_pad_longest = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
print("Input IDs shape:", enc_pad_longest["input_ids"].shape)
print("Attention Mask shape:", enc_pad_longest["attention_mask"].shape)
tokens = []
for token_id, attention_mask_id in zip(enc_pad_longest["input_ids"][2], enc_pad_longest["attention_mask"][2]):
    tokens.append({
        "token": tokenizer.decode(token_id),
        "token_id": token_id,
        "attention_mask_id": attention_mask_id
    })
print(pd.DataFrame.from_dict(tokens).to_markdown())

Encoding strings with different lengths without padding gives error!
Error: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

--- Padding to a fixed `max_length` (e.g., 16) ---
Purpose: Ensures all sequences have the exact same length, often used for fixed-size model inputs.
Input IDs shape: torch.Size([3, 16])
Attention Mask shape: torch.Size([3, 16])
First sequence input_ids (padded):
|    | token   |   token_id |   attention_mask_id |
|---:|:--------|-----------:|--------------------:|
|  0 | [CLS]   |        101 |                   1 |
|  1 | this    |       2023 |                   1 |
|  2 | movie   |       3185 |                   1 |
|  3 | was     |       2001 |                   1 |
|  4 | amazing |       6429 |                   1 |
|  5 | !


### Why attention masks matter

Let's pass two padded sequences through BERT **with** and **without** the attention mask and see how the outputs change.  
Padding tokens should **not** influence the representation of real tokens.


In [7]:
config = BertConfig.from_pretrained(MODEL_NAME, output_hidden_states=True)
bert = BertModel.from_pretrained(MODEL_NAME, config=config).to(device)
bert.eval()

with torch.no_grad():
    padded = tokenizer(["hello world", "hello"], padding=True, return_tensors="pt").to(device)
    #padded = tokenizer(["hello world", "hello"], padding="max_length", max_length=50, return_tensors="pt").to(device)

    # With mask
    out_with = bert(input_ids=padded["input_ids"], attention_mask=padded["attention_mask"])
    # Without mask (pretend padding is real content)
    out_without = bert(input_ids=padded["input_ids"], attention_mask=None)

# Compare [CLS] embeddings difference
cls_with = out_with.last_hidden_state[:,0,:]
cls_without = out_without.last_hidden_state[:,0,:]
diff = (cls_with - cls_without).pow(2).sum(dim=-1).sqrt().cpu().tolist()
print("L2 distance between CLS with/without mask per sequence:", diff)


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


L2 distance between CLS with/without mask per sequence: [0.0, 2.567479133605957]


## 3) BERT architecture: modules, shapes, and parameters

In [8]:
total_params = sum(p.numel() for p in bert.parameters())
trainable_params = sum(p.numel() for p in bert.parameters() if p.requires_grad)
print(f"Total params in {MODEL_NAME}: {total_params:,} (trainable: {trainable_params:,})")

print("\nHigh-level modules:")
for name, module in bert.named_children():
    print(" -", name, ":", module.__class__.__name__)

print("\nEncoder layer stack depth:", bert.config.num_hidden_layers)
print("Hidden size:", bert.config.hidden_size)
print("Intermediate FF size:", bert.config.intermediate_size)
print("Attention heads:", bert.config.num_attention_heads)


Total params in bert-base-uncased: 109,482,240 (trainable: 109,482,240)

High-level modules:
 - embeddings : BertEmbeddings
 - encoder : BertEncoder
 - pooler : BertPooler

Encoder layer stack depth: 12
Hidden size: 768
Intermediate FF size: 3072
Attention heads: 12


## 4) BERT's original pre-training tasks

BERT was originally pre-trained on two tasks:
1. **Masked Language Modeling (MLM)**: Predict masked tokens in a sentence.
2. **Next Sentence Prediction (NSP)**: Determine if sentence B follows sentence A.

Let's demonstrate both tasks using a pre-trained BERT model.

In [9]:
# Masked Language Modeling (MLM) Demo
# We need BertForMaskedLM for this task
from transformers import AutoTokenizer, BertForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

print(tokenizer.mask_token, "->", tokenizer.mask_token_id)
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")

with torch.no_grad():
    # Get model predictions (logits) for all tokens in the vocabulary
    # logits shape: [batch_size=1, sequence_length, vocab_size=30522]
    logits = model(**inputs).logits

# Step 1: Find the position of [MASK] in the input sequence
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

# Step 2: Get the predicted token for the [MASK] position
# logits[0, mask_token_index] extracts predictions at the mask position (shape: [vocab_size])
# .argmax(axis=-1) finds the token ID with the highest score (most likely prediction)
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)

print("Predicted token id:", predicted_token_id.item(), "->", tokenizer.decode(predicted_token_id))  

# Step 3: Prepare labels for loss calculation
# Tokenize the correct sentence to get ground truth token IDs
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

# Set all non-[MASK] positions to -100 (tells PyTorch to ignore them in loss calculation)
# Only the token at the [MASK] position will be used to compute the loss
# Result: [-100, -100, -100, -100, -100, -100, 3000, -100] where 3000 is "paris"
print("Tokens:", tokenizer.convert_ids_to_tokens(labels[0]))
print("Labels before masking non-[MASK]:", labels)
labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100)
print("Labels after masking non-[MASK]:", labels)

# Step 4: Calculate loss (how wrong was the prediction?)
# The model compares its prediction at [MASK] position with the correct token "Paris"
# Lower loss = better prediction (0 would mean perfect prediction)
outputs = model(**inputs, labels=labels)
print(f"Loss: {round(outputs.loss.item(), 2)}")  

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[MASK] -> 103
Predicted token id: 3000 -> paris
Tokens: ['[CLS]', 'the', 'capital', 'of', 'france', 'is', 'paris', '.', '[SEP]']
Labels before masking non-[MASK]: tensor([[ 101, 1996, 3007, 1997, 2605, 2003, 3000, 1012,  102]])
Labels after masking non-[MASK]: tensor([[-100, -100, -100, -100, -100, -100, 3000, -100, -100]])
Loss: 0.88


### Understanding the Loss Value

Loss measures **confidence**, not just correctness!

- **Loss = -log(probability)** for the correct token
- Loss of 0.88 means BERT gave "paris" a probability of ~41% (exp(-0.88) ≈ 0.41)
- Even though "paris" was the top prediction, BERT wasn't 100% confident
- BERT must choose from **30,522 possible tokens** in the vocabulary!

**Loss interpretation:**
- Loss = 0.0 → 100% confident (probability = 1.0) - perfect!
- Loss = 0.69 → 50% confident (probability = 0.5)
- Loss = 1.0 → 37% confident (probability = 0.37)
- Loss = 2.0 → 14% confident (probability = 0.14)

Let's see the actual probabilities for the top predictions:

In [10]:
# Let's look at the actual probabilities for top-10 predictions
import torch.nn.functional as F

# Get probabilities (not just logits)
probs_at_mask = F.softmax(logits[0, mask_token_index], dim=-1)

# Get top 10 predictions with their probabilities
top_k = 10
top_probs, top_indices = torch.topk(probs_at_mask, top_k, dim=-1)

print(f"Top {top_k} predictions for 'The capital of France is [MASK].':\n")
print(f"{'Rank':<6} {'Token':<15} {'Probability':<15} {'Loss if correct':<20}")
print("-" * 60)

for rank, (prob, idx) in enumerate(zip(top_probs[0], top_indices[0]), 1):
    token = tokenizer.decode([idx])
    prob_value = prob.item()
    loss_value = -torch.log(prob).item()
    
    # Highlight the actual prediction (paris)
    marker = " ← PREDICTED" if rank == 1 else ""
    print(f"{rank:<6} {token:<15} {prob_value:<15.4f} {loss_value:<20.4f}{marker}")

print("\n" + "="*60)
print(f"Notice: Even though 'paris' is #1, it has ~{top_probs[0][0].item()*100:.1f}% probability")
print(f"The model spreads probability across other plausible tokens.")
print(f"This is why the loss is {outputs.loss.item():.2f}, not 0.0!")

Top 10 predictions for 'The capital of France is [MASK].':

Rank   Token           Probability     Loss if correct     
------------------------------------------------------------
1      paris           0.4168          0.8752               ← PREDICTED
2      lille           0.0714          2.6392              
3      lyon            0.0634          2.7584              
4      marseille       0.0444          3.1134              
5      tours           0.0303          3.4967              
6      toulouse        0.0288          3.5489              
7      orleans         0.0254          3.6717              
8      nantes          0.0228          3.7809              
9      brest           0.0226          3.7915              
10     bordeaux        0.0212          3.8526              

Notice: Even though 'paris' is #1, it has ~41.7% probability
The model spreads probability across other plausible tokens.
This is why the loss is 0.88, not 0.0!


In [11]:
# Next Sentence Prediction (NSP) Demo
# We need BertForNextSentencePrediction for this task
from transformers import BertForNextSentencePrediction

nsp_model = BertForNextSentencePrediction.from_pretrained(MODEL_NAME).to(device)
nsp_model.eval()

# Test sentences
starting_sentence = "The weather is beautiful today."
true_next = "I think I'll go for a walk in the park."
false_next = "Deep Learning is a branch of Machine Learning."

sentence_pairs = [
    (starting_sentence, true_next, "True continuation"),
    (starting_sentence, false_next, "False continuation"),
]

print("Next Sentence Prediction Demo\n")
print(f"Starting sentence: '{starting_sentence}'\n")

for sent_a, sent_b, description in sentence_pairs:
    # Tokenize the sentence pair
    # BERT uses [CLS] sent_a [SEP] sent_b [SEP] format
    inputs = tokenizer(sent_a, sent_b, return_tensors="pt").to(device)
    
    # Get predictions
    with torch.no_grad():
        outputs = nsp_model(**inputs)
        logits = outputs.logits
    
    # logits shape: [batch_size, 2]
    # logits[:, 0] = score for "is next sentence"
    # logits[:, 1] = score for "is NOT next sentence"
    probs = F.softmax(logits, dim=-1)
    is_next_prob = probs[0, 0].item()
    not_next_prob = probs[0, 1].item()
    
    prediction = "IS the next sentence" if is_next_prob > not_next_prob else "IS NOT the next sentence"
    
    print(f"--- {description} ---")
    print(f"Candidate sentence: '{sent_b}'")
    print(f"Prediction: {prediction}")
    print(f"  P(is next)     = {is_next_prob:.4f}")
    print(f"  P(is NOT next) = {not_next_prob:.4f}")
    print()


Next Sentence Prediction Demo

Starting sentence: 'The weather is beautiful today.'

--- True continuation ---
Candidate sentence: 'I think I'll go for a walk in the park.'
Prediction: IS the next sentence
  P(is next)     = 0.9999
  P(is NOT next) = 0.0001

--- False continuation ---
Candidate sentence: 'Deep Learning is a branch of Machine Learning.'
Prediction: IS NOT the next sentence
  P(is next)     = 0.0000
  P(is NOT next) = 1.0000



## 5) References
- Many other BertFor** are supported by transformers. You can check the full list at https://huggingface.co/docs/transformers/model_doc/bert
- Check out also the "Auto" classes, which allows for more flexible pipelines integrating BERT and non-BERT models without changing the code: https://huggingface.co/docs/transformers/model_doc/auto

## 6) Exercises 


### Exercise 1: Tokenization Analysis

**Task:** Tokenize the following sentence and analyze its components:
```
"The pre-trained BERT model uses WordPiece tokenization with 30,522 tokens."
```

**Requirements:**
1. Tokenize the sentence and print the tokens
2. Calculate the total number of tokens (including special tokens)
3. Identify which words got split into subwords
4. Create a DataFrame showing: token, token_id, and whether it's a subword (starts with ##)

**Bonus:** Try with a sentence containing a rare word like "antidisestablishmentarianism" and observe the tokenization.

In [12]:
# Solution to Exercise 1
sentence = "The pre-trained BERT model uses WordPiece tokenization with 30,522 tokens."
#sentence = "Antidisestablishmentarianism is a political position that opposes the withdrawal of state support from an established church."

# 1. Tokenize and print tokens
enc = tokenizer(sentence, return_tensors="pt")
tokens = [tokenizer.decode([token_id]) for token_id in enc["input_ids"][0]]
print("Tokens:", tokens)

# 2. Calculate total number of tokens
print(f"\nTotal number of tokens: {len(tokens)}")

# 3. Identify split words
print("\nWords split into subwords:")
for i, token in enumerate(tokens):
    if token.startswith("##"):
        print(f"  '{tokens[i-1]}{token[2:]}' was split into '{tokens[i-1]}' + '{token}'")

# 4. Create DataFrame
import pandas as pd
token_data = []
for token_id in enc["input_ids"][0]:
    token = tokenizer.decode([token_id])
    is_subword = token.startswith("##")
    token_data.append({
        "token": token,
        "token_id": token_id.item(),
        "is_subword": is_subword
    })

df = pd.DataFrame(token_data)
print("\n" + df.to_markdown(index=False))

# Bonus: Rare word tokenization
print("\n--- BONUS: Rare word ---")
rare_sentence = "Antidisestablishmentarianism is a very long word."
enc_rare = tokenizer(rare_sentence, return_tensors="pt")
rare_tokens = [tokenizer.decode([token_id]) for token_id in enc_rare["input_ids"][0]]
print("Tokens:", rare_tokens)
print(f"The word 'antidisestablishmentarianism' was split into {len([t for t in rare_tokens if 'anti' in t or '##' in t])} subword tokens!")

Tokens: ['[CLS]', 'the', 'pre', '-', 'trained', 'bert', 'model', 'uses', 'word', '##piece', 'token', '##ization', 'with', '30', ',', '52', '##2', 'token', '##s', '.', '[SEP]']

Total number of tokens: 21

Words split into subwords:
  'wordpiece' was split into 'word' + '##piece'
  'tokenization' was split into 'token' + '##ization'
  '522' was split into '52' + '##2'
  'tokens' was split into 'token' + '##s'

| token     |   token_id | is_subword   |
|:----------|-----------:|:-------------|
| [CLS]     |        101 | False        |
| the       |       1996 | False        |
| pre       |       3653 | False        |
| -         |       1011 | False        |
| trained   |       4738 | False        |
| bert      |      14324 | False        |
| model     |       2944 | False        |
| uses      |       3594 | False        |
| word      |       2773 | False        |
| ##piece   |      11198 | True         |
| token     |      19204 | False        |
| ##ization |       3989 | True         |

### Exercise 2: Padding Strategy Comparison

**Task:** Compare different padding strategies.

Given the following batch of sentences:
```python
sentences = [
    "Great!",
    "This is a medium length sentence about machine learning.",
    "Short one.",
    "This is the longest sentence in our batch and it talks about natural language processing and transformers."
]
```

**Requirements:**
1. Tokenize with `padding="longest"` and report the shape
2. Tokenize with `padding="max_length"` (max_length=64) and report the shape
3. Calculate the percentage of padding tokens for each strategy
4. Discuss: Which strategy would be better for training? Why?

In [13]:
# Solution to Exercise 2
import matplotlib.pyplot as plt
import numpy as np

sentences = [
    "Great!",
    "This is a medium length sentence about machine learning.",
    "Short one.",
    "This is the longest sentence in our batch and it talks about natural language processing and transformers."
]

# 1. Padding to longest
enc_longest = tokenizer(sentences, padding="longest", truncation=True, return_tensors="pt")
print("--- Padding='longest' ---")
print(f"Shape: {enc_longest['input_ids'].shape}")

# 2. Padding to max_length
enc_maxlen = tokenizer(sentences, padding="max_length", max_length=64, truncation=True, return_tensors="pt")
print("\n--- Padding='max_length' (64) ---")
print(f"Shape: {enc_maxlen['input_ids'].shape}")

# 3. Calculate percentage of padding
def calculate_padding_percentage(attention_mask):
    total_tokens = attention_mask.numel()
    padding_tokens = (attention_mask == 0).sum().item()
    return (padding_tokens / total_tokens) * 100

padding_pct_longest = calculate_padding_percentage(enc_longest['attention_mask'])
padding_pct_maxlen = calculate_padding_percentage(enc_maxlen['attention_mask'])

print(f"\nPadding percentage (longest): {padding_pct_longest:.2f}%")
print(f"Padding percentage (max_length=64): {padding_pct_maxlen:.2f}%")
print(f"Efficiency gain: {padding_pct_maxlen - padding_pct_longest:.2f}% less padding with 'longest'")

# 4. Discussion
print("\n--- Discussion ---")
print("For training, 'longest' is generally better because:")
print("  • Reduces wasted computation on padding tokens")
print("  • Each batch is optimally sized for its contents")
print("  • Faster training with dynamic batching")
print("\nHowever, 'max_length' might be preferred when:")
print("  • Hardware requires fixed input sizes")
print("  • Comparing models fairly across different implementations")
print("  • Debugging (consistent shapes are easier to trace)")

--- Padding='longest' ---
Shape: torch.Size([4, 20])

--- Padding='max_length' (64) ---
Shape: torch.Size([4, 64])

Padding percentage (longest): 48.75%
Padding percentage (max_length=64): 83.98%
Efficiency gain: 35.23% less padding with 'longest'

--- Discussion ---
For training, 'longest' is generally better because:
  • Reduces wasted computation on padding tokens
  • Each batch is optimally sized for its contents
  • Faster training with dynamic batching

However, 'max_length' might be preferred when:
  • Hardware requires fixed input sizes
  • Comparing models fairly across different implementations
  • Debugging (consistent shapes are easier to trace)


### Exercise 3: Exploring BERT's Architecture

**Task:** Investigate BERT's parameter distribution across different components.

**Requirements:**
1. Calculate the number of parameters in the embedding layer (word embeddings, position embeddings, token type embeddings)
2. Calculate the number of parameters in a single encoder layer
3. Calculate the number of parameters in the pooler layer
4. Verify your calculations sum to the total number of parameters

**Hint:** Use `named_parameters()` to iterate through all parameters and their shapes.

In [14]:
# Solution to Exercise 3
import matplotlib.pyplot as plt

# Helper function to count parameters
def count_params(module):
    return sum(p.numel() for p in module.parameters())

# 1. Embedding layer parameters
embedding_params = count_params(bert.embeddings)
print(f"Embedding layer parameters: {embedding_params:,}")
print(f"  - Word embeddings: {bert.embeddings.word_embeddings.weight.numel():,}")
print(f"  - Position embeddings: {bert.embeddings.position_embeddings.weight.numel():,}")
print(f"  - Token type embeddings: {bert.embeddings.token_type_embeddings.weight.numel():,}")
print(f"  - LayerNorm + others: {embedding_params - bert.embeddings.word_embeddings.weight.numel() - bert.embeddings.position_embeddings.weight.numel() - bert.embeddings.token_type_embeddings.weight.numel():,}")

# 2. Single encoder layer parameters
single_encoder_params = count_params(bert.encoder.layer[0])
print(f"\nSingle encoder layer parameters: {single_encoder_params:,}")

# Break down a single layer
layer0 = bert.encoder.layer[0]
attn_params = count_params(layer0.attention)
intermediate_params = count_params(layer0.intermediate)
output_params = count_params(layer0.output)
print(f"  - Self-attention: {attn_params:,}")
print(f"  - Intermediate (FFN): {intermediate_params:,}")
print(f"  - Output projection: {output_params:,}")

# 3. Pooler layer parameters
pooler_params = count_params(bert.pooler)
print(f"\nPooler layer parameters: {pooler_params:,}")

# All encoder layers
all_encoder_params = count_params(bert.encoder)
print(f"\nAll encoder layers (12 layers) parameters: {all_encoder_params:,}")

# 4. Verify calculations
param_distribution = {
    'Embeddings': embedding_params,
    'Encoder Layers (12x)': all_encoder_params,
    'Pooler': pooler_params
}

calculated_total = sum(param_distribution.values())
actual_total = count_params(bert)
print(f"\n--- Verification ---")
print(f"Sum of calculated parameters: {calculated_total:,}")
print(f"Actual total parameters: {actual_total:,}")
print(f"Match: {calculated_total == actual_total}" if calculated_total == actual_total else f"Difference: {abs(calculated_total - actual_total):,}")

# Additional insight: parameters per encoder layer
print(f"\n--- Additional Insights ---")
print(f"Encoder layers contain {(all_encoder_params/actual_total)*100:.1f}% of all parameters")
print(f"Each encoder layer has {single_encoder_params:,} parameters")
print(f"Average parameters per layer (including embeddings): {actual_total // (bert.config.num_hidden_layers + 2):,}")

Embedding layer parameters: 23,837,184
  - Word embeddings: 23,440,896
  - Position embeddings: 393,216
  - Token type embeddings: 1,536
  - LayerNorm + others: 1,536

Single encoder layer parameters: 7,087,872
  - Self-attention: 2,363,904
  - Intermediate (FFN): 2,362,368
  - Output projection: 2,361,600

Pooler layer parameters: 590,592

All encoder layers (12 layers) parameters: 85,054,464

--- Verification ---
Sum of calculated parameters: 109,482,240
Actual total parameters: 109,482,240
Match: True

--- Additional Insights ---
Encoder layers contain 77.7% of all parameters
Each encoder layer has 7,087,872 parameters
Average parameters per layer (including embeddings): 7,820,160


### Exercise 4: Masked Language Modeling with Multiple Masks

**Task:** Implement a function that predicts multiple masked tokens in a sentence and evaluates the predictions.

**Requirements:**
1. Create a function `predict_masked_tokens(sentence, mask_positions)` that:
   - Takes a sentence and a list of word positions to mask (0-indexed, excluding special tokens)
   - Replaces those positions with [MASK]
   - Returns the top-3 predictions for each masked position with their probabilities
2. Test your function with: `"The quick brown fox jumps over the lazy dog"` 
   - Mask positions: 1 ("quick"), 4 ("jumps"), and 7 ("lazy")
3. Calculate the average confidence (probability) of the top prediction across all masks
4. Discuss: Why might BERT struggle with certain masked words more than others?

In [15]:
# Solution to Exercise 4
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
from transformers import BertForMaskedLM

mlm_model = BertForMaskedLM.from_pretrained("bert-base-uncased")
mlm_model.eval()

# 1. Create prediction function
def predict_masked_tokens(sentence, mask_positions, top_k=3):
    """
    Predict masked tokens in a sentence.
    
    Args:
        sentence: Input sentence (string)
        mask_positions: List of word positions to mask (0-indexed, excluding [CLS])
        top_k: Number of top predictions to return
    
    Returns:
        Dictionary with predictions for each masked position
    """
    # Tokenize to get word positions
    words = sentence.split()
    
    # Replace specified positions with [MASK]
    masked_words = words.copy()
    original_words = {}
    for pos in mask_positions:
        original_words[pos] = masked_words[pos]
        masked_words[pos] = "[MASK]"
    
    masked_sentence = " ".join(masked_words)
    print(f"Original: {sentence}")
    print(f"Masked:   {masked_sentence}\n")
    
    # Tokenize (note: some words might be split into subwords)
    inputs = tokenizer(masked_sentence, return_tensors="pt")
    
    # Get predictions
    with torch.no_grad():
        outputs = mlm_model(**inputs)
        logits = outputs.logits
    
    # Find [MASK] positions in tokenized input
    mask_token_indices = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
    
    # Get predictions for each mask
    results = {}
    for i, mask_idx in enumerate(mask_token_indices):
        # Get probabilities for this mask position
        mask_logits = logits[0, mask_idx]
        probs = F.softmax(mask_logits, dim=-1)
        
        # Get top-k predictions
        top_probs, top_indices = torch.topk(probs, top_k)
        
        predictions = []
        for prob, idx in zip(top_probs, top_indices):
            token = tokenizer.decode([idx]).strip()
            predictions.append({
                'token': token,
                'probability': prob.item()
            })
        
        word_pos = mask_positions[i]
        results[word_pos] = {
            'original': original_words[word_pos],
            'predictions': predictions
        }
    
    return results

# 2. Test the function
sentence = "The quick brown fox jumps over the lazy dog"
mask_positions = [1, 4, 7]  # "quick", "jumps", "lazy"

top_k = 5
predictions = predict_masked_tokens(sentence, mask_positions, top_k=top_k)

# Display results
print("=" * 70)
print("PREDICTIONS")
print("=" * 70)
for pos, data in predictions.items():
    print(f"\nPosition {pos} - Original word: '{data['original']}'")
    print("-" * 50)
    for i, pred in enumerate(data['predictions'], 1):
        marker = " ✓" if pred['token'] == data['original'].lower() else ""
        print(f"  {i}. {pred['token']:<15} (confidence: {pred['probability']*100:5.2f}%){marker}")

# 3. Calculate average confidence
top_confidences = [data['predictions'][0]['probability'] for data in predictions.values()]
avg_confidence = sum(top_confidences) / len(top_confidences)
print(f"\n{'=' * 70}")
print(f"Average confidence of top predictions: {avg_confidence*100:.2f}%")

# 4. Discussion
print(f"\n{'=' * 70}")
print("DISCUSSION: Why BERT struggles with certain words")
print("=" * 70)
print("""
BERT's prediction confidence varies based on:

1. **Context informativeness**: 
   - "lazy" before "dog" is more predictable (common phrase)
   - "quick" could be many adjectives (fast, small, big, etc.)

2. **Word frequency in training data**:
   - Common words are easier to predict
   - Rare words have fewer training examples

3. **Part of speech**:
   - Content words (nouns, verbs) often harder than function words
   - Adjectives can be very context-dependent

4. **Semantic ambiguity**:
   - Multiple valid options reduce confidence
   - "jumps" could be runs, leaps, moves, etc.
""")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Original: The quick brown fox jumps over the lazy dog
Masked:   The [MASK] brown fox [MASK] over the [MASK] dog

PREDICTIONS

Position 1 - Original word: 'quick'
--------------------------------------------------
  1. little          (confidence: 23.19%)
  2. great           (confidence:  7.49%)
  3. big             (confidence:  5.30%)
  4. new             (confidence:  2.98%)
  5. large           (confidence:  2.06%)

Position 4 - Original word: 'jumps'
--------------------------------------------------
  1. takes           (confidence: 74.10%)
  2. took            (confidence:  9.22%)
  3. wins            (confidence:  7.18%)
  4. taking          (confidence:  1.04%)
  5. take            (confidence:  1.00%)

Position 7 - Original word: 'lazy'
--------------------------------------------------
  1. white           (confidence:  5.71%)
  2. big             (confidence:  5.31%)
  3. black           (confidence:  4.23%)
  4. red             (confidence:  4.01%)
  5. little          (co

### Exercise 5: Next Sentence Prediction Evaluation

**Task:** Create a small dataset to evaluate BERT's Next Sentence Prediction capabilities and analyze its performance.

**Requirements:**
1. Create 10 sentence pairs: 5 that are actual continuations and 5 that are not
   - Use diverse topics (news, stories, technical content, etc.)
2. For each pair, get BERT's NSP prediction and confidence scores
3. Calculate accuracy, precision, and recall for the predictions
4. Identify which pair had the highest confidence (regardless of correctness)
5. Create a confusion matrix visualization

**Hint:** Use `BertForNextSentencePrediction` and apply softmax to get probabilities.

In [16]:
# Solution to Exercise 6
import torch
import torch.nn.functional as F
from transformers import BertForNextSentencePrediction
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Load NSP model
nsp_model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased").to(device)
nsp_model.eval()

# 1. Create dataset of sentence pairs
sentence_pairs = [
    # True continuations (label = 1 for "is next")
    {
        'sent_a': "The weather forecast predicts heavy rain tomorrow.",
        'sent_b': "Don't forget to bring an umbrella when you go out.",
        'is_next': True,
        'topic': 'Weather'
    },
    {
        'sent_a': "She studied hard for her final exams all week.",
        'sent_b': "Her efforts paid off when she received excellent grades.",
        'is_next': True,
        'topic': 'Education'
    },
    {
        'sent_a': "The company announced record profits in Q4.",
        'sent_b': "Shareholders were pleased with the financial results.",
        'is_next': True,
        'topic': 'Business'
    },
    {
        'sent_a': "Neural networks learn by adjusting their weights.",
        'sent_b': "This process is called backpropagation and uses gradient descent.",
        'is_next': True,
        'topic': 'Technical'
    },
    {
        'sent_a': "The movie had stunning visual effects and great acting.",
        'sent_b': "It went on to win several Academy Awards that year.",
        'is_next': True,
        'topic': 'Entertainment'
    },
    
    # False continuations (label = 0 for "is NOT next")
    {
        'sent_a': "Quantum computing uses qubits instead of classical bits.",
        'sent_b': "Pizza is one of the most popular foods in Italy.",
        'is_next': False,
        'topic': 'Random'
    },
    {
        'sent_a': "The patient showed symptoms of seasonal allergies.",
        'sent_b': "The stock market reached an all-time high today.",
        'is_next': False,
        'topic': 'Random'
    },
    {
        'sent_a': "Python is a versatile programming language.",
        'sent_b': "Pythons are the largest snakes in the world.",
        'is_next': False,
        'topic': 'Random'
    },
    {
        'sent_a': "Climate change affects global weather patterns.",
        'sent_b': "The restaurant serves authentic Japanese cuisine.",
        'is_next': False,
        'topic': 'Random'
    },
    {
        'sent_a': "The team celebrated their championship victory.",
        'sent_b': "Photosynthesis converts light energy into chemical energy.",
        'is_next': False,
        'topic': 'Random'
    }
]

In [17]:
# 2. Get predictions for each pair
print("=" * 90)
print("NEXT SENTENCE PREDICTION RESULTS")
print("=" * 90)

results = []
for i, pair in enumerate(sentence_pairs):
    # Tokenize the pair
    inputs = tokenizer(pair['sent_a'], pair['sent_b'], return_tensors="pt").to(device)
    
    # Get prediction
    with torch.no_grad():
        outputs = nsp_model(**inputs)
        logits = outputs.logits
    
    # Get probabilities
    probs = F.softmax(logits, dim=-1)
    is_next_prob = probs[0, 0].item()  # Probability that sent_b IS next
    not_next_prob = probs[0, 1].item()  # Probability that sent_b is NOT next
    
    # Make prediction
    predicted_is_next = is_next_prob > not_next_prob
    correct = predicted_is_next == pair['is_next']
    confidence = max(is_next_prob, not_next_prob)
    
    results.append({
        'pair_id': i + 1,
        'sent_a': pair['sent_a'],
        'sent_b': pair['sent_b'],
        'true_label': pair['is_next'],
        'predicted_label': predicted_is_next,
        'is_next_prob': is_next_prob,
        'not_next_prob': not_next_prob,
        'confidence': confidence,
        'correct': correct,
        'topic': pair['topic']
    })
    
    # Display
    status = "✓ CORRECT" if correct else "✗ WRONG"
    print(f"\nPair {i+1} ({pair['topic']}) - {status}")
    print(f"  A: {pair['sent_a'][:65]}...")
    print(f"  B: {pair['sent_b'][:65]}...")
    print(f"  True: {'IS next' if pair['is_next'] else 'NOT next'} | "
          f"Predicted: {'IS next' if predicted_is_next else 'NOT next'}")
    print(f"  Confidence: {confidence*100:.2f}% | P(is_next)={is_next_prob:.3f}, P(not_next)={not_next_prob:.3f}")

# 3. Calculate metrics
y_true = [1 if r['true_label'] else 0 for r in results]
y_pred = [1 if r['predicted_label'] else 0 for r in results]

accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, zero_division=0)
recall = recall_score(y_true, y_pred, zero_division=0)
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print("\n" + "=" * 90)
print("PERFORMANCE METRICS")
print("=" * 90)
print(f"Accuracy:  {accuracy*100:.2f}% ({sum([r['correct'] for r in results])}/{len(results)} correct)")
print(f"Precision: {precision*100:.2f}% (of predicted 'IS next', how many were correct)")
print(f"Recall:    {recall*100:.2f}% (of true 'IS next', how many were found)")
print(f"F1-Score:  {f1*100:.2f}%")

# 4. Identify highest confidence pair
highest_conf_result = max(results, key=lambda x: x['confidence'])
print("\n" + "=" * 90)
print("HIGHEST CONFIDENCE PREDICTION")
print("=" * 90)
print(f"Pair {highest_conf_result['pair_id']} with {highest_conf_result['confidence']*100:.2f}% confidence")
print(f"  Prediction: {'IS next' if highest_conf_result['predicted_label'] else 'NOT next'}")
print(f"  Actually: {'IS next' if highest_conf_result['true_label'] else 'NOT next'}")
print(f"  Status: {'Correct ✓' if highest_conf_result['correct'] else 'Wrong ✗'}")

# Confidence distribution
true_positives = [r for r in results if r['true_label'] and r['predicted_label']]
true_negatives = [r for r in results if not r['true_label'] and not r['predicted_label']]
false_positives = [r for r in results if not r['true_label'] and r['predicted_label']]
false_negatives = [r for r in results if r['true_label'] and not r['predicted_label']]

NEXT SENTENCE PREDICTION RESULTS

Pair 1 (Weather) - ✓ CORRECT
  A: The weather forecast predicts heavy rain tomorrow....
  B: Don't forget to bring an umbrella when you go out....
  True: IS next | Predicted: IS next
  Confidence: 100.00% | P(is_next)=1.000, P(not_next)=0.000

Pair 2 (Education) - ✓ CORRECT
  A: She studied hard for her final exams all week....
  B: Her efforts paid off when she received excellent grades....
  True: IS next | Predicted: IS next
  Confidence: 100.00% | P(is_next)=1.000, P(not_next)=0.000

Pair 3 (Business) - ✓ CORRECT
  A: The company announced record profits in Q4....
  B: Shareholders were pleased with the financial results....
  True: IS next | Predicted: IS next
  Confidence: 100.00% | P(is_next)=1.000, P(not_next)=0.000

Pair 4 (Technical) - ✓ CORRECT
  A: Neural networks learn by adjusting their weights....
  B: This process is called backpropagation and uses gradient descent....
  True: IS next | Predicted: IS next
  Confidence: 100.00% | P(is_n