# Module 07: BERT and Masked Language Modeling

**Difficulty**: ⭐⭐⭐ Advanced  
**Estimated Time**: 120 minutes  
**Prerequisites**: [Module 06: Transformer Architecture](06_transformer_architecture.ipynb)

## Learning Objectives

1. Understand BERT's bidirectional pre-training approach
2. Implement masked language modeling (MLM)
3. Understand next sentence prediction (NSP)
4. Use pre-trained BERT from Hugging Face
5. Fine-tune BERT for downstream tasks
6. Compare BERT with other pre-training approaches

## BERT: Bidirectional Encoder Representations from Transformers

**Key Innovation**: Pre-train bidirectional representations by masking random tokens.

### Why BERT Matters:

- **Before BERT**: Models were either left-to-right (GPT) or shallow bidirectional
- **BERT**: Deep bidirectional understanding
- **Result**: State-of-the-art on 11 NLP tasks

### Architecture:

- **Encoder-only** Transformer (no decoder)
- **BERT-Base**: 12 layers, 768 hidden, 12 heads = 110M parameters
- **BERT-Large**: 24 layers, 1024 hidden, 16 heads = 340M parameters

## Setup and Imports

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertTokenizer, BertModel, BertForMaskedLM, BertConfig
from transformers import AutoTokenizer, AutoModel
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
print('✓ Libraries imported!')

## 1. Masked Language Modeling (MLM)

**Training objective**: Predict randomly masked tokens.

**Example**:
- Input: "The [MASK] sat on the [MASK]"
- Target: "cat", "mat"

**Masking strategy** (15% of tokens):
- 80%: Replace with [MASK]
- 10%: Replace with random word
- 10%: Keep unchanged

**Why?** Forces bidirectional understanding!

In [None]:
# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example sentence
text = "The quick brown fox jumps over the lazy dog"

# Tokenize
tokens = tokenizer.tokenize(text)
print(f'Tokens: {tokens}')

# Convert to IDs
input_ids = tokenizer.encode(text, add_special_tokens=True)
print(f'\nInput IDs: {input_ids}')

# Decode back
decoded = tokenizer.decode(input_ids)
print(f'Decoded: {decoded}')

In [None]:
# Demonstrate masking
from transformers import pipeline

# Load masked LM pipeline
mlm = pipeline('fill-mask', model='bert-base-uncased')

# Test MLM
test_sentences = [
    "The cat [MASK] on the mat.",
    "Paris is the [MASK] of France.",
    "I love [MASK] learning."
]

for sent in test_sentences:
    results = mlm(sent)
    print(f"\nSentence: {sent}")
    print("Top predictions:")
    for i, result in enumerate(results[:3], 1):
        print(f"  {i}. {result['token_str']:15} (score: {result['score']:.3f})")

## 2. Using Pre-trained BERT

**Hugging Face** provides easy access to pre-trained models.

In [None]:
# Load pre-trained BERT
model = BertModel.from_pretrained('bert-base-uncased')
model.to(device)
model.eval()

print(f'✓ BERT loaded!')
print(f'Parameters: {sum(p.numel() for p in model.parameters()):,}')

In [None]:
# Extract embeddings
text = "BERT provides contextualized word embeddings"

# Tokenize
inputs = tokenizer(text, return_tensors='pt', padding=True).to(device)

# Get BERT outputs
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings
last_hidden_state = outputs.last_hidden_state  # (batch, seq_len, hidden_dim)
cls_embedding = last_hidden_state[:, 0, :]  # [CLS] token embedding

print(f'Last hidden state shape: {last_hidden_state.shape}')
print(f'[CLS] embedding shape: {cls_embedding.shape}')

## 3. Fine-Tuning BERT

**Transfer learning workflow**:
1. Load pre-trained BERT
2. Add task-specific head
3. Fine-tune on target task
4. Achieve SOTA with less data!

In [None]:
from transformers import BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup

# Load BERT for classification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2,  # Binary classification
    output_attentions=False,
    output_hidden_states=False
)

model.to(device)
print('✓ BERT classifier loaded!')

**Exercise 1**: Fine-tune BERT for sentiment analysis

1. Load IMDB or SST dataset
2. Tokenize with BERT tokenizer
3. Fine-tune BERT classifier
4. Evaluate performance
5. Compare with RNN baseline

In [None]:
# YOUR CODE HERE
# Fine-tune BERT for sentiment analysis

## 4. BERT Variants

### Popular BERT derivatives:

- **RoBERTa**: Optimized training (no NSP, larger batches)
- **ALBERT**: Parameter sharing for efficiency
- **DistilBERT**: Smaller, faster (66% size, 95% performance)
- **ELECTRA**: Discriminative pre-training
- **DeBERTa**: Disentangled attention

In [None]:
# Compare different BERT models
models_to_compare = [
    'bert-base-uncased',
    'distilbert-base-uncased',
    'roberta-base'
]

for model_name in models_to_compare:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    params = sum(p.numel() for p in model.parameters())
    print(f'{model_name:30} Parameters: {params:,}')

## Summary

### Key Concepts:

1. **Masked Language Modeling**: Pre-training via masking
2. **Bidirectional Context**: Deep understanding
3. **Transfer Learning**: Pre-train then fine-tune
4. **Contextualized Embeddings**: Same word, different vectors

### BERT Impact:

✅ State-of-the-art on many tasks  
✅ Efficient transfer learning  
✅ Spawned many variants  
✅ Foundation for modern NLP  

### What's Next?

In **Module 08: GPT**, we'll learn about autoregressive (decoder-only) models.

### Resources:

- **BERT Paper**: [BERT: Pre-training of Deep Bidirectional Transformers](https://arxiv.org/abs/1810.04805)
- **Illustrated BERT**: [Jay Alammar's Blog](http://jalammar.github.io/illustrated-bert/)