Let's dive into the **heart of BERT** — where words hide in masks and context reveals everything.  
Time to build a **Masked Language Model (MLM)** from scratch.

---

# 🧪 `08_lab_masked_language_modeling_from_scratch.ipynb`  
### 📁 `03_natural_language_processing`  
> Pretrain a **mini BERT-style model** on a small text corpus using **masked token prediction**.  
No black-box — you’ll learn **what BERT learns**, and **how**.

---

## 🎯 Learning Goals

- Understand **Masked Language Modeling (MLM)** as used in BERT  
- Train a **tiny Transformer encoder** on a small dataset  
- Implement training loop, masking logic, loss function  
- Visualize token recovery over time

---

## ⚙️ Runtime Design

| Feature         | Spec              |
|-----------------|-------------------|
| Target device   | ✅ Colab / Laptop (CPU or GPU)  
| Dataset size    | ✅ <10k lines (custom text or dataset)  
| Model size      | ✅ ~100k params (tiny encoder)  
| Epochs          | 🔁 5–10 fast epochs  
| Libraries       | ✅ PyTorch, HuggingFace Tokenizers  

---

## 🔧 Section 1: Install & Imports

```python
!pip install transformers datasets

import torch
from torch import nn
from transformers import BertTokenizerFast
from torch.utils.data import Dataset, DataLoader
import random
import numpy as np
```

---

## 📄 Section 2: Create Sample Dataset

```python
# Toy corpus (repeat to expand)
text_data = [
    "The quick brown fox jumps over the lazy dog",
    "All your base are belong to us",
    "The cake is a lie",
    "To be or not to be",
    "I am the one who knocks"
] * 200  # expand for training
```

---

## 🧠 Section 3: Tokenizer + Masking

```python
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

def mask_tokens(inputs, tokenizer, mlm_probability=0.15):
    labels = inputs.clone()
    # Sample tokens to mask
    probability_matrix = torch.full(labels.shape, mlm_probability)
    special_tokens_mask = [
        tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True)
        for val in labels.tolist()
    ]
    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -100  # only compute loss on masked tokens

    inputs[masked_indices] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
    return inputs, labels
```

---

## 🧱 Section 4: Dataset + Dataloader

```python
class MLMDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=32):
        self.examples = tokenizer(texts, truncation=True, padding='max_length',
                                  max_length=max_length, return_tensors='pt')['input_ids']

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        input_ids = self.examples[idx].clone()
        masked, labels = mask_tokens(input_ids, tokenizer)
        return masked, labels

dataset = MLMDataset(text_data, tokenizer)
loader = DataLoader(dataset, batch_size=8, shuffle=True)
```

---

## 🧠 Section 5: Define a Mini BERT-style Encoder

```python
class MiniBert(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, 64)
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=64, nhead=4),
            num_layers=2
        )
        self.out = nn.Linear(64, vocab_size)

    def forward(self, x):
        x = self.emb(x)  # (B, T, D)
        x = self.encoder(x.permute(1, 0, 2))  # (T, B, D)
        x = self.out(x.permute(1, 0, 2))  # back to (B, T, V)
        return x
```

---

## 🔁 Section 6: Train Loop (MLM)

```python
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MiniBert(tokenizer.vocab_size).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
criterion = nn.CrossEntropyLoss()

for epoch in range(5):
    total_loss = 0
    for inputs, labels in loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        loss = criterion(outputs.view(-1, tokenizer.vocab_size), labels.view(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")
```

---

## 🧪 Section 7: Try Masked Prediction

```python
def predict_masked(text):
    tokens = tokenizer(text, return_tensors='pt', max_length=32, truncation=True, padding='max_length')
    input_ids = tokens['input_ids'].clone()
    mask_pos = 5  # for demo
    input_ids[0, mask_pos] = tokenizer.mask_token_id

    with torch.no_grad():
        output = model(input_ids.to(device))
    top_pred = output[0, mask_pos].argmax().item()
    print("Original:", text)
    print("Predicted mask:", tokenizer.decode([top_pred]))

predict_masked("The quick brown [MASK] jumps over")
```

---

## ✅ Wrap-Up Recap

| What You Did               | ✅ |
|----------------------------|----|
| Created a toy corpus       | ✅ |
| Built masking logic        | ✅ |
| Trained MLM from scratch   | ✅ |
| Tested masked recovery     | ✅ |
| CPU/GPU friendly           | ✅ |

---

## 🧠 What You Learned

- How **BERT learns by unmasking tokens**  
- Importance of **contextual embeddings**  
- Why MLM pretraining captures **deep semantics**  
- How to scale this into **serious pretraining pipelines**

---

Ready for `09_lab_attention_visualization.ipynb` next?  
We’ll visualize **real attention heads** from a transformer — and see how models "look" at words.