# BERT: Part 1

## Why BERT Changed Everything

---

**Paper:** [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

**Authors:** Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (Google AI Language)

**Published:** October 2018

---

Before BERT, getting state-of-the-art results on NLP tasks required task-specific architectures. You needed different models for sentiment analysis, question answering, named entity recognition.

BERT changed that. One pre-trained model, fine-tune it for 30 minutes, beat everything.

Let me show you why this was such a big deal.

---

## The State of NLP Before BERT

To understand why BERT mattered, you need to know what we were doing before.

### The Word Embedding Era (2013-2017)

In 2013, Word2Vec showed us something remarkable: you could represent words as vectors, and similar words would have similar vectors.

```
king - man + woman ≈ queen
```

This was genuinely exciting. We went from treating words as arbitrary symbols to having meaningful representations.

**Word2Vec, GloVe, FastText** - these became the standard. You'd download pre-trained embeddings and use them as the first layer of your model.

But there was a fundamental problem.

### The Problem: One Word, One Vector

Consider the word **"bank"**:

- "I deposited money in the **bank**"
- "I sat by the river **bank**"

With Word2Vec, both sentences use the exact same vector for "bank". The embedding is **static** - it doesn't change based on context.

This is obviously wrong. The word means completely different things in these sentences.

Same problem with "apple":
- "I ate an **apple**" (fruit)
- "I bought an **Apple**" (company)

Or "play":
- "Let's **play** basketball" (verb - activity)
- "We watched a **play**" (noun - theater)

Static embeddings can't handle this. They give you one vector per word, regardless of how it's used.

In [None]:
# Let's visualize this problem
import numpy as np
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Static embeddings (Word2Vec style)
ax = axes[0]
ax.set_xlim(-2, 2)
ax.set_ylim(-2, 2)
ax.set_title('Static Embeddings (Word2Vec)', fontsize=12, fontweight='bold')

# "bank" has ONE position regardless of context
ax.scatter([0.5], [0.8], s=200, c='#e74c3c', zorder=5)
ax.annotate('"bank"\n(one vector for all uses)', (0.5, 0.8), 
            xytext=(0.5, 0.2), fontsize=10, ha='center',
            arrowprops=dict(arrowstyle='->', color='gray'))

# Context sentences
ax.text(-1.8, 1.7, '"river bank"', fontsize=9, color='#3498db')
ax.text(-1.8, 1.4, '"bank account"', fontsize=9, color='#3498db')
ax.text(-1.8, 1.1, '"bank robbery"', fontsize=9, color='#3498db')
ax.text(-1.8, 0.8, '"steep bank"', fontsize=9, color='#3498db')

# Arrow showing they all map to same point
for y in [1.7, 1.4, 1.1, 0.8]:
    ax.annotate('', xy=(0.3, 0.8), xytext=(-0.5, y),
                arrowprops=dict(arrowstyle='->', color='#bdc3c7', lw=0.8))

ax.text(0, -1.5, 'Problem: Same vector for\ncompletely different meanings!', 
        fontsize=10, ha='center', color='#e74c3c', fontweight='bold')
ax.set_xticks([])
ax.set_yticks([])

# Right: Contextualized embeddings (BERT style)
ax = axes[1]
ax.set_xlim(-2, 2)
ax.set_ylim(-2, 2)
ax.set_title('Contextualized Embeddings (BERT)', fontsize=12, fontweight='bold')

# Different positions based on context
positions = [(1.2, 1.5), (1.0, 0.3), (0.8, 0.1), (-0.8, 1.2)]
contexts = ['"river bank"', '"bank account"', '"bank robbery"', '"steep bank"']
colors = ['#27ae60', '#e74c3c', '#e74c3c', '#27ae60']

for (x, y), ctx, c in zip(positions, contexts, colors):
    ax.scatter([x], [y], s=150, c=c, zorder=5)
    ax.annotate(ctx, (x, y), xytext=(x-0.3, y+0.3), fontsize=9)

ax.text(0, -1.5, 'Solution: Different vectors\nbased on context!', 
        fontsize=10, ha='center', color='#27ae60', fontweight='bold')

# Add legend
ax.scatter([], [], c='#27ae60', s=100, label='Geography meaning')
ax.scatter([], [], c='#e74c3c', s=100, label='Financial meaning')
ax.legend(loc='lower right', fontsize=9)
ax.set_xticks([])
ax.set_yticks([])

plt.tight_layout()
plt.show()

---

## Enter ELMo: The First Contextualized Embeddings (2018)

A few months before BERT, a paper called **ELMo** (Embeddings from Language Models) tackled this problem.

The idea: instead of one fixed embedding per word, run the sentence through a bidirectional LSTM and use the hidden states as embeddings.

```
Forward LSTM:  The → cat → sat → on → the → bank
Backward LSTM: bank ← the ← on ← sat ← cat ← The
```

Combine both directions → contextualized embedding for each word.

This worked! ELMo improved results across many tasks.

**But there was still a problem...**

### ELMo's Limitation: Shallow Bidirectionality

ELMo runs two separate LSTMs:
- Forward: reads left-to-right
- Backward: reads right-to-left

Then it **concatenates** them:

```
ELMo("bank") = [forward_hidden; backward_hidden]
```

The problem? Each direction is computed **independently**. The forward LSTM doesn't know what the backward LSTM learned, and vice versa.

It's like having two people read a sentence from opposite ends and then combining their notes. They never actually discussed what they found.

In [None]:
# ELMo vs BERT comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ELMo
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')
ax.set_title('ELMo: Concatenate Two Directions', fontsize=12, fontweight='bold')

words = ['The', 'river', 'bank', 'was', 'steep']
for i, word in enumerate(words):
    ax.text(1 + i*1.7, 1, word, fontsize=10, ha='center', 
            bbox=dict(boxstyle='round,pad=0.3', facecolor='#ecf0f1', edgecolor='#bdc3c7'))

# Forward arrows
for i in range(4):
    ax.annotate('', xy=(2.2 + i*1.7, 3.5), xytext=(1.3 + i*1.7, 3.5),
                arrowprops=dict(arrowstyle='->', color='#3498db', lw=2))
ax.text(5, 4, 'Forward LSTM', fontsize=10, ha='center', color='#3498db', fontweight='bold')

# Backward arrows
for i in range(4):
    ax.annotate('', xy=(1.3 + i*1.7, 5.5), xytext=(2.2 + i*1.7, 5.5),
                arrowprops=dict(arrowstyle='->', color='#e74c3c', lw=2))
ax.text(5, 6.2, 'Backward LSTM', fontsize=10, ha='center', color='#e74c3c', fontweight='bold')

# Concatenation
ax.text(5, 7.2, 'Concatenate (no interaction)', fontsize=9, ha='center', 
        bbox=dict(boxstyle='round', facecolor='#f9e79f'))

# BERT
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')
ax.set_title('BERT: True Bidirectional Attention', fontsize=12, fontweight='bold')

for i, word in enumerate(words):
    ax.text(1 + i*1.7, 1, word, fontsize=10, ha='center', 
            bbox=dict(boxstyle='round,pad=0.3', facecolor='#ecf0f1', edgecolor='#bdc3c7'))

# Bidirectional attention (all-to-all)
# Show attention from "bank" to all other words
bank_x = 1 + 2*1.7  # position of "bank"
for i, word in enumerate(words):
    other_x = 1 + i*1.7
    if word != 'bank':
        ax.annotate('', xy=(other_x, 2.5), xytext=(bank_x, 2.5),
                    arrowprops=dict(arrowstyle='<->', color='#27ae60', lw=1.5,
                                   connectionstyle='arc3,rad=0.3'))

ax.text(5, 4.5, 'Self-Attention:\nEvery word sees every other word\nAT THE SAME TIME', 
        fontsize=10, ha='center', color='#27ae60', fontweight='bold')

ax.text(5, 7.2, 'Deep bidirectional (true fusion)', fontsize=9, ha='center', 
        bbox=dict(boxstyle='round', facecolor='#d5f5e3'))

plt.tight_layout()
plt.show()

---

## Meanwhile: GPT (June 2018)

OpenAI released GPT a few months before BERT. GPT used the Transformer architecture (which we covered in the previous series), but only the **decoder** part.

GPT was trained to predict the next word:

```
Input:  "The cat sat on the"
Target: "mat"
```

This worked great for text generation. But there's a limitation for understanding tasks.

### GPT's Problem: Left-to-Right Only

Because GPT predicts the next word, it can only look at previous words. It's **unidirectional**.

```
"The cat sat on the [MASK]"
                     ↑
    Can only see: "The cat sat on the"
    Cannot see: anything after [MASK]
```

For many tasks (sentiment analysis, question answering), you want to look at the **entire** sentence, not just what came before.

---

## The BERT Insight: Masked Language Modeling

The Google team asked a simple question:

> "What if we could train a Transformer to look at the whole sentence?"

The problem: if you train a model to predict the next word, it can't see future words (that would be cheating). But if you just show it the whole sentence... what do you train it to predict?

**Their solution: Mask some words and predict them.**

```
Original:  "The cat sat on the mat"
Masked:    "The [MASK] sat on the mat"
Task:      Predict that [MASK] = "cat"
```

Now the model can see words on **both sides** of the masked word. It has full bidirectional context.

This is the key insight of BERT: **Masked Language Modeling (MLM)**.

In [None]:
# MLM visualization
fig, ax = plt.subplots(figsize=(14, 7))
ax.set_xlim(0, 14)
ax.set_ylim(0, 7)
ax.axis('off')

ax.text(7, 6.5, 'Masked Language Modeling: The Key Innovation', fontsize=14, 
        ha='center', fontweight='bold')

# Original sentence
ax.text(1, 5, 'Original:', fontsize=11, fontweight='bold')
words = ['The', 'cat', 'sat', 'on', 'the', 'mat']
for i, word in enumerate(words):
    ax.text(2.5 + i*1.5, 5, word, fontsize=12, ha='center',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='#3498db', edgecolor='none'),
            color='white')

# Masked sentence
ax.text(1, 3.5, 'Masked:', fontsize=11, fontweight='bold')
masked_words = ['The', '[MASK]', 'sat', 'on', 'the', 'mat']
for i, word in enumerate(masked_words):
    color = '#e74c3c' if word == '[MASK]' else '#3498db'
    ax.text(2.5 + i*1.5, 3.5, word, fontsize=12, ha='center',
            bbox=dict(boxstyle='round,pad=0.3', facecolor=color, edgecolor='none'),
            color='white')

# Arrows showing context
mask_x = 2.5 + 1*1.5  # position of [MASK]
ax.annotate('', xy=(mask_x - 0.5, 2.8), xytext=(2.5, 3.2),
            arrowprops=dict(arrowstyle='->', color='#27ae60', lw=2))
ax.annotate('', xy=(mask_x + 0.5, 2.8), xytext=(2.5 + 2*1.5, 3.2),
            arrowprops=dict(arrowstyle='->', color='#27ae60', lw=2))
ax.annotate('', xy=(mask_x + 0.5, 2.8), xytext=(2.5 + 3*1.5, 3.2),
            arrowprops=dict(arrowstyle='->', color='#27ae60', lw=2))
ax.annotate('', xy=(mask_x + 0.5, 2.8), xytext=(2.5 + 4*1.5, 3.2),
            arrowprops=dict(arrowstyle='->', color='#27ae60', lw=2))
ax.annotate('', xy=(mask_x + 0.5, 2.8), xytext=(2.5 + 5*1.5, 3.2),
            arrowprops=dict(arrowstyle='->', color='#27ae60', lw=2))

# Prediction
ax.text(mask_x, 2.3, 'Predict: "cat"', fontsize=12, ha='center', fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='#f39c12', edgecolor='none'),
        color='white')

# Key point
ax.text(7, 1, 'The model sees BOTH "The" (left) AND "sat on the mat" (right)\nto predict the masked word!', 
        fontsize=11, ha='center', 
        bbox=dict(boxstyle='round', facecolor='#d5f5e3', edgecolor='#27ae60'))

# Comparison
ax.text(12, 4.5, 'GPT:', fontsize=10, fontweight='bold')
ax.text(12, 4, 'Can only see left', fontsize=9, color='#e74c3c')
ax.text(12, 3.3, 'BERT:', fontsize=10, fontweight='bold')
ax.text(12, 2.8, 'Sees both sides!', fontsize=9, color='#27ae60')

plt.tight_layout()
plt.show()

---

## Why This Matters: Real Examples

Let me show you why bidirectional context is so important.

### Example 1: Pronoun Resolution

```
"The trophy didn't fit in the suitcase because it was too big."
```

What does "it" refer to? The trophy or the suitcase?

- If "it was too **big**" → "it" = trophy
- If "it was too **small**" → "it" = suitcase

You need to see "big" (which comes **after** "it") to understand what "it" means.

GPT, reading left-to-right, hasn't seen "big" yet when it processes "it".

BERT sees everything at once.

### Example 2: Sentiment Analysis

```
"I thought the movie would be terrible, but it was actually amazing."
```

Is this positive or negative? You need to read the **whole sentence**. The beginning sounds negative, but the ending flips it.

### Example 3: Question Answering

```
Context: "The Eiffel Tower is located in Paris, France."
Question: "Where is the Eiffel Tower?"
```

The model needs to match "Where" in the question to "located in" in the context. This requires looking at both directions.

---

## The Second Pre-training Task: Next Sentence Prediction

Masked LM teaches the model about words. But many tasks (like question answering) involve understanding **relationships between sentences**.

BERT adds a second pre-training task: **Next Sentence Prediction (NSP)**.

Given two sentences, predict if the second one actually follows the first:

```
Sentence A: "The man went to the store."
Sentence B: "He bought some milk."
Label: IsNext (yes, B follows A)

Sentence A: "The man went to the store."
Sentence B: "Penguins are flightless birds."
Label: NotNext (random sentence, doesn't follow)
```

50% of training examples are real consecutive sentences (IsNext).
50% are random pairs (NotNext).

This teaches the model to understand discourse and sentence relationships.

*(Note: Later research showed NSP isn't that important. Models like RoBERTa dropped it. But it was part of the original BERT.)*

---

## Pre-training + Fine-tuning: The New Paradigm

Here's the workflow BERT introduced:

### Step 1: Pre-train (done once, by Google)

- Train on massive unlabeled text (Wikipedia + Books)
- Tasks: Masked LM + Next Sentence Prediction
- Takes days on many TPUs
- Results in a general-purpose language understanding model

### Step 2: Fine-tune (done by you, for your task)

- Start from pre-trained weights
- Add a simple output layer for your task
- Train on your labeled data
- Takes minutes to hours on one GPU

This is **transfer learning** for NLP. Similar to how ImageNet pre-training revolutionized computer vision.

In [None]:
# Pre-train + Fine-tune visualization
from matplotlib.patches import FancyBboxPatch

fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 8)
ax.axis('off')

ax.text(7, 7.5, 'The Pre-train + Fine-tune Paradigm', fontsize=14, 
        ha='center', fontweight='bold')

# Pre-training phase
pretrain_box = FancyBboxPatch((0.5, 3), 5.5, 3.5, boxstyle="round,pad=0.1",
                                facecolor='#ebf5fb', edgecolor='#3498db', linewidth=2)
ax.add_patch(pretrain_box)
ax.text(3.25, 6, 'PRE-TRAINING', fontsize=12, ha='center', fontweight='bold', color='#3498db')

ax.text(3.25, 5.2, 'Data: Wikipedia + Books\n(3.3 billion words)', fontsize=9, ha='center')
ax.text(3.25, 4.2, 'Tasks:\n- Masked LM (15% of words)\n- Next Sentence Prediction', fontsize=9, ha='center')
ax.text(3.25, 3.3, 'Time: Days on 64 TPUs', fontsize=9, ha='center', color='#e74c3c')

# Arrow
ax.annotate('', xy=(7.5, 4.75), xytext=(6, 4.75),
            arrowprops=dict(arrowstyle='->', color='#333', lw=2))
ax.text(6.75, 5.2, 'Pre-trained\nweights', fontsize=9, ha='center')

# Fine-tuning phase
finetune_box = FancyBboxPatch((7.5, 3), 6, 3.5, boxstyle="round,pad=0.1",
                                facecolor='#fef9e7', edgecolor='#f39c12', linewidth=2)
ax.add_patch(finetune_box)
ax.text(10.5, 6, 'FINE-TUNING', fontsize=12, ha='center', fontweight='bold', color='#f39c12')

ax.text(10.5, 5.2, 'Data: Your task-specific dataset\n(could be just 1000 examples!)', fontsize=9, ha='center')
ax.text(10.5, 4.2, 'Task: Classification, NER, QA,\nwhatever you need', fontsize=9, ha='center')
ax.text(10.5, 3.3, 'Time: Minutes to hours on 1 GPU', fontsize=9, ha='center', color='#27ae60')

# Output tasks
tasks = ['Sentiment', 'NER', 'QA', 'Similarity']
for i, task in enumerate(tasks):
    ax.text(8 + i*1.5, 2.3, task, fontsize=8, ha='center',
            bbox=dict(boxstyle='round,pad=0.2', facecolor='#d5f5e3', edgecolor='#27ae60'))

# Key insight
ax.text(7, 1, 'Key: Pre-training captures general language knowledge.\nFine-tuning adapts it to your specific task.', 
        fontsize=10, ha='center', style='italic',
        bbox=dict(boxstyle='round', facecolor='#fadbd8', edgecolor='#e74c3c'))

plt.tight_layout()
plt.show()

---

## The Results: BERT Destroyed Everything

When BERT was released, it immediately set new state-of-the-art results on 11 NLP tasks.

Not by a little bit. By a lot.

In [None]:
# Results comparison
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# GLUE benchmark
ax = axes[0]
tasks = ['MNLI', 'QQP', 'QNLI', 'SST-2', 'CoLA', 'MRPC', 'RTE']
previous_sota = [80.6, 66.1, 82.3, 93.2, 35.0, 86.0, 61.7]
bert_base = [84.6, 71.2, 90.5, 93.5, 52.1, 88.9, 66.4]
bert_large = [86.7, 72.1, 92.7, 94.9, 60.5, 89.3, 70.1]

x = np.arange(len(tasks))
width = 0.25

ax.bar(x - width, previous_sota, width, label='Previous SOTA', color='#bdc3c7')
ax.bar(x, bert_base, width, label='BERT-Base', color='#3498db')
ax.bar(x + width, bert_large, width, label='BERT-Large', color='#e74c3c')

ax.set_ylabel('Accuracy / Score')
ax.set_title('GLUE Benchmark Results', fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(tasks, rotation=45, ha='right')
ax.legend()
ax.set_ylim(30, 100)
ax.grid(True, alpha=0.3, axis='y')

# SQuAD
ax = axes[1]
models = ['Previous\nSOTA', 'BERT-Base', 'BERT-Large', 'Human']
squad_f1 = [84.1, 88.5, 90.9, 91.2]
colors = ['#bdc3c7', '#3498db', '#e74c3c', '#27ae60']

bars = ax.bar(models, squad_f1, color=colors)
ax.set_ylabel('F1 Score')
ax.set_title('SQuAD 1.1 (Question Answering)', fontweight='bold')
ax.set_ylim(80, 95)
ax.grid(True, alpha=0.3, axis='y')

for bar, score in zip(bars, squad_f1):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3, 
            f'{score}', ha='center', fontsize=10, fontweight='bold')

ax.text(2.5, 81.5, 'BERT-Large nearly matches\nhuman performance!', 
        fontsize=9, ha='center', color='#e74c3c', style='italic')

plt.tight_layout()
plt.show()

### The Key Numbers

| Benchmark | Metric | Previous SOTA | BERT-Large | Improvement |
|-----------|--------|---------------|------------|-------------|
| GLUE | Average | 75.5 | 82.1 | **+6.6 points** |
| SQuAD 1.1 | F1 | 84.1 | 90.9 | **+6.8 points** |
| SQuAD 2.0 | F1 | 66.3 | 83.1 | **+16.8 points** |
| SWAG | Accuracy | 66.7 | 86.6 | **+19.9 points** |

On SQuAD 2.0, BERT improved by almost **17 points**. That's not incremental progress - that's a paradigm shift.

On SWAG (commonsense inference), the improvement was nearly **20 points**.

---

## Why BERT Worked So Well

Let me summarize the key innovations:

### 1. True Bidirectionality

Unlike GPT (left-to-right) or ELMo (separate directions), BERT sees the full context at every layer through self-attention.

### 2. Masked Language Modeling

Clever pre-training task that forces the model to understand context deeply.

### 3. Transformer Architecture

The encoder-only Transformer (which we studied in the previous series) is perfect for understanding tasks.

### 4. Scale

BERT was trained on a lot of data:
- BooksCorpus: 800 million words
- English Wikipedia: 2.5 billion words
- Total: 3.3 billion words

And the model was large:
- BERT-Base: 110 million parameters
- BERT-Large: 340 million parameters

### 5. The Pre-train + Fine-tune Paradigm

Training once on lots of data, then adapting to specific tasks with minimal effort.

This combination was incredibly powerful.

---

## BERT's Impact

BERT changed how we do NLP:

**Before BERT:** Design task-specific architectures, train from scratch.

**After BERT:** Download pre-trained model, add output layer, fine-tune.

This made state-of-the-art NLP accessible. You didn't need massive compute or clever architectures anymore. Just fine-tune BERT.

### The Family Tree

BERT spawned an entire family of models:

```
BERT (Oct 2018)
    ├── RoBERTa (July 2019) - Better training recipe
    ├── ALBERT (Sept 2019) - Parameter efficient
    ├── DistilBERT (Oct 2019) - Smaller, faster
    ├── ELECTRA (March 2020) - More efficient pre-training
    └── Many more...
```

Even GPT-2 and GPT-3 were influenced by showing that pre-training at scale works.

---

## Summary: The Timeline

| Year | Development | Limitation |
|------|------------|------------|
| 2013 | Word2Vec | Static embeddings |
| 2014 | GloVe | Static embeddings |
| 2018 Feb | ELMo | Shallow bidirectional |
| 2018 June | GPT | Unidirectional only |
| **2018 Oct** | **BERT** | **Deep bidirectional** |

BERT combined the best ideas:
- Transformer architecture (from "Attention is All You Need")
- Contextualized embeddings (from ELMo)
- Pre-training at scale (from GPT)
- True bidirectionality (new!)

---

## What's Next: Part 2

Now you understand WHY BERT was created and what problems it solved.

In Part 2, we'll look at HOW it works:

- The architecture (encoder-only Transformer)
- Input representation ([CLS], [SEP], segments)
- WordPiece tokenization
- BERT-Base vs BERT-Large

---

*Paper:* [BERT: Pre-training of Deep Bidirectional Transformers](https://arxiv.org/abs/1810.04805)

*Original Transformer paper:* [Attention Is All You Need](https://arxiv.org/abs/1706.03762)