# BERT: Part 2

## The Architecture - How BERT Actually Works

---

**Paper:** [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

---

In Part 1, we learned why BERT was created. Now let's look at how it's built.

If you followed the Transformer series, you already know most of this. BERT is just the **encoder** part of the Transformer. But there are some important details about input representation that we need to cover.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle
np.random.seed(42)

---

## BERT = Encoder-Only Transformer

Remember the original Transformer had two parts:
- **Encoder**: Processes input, can see all positions
- **Decoder**: Generates output, can only see past positions

BERT uses **only the encoder**. No decoder.

Why? Because BERT is for **understanding**, not generation. It needs to see the full input to understand it.

In [None]:
# BERT vs Original Transformer
fig, axes = plt.subplots(1, 2, figsize=(14, 8))

# Original Transformer
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('Original Transformer\n(Translation)', fontsize=12, fontweight='bold')

# Encoder
enc = FancyBboxPatch((1, 3), 3, 5, boxstyle="round,pad=0.1",
                      facecolor='#3498db', edgecolor='#2980b9', linewidth=2, alpha=0.8)
ax.add_patch(enc)
ax.text(2.5, 5.5, 'ENCODER', fontsize=11, ha='center', va='center', color='white', fontweight='bold')

# Decoder
dec = FancyBboxPatch((6, 3), 3, 5, boxstyle="round,pad=0.1",
                      facecolor='#e74c3c', edgecolor='#c0392b', linewidth=2, alpha=0.8)
ax.add_patch(dec)
ax.text(7.5, 5.5, 'DECODER', fontsize=11, ha='center', va='center', color='white', fontweight='bold')

# Arrow
ax.annotate('', xy=(6, 5.5), xytext=(4, 5.5),
            arrowprops=dict(arrowstyle='->', color='#333', lw=2))

ax.text(2.5, 2.3, 'Input:\n"The cat sat"', fontsize=9, ha='center')
ax.text(7.5, 2.3, 'Output:\n"Le chat assis"', fontsize=9, ha='center')

# BERT
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('BERT\n(Understanding)', fontsize=12, fontweight='bold')

# Just encoder
enc = FancyBboxPatch((2.5, 3), 5, 5, boxstyle="round,pad=0.1",
                      facecolor='#3498db', edgecolor='#2980b9', linewidth=2, alpha=0.8)
ax.add_patch(enc)
ax.text(5, 5.5, 'ENCODER\nONLY', fontsize=12, ha='center', va='center', color='white', fontweight='bold')

ax.text(5, 2.3, 'Input:\n"The [MASK] sat"', fontsize=9, ha='center')
ax.text(5, 8.7, 'Output:\nContextualized embeddings\n(one per token)', fontsize=9, ha='center')

# Strikethrough decoder
ax.text(8, 5.5, 'No decoder!', fontsize=10, color='#e74c3c', fontweight='bold', rotation=-20)

plt.tight_layout()
plt.show()

---

## The Two Model Sizes

BERT comes in two sizes:

| | BERT-Base | BERT-Large |
|---|---|---|
| Layers (L) | 12 | 24 |
| Hidden size (H) | 768 | 1024 |
| Attention heads (A) | 12 | 16 |
| Total parameters | 110M | 340M |

BERT-Base was designed to match GPT's size (for fair comparison).

BERT-Large was made bigger to see if scale helps (it does).

The architecture is the same - just different dimensions.

In [None]:
# Model size comparison
fig, ax = plt.subplots(figsize=(10, 6))

models = ['GPT-1', 'BERT-Base', 'BERT-Large']
params = [117, 110, 340]
colors = ['#95a5a6', '#3498db', '#e74c3c']

bars = ax.bar(models, params, color=colors, edgecolor='white', linewidth=2)

ax.set_ylabel('Parameters (Millions)', fontsize=11)
ax.set_title('Model Size Comparison', fontsize=12, fontweight='bold')
ax.set_ylim(0, 400)
ax.grid(True, alpha=0.3, axis='y')

for bar, param in zip(bars, params):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10, 
            f'{param}M', ha='center', fontsize=12, fontweight='bold')

ax.text(1, 50, 'Same size as GPT-1\n(fair comparison)', fontsize=9, ha='center', 
        bbox=dict(boxstyle='round', facecolor='#fef9e7'))

plt.tight_layout()
plt.show()

---

## Input Representation: The Key Details

This is where BERT differs from a standard Transformer encoder. BERT has specific input formatting.

### The Input Format

Every BERT input looks like this:

```
[CLS] tokens of sentence A [SEP] tokens of sentence B [SEP]
```

Or for single sentence:

```
[CLS] tokens of sentence [SEP]
```

Let's break this down.

### Special Token: [CLS]

**[CLS]** stands for "classification".

It's always the first token. The final hidden state of [CLS] is used as the "sentence representation" for classification tasks.

Why? Because self-attention lets [CLS] gather information from all other tokens. After 12 (or 24) layers, [CLS] has "seen" the entire input.

```
Input:  [CLS] The movie was great [SEP]
                  ↓ self-attention ↓
Output: [CLS'] contains info about entire sentence
                  ↓
        Classify: POSITIVE
```

### Special Token: [SEP]

**[SEP]** stands for "separator".

It marks the end of a sentence. For tasks with two sentences (like question answering or entailment), it separates them:

```
[CLS] Do cats meow ? [SEP] Yes , cats make meowing sounds . [SEP]
      ← Question →        ← Answer →
```

In [None]:
# Input format visualization
fig, ax = plt.subplots(figsize=(14, 6))
ax.set_xlim(0, 14)
ax.set_ylim(0, 6)
ax.axis('off')

ax.text(7, 5.5, 'BERT Input Format (Two Sentences)', fontsize=13, ha='center', fontweight='bold')

# Tokens
tokens = ['[CLS]', 'The', 'cat', 'sat', '[SEP]', 'It', 'was', 'tired', '[SEP]']
colors = ['#9b59b6', '#3498db', '#3498db', '#3498db', '#9b59b6', 
          '#e74c3c', '#e74c3c', '#e74c3c', '#9b59b6']

for i, (token, color) in enumerate(zip(tokens, colors)):
    x = 1 + i * 1.4
    box = FancyBboxPatch((x-0.5, 3.5), 1.1, 0.8, boxstyle="round,pad=0.05",
                          facecolor=color, edgecolor='none', alpha=0.8)
    ax.add_patch(box)
    ax.text(x, 3.9, token, fontsize=10, ha='center', va='center', 
            color='white', fontweight='bold')

# Segment labels
ax.plot([0.5, 5.5], [2.8, 2.8], color='#3498db', lw=3)
ax.text(3, 2.5, 'Segment A', fontsize=10, ha='center', color='#3498db', fontweight='bold')

ax.plot([6.5, 12.5], [2.8, 2.8], color='#e74c3c', lw=3)
ax.text(9.5, 2.5, 'Segment B', fontsize=10, ha='center', color='#e74c3c', fontweight='bold')

# Legend
ax.text(7, 1.5, '[CLS] = Classification token (used for sentence-level tasks)', fontsize=9, ha='center')
ax.text(7, 1, '[SEP] = Separator (marks sentence boundaries)', fontsize=9, ha='center')

plt.tight_layout()
plt.show()

---

## The Three Embeddings

For each token, BERT combines three embeddings:

1. **Token Embedding**: What word/subword is this?
2. **Segment Embedding**: Is this sentence A or sentence B?
3. **Position Embedding**: What position in the sequence?

These three are added together to create the input representation.

```
Input = Token_Embedding + Segment_Embedding + Position_Embedding
```

In [None]:
# Three embeddings visualization (BERT paper Figure 2 style)
fig, ax = plt.subplots(figsize=(14, 10))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')

ax.text(7, 9.5, 'BERT Input Representation', fontsize=14, ha='center', fontweight='bold')

# Tokens
tokens = ['[CLS]', 'my', 'dog', 'is', 'cute', '[SEP]', 'he', 'likes', 'play', '##ing', '[SEP]']
n = len(tokens)
x_positions = np.linspace(1, 13, n)

# Input tokens row
ax.text(0.3, 7.5, 'Input', fontsize=10, ha='right', fontweight='bold')
for i, (x, token) in enumerate(zip(x_positions, tokens)):
    ax.text(x, 7.5, token, fontsize=9, ha='center', 
            bbox=dict(boxstyle='round,pad=0.2', facecolor='#ecf0f1', edgecolor='#bdc3c7'))

# Token embeddings row
ax.text(0.3, 6, 'Token\nEmbed', fontsize=9, ha='right', fontweight='bold')
for i, x in enumerate(x_positions):
    color = '#9b59b6' if tokens[i] in ['[CLS]', '[SEP]'] else '#3498db'
    box = FancyBboxPatch((x-0.45, 5.6), 0.9, 0.8, boxstyle="round,pad=0.02",
                          facecolor=color, edgecolor='none', alpha=0.7)
    ax.add_patch(box)
    ax.text(x, 6, f'E_{tokens[i][:3]}', fontsize=7, ha='center', va='center', color='white')

# Segment embeddings row
ax.text(0.3, 4.5, 'Segment\nEmbed', fontsize=9, ha='right', fontweight='bold')
segments = ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B']
for i, (x, seg) in enumerate(zip(x_positions, segments)):
    color = '#27ae60' if seg == 'A' else '#e74c3c'
    box = FancyBboxPatch((x-0.45, 4.1), 0.9, 0.8, boxstyle="round,pad=0.02",
                          facecolor=color, edgecolor='none', alpha=0.7)
    ax.add_patch(box)
    ax.text(x, 4.5, f'E_{seg}', fontsize=8, ha='center', va='center', color='white')

# Position embeddings row
ax.text(0.3, 3, 'Position\nEmbed', fontsize=9, ha='right', fontweight='bold')
for i, x in enumerate(x_positions):
    box = FancyBboxPatch((x-0.45, 2.6), 0.9, 0.8, boxstyle="round,pad=0.02",
                          facecolor='#f39c12', edgecolor='none', alpha=0.7)
    ax.add_patch(box)
    ax.text(x, 3, f'E_{i}', fontsize=8, ha='center', va='center', color='white')

# Plus signs
for y in [5.2, 3.7]:
    for x in x_positions:
        ax.text(x, y, '+', fontsize=12, ha='center', va='center', fontweight='bold')

# Equals
ax.text(0.3, 1.5, 'Input\nRepresent.', fontsize=9, ha='right', fontweight='bold')
for i, x in enumerate(x_positions):
    box = FancyBboxPatch((x-0.45, 1.1), 0.9, 0.8, boxstyle="round,pad=0.02",
                          facecolor='#2c3e50', edgecolor='none', alpha=0.9)
    ax.add_patch(box)

ax.text(7, 2.2, '=', fontsize=20, ha='center', va='center', fontweight='bold')

# Legend
ax.text(7, 0.3, 'Final input = Token Embedding + Segment Embedding + Position Embedding', 
        fontsize=10, ha='center', style='italic')

plt.tight_layout()
plt.show()

### About Position Embeddings

Unlike the original Transformer (which used sinusoidal functions), BERT uses **learned** position embeddings.

- Maximum sequence length: 512 tokens
- Position embedding matrix: 512 × 768 (for BERT-Base)

The authors found no significant difference between learned and sinusoidal positions, so they just used learned ones.

---

## WordPiece Tokenization

BERT doesn't use word-level tokenization. It uses **WordPiece** - a subword tokenization method.

### Why Subwords?

Word-level tokenization has problems:
- Vocabulary gets huge (every word needs an entry)
- Rare words are poorly represented
- Out-of-vocabulary words can't be handled

Subword tokenization splits rare words into smaller pieces:

```
"playing"    → ["play", "##ing"]
"unhappiness" → ["un", "##hap", "##pi", "##ness"]
"TensorFlow" → ["Ten", "##sor", "##Fl", "##ow"]
```

The `##` prefix means "this continues the previous token".

Common words stay whole. Rare words get split into recognizable pieces.

In [None]:
# WordPiece tokenization examples
print("WordPiece Tokenization Examples")
print("=" * 50)

examples = [
    ("The cat sat", ["The", "cat", "sat"]),
    ("playing", ["play", "##ing"]),
    ("unhappiness", ["un", "##hap", "##pi", "##ness"]),
    ("embeddings", ["em", "##bed", "##ding", "##s"]),
    ("TensorFlow", ["Tensor", "##Fl", "##ow"]),
    ("transformer", ["transform", "##er"]),
]

for text, tokens in examples:
    print(f"\n'{text}'")
    print(f"  → {tokens}")
    print(f"  ({len(tokens)} tokens)")

In [None]:
# Visual comparison
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

# Word-level
ax = axes[0]
ax.set_xlim(0, 12)
ax.set_ylim(0, 2)
ax.axis('off')
ax.set_title('Word-Level Tokenization', fontsize=11, fontweight='bold', loc='left')

words = ['I', 'love', 'playing', 'basketball', 'unhappily']
for i, word in enumerate(words):
    ax.text(1 + i*2.2, 1, word, fontsize=11, ha='center',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='#e74c3c', edgecolor='none', alpha=0.7),
            color='white')

ax.text(11.5, 1, '5 tokens', fontsize=10, ha='center', color='#666')
ax.text(11.5, 0.4, 'Problem: "unhappily"\nmight be OOV', fontsize=8, ha='center', color='#e74c3c')

# WordPiece
ax = axes[1]
ax.set_xlim(0, 12)
ax.set_ylim(0, 2)
ax.axis('off')
ax.set_title('WordPiece Tokenization', fontsize=11, fontweight='bold', loc='left')

pieces = ['I', 'love', 'play', '##ing', 'basket', '##ball', 'un', '##hap', '##pi', '##ly']
positions = [0.5, 1.5, 2.5, 3.3, 4.3, 5.2, 6.2, 7, 7.8, 8.6]
for pos, piece in zip(positions, pieces):
    color = '#27ae60' if not piece.startswith('##') else '#3498db'
    ax.text(pos + 0.5, 1, piece, fontsize=9, ha='center',
            bbox=dict(boxstyle='round,pad=0.2', facecolor=color, edgecolor='none', alpha=0.7),
            color='white')

ax.text(11.5, 1, '10 tokens', fontsize=10, ha='center', color='#666')
ax.text(11.5, 0.4, 'All pieces are\nin vocabulary!', fontsize=8, ha='center', color='#27ae60')

plt.tight_layout()
plt.show()

### BERT's Vocabulary

- Size: ~30,000 tokens
- Includes whole words, subwords, and characters
- Special tokens: [CLS], [SEP], [MASK], [PAD], [UNK]

With 30K subwords, BERT can represent virtually any text - even words it's never seen before.

---

## The Architecture Diagram

Let's draw the complete BERT architecture:

In [None]:
# Complete BERT architecture
fig, ax = plt.subplots(figsize=(12, 14))
ax.set_xlim(0, 12)
ax.set_ylim(0, 14)
ax.axis('off')

ax.text(6, 13.5, 'BERT Architecture', fontsize=16, ha='center', fontweight='bold')

# Input tokens
tokens = ['[CLS]', 'The', 'cat', 'sat', '[SEP]']
x_positions = [2, 4, 6, 8, 10]

for x, token in zip(x_positions, tokens):
    ax.text(x, 1, token, fontsize=10, ha='center',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='#ecf0f1', edgecolor='#bdc3c7'))

ax.text(6, 0.3, 'Input Tokens', fontsize=10, ha='center', color='#666')

# Embedding layer
embed_box = FancyBboxPatch((1, 1.8), 10, 1.2, boxstyle="round,pad=0.05",
                            facecolor='#f39c12', edgecolor='#e67e22', linewidth=2, alpha=0.8)
ax.add_patch(embed_box)
ax.text(6, 2.4, 'Token + Segment + Position Embeddings', fontsize=10, 
        ha='center', va='center', color='white', fontweight='bold')

# Arrows from input to embedding
for x in x_positions:
    ax.annotate('', xy=(x, 1.8), xytext=(x, 1.4),
                arrowprops=dict(arrowstyle='->', color='#333', lw=1))

# Transformer layers
layer_colors = plt.cm.Blues(np.linspace(0.3, 0.8, 6))
for i in range(6):  # Show 6 layers (representing 12)
    y = 3.5 + i * 1.4
    layer_box = FancyBboxPatch((1, y), 10, 1.2, boxstyle="round,pad=0.05",
                                facecolor=layer_colors[i], edgecolor='#2980b9', linewidth=1)
    ax.add_patch(layer_box)
    
    if i == 2:
        ax.text(6, y + 0.6, 'Transformer Encoder Layer\n(Self-Attention + FFN)', 
                fontsize=9, ha='center', va='center', color='white', fontweight='bold')
    if i == 5:
        ax.text(6, y + 0.6, '× 12 (Base) or × 24 (Large)', 
                fontsize=9, ha='center', va='center', color='white', fontweight='bold')

# Output representations
output_y = 12
for i, x in enumerate(x_positions):
    box = FancyBboxPatch((x-0.5, output_y), 1, 0.8, boxstyle="round,pad=0.05",
                          facecolor='#27ae60', edgecolor='none', alpha=0.8)
    ax.add_patch(box)
    label = 'C' if i == 0 else f'T{i}'
    ax.text(x, output_y + 0.4, label, fontsize=10, ha='center', va='center', 
            color='white', fontweight='bold')

ax.text(6, 13, 'Output: Contextualized Representations', fontsize=10, ha='center', color='#666')

# Labels
ax.text(0.5, output_y + 0.4, 'C = [CLS]\nrepresentation', fontsize=8, ha='center', color='#27ae60')

# Arrows
for x in x_positions:
    ax.annotate('', xy=(x, 12), xytext=(x, 11.7),
                arrowprops=dict(arrowstyle='->', color='#333', lw=1))

plt.tight_layout()
plt.show()

---

## What Each Layer Does

Each Transformer encoder layer (from Part 3 of our Transformer series) contains:

1. **Multi-Head Self-Attention**
   - Each token attends to all other tokens
   - 12 attention heads (BERT-Base) or 16 heads (BERT-Large)
   - Captures different types of relationships

2. **Feed-Forward Network**
   - Two linear layers with GELU activation
   - Hidden size: 3072 (4× the hidden dimension)
   - Applied to each position independently

3. **Residual Connections + Layer Normalization**
   - After each sub-layer
   - Helps with training deep networks

In [None]:
# Single layer detail
fig, ax = plt.subplots(figsize=(10, 10))
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

ax.text(5, 9.5, 'Inside One Transformer Encoder Layer', fontsize=13, ha='center', fontweight='bold')

# Input
ax.text(5, 8.8, 'Input from previous layer', fontsize=9, ha='center', color='#666')
ax.annotate('', xy=(5, 8.2), xytext=(5, 8.5),
            arrowprops=dict(arrowstyle='->', color='#333', lw=1.5))

# Multi-head attention
attn_box = FancyBboxPatch((2, 6.5), 6, 1.5, boxstyle="round,pad=0.05",
                           facecolor='#e74c3c', edgecolor='#c0392b', linewidth=2, alpha=0.8)
ax.add_patch(attn_box)
ax.text(5, 7.25, 'Multi-Head Self-Attention', fontsize=11, ha='center', va='center',
        color='white', fontweight='bold')

# Add & Norm 1
an1_box = FancyBboxPatch((2, 5), 6, 0.8, boxstyle="round,pad=0.05",
                          facecolor='#f39c12', edgecolor='none', alpha=0.8)
ax.add_patch(an1_box)
ax.text(5, 5.4, 'Add & Layer Norm', fontsize=10, ha='center', va='center',
        color='white', fontweight='bold')

# Feed-forward
ff_box = FancyBboxPatch((2, 3), 6, 1.5, boxstyle="round,pad=0.05",
                         facecolor='#9b59b6', edgecolor='#8e44ad', linewidth=2, alpha=0.8)
ax.add_patch(ff_box)
ax.text(5, 3.75, 'Feed-Forward Network\n(768 → 3072 → 768)', fontsize=10, ha='center', va='center',
        color='white', fontweight='bold')

# Add & Norm 2
an2_box = FancyBboxPatch((2, 1.5), 6, 0.8, boxstyle="round,pad=0.05",
                          facecolor='#f39c12', edgecolor='none', alpha=0.8)
ax.add_patch(an2_box)
ax.text(5, 1.9, 'Add & Layer Norm', fontsize=10, ha='center', va='center',
        color='white', fontweight='bold')

# Residual connections
ax.annotate('', xy=(1.5, 5.4), xytext=(1.5, 8),
            arrowprops=dict(arrowstyle='-', color='#3498db', lw=2))
ax.annotate('', xy=(2, 5.4), xytext=(1.5, 5.4),
            arrowprops=dict(arrowstyle='->', color='#3498db', lw=2))

ax.annotate('', xy=(1.5, 1.9), xytext=(1.5, 4.5),
            arrowprops=dict(arrowstyle='-', color='#3498db', lw=2))
ax.annotate('', xy=(2, 1.9), xytext=(1.5, 1.9),
            arrowprops=dict(arrowstyle='->', color='#3498db', lw=2))

# Output
ax.annotate('', xy=(5, 0.8), xytext=(5, 1.5),
            arrowprops=dict(arrowstyle='->', color='#333', lw=1.5))
ax.text(5, 0.5, 'Output to next layer', fontsize=9, ha='center', color='#666')

# Vertical arrows
ax.annotate('', xy=(5, 8), xytext=(5, 8.2),
            arrowprops=dict(arrowstyle='->', color='#333', lw=1.5))
ax.annotate('', xy=(5, 6.5), xytext=(5, 5.8),
            arrowprops=dict(arrowstyle='->', color='#333', lw=1.5))
ax.annotate('', xy=(5, 5), xytext=(5, 4.5),
            arrowprops=dict(arrowstyle='->', color='#333', lw=1.5))
ax.annotate('', xy=(5, 3), xytext=(5, 2.3),
            arrowprops=dict(arrowstyle='->', color='#333', lw=1.5))

ax.text(0.8, 6.5, 'Residual\nconnection', fontsize=8, ha='center', color='#3498db')

plt.tight_layout()
plt.show()

---

## GELU Activation

BERT uses **GELU** (Gaussian Error Linear Unit) instead of ReLU in the feed-forward layers.

$$GELU(x) = x \cdot \Phi(x)$$

Where Φ(x) is the cumulative distribution function of the standard normal distribution.

In practice, it's approximated as:

$$GELU(x) \approx 0.5x(1 + \tanh(\sqrt{2/\pi}(x + 0.044715x^3)))$$

GELU is smoother than ReLU and has been shown to work better for Transformers.

In [None]:
# GELU vs ReLU
def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def relu(x):
    return np.maximum(0, x)

x = np.linspace(-4, 4, 200)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, relu(x), label='ReLU', lw=2, color='#e74c3c')
ax.plot(x, gelu(x), label='GELU', lw=2, color='#3498db')
ax.axhline(y=0, color='#333', lw=0.5)
ax.axvline(x=0, color='#333', lw=0.5)
ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title('GELU vs ReLU Activation Functions', fontsize=12, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim(-1, 4)

ax.annotate('GELU is smooth\n(no sharp corner)', xy=(-0.5, gelu(-0.5)), 
            xytext=(-2.5, 1.5), fontsize=9,
            arrowprops=dict(arrowstyle='->', color='#3498db'))

plt.tight_layout()
plt.show()

---

## Summary: BERT Architecture

| Component | BERT-Base | BERT-Large |
|-----------|-----------|------------|
| Layers | 12 | 24 |
| Hidden size | 768 | 1024 |
| Attention heads | 12 | 16 |
| Feed-forward size | 3072 | 4096 |
| Max sequence length | 512 | 512 |
| Vocabulary size | ~30,000 | ~30,000 |
| Total parameters | 110M | 340M |

### Input Format
- [CLS] + tokens + [SEP] (+ more tokens + [SEP] for pairs)
- WordPiece tokenization
- Three embeddings summed: token + segment + position

### Architecture
- Encoder-only Transformer
- GELU activation in feed-forward
- Learned position embeddings

---

## What's Next: Part 3

Now you know BERT's architecture. In Part 3, we'll cover:

- **Pre-training in detail**: Masked LM and Next Sentence Prediction
- **The masking strategy**: Why 15%? Why not always [MASK]?
- **Training data and compute**

---

*Paper:* [BERT: Pre-training of Deep Bidirectional Transformers](https://arxiv.org/abs/1810.04805)