# Week 7: Build BERT from Scratch

## Objective
Implement the BERT (Bidirectional Encoder Representations from Transformers) architecture from first principles.

**Goals**:
- Understand transformer encoder architecture
- Implement multi-head self-attention
- Build positional encodings
- Create masked language model (MLM) pretraining task
- Fine-tune BERT for classification

---

## Why BERT?

**Breakthrough**: Bidirectional context understanding
- **GPT**: Left-to-right (autoregressive)
- **BERT**: Bidirectional (masked language modeling)

**Key Innovation**: Masked Language Model (MLM)
- Mask 15% of tokens
- Predict masked tokens using bidirectional context
- Pre-train once, fine-tune for many tasks

---

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

sys.path.append(os.path.abspath('../../'))

from src.llm.attention import MultiHeadAttention, PositionalEncoding
from src.ml.deep_learning import Dense, Activation, LayerNorm, Dropout, NeuralNetwork

sns.set_style('whitegrid')
print("✓ Imports successful")

## Step 1: Multi-Head Self-Attention

### Core Formula

```
Attention(Q, K, V) = softmax(QK^T / √d_k) V

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
```

In [None]:
class ScaledDotProductAttention:
    """Scaled dot-product attention mechanism."""
    
    def __init__(self, dropout=0.1):
        self.dropout = dropout
        self.attention_weights = None
    
    def forward(self, Q, K, V, mask=None):
        """
        Args:
            Q: Queries (batch_size, seq_len, d_k)
            K: Keys (batch_size, seq_len, d_k)
            V: Values (batch_size, seq_len, d_v)
            mask: Attention mask (batch_size, seq_len, seq_len)
        
        Returns:
            output: (batch_size, seq_len, d_v)
            attention_weights: (batch_size, seq_len, seq_len)
        """
        d_k = Q.shape[-1]
        
        # Compute attention scores
        scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
        
        # Apply mask (for padding or future tokens)
        if mask is not None:
            scores = np.where(mask == 0, -1e9, scores)
        
        # Softmax to get attention weights
        attention_weights = self.softmax(scores)
        
        # Apply dropout
        if self.dropout > 0:
            mask = np.random.binomial(1, 1-self.dropout, attention_weights.shape)
            attention_weights = attention_weights * mask / (1 - self.dropout)
        
        # Weighted sum of values
        output = np.matmul(attention_weights, V)
        
        self.attention_weights = attention_weights
        return output, attention_weights
    
    def softmax(self, x):
        """Numerically stable softmax."""
        exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

# Test attention
attention = ScaledDotProductAttention()

# Create dummy input
batch_size, seq_len, d_model = 2, 10, 64
Q = K = V = np.random.randn(batch_size, seq_len, d_model)

output, weights = attention.forward(Q, K, V)

print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Attention weights sum (should be 1.0): {weights[0, 0, :].sum():.4f}")
print("✓ Scaled dot-product attention works!")

## Step 2: Multi-Head Attention Layer

In [None]:
class MultiHeadAttentionLayer:
    """Multi-head self-attention with learned projections."""
    
    def __init__(self, d_model=512, num_heads=8, dropout=0.1):
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Learnable projection matrices
        self.W_Q = np.random.randn(d_model, d_model) * 0.01
        self.W_K = np.random.randn(d_model, d_model) * 0.01
        self.W_V = np.random.randn(d_model, d_model) * 0.01
        self.W_O = np.random.randn(d_model, d_model) * 0.01
        
        self.attention = ScaledDotProductAttention(dropout)
    
    def split_heads(self, x):
        """Split into multiple attention heads."""
        batch_size, seq_len, d_model = x.shape
        # Reshape: (batch, seq_len, d_model) -> (batch, seq_len, num_heads, d_k)
        x = x.reshape(batch_size, seq_len, self.num_heads, self.d_k)
        # Transpose: (batch, num_heads, seq_len, d_k)
        return x.transpose(0, 2, 1, 3)
    
    def combine_heads(self, x):
        """Combine attention heads back."""
        batch_size, num_heads, seq_len, d_k = x.shape
        # Transpose: (batch, seq_len, num_heads, d_k)
        x = x.transpose(0, 2, 1, 3)
        # Reshape: (batch, seq_len, d_model)
        return x.reshape(batch_size, seq_len, self.d_model)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: Input (batch_size, seq_len, d_model)
            mask: Attention mask
        """
        batch_size = x.shape[0]
        
        # Linear projections
        Q = np.matmul(x, self.W_Q)
        K = np.matmul(x, self.W_K)
        V = np.matmul(x, self.W_V)
        
        # Split into heads
        Q = self.split_heads(Q)  # (batch, num_heads, seq_len, d_k)
        K = self.split_heads(K)
        V = self.split_heads(V)
        
        # Apply attention for each head
        attn_outputs = []
        for i in range(self.num_heads):
            output, _ = self.attention.forward(Q[:, i], K[:, i], V[:, i], mask)
            attn_outputs.append(output)
        
        # Stack heads: (batch, num_heads, seq_len, d_k)
        attn_output = np.stack(attn_outputs, axis=1)
        
        # Combine heads
        combined = self.combine_heads(attn_output)
        
        # Final linear projection
        output = np.matmul(combined, self.W_O)
        
        return output

# Test multi-head attention
mha = MultiHeadAttentionLayer(d_model=512, num_heads=8)
x = np.random.randn(2, 20, 512)  # (batch=2, seq_len=20, d_model=512)
output = mha.forward(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print("✓ Multi-head attention works!")

## Step 3: Positional Encoding

Since attention has no notion of sequence order, we add positional information.

**Sinusoidal Encoding**:
```
PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
```

In [None]:
class PositionalEncodingLayer:
    """Sinusoidal positional encoding."""
    
    def __init__(self, d_model=512, max_len=5000):
        self.d_model = d_model
        
        # Create positional encoding matrix
        pe = np.zeros((max_len, d_model))
        position = np.arange(0, max_len).reshape(-1, 1)
        div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
        
        pe[:, 0::2] = np.sin(position * div_term)
        pe[:, 1::2] = np.cos(position * div_term)
        
        self.pe = pe
    
    def forward(self, x):
        """Add positional encoding to input."""
        seq_len = x.shape[1]
        return x + self.pe[:seq_len]

# Visualize positional encoding
pe_layer = PositionalEncodingLayer(d_model=128, max_len=100)

plt.figure(figsize=(15, 5))
plt.imshow(pe_layer.pe.T, aspect='auto', cmap='RdBu')
plt.xlabel('Position', fontsize=12)
plt.ylabel('Dimension', fontsize=12)
plt.title('Sinusoidal Positional Encoding', fontsize=14, fontweight='bold')
plt.colorbar()
plt.tight_layout()
plt.show()

print("✓ Positional encoding visualized")

## Step 4: Feed-Forward Network

**Architecture**: Linear → ReLU → Linear

```
FFN(x) = ReLU(xW_1 + b_1)W_2 + b_2
```

Typical: d_ff = 4 × d_model

In [None]:
class FeedForwardNetwork:
    """Position-wise feed-forward network."""
    
    def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
        self.W1 = np.random.randn(d_model, d_ff) * np.sqrt(2.0 / d_model)
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * np.sqrt(2.0 / d_ff)
        self.b2 = np.zeros(d_model)
        self.dropout = dropout
    
    def forward(self, x):
        # First linear + ReLU
        hidden = np.maximum(0, np.matmul(x, self.W1) + self.b1)
        
        # Dropout
        if self.dropout > 0:
            mask = np.random.binomial(1, 1-self.dropout, hidden.shape)
            hidden = hidden * mask / (1 - self.dropout)
        
        # Second linear
        output = np.matmul(hidden, self.W2) + self.b2
        
        return output

# Test FFN
ffn = FeedForwardNetwork(d_model=512, d_ff=2048)
x = np.random.randn(2, 20, 512)
output = ffn.forward(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print("✓ Feed-forward network works!")

## Step 5: Transformer Encoder Block

**Architecture**:
```
x → Multi-Head Attention → Add & Norm
  → Feed-Forward         → Add & Norm
```

**Residual Connections** prevent vanishing gradients in deep networks.

In [None]:
class TransformerEncoderBlock:
    """Single transformer encoder block."""
    
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        self.attention = MultiHeadAttentionLayer(d_model, num_heads, dropout)
        self.ffn = FeedForwardNetwork(d_model, d_ff, dropout)
        
        # Layer normalization parameters
        self.gamma1 = np.ones(d_model)
        self.beta1 = np.zeros(d_model)
        self.gamma2 = np.ones(d_model)
        self.beta2 = np.zeros(d_model)
        
        self.dropout = dropout
    
    def layer_norm(self, x, gamma, beta, eps=1e-6):
        """Layer normalization."""
        mean = x.mean(axis=-1, keepdims=True)
        std = x.std(axis=-1, keepdims=True)
        return gamma * (x - mean) / (std + eps) + beta
    
    def forward(self, x, mask=None):
        # Multi-head self-attention with residual
        attn_output = self.attention.forward(x, mask)
        
        # Dropout
        if self.dropout > 0:
            dropout_mask = np.random.binomial(1, 1-self.dropout, attn_output.shape)
            attn_output = attn_output * dropout_mask / (1 - self.dropout)
        
        # Add & Norm
        x = self.layer_norm(x + attn_output, self.gamma1, self.beta1)
        
        # Feed-forward with residual
        ffn_output = self.ffn.forward(x)
        
        # Dropout
        if self.dropout > 0:
            dropout_mask = np.random.binomial(1, 1-self.dropout, ffn_output.shape)
            ffn_output = ffn_output * dropout_mask / (1 - self.dropout)
        
        # Add & Norm
        output = self.layer_norm(x + ffn_output, self.gamma2, self.beta2)
        
        return output

# Test encoder block
encoder_block = TransformerEncoderBlock()
x = np.random.randn(2, 20, 512)
output = encoder_block.forward(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print("✓ Transformer encoder block works!")

## Step 6: Complete BERT Model

In [None]:
class BERT:
    """BERT: Bidirectional Encoder Representations from Transformers."""
    
    def __init__(self, vocab_size=30000, d_model=512, num_heads=8, 
                 num_layers=6, d_ff=2048, max_len=512, dropout=0.1):
        self.vocab_size = vocab_size
        self.d_model = d_model
        
        # Token embeddings
        self.token_embedding = np.random.randn(vocab_size, d_model) * 0.01
        
        # Positional encoding
        self.pos_encoding = PositionalEncodingLayer(d_model, max_len)
        
        # Encoder blocks
        self.encoder_blocks = [
            TransformerEncoderBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ]
        
        print(f"BERT Model initialized:")
        print(f"  Vocabulary: {vocab_size:,}")
        print(f"  Model dimension: {d_model}")
        print(f"  Attention heads: {num_heads}")
        print(f"  Encoder layers: {num_layers}")
        print(f"  Feed-forward dim: {d_ff}")
    
    def forward(self, input_ids, attention_mask=None):
        """
        Args:
            input_ids: Token indices (batch_size, seq_len)
            attention_mask: Mask for padding tokens
        
        Returns:
            contextualized_embeddings: (batch_size, seq_len, d_model)
        """
        # Token embeddings
        x = self.token_embedding[input_ids]  # (batch, seq_len, d_model)
        
        # Add positional encoding
        x = self.pos_encoding.forward(x)
        
        # Pass through encoder blocks
        for encoder_block in self.encoder_blocks:
            x = encoder_block.forward(x, attention_mask)
        
        return x

# Build BERT
bert = BERT(
    vocab_size=30000,
    d_model=512,
    num_heads=8,
    num_layers=6,
    d_ff=2048
)

# Test forward pass
input_ids = np.random.randint(0, 30000, size=(2, 50))  # batch=2, seq_len=50
embeddings = bert.forward(input_ids)

print(f"\nInput shape: {input_ids.shape}")
print(f"Output embeddings shape: {embeddings.shape}")
print("\n✅ BERT model built successfully!")

## Step 7: Masked Language Model (MLM) Training

**Objective**: Predict masked tokens using bidirectional context

**Masking Strategy**:
- 80% → [MASK] token
- 10% → Random token
- 10% → Keep original

In [None]:
class MaskedLanguageModel:
    """MLM head for BERT pretraining."""
    
    def __init__(self, bert_model, mask_token_id=1, mask_prob=0.15):
        self.bert = bert_model
        self.mask_token_id = mask_token_id
        self.mask_prob = mask_prob
        
        # Prediction head
        d_model = bert_model.d_model
        vocab_size = bert_model.vocab_size
        
        self.W_pred = np.random.randn(d_model, vocab_size) * 0.01
        self.b_pred = np.zeros(vocab_size)
    
    def create_masked_input(self, input_ids):
        """Create masked version of input for MLM."""
        masked_ids = input_ids.copy()
        labels = np.full_like(input_ids, -100)  # -100 = ignore in loss
        
        # Randomly select 15% of tokens to mask
        mask_indices = np.random.rand(*input_ids.shape) < self.mask_prob
        labels[mask_indices] = input_ids[mask_indices]
        
        # Apply masking strategy
        for i in range(len(mask_indices)):
            for j in range(len(mask_indices[i])):
                if mask_indices[i, j]:
                    rand = np.random.rand()
                    if rand < 0.8:
                        masked_ids[i, j] = self.mask_token_id  # [MASK]
                    elif rand < 0.9:
                        masked_ids[i, j] = np.random.randint(0, self.bert.vocab_size)  # Random
                    # else: keep original (10%)
        
        return masked_ids, labels
    
    def forward(self, input_ids):
        """Forward pass with MLM."""
        # Get BERT embeddings
        embeddings = self.bert.forward(input_ids)
        
        # Predict token probabilities
        logits = np.matmul(embeddings, self.W_pred) + self.b_pred
        
        return logits

# Test MLM
mlm = MaskedLanguageModel(bert)

# Create sample input
sample_ids = np.random.randint(2, 1000, size=(4, 30))  # Avoid special tokens

# Create masked version
masked_ids, labels = mlm.create_masked_input(sample_ids)

print(f"Original tokens (first sequence): {sample_ids[0, :10]}")
print(f"Masked tokens (first sequence):   {masked_ids[0, :10]}")
print(f"Labels (first sequence):          {labels[0, :10]}")

# Forward pass
logits = mlm.forward(masked_ids)
print(f"\nLogits shape: {logits.shape}  (batch, seq_len, vocab_size)")
print("✓ MLM training ready!")

## Step 8: Fine-Tuning for Classification

**Transfer Learning**: Use pre-trained BERT + add task-specific head

In [None]:
class BERTForSequenceClassification:
    """BERT + Classification head."""
    
    def __init__(self, bert_model, num_classes=2):
        self.bert = bert_model
        self.num_classes = num_classes
        
        # Classification head
        d_model = bert_model.d_model
        self.W_cls = np.random.randn(d_model, num_classes) * 0.01
        self.b_cls = np.zeros(num_classes)
    
    def forward(self, input_ids):
        """Forward pass for classification."""
        # Get BERT embeddings
        embeddings = self.bert.forward(input_ids)
        
        # Use [CLS] token representation (first token)
        cls_embedding = embeddings[:, 0, :]  # (batch, d_model)
        
        # Classification logits
        logits = np.matmul(cls_embedding, self.W_cls) + self.b_cls
        
        # Softmax
        exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
        probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
        
        return probs

# Build classifier
classifier = BERTForSequenceClassification(bert, num_classes=3)

# Test classification
input_ids = np.random.randint(0, 30000, size=(4, 50))
probs = classifier.forward(input_ids)

print(f"Input shape: {input_ids.shape}")
print(f"Class probabilities shape: {probs.shape}")
print(f"\nSample predictions:")
for i, prob in enumerate(probs):
    print(f"  Sample {i+1}: {prob} → Class {np.argmax(prob)}")

print("\n✓ Classification fine-tuning ready!")

## Conclusion

### Key Achievements

1. ✅ **Built BERT from scratch** - All components implemented
2. ✅ **Multi-head self-attention** - Core mechanism understood
3. ✅ **Positional encoding** - Sinusoidal implementation
4. ✅ **Transformer encoder** - 6-layer stack
5. ✅ **MLM pretraining** - Masked language modeling
6. ✅ **Fine-tuning** - Classification head for downstream tasks

### Interview Discussion Points

**Q: Why is BERT bidirectional?**
> "BERT uses masked language modeling where we mask 15% of tokens and predict them using both left AND right context. Unlike GPT which is autoregressive (left-to-right), BERT sees the full sentence, giving richer representations."

**Q: What's the role of [CLS] token?**
> "The [CLS] token is prepended to every sequence. During pre-training, it aggregates sentence-level information. For classification, we use its final hidden state as the sentence representation."

**Q: How does attention scale?**
> "Attention is O(n²) in sequence length due to the QK^T multiplication. For long sequences, we use sparse attention patterns (Longformer) or linear approximations (Linformer)."

**Q: Why layer normalization instead of batch normalization?**
> "Layer norm normalizes across features (not batch), making it stable for variable batch sizes and sequence lengths. It's also more effective for recurrent/sequential models."

### Real-World Applications

- **Text Classification**: Sentiment analysis, spam detection
- **Named Entity Recognition**: Extract entities from text
- **Question Answering**: SQuAD, conversational AI
- **Embeddings**: Semantic search, document clustering

---

**✅ Week 7 Complete**: Built BERT transformer from first principles!

---

*Next: Week 8 - Load Pre-trained GPT-2 Weights*