# Transformer Encoder — Full Implementation in PyTorch

This notebook implements the **Transformer Encoder** from scratch using PyTorch.

## Architecture Overview
The encoder consists of **N stacked identical layers**, each containing:
1. **Multi-Head Self-Attention** — allows each token to attend to every other token in the sequence
2. **Add & Layer Normalization** — residual connection followed by normalization
3. **Position-wise Feed-Forward Network (FFN)** — two linear layers with ReLU activation
4. **Add & Layer Normalization** — another residual + normalization

## Key Hyperparameters
| Parameter | Value | Description |
|-----------|-------|-------------|
| `d_model` | 512 | Embedding dimension for each token |
| `num_heads` | 8 | Number of parallel attention heads |
| `head_dim` | 64 | Dimension per head (512 / 8) |
| `ffn_hidden` | 2048 | Hidden layer size in the FFN |
| `num_layers` | 5 | Number of stacked encoder layers |
| `drop_prob` | 0.1 | Dropout probability (regularization) |
| `batch_size` | 30 | Number of sentences per mini-batch |
| `max_sequence_length` | 200 | Fixed sequence length (padded) |

## Tensor Shape Throughout the Encoder
The input and output of every encoder layer is **[batch_size, max_sequence_length, d_model]** = **[30, 200, 512]**.

## Step 1: Import Libraries
- **torch**: The core PyTorch library for tensor operations and automatic differentiation.
- **math**: Python's built-in math module — used here for `sqrt` in scaled dot-product attention.
- **nn (torch.nn)**: Provides neural network building blocks like `Linear`, `Module`, `Parameter`, `Dropout`, etc.
- **F (torch.nn.functional)**: Functional API for operations like `softmax` that don't have learnable parameters.

In [None]:
import torch                        # Core PyTorch library for tensor computation & GPU acceleration
import math                         # Python math library — we need sqrt() for scaling in attention
from torch import nn                # Neural network module: layers (Linear, Dropout), base class (Module), etc.
import torch.nn.functional as F     # Functional API: stateless ops like softmax, relu (no learnable params)

## Step 2: Define All Encoder Components

This cell defines **5 components** that build up the Transformer Encoder:

1. **`scaled_dot_product()`** — Core attention formula: $\text{Attention}(Q,K,V) = \text{softmax}\!\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V$
2. **`MultiHeadAttention`** — Splits attention into multiple parallel heads for richer representations
3. **`LayerNormalization`** — Normalizes across the embedding dimension with learnable γ and β
4. **`PositionwiseFeedForward`** — Two-layer MLP (512→2048→512) applied independently to each position
5. **`EncoderLayer`** — One complete encoder block (attention → add & norm → FFN → add & norm)
6. **`Encoder`** — Stacks N encoder layers sequentially

In [None]:
# ============================================================================
# 1. SCALED DOT-PRODUCT ATTENTION (the core math behind attention)
# ============================================================================
# Formula:  Attention(Q, K, V) = softmax( (Q @ K^T) / sqrt(d_k) ) @ V
#
# Why scale by sqrt(d_k)?
#   - The dot products Q·K grow in magnitude as d_k increases.
#   - Large values push softmax into regions with tiny gradients (saturation).
#   - Dividing by sqrt(d_k) keeps the variance ≈ 1, giving healthier gradients.
#
# Parameters:
#   q : Query tensor  — shape [batch_size, num_heads, seq_len, head_dim]  (what am I looking for?)
#   k : Key tensor    — shape [batch_size, num_heads, seq_len, head_dim]  (what do I contain?)
#   v : Value tensor  — shape [batch_size, num_heads, seq_len, head_dim]  (what information do I carry?)
#   mask : Optional tensor to block certain positions (used in decoder, not needed in encoder)
#
# Returns:
#   values    — weighted sum of V vectors  [batch_size, num_heads, seq_len, head_dim]
#   attention — attention weight matrix     [batch_size, num_heads, seq_len, seq_len]

def scaled_dot_product(q, k, v, mask=None):
    # d_k = dimension of each head (e.g., 64 when d_model=512 and num_heads=8)
    d_k = q.size()[-1]

    # Step A: Compute raw attention scores by taking dot product of Q and K^T
    # q shape:        [batch, heads, seq_len, head_dim]   e.g. [30, 8, 200, 64]
    # k.transpose:    [batch, heads, head_dim, seq_len]   e.g. [30, 8, 64, 200]
    # Result "scaled": [batch, heads, seq_len, seq_len]   e.g. [30, 8, 200, 200]
    # Each entry (i, j) = how much token i should attend to token j
    scaled = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(d_k)
    print(f"scaled.size() : {scaled.size()}")

    # Step B: (Optional) Apply mask — used in decoder to prevent attending to future tokens
    # In the ENCODER self-attention, mask is None because every token can see every other token
    if mask is not None:
        print(f"-- ADDING MASK of shape {mask.size()} --")
        # Mask positions are typically filled with -inf so that softmax maps them to ~0
        # Broadcasting add: only the last N dimensions need to match
        scaled += mask

    # Step C: Apply softmax along the last dimension (across keys for each query)
    # Converts raw scores into probabilities that sum to 1 for each query position
    # Shape stays [batch, heads, seq_len, seq_len]
    attention = F.softmax(scaled, dim=-1)

    # Step D: Multiply attention weights by V to get the weighted combination of values
    # attention: [batch, heads, seq_len, seq_len]   e.g. [30, 8, 200, 200]
    # v:         [batch, heads, seq_len, head_dim]   e.g. [30, 8, 200, 64]
    # Result:    [batch, heads, seq_len, head_dim]   e.g. [30, 8, 200, 64]
    values = torch.matmul(attention, v)
    return values, attention


# ============================================================================
# 2. MULTI-HEAD ATTENTION
# ============================================================================
# Instead of performing a single attention function with d_model-dimensional keys,
# values and queries, it is beneficial to split them into 'num_heads' parallel
# attention operations, each with dimension head_dim = d_model / num_heads.
#
# Why multiple heads?
#   - Each head can learn to attend to DIFFERENT types of relationships:
#     e.g., one head might focus on syntactic relations, another on semantic similarity.
#   - The outputs of all heads are concatenated and linearly projected back.
#
# Tensor flow through this class:
#   Input x:           [batch_size, seq_len, d_model]         e.g. [30, 200, 512]
#   After qkv_layer:   [batch_size, seq_len, 3 * d_model]    e.g. [30, 200, 1536]
#   Reshape:           [batch_size, seq_len, num_heads, 3*head_dim] e.g. [30, 200, 8, 192]
#   Permute:           [batch_size, num_heads, seq_len, 3*head_dim] e.g. [30, 8, 200, 192]
#   Split into Q,K,V:  each [batch_size, num_heads, seq_len, head_dim] e.g. [30, 8, 200, 64]
#   Attention output:  [batch_size, num_heads, seq_len, head_dim]      e.g. [30, 8, 200, 64]
#   Reshape (concat):  [batch_size, seq_len, d_model]                  e.g. [30, 200, 512]
#   Final linear:      [batch_size, seq_len, d_model]                  e.g. [30, 200, 512]

class MultiHeadAttention(nn.Module):

    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model        # Total embedding dimension (e.g., 512)
        self.num_heads = num_heads     # Number of parallel attention heads (e.g., 8)
        self.head_dim = d_model // num_heads  # Dimension per head: 512 / 8 = 64

        # Single linear layer that projects input into Q, K, and V simultaneously
        # Input:  [batch, seq_len, 512]
        # Output: [batch, seq_len, 1536]  (512*3 = 1536, for Q, K, V concatenated)
        # This is more efficient than having 3 separate linear layers
        self.qkv_layer = nn.Linear(d_model, 3 * d_model)

        # Final linear projection after concatenating all heads back together
        # Input:  [batch, seq_len, 512]  (concatenated heads)
        # Output: [batch, seq_len, 512]
        self.linear_layer = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        # x shape: [batch_size, max_sequence_length, d_model] = [30, 200, 512]
        batch_size, max_sequence_length, d_model = x.size()
        print(f"x.size(): {x.size()}")

        # Project input into Q, K, V all at once
        # [30, 200, 512] → [30, 200, 1536]
        qkv = self.qkv_layer(x)
        print(f"qkv.size(): {qkv.size()}")

        # Reshape to separate the heads:
        # [30, 200, 1536] → [30, 200, 8, 192]
        # 192 = 3 * head_dim = 3 * 64 (still contains Q, K, V concatenated per head)
        qkv = qkv.reshape(batch_size, max_sequence_length, self.num_heads, 3 * self.head_dim)
        print(f"qkv.size(): {qkv.size()}")

        # Move the heads dimension before sequence length for parallel computation
        # [30, 200, 8, 192] → [30, 8, 200, 192]
        # Now each head can independently process all 200 sequence positions
        qkv = qkv.permute(0, 2, 1, 3)
        print(f"qkv.size(): {qkv.size()}")

        # Split the last dimension into 3 equal chunks: Q, K, V
        # Each: [30, 8, 200, 64]  — 8 heads, each attending with 64-dim vectors
        q, k, v = qkv.chunk(3, dim=-1)
        print(f"q size: {q.size()}, k size: {k.size()}, v size: {v.size()}, ")

        # Compute scaled dot-product attention for all heads in parallel
        # values:    [30, 8, 200, 64]  — contextualized representations per head
        # attention: [30, 8, 200, 200] — attention weight matrices per head
        values, attention = scaled_dot_product(q, k, v, mask)
        print(f"values.size(): {values.size()}, attention.size:{ attention.size()} ")

        # Concatenate all heads back together:
        # [30, 8, 200, 64] → [30, 200, 512]  (8 heads × 64 dims = 512)
        # This merges the information learned by each head into a single representation
        values = values.reshape(batch_size, max_sequence_length, self.num_heads * self.head_dim)
        print(f"values.size(): {values.size()}")

        # Final linear projection: mix information across heads
        # [30, 200, 512] → [30, 200, 512]
        out = self.linear_layer(values)
        print(f"out.size(): {out.size()}")
        return out


# ============================================================================
# 3. LAYER NORMALIZATION
# ============================================================================
# Normalizes the input across the LAST dimension(s) (the embedding dimension).
# This is different from Batch Normalization, which normalizes across the batch dimension.
#
# Why Layer Norm?
#   - Stabilizes training by ensuring each layer receives inputs with consistent
#     mean (~0) and variance (~1), reducing "internal covariate shift".
#   - Works well with variable-length sequences (unlike Batch Norm).
#
# Formula:
#   y = (x - mean) / std
#   out = gamma * y + beta
#
#   - gamma (γ): learnable scale parameter, initialized to 1  (shape: [512])
#   - beta  (β): learnable shift parameter, initialized to 0  (shape: [512])
#   - These allow the model to undo the normalization if that's beneficial.
#
# Input shape:  [batch_size, seq_len, d_model]  e.g. [30, 200, 512]
# Output shape: [batch_size, seq_len, d_model]  e.g. [30, 200, 512]  (unchanged)

class LayerNormalization(nn.Module):
    def __init__(self, parameters_shape, eps=1e-5):
        super().__init__()
        self.parameters_shape = parameters_shape  # Shape of the dimension(s) to normalize over, e.g. [512]
        self.eps = eps  # Small constant to prevent division by zero in std calculation

        # Learnable parameters:
        # gamma (scale): initialized to ones — starts as identity transform
        self.gamma = nn.Parameter(torch.ones(parameters_shape))
        # beta (shift): initialized to zeros — starts with no bias
        self.beta = nn.Parameter(torch.zeros(parameters_shape))

    def forward(self, inputs):
        # inputs shape: [30, 200, 512]

        # Build the list of dimensions to normalize over
        # For parameters_shape=[512], dims=[-1], meaning normalize over the last (embedding) dimension
        # For each token independently, we compute mean and variance across all 512 features
        dims = [-(i + 1) for i in range(len(self.parameters_shape))]

        # Compute mean across the embedding dimension, keeping dims for broadcasting
        # mean shape: [30, 200, 1] — one mean value per token
        mean = inputs.mean(dim=dims, keepdim=True)
        print(f"Mean ({mean.size()})")

        # Compute variance: E[(x - mean)^2]
        # var shape: [30, 200, 1]
        var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)

        # Standard deviation (with epsilon for numerical stability)
        # std shape: [30, 200, 1]
        std = (var + self.eps).sqrt()
        print(f"Standard Deviation  ({std.size()})")

        # Normalize: zero mean, unit variance
        # y shape: [30, 200, 512]
        y = (inputs - mean) / std
        print(f"y: {y.size()}")

        # Scale and shift with learnable parameters
        # gamma [512] broadcasts across batch and seq dimensions
        # beta  [512] broadcasts across batch and seq dimensions
        # out shape: [30, 200, 512]
        out = self.gamma * y + self.beta
        print(f"self.gamma: {self.gamma.size()}, self.beta: {self.beta.size()}")
        print(f"out: {out.size()}")
        return out


# ============================================================================
# 4. POSITION-WISE FEED-FORWARD NETWORK (FFN)
# ============================================================================
# A simple two-layer fully connected network applied to each position INDEPENDENTLY.
# "Position-wise" means the same MLP is applied to every token in the sequence separately.
#
# Architecture:
#   Linear1:  512 → 2048     (expand to a higher-dimensional space)
#   ReLU:     activation      (introduce non-linearity)
#   Dropout:  regularization  (randomly zero out neurons during training)
#   Linear2:  2048 → 512     (project back to original dimension)
#
# Why expand then contract?
#   - The wider hidden layer (2048) gives the network more capacity to learn
#     complex feature transformations, while projecting back to 512 keeps
#     the representation size consistent for the next encoder layer.
#
# Input shape:  [30, 200, 512]
# Output shape: [30, 200, 512]  (same!)

class PositionwiseFeedForward(nn.Module):

    def __init__(self, d_model, hidden, drop_prob=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, hidden)   # 512 → 2048 (expansion)
        self.linear2 = nn.Linear(hidden, d_model)    # 2048 → 512 (compression)
        self.relu = nn.ReLU()                         # Non-linear activation function
        self.dropout = nn.Dropout(p=drop_prob)         # Dropout for regularization (10% neurons zeroed)

    def forward(self, x):
        # x: [30, 200, 512]

        # Expand: [30, 200, 512] → [30, 200, 2048]
        x = self.linear1(x)
        print(f"x after first linear layer: {x.size()}")

        # Apply ReLU: max(0, x) — introduces non-linearity so the network
        # can learn more complex functions than just linear transformations
        # Shape unchanged: [30, 200, 2048]
        x = self.relu(x)
        print(f"x after activation: {x.size()}")

        # Dropout: randomly set ~10% of values to 0 during training
        # This prevents the network from relying too heavily on any single neuron
        # Shape unchanged: [30, 200, 2048]
        x = self.dropout(x)
        print(f"x after dropout: {x.size()}")

        # Compress back: [30, 200, 2048] → [30, 200, 512]
        x = self.linear2(x)
        print(f"x after 2nd linear layer: {x.size()}")
        return x


# ============================================================================
# 5. ENCODER LAYER (one complete block)
# ============================================================================
# Each encoder layer performs the following sequence:
#
#   ┌─────────────────────────────────────────────────────┐
#   │  Input x  [30, 200, 512]                            │
#   │     │                                               │
#   │     ├──────── (save as residual_x) ────────┐        │
#   │     ▼                                      │        │
#   │  Multi-Head Self-Attention                  │        │
#   │     ▼                                      │        │
#   │  Dropout                                   │        │
#   │     ▼                                      │        │
#   │  Add (x + residual_x)  ← residual connection       │
#   │     ▼                                               │
#   │  Layer Normalization                                │
#   │     │                                               │
#   │     ├──────── (save as residual_x) ────────┐        │
#   │     ▼                                      │        │
#   │  Feed-Forward Network                      │        │
#   │     ▼                                      │        │
#   │  Dropout                                   │        │
#   │     ▼                                      │        │
#   │  Add (x + residual_x)  ← residual connection       │
#   │     ▼                                               │
#   │  Layer Normalization                                │
#   │     ▼                                               │
#   │  Output [30, 200, 512]                              │
#   └─────────────────────────────────────────────────────┘
#
# Residual connections:
#   - The original input is ADDED to the output of each sub-layer before normalization.
#   - This helps gradients flow directly through the network during backpropagation,
#     solving the "vanishing gradient" problem in deep networks.
#   - Intuition: the layer only needs to learn the RESIDUAL (what to add/change),
#     not the entire transformation from scratch.

class EncoderLayer(nn.Module):

    def __init__(self, d_model, ffn_hidden, num_heads, drop_prob):
        super(EncoderLayer, self).__init__()
        # Sub-layer 1: Multi-Head Self-Attention
        self.attention = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
        # Layer Norm after attention (normalizes over embedding dim [512])
        self.norm1 = LayerNormalization(parameters_shape=[d_model])
        # Dropout after attention
        self.dropout1 = nn.Dropout(p=drop_prob)

        # Sub-layer 2: Position-wise Feed-Forward Network
        self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)
        # Layer Norm after FFN
        self.norm2 = LayerNormalization(parameters_shape=[d_model])
        # Dropout after FFN
        self.dropout2 = nn.Dropout(p=drop_prob)

    def forward(self, x):
        # x shape: [30, 200, 512]

        # ---- Sub-layer 1: Multi-Head Self-Attention + Add & Norm ----
        residual_x = x  # Save input for residual connection
        print("------- ATTENTION 1 ------")
        x = self.attention(x, mask=None)  # Self-attention (mask=None for encoder)
        print("------- DROPOUT 1 ------")
        x = self.dropout1(x)              # Apply dropout for regularization
        print("------- ADD AND LAYER NORMALIZATION 1 ------")
        x = self.norm1(x + residual_x)    # Add residual, then normalize

        # ---- Sub-layer 2: Feed-Forward Network + Add & Norm ----
        residual_x = x  # Save input for second residual connection
        print("------- ATTENTION 2 ------")
        x = self.ffn(x)                   # Feed-Forward Network (NOT attention — label is misleading)
        print("------- DROPOUT 2 ------")
        x = self.dropout2(x)              # Apply dropout
        print("------- ADD AND LAYER NORMALIZATION 2 ------")
        x = self.norm2(x + residual_x)    # Add residual, then normalize

        return x  # Output shape: [30, 200, 512] — same as input!


# ============================================================================
# 6. ENCODER (stack of N encoder layers)
# ============================================================================
# The full encoder simply stacks N identical EncoderLayer modules sequentially.
# Data flows through each layer one after another:
#   Input → EncoderLayer_1 → EncoderLayer_2 → ... → EncoderLayer_N → Output
#
# nn.Sequential: PyTorch container that chains modules — calling forward()
# on the Sequential passes input through each child module in order.
#
# Input shape:  [30, 200, 512]
# Output shape: [30, 200, 512]  (unchanged through all N layers)

class Encoder(nn.Module):
    def __init__(self, d_model, ffn_hidden, num_heads, drop_prob, num_layers):
        super().__init__()
        # Create a list of `num_layers` identical encoder layers and wrap in Sequential
        # The * unpacks the list so each EncoderLayer is a separate argument to Sequential
        self.layers = nn.Sequential(*[EncoderLayer(d_model, ffn_hidden, num_heads, drop_prob)
                                     for _ in range(num_layers)])

    def forward(self, x):
        # Pass input through all encoder layers sequentially
        # Each layer: attention → add&norm → FFN → add&norm
        x = self.layers(x)
        return x

## Step 3: Set Hyperparameters & Instantiate the Encoder

These are the key settings that control the architecture:

| Parameter | Value | Why this value? |
|-----------|-------|-----------------|
| `d_model = 512` | Embedding dimension | Standard from "Attention Is All You Need" paper |
| `num_heads = 8` | Parallel attention heads | 512/8 = 64 dims per head — good balance |
| `drop_prob = 0.1` | 10% dropout | Mild regularization to prevent overfitting |
| `batch_size = 30` | Mini-batch size | Trade-off between training speed and memory |
| `max_sequence_length = 200` | Padded sequence length | All sentences padded/truncated to same length |
| `ffn_hidden = 2048` | FFN hidden layer size | 4× the embedding dim (standard ratio) |
| `num_layers = 5` | Encoder depth | 5 stacked identical layers |

In [None]:
# ---- Hyperparameters (from the original Transformer paper) ----
d_model = 512              # Embedding dimension: each token is a 512-dimensional vector
num_heads = 8              # Number of attention heads (each head sees 512/8 = 64 dimensions)
drop_prob = 0.1            # Dropout rate: 10% of neurons randomly turned off during training
batch_size = 30            # Number of sentences processed together in one forward pass
max_sequence_length = 200  # All input sequences padded/truncated to this fixed length
ffn_hidden = 2048          # Hidden size in feed-forward network (4x expansion of d_model)
num_layers = 5             # Number of stacked encoder layers (deeper = more capacity)

# ---- Instantiate the Encoder ----
# This creates 5 encoder layers, each containing:
#   - MultiHeadAttention (8 heads, 512 dims)
#   - LayerNormalization (over 512 dims)
#   - PositionwiseFeedForward (512 → 2048 → 512)
#   - Dropout layers and residual connections
encoder = Encoder(d_model, ffn_hidden, num_heads, drop_prob, num_layers)

## Step 4: Run a Forward Pass Through the Encoder

We create a **random input tensor** simulating a batch of 30 sentences, each with 200 tokens, each token represented as a 512-dimensional embedding (which would normally include positional encoding).

**Expected output shape: [30, 200, 512]** — the encoder preserves the input dimensions exactly, but the values are now contextualized (each token's representation incorporates information from all other tokens via self-attention).

In [None]:
# Create a random input tensor to simulate token embeddings + positional encodings
# Shape: [batch_size, max_sequence_length, d_model] = [30, 200, 512]
# In a real scenario, this would come from:
#   1. Token embeddings (learned lookup table mapping word IDs → 512-dim vectors)
#   2. + Positional encodings (sinusoidal signals encoding each token's position)
x = torch.randn((batch_size, max_sequence_length, d_model))

# Pass through all 5 encoder layers
# The input flows: Layer1 → Layer2 → Layer3 → Layer4 → Layer5
# Each layer applies: Attention → Add&Norm → FFN → Add&Norm
# Output shape: [30, 200, 512] — same as input, but now CONTEXTUALIZED
# (each token's vector now contains information about all other tokens in the sequence)
out = encoder(x)

------- ATTENTION 1 ------
x.size(): torch.Size([30, 200, 512])
qkv.size(): torch.Size([30, 200, 1536])
qkv.size(): torch.Size([30, 200, 8, 192])
qkv.size(): torch.Size([30, 8, 200, 192])
q size: torch.Size([30, 8, 200, 64]), k size: torch.Size([30, 8, 200, 64]), v size: torch.Size([30, 8, 200, 64]), 
scaled.size() : torch.Size([30, 8, 200, 200])
values.size(): torch.Size([30, 8, 200, 64]), attention.size:torch.Size([30, 8, 200, 200]) 
values.size(): torch.Size([30, 200, 512])
out.size(): torch.Size([30, 200, 512])
------- DROPOUT 1 ------
------- ADD AND LAYER NORMALIZATION 1 ------
Mean (torch.Size([30, 200, 1]))
Standard Deviation  (torch.Size([30, 200, 1]))
y: torch.Size([30, 200, 512])
self.gamma: torch.Size([512]), self.beta: torch.Size([512])
out: torch.Size([30, 200, 512])
------- ATTENTION 2 ------
x after first linear layer: torch.Size([30, 200, 2048])
x after activation: torch.Size([30, 200, 2048])
x after dropout: torch.Size([30, 200, 2048])
x after 2nd linear layer: torch.