# Understanding Transformers by Building One from Scratch

This notebook is a learning-oriented walkthrough of the Transformer
architecture using a minimal PyTorch implementation.

The goal is to understand *why* each component exists and how information
flows through the model ‚Äî not to train a real language model.


## Why Transformers?

Sequence models like RNNs and LSTMs process tokens sequentially, which:
- limits parallelism
- makes long-range dependencies hard to learn

Transformers remove recurrence entirely and rely on attention mechanisms
to model relationships between tokens.


The notebook breaks the architecture into several key parts. Think of them like steps in a factory line:

Token Embeddings: Computers don't understand words like "Apple." They need numbers. This step turns each word into a list of numbers (a vector).

Positional Encoding: Transformers process all words at once (unlike humans who read left-to-right). This step adds a "time stamp" to each word so the model knows where it sits in the sentence.

Multi-Head Attention: This is the "brain." It allows the model to look at the word "bank" and decide if it means a "river bank" or a "money bank" based on the surrounding words.

Feedforward Layer: After the model "pays attention" to the context, this layer processes that information independently for each word to refine its meaning.

üåÄ The Transformer Mind Map: A Layman‚Äôs Guide
Imagine the Transformer is a high-speed translation agency filled with workers. Instead of reading a book page by page, they tear the pages out and look at every word at the exact same time to understand the "big picture" instantly.

1. The Input Stage: "The Passport Office"
Before words enter the brain, they need two things: Identity and Location.

Embeddings: We turn words into lists of numbers because computers can't "read" text.

Positional Encoding: Since the model looks at all words at once, it loses the sense of order. We add a "timestamp" to each word so the model knows which word came first and which came last.

2. The Interaction Stage: "The Cocktail Party" (Attention)
This is the "magic" of the Transformer. Imagine all words in a sentence are at a party.

Queries, Keys, and Values:

Query: A word asking a question (e.g., "Which word here describes me?").

Key: A word's "ID badge" (e.g., "I am an adjective").

Value: The actual information the word holds.

Self-Attention: Each word compares its Query against everyone else's Key to see who is relevant. If there's a match, it grabs that word's Value to update its own meaning.

3. The Processing Stage: "The Individual Thinking" (Feedforward)
Feedforward Network: After "talking" to other words in the Attention step, each word goes into a private booth to think.

The Goal: It processes the new context it just learned (e.g., "I now know that 'bank' refers to money, not a river") independently of the other words.

4. The Safety Net: "The Quality Control"
Residual Connections: To make sure the original meaning of the word doesn't get "garbled" through too many layers, we keep a copy of the input and add it back to the output (like a "shortcut").

Layer Normalization: This keeps the math from getting too wild or "exploding," keeping the numbers in a healthy range for the next layer.

5. The Output Stage: "The Crystal Ball"
The Linear Layer: After several layers of "talking" and "thinking," the model produces a final score for every word in its dictionary.

Generation: It picks the word with the highest score, prints it out, and then feeds that word back into the start to figure out what the next word should be.

In [1]:
import torch
import torch.nn as nn
import math


## Token Embeddings

Language models operate on token IDs (integers), not raw text.
Embeddings map each token ID to a dense vector representation.

At this stage:
- tokens have *no order*
- tokens do *not* interact with each other


## Positional Encoding

Self-attention alone has no notion of sequence order.
Positional encoding injects information about token positions.

This implementation uses fixed sinusoidal encodings.


In [2]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

In [3]:
pe = PositionalEncoding(d_model=16, max_len=10)
dummy = torch.zeros(1, 10, 16)
out = pe(dummy)

print(out.shape)


torch.Size([1, 10, 16])


We confirm that positional encodings preserve shape:
(batch, sequence_length, embedding_dim)


## Self-Attention: Intuition

Self-attention allows each token to decide which other tokens
are relevant when building its representation.

Conceptually:
- Query: what I‚Äôm looking for
- Key: what I offer
- Value: what information I provide


In [4]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.fc_out = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections
        Q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.k_linear(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.v_linear(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Scaled Dot-Product Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attention = torch.softmax(scores, dim=-1)

        # Combine heads
        output = torch.matmul(attention, V).transpose(1, 2).contiguous()
        output = output.view(batch_size, -1, self.d_model)
        return self.fc_out(output)

In [5]:
batch = 2
seq_len = 5
d_model = 16
heads = 4

x = torch.randn(batch, seq_len, d_model)
attn = MultiHeadAttention(d_model, heads)
out = attn(x, x, x)

print(out.shape)


torch.Size([2, 5, 16])


Attention preserves the input shape:
(batch, sequence_length, d_model)

Internally, attention operates on
(batch, heads, sequence_length, sequence_length)
which explains its O(n¬≤) complexity.

## Feedforward Layer: Why Attention Alone Isn‚Äôt Enough

Self-attention allows tokens to exchange information with other tokens
in the sequence. However, attention by itself is a **linear mixing**
operation across tokens.

Without an additional non-linear transformation, stacking attention
layers would not significantly increase the model‚Äôs expressive power.

The feedforward network (FFN) addresses this by:
- applying a non-linear transformation
- operating independently on each token
- increasing representational capacity after attention

Conceptually:
- Attention answers *‚ÄúWhich tokens matter?‚Äù*
- Feedforward layers answer *‚ÄúHow should this information be processed?‚Äù*

Each Transformer layer includes a position-wise feedforward network
that is applied to every token in the same way.


In [6]:

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.dropout(torch.relu(self.linear1(x))))

## Transformer Block

A Transformer layer combines:
1. Self-attention (token interactions)
2. Feedforward network (token-wise processing)
3. Residual connections
4. Layer normalization


In [7]:
class TransformerLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(TransformerLayer, self).__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.ff = FeedForward(d_model, d_ff, dropout)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Multi-Head Attention + Residual
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feedforward + Residual
        ff_output = self.ff(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

In [8]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, vocab_size, max_len, dropout=0.1):
        super(Transformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_len)
        self.layers = nn.ModuleList([
            TransformerLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)
        ])
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, x, mask=None):
        x = self.embedding(x) * math.sqrt(self.embedding.embedding_dim)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)
        return self.fc_out(x)

In [11]:
# Example Usage
vocab_size = 10000
max_len = 100
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

model = Transformer(d_model, num_heads, d_ff, num_layers, vocab_size, max_len)
input_seq = torch.randint(0, vocab_size, (32, max_len))  # Batch size of 32
output = model(input_seq)

print(output.shape)  # Expected: (32, max_len, vocab_size)


torch.Size([32, 100, 10000])


The model outputs logits for each token position and vocabulary item.
(batch, sequence_length, vocab_size)


## Autoregressive Generation

Language models generate text by repeatedly predicting
the next token given previous tokens.


In [12]:
def generate_text(model, start_token, max_len, vocab_size, device='cpu'):
    model.eval()
    generated = torch.tensor([start_token], dtype=torch.long, device=device).unsqueeze(0)
    for _ in range(max_len):
        output = model(generated)
        next_token_logits = output[:, -1, :]
        next_token = torch.argmax(next_token_logits, dim=-1)
        generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
        if next_token.item() == vocab_size - 1:  # End token
            break
    return generated.squeeze().tolist()

## Scope and Limitations

This implementation:
- is not trained
- does not use causal masking
- is not a production LLM

Modern LLMs use the same architecture with optimized attention,
causal masks, large-scale training, and pretrained weights.


In [17]:
vocab_size = 100  # Example vocabulary size (including <END>)
d_model = 128
num_heads = 4
d_ff = 256
num_layers = 3
max_len = 50

# Instantiate and move model to device
device = 'cpu'
model = Transformer(d_model, num_heads, d_ff, num_layers, vocab_size, max_len).to(device)

# Example usage
start_token = 1  # Assume 1 is the start token
output = generate_text(model, start_token, max_len=20, vocab_size=vocab_size, device=device)
print("Generated Text:", output)

Generated Text: [1, 7, 25, 8, 16, 23, 54, 22, 18, 24, 17, 35, 46, 43, 6, 35, 45, 77, 46, 43, 92]
