Skip to content

Alanperry1/transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer — "Attention Is All You Need"

A clean, faithful PyTorch implementation of the original Transformer architecture from:

Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
NeurIPS 2017 — https://arxiv.org/abs/1706.03762

Every component maps directly to a numbered section of the paper. The base-model hyperparameters from Table 3 are used as defaults throughout.


Table of Contents


Architecture Overview

The Transformer is a sequence-to-sequence model built entirely on attention — no recurrence and no convolutions.

src tokens ──► Embedding × √d_model ──► + Positional Encoding
                                               │
                                    ┌──────────▼──────────┐
                                    │    Encoder Layer × N │
                                    │  ┌─────────────────┐ │
                                    │  │  Self-Attention  │ │
                                    │  ├─────────────────┤ │
                                    │  │  Add & Norm      │ │
                                    │  ├─────────────────┤ │
                                    │  │  Feed-Forward    │ │
                                    │  ├─────────────────┤ │
                                    │  │  Add & Norm      │ │
                                    │  └─────────────────┘ │
                                    └──────────┬──────────┘
                                     encoder memory (B, S, d_model)
                                               │
tgt tokens ──► Embedding × √d_model ──► + Positional Encoding
                                               │
                                    ┌──────────▼──────────┐
                                    │    Decoder Layer × N │
                                    │  ┌─────────────────┐ │
                                    │  │ Masked Self-Attn │ │
                                    │  ├─────────────────┤ │
                                    │  │  Add & Norm      │ │
                                    │  ├─────────────────┤ │
                                    │  │  Cross-Attention │◄──── encoder memory
                                    │  ├─────────────────┤ │
                                    │  │  Add & Norm      │ │
                                    │  ├─────────────────┤ │
                                    │  │  Feed-Forward    │ │
                                    │  ├─────────────────┤ │
                                    │  │  Add & Norm      │ │
                                    │  └─────────────────┘ │
                                    └──────────┬──────────┘
                                               │
                                    Linear (d_model → vocab_size)
                                               │
                                          logits (B, T, V)

Project Structure

transformer/
├── transformer/              # Python package — one file per component
│   ├── __init__.py           # Public API (all classes re-exported here)
│   ├── masks.py              # make_causal_mask, make_padding_mask
│   ├── attention.py          # ScaledDotProductAttention, MultiHeadAttention
│   ├── blocks.py             # PositionwiseFeedForward, PositionalEncoding,
│   │                         #   SublayerConnection
│   ├── encoder.py            # EncoderLayer, Encoder
│   ├── decoder.py            # DecoderLayer, Decoder
│   └── model.py              # Transformer (full model)
├── transformer.py            # Original single-file implementation (reference)
├── smoke_test.py             # Shape + correctness checks
├── requirements.txt          # PyTorch ≥ 2.0
└── README.md

Quick Start

# 1. Create and activate the virtual environment
python3 -m venv venv
source venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run the smoke test
python smoke_test.py

Expected output:

Device       : cpu
Output shape : (2, 12, 10000)  ✓
Causal mask  : future tokens masked  ✓
Head dim     : d_k = d_v = 64  ✓
Parameters   : 59,463,680  (~60-65 M expected for base model)
Param count  : within expected range  ✓

All checks passed.

Installation

Prerequisites

  • Python 3.9+
  • pip

Steps

# Clone / navigate to the project folder
cd "path/to/transformer"

# Create a virtual environment
python3 -m venv venv

# Activate it
source venv/bin/activate          # Linux / macOS
# venv\Scripts\activate           # Windows

# Install PyTorch (CPU-only; see https://pytorch.org for GPU builds)
pip install -r requirements.txt

# Optionally install an editable local package
pip install -e .

GPU support: replace the torch line in requirements.txt with the appropriate CUDA wheel from https://pytorch.org/get-started/locally/.


Module Reference

Masks

transformer.masksmake_causal_mask, make_padding_mask

Function Signature Description
make_causal_mask (size, device) → (1, 1, T, T) Upper-triangular bool mask; True = future position, blocked.
make_padding_mask (seq, pad_idx=0) → (B, 1, 1, T) True at pad token positions.

Both masks broadcast over (B, h, T_q, T_k) and can be combined with |.

from transformer import make_causal_mask, make_padding_mask

causal   = make_causal_mask(T, device)   # (1, 1, T, T)
pad_mask = make_padding_mask(tgt)        # (B, 1, 1, T)
tgt_mask = causal | pad_mask             # (B, 1, T, T)

Scaled Dot-Product Attention

transformer.attention.ScaledDotProductAttention — §3.2.1

$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

  • Scales scores by $1/\sqrt{d_k}$ to stabilise gradients.
  • Fills masked positions with $-\infty$ before softmax → zero weight.
  • Applies dropout to attention weights.
attn = ScaledDotProductAttention(dropout=0.1)
output, weights = attn(q, k, v, mask=tgt_mask)
# output:  (..., T_q, d_v)
# weights: (..., T_q, T_k)

Multi-Head Attention

transformer.attention.MultiHeadAttention — §3.2.2

$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h),W^O$$ $$\text{head}_i = \text{Attention}(QW^Q_i,; KW^K_i,; VW^V_i)$$

  • Projection matrices $W^Q, W^K, W^V, W^O$ are all Linear with no bias.
  • Per-head projections are implemented as a single batched Linear(d_model → d_model) then reshaped for efficiency.
  • $d_k = d_v = d_\text{model} / h = 64$ for the base model.
Parameter Default Paper
d_model 512 512
h 8 8
dropout 0.1 0.1
mha = MultiHeadAttention(d_model=512, h=8, dropout=0.1)
out = mha(query, key, value, mask=None)  # (B, T_q, d_model)

Three uses in the full model:

Location Q source K, V source Mask
Encoder self-attention encoder input encoder input padding mask
Decoder masked self-attention decoder input decoder input causal + padding mask
Decoder cross-attention decoder hidden encoder output source padding mask

Position-wise Feed-Forward Network

transformer.blocks.PositionwiseFeedForward — §3.3

$$\text{FFN}(x) = \max(0,; xW_1 + b_1),W_2 + b_2$$

Applied identically and independently to each position (like a 1×1 convolution over the sequence dimension).

Parameter Default Paper
d_model 512 512
d_ff 2048 2048
dropout 0.1 0.1

The hidden dimension $d_\text{ff} = 2048$ is 4× the model dimension — this is where the bulk of per-token computation happens.


Positional Encoding

transformer.blocks.PositionalEncoding — §3.5

$$PE_{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d_\text{model}}}\right) \qquad PE_{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)$$

  • Non-learned — computed once at construction and stored as a buffer.
  • Allows the model to attend to relative positions via linear combinations of sin/cos.
  • Added to embeddings that are first scaled by $\sqrt{d_\text{model}}$ (§3.4).
  • Division terms are computed in log-space for numerical stability.
  • Supports sequences up to max_len = 5000 by default.

Sublayer Connection

transformer.blocks.SublayerConnection — §3.1

$$\text{output} = \text{LayerNorm}\bigl(x + \text{Dropout}(\text{sublayer}(x))\bigr)$$

Wraps every attention and FFN sub-layer in both the encoder and decoder. The post-norm formulation matches the original paper (pre-norm variants are popular in practice but not in this implementation).


Encoder

transformer.encoder — §3.1

Encoder stacks N = 6 identical EncoderLayer blocks:

src → Embedding × √d_model → + PositionalEncoding → EncoderLayer × N → LayerNorm → memory

Each EncoderLayer contains:

x → [Self-Attention] → SublayerConnection → [FFN] → SublayerConnection → x'

Decoder

transformer.decoder — §3.1

Decoder stacks N = 6 identical DecoderLayer blocks:

tgt → Embedding × √d_model → + PositionalEncoding → DecoderLayer × N → LayerNorm → hidden

Each DecoderLayer contains three sub-layers:

x → [Masked Self-Attention] → SublayerConnection
  → [Cross-Attention (K,V from encoder)] → SublayerConnection
  → [FFN] → SublayerConnection → x'

The masked self-attention uses the causal mask to prevent position $i$ from attending to any position $j > i$.


Transformer

transformer.model.Transformer — §3

The top-level model:

model = Transformer(
    src_vocab_size=32_000,
    tgt_vocab_size=32_000,
    d_model=512,   # model dimension
    h=8,           # attention heads
    d_ff=2048,     # FFN inner dim
    N=6,           # encoder / decoder layers
    dropout=0.1,
    max_len=5000,
)

logits = model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
# logits: (B, T, tgt_vocab_size) — raw pre-softmax scores

Weight initialisation (§5.3): Xavier uniform for all Linear layers; scaled normal ($\sigma = d_\text{model}^{-0.5}$) for Embedding layers.

Convenience methods for stepwise inference:

memory = model.encode(src, src_mask)             # (B, S, d_model)
hidden = model.decode(tgt, memory, tgt_mask, src_mask)  # (B, T, d_model)

Hyperparameters

Paper base model defaults (Table 3 of the paper):

Parameter Symbol Value Notes
Model dimension $d_\text{model}$ 512 Embedding and hidden size
Feed-forward dim $d_\text{ff}$ 2048 FFN inner layer; 4× $d_\text{model}$
Attention heads $h$ 8
Head dimension $d_k = d_v$ 64 $= d_\text{model} / h$
Encoder/Decoder layers $N$ 6
Dropout $p$ 0.1 Applied after attention, FFN, embedding
Max sequence length 5000 PE buffer size
~Parameters ~60 M Depends on vocab size / weight tying

All parameters are constructor arguments so the "big model" (or any other variant) is easy to configure:

# Paper "big" model
big = Transformer(
    src_vocab_size=37_000,
    tgt_vocab_size=37_000,
    d_model=1024,
    h=16,
    d_ff=4096,
    N=6,
    dropout=0.3,
)

Usage Examples

Building masks

import torch
from transformer import make_causal_mask, make_padding_mask

PAD = 0
src = torch.tensor([[5, 3, 7, PAD, PAD]])  # (1, 5)
tgt = torch.tensor([[2, 8, PAD]])          # (1, 3)

src_mask = make_padding_mask(src, PAD)         # (1, 1, 1, 5)
causal   = make_causal_mask(tgt.size(1), src.device)  # (1, 1, 3, 3)
tgt_mask = causal | make_padding_mask(tgt, PAD)        # (1, 1, 3, 3)

Forward pass

from transformer import Transformer

model = Transformer(src_vocab_size=10_000, tgt_vocab_size=10_000)
model.eval()

with torch.no_grad():
    logits = model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
    # logits: (1, 3, 10000)

probs = logits.softmax(dim=-1)
predicted_ids = logits.argmax(dim=-1)  # (1, 3)

Greedy autoregressive decoding (example sketch)

model.eval()
with torch.no_grad():
    memory = model.encode(src, src_mask)              # encode once

    # Start with <BOS> token
    ys = torch.full((1, 1), BOS_IDX, dtype=torch.long)

    for _ in range(max_len):
        T = ys.size(1)
        tgt_mask = make_causal_mask(T, src.device)
        hidden = model.decode(ys, memory, tgt_mask, src_mask)
        next_logits = model.output_projection(hidden[:, -1])  # (1, vocab)
        next_token  = next_logits.argmax(dim=-1, keepdim=True)  # (1, 1)
        ys = torch.cat([ys, next_token], dim=1)
        if next_token.item() == EOS_IDX:
            break

Using individual components

from transformer import MultiHeadAttention, PositionalEncoding

# Stand-alone multi-head attention
mha = MultiHeadAttention(d_model=512, h=8, dropout=0.1)
out = mha(query, key, value, mask=None)

# Stand-alone positional encoding
pe = PositionalEncoding(d_model=512, dropout=0.1, max_len=5000)
x_with_pe = pe(x)  # x: (B, T, 512)

Design Decisions

Decision Rationale
Post-norm (LayerNorm after residual) Matches the original paper §3.1; pre-norm is more stable to train but changes the architecture.
No bias on projection matrices The paper uses pure linear projections $W^Q, W^K, W^V, W^O$; biases are absent.
Batched head projections Linear(d_model → d_model) then reshape is equivalent to $h$ separate Linear(d_model → d_k) calls but faster due to a single GEMM.
Sinusoidal PE as a buffer Non-learned; matches §3.5 exactly. Can be swapped for learned PE by replacing PositionalEncoding.
Division terms in log-space exp(arange * -log(10000) / d_model) avoids computing large intermediate powers.
Xavier uniform init Standard for Transformer-like models; the paper mentions using a specific schedule with warm-up but does not specify exact init (§5.3).
Separate src/tgt vocabularies Allows different source and target languages. The paper optionally shares weights — easy to add by passing the same Embedding.
mask: True = block Consistent convention across all mask functions; positions where mask is True receive $-\infty$ score.

Scope & Limitations

This implementation covers every architectural component from the paper. The following are intentionally out of scope:

  • No training loop / optimizer / learning rate scheduler — the paper's schedule with warm-up steps is non-trivial; adding one would conflate architecture with training code.
  • No tokenization or dataset loading — task-agnostic.
  • No beam search — the greedy decoding sketch above is included as an example only.
  • No weight tying — the paper optionally ties the src embedding, tgt embedding, and output projection weights (§3.4); straightforward to add.
  • No label smoothing — used in the paper's training (§5.4) but belongs in the loss function, not the model.

License

This implementation is released for educational and research use.
Paper copyright belongs to the original authors.

About

A clean, faithful PyTorch implementation of the original Transformer architecture from: "Attention Is All You Need" paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages