This repository contains a complete PyTorch implementation of the Transformer architecture introduced in the groundbreaking paper "Attention is All You Need" by Vaswani et al. (2017).
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ποΈ Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Input β Embedding β Positional Encoding β
β β β
β Encoder Stack (N layers) β
β βββ Multi-Head Attention β
β βββ Add & Norm β
β βββ Feed Forward β
β βββ Add & Norm β
β β β
β Decoder Stack (N layers) β
β βββ Masked Multi-Head Attention β
β βββ Add & Norm β
β βββ Cross Multi-Head Attention β
β βββ Add & Norm β
β βββ Feed Forward β
β βββ Add & Norm β
β β β
β Linear β Softmax β Output β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- π― Multi-Head Attention - Parallel attention mechanisms for capturing different types of relationships
- π Positional Encoding - Sinusoidal position embeddings for sequence order understanding
- ποΈ Encoder-Decoder Architecture - Complete transformer stack with residual connections
- π Masking Support - Proper attention masking for causal and padding tokens
- β‘ Optimized Implementation - Efficient PyTorch operations with proper tensor shapes
- π‘οΈ Bug-Free Code - Fixed all issues from the original implementation
# Clone the repository
git clone https://github.com/yourusername/Transformer-Model.git
cd Transformer-Model
# Install dependencies
pip install torch torchvision torchaudio
pip install numpy matplotlib
import torch
from model import Transformer
# Initialize model parameters
src_vocab_size = 10000
tgt_vocab_size = 10000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1
# Create the model
model = Transformer(
src_vocab_size=src_vocab_size,
tgt_vocab_size=tgt_vocab_size,
d_model=d_model,
num_heads=num_heads,
num_layers=num_layers,
d_ff=d_ff,
max_seq_length=max_seq_length,
dropout=dropout
)
# Example forward pass
batch_size = 32
src_seq_len = 50
tgt_seq_len = 40
src = torch.randint(1, src_vocab_size, (batch_size, src_seq_len))
tgt = torch.randint(1, tgt_vocab_size, (batch_size, tgt_seq_len))
output = model(src, tgt)
print(f"Output shape: {output.shape}") # [batch_size, tgt_seq_len, tgt_vocab_size]
π Multi-Head Attention
The core of the Transformer - allows the model to attend to different positions simultaneously:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
# Creates multiple attention heads
# Each head learns different relationships
Key Features:
- β Parallel attention computation
- β Scaled dot-product attention
- β Proper head splitting and combining
- β Optional masking support
π Positional Encoding
Since Transformers don't have built-in notion of sequence order, we add positional information:
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_seq_length):
# Uses sinusoidal functions
# PE(pos, 2i) = sin(pos/10000^(2i/d_model))
# PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))
ποΈ Encoder & Decoder Layers
Encoder Layer:
- Self-attention mechanism
- Position-wise feed-forward network
- Residual connections and layer normalization
Decoder Layer:
- Masked self-attention
- Cross-attention with encoder output
- Position-wise feed-forward network
- Residual connections and layer normalization
Parameter | Default | Description |
---|---|---|
d_model |
512 | Model dimension |
num_heads |
8 | Number of attention heads |
num_layers |
6 | Number of encoder/decoder layers |
d_ff |
2048 | Feed-forward dimension |
dropout |
0.1 | Dropout rate |
max_seq_length |
100 | Maximum sequence length |
import torch.optim as optim
import torch.nn as nn
# Initialize model and optimizer
model = Transformer(...)
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
criterion = nn.CrossEntropyLoss(ignore_index=0)
# Training loop
model.train()
for epoch in range(num_epochs):
for batch in dataloader:
src, tgt = batch
# Forward pass
output = model(src, tgt[:, :-1]) # Teacher forcing
# Calculate loss
loss = criterion(
output.reshape(-1, tgt_vocab_size),
tgt[:, 1:].reshape(-1)
)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch: {epoch}, Loss: {loss.item():.4f}')
- π Batch Size: Use larger batch sizes for better GPU utilization
- π― Learning Rate: Start with 0.0001 and use learning rate scheduling
- π₯ Warmup: Implement learning rate warmup for better convergence
- πΎ Memory: Use gradient checkpointing for very large models
- β‘ Mixed Precision: Enable automatic mixed precision for faster training
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- Attention Is All You Need - Original Transformer paper
- The Annotated Transformer - Detailed explanation
- PyTorch Transformer Tutorial
This project is licensed under the MIT License - see the LICENSE file for details.