🤖 Transformer Model Implementation

🌟 Overview

This repository contains a complete PyTorch implementation of the Transformer architecture introduced in the groundbreaking paper "Attention is All You Need" by Vaswani et al. (2017).

┌─────────────────────────────────────────────────────────┐
│                    🏗️ Architecture                      │
├─────────────────────────────────────────────────────────┤
│  Input → Embedding → Positional Encoding               │
│    ↓                                                    │
│  Encoder Stack (N layers)                              │
│    ├── Multi-Head Attention                            │
│    ├── Add & Norm                                      │
│    ├── Feed Forward                                    │
│    └── Add & Norm                                      │
│    ↓                                                    │
│  Decoder Stack (N layers)                              │
│    ├── Masked Multi-Head Attention                     │
│    ├── Add & Norm                                      │
│    ├── Cross Multi-Head Attention                      │
│    ├── Add & Norm                                      │
│    ├── Feed Forward                                    │
│    └── Add & Norm                                      │
│    ↓                                                    │
│  Linear → Softmax → Output                             │
└─────────────────────────────────────────────────────────┘

✨ Features

🎯 Multi-Head Attention - Parallel attention mechanisms for capturing different types of relationships
🔄 Positional Encoding - Sinusoidal position embeddings for sequence order understanding
🏗️ Encoder-Decoder Architecture - Complete transformer stack with residual connections
🎭 Masking Support - Proper attention masking for causal and padding tokens
⚡ Optimized Implementation - Efficient PyTorch operations with proper tensor shapes
🛡️ Bug-Free Code - Fixed all issues from the original implementation

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/Transformer-Model.git
cd Transformer-Model

# Install dependencies
pip install torch torchvision torchaudio
pip install numpy matplotlib

Basic Usage

import torch
from model import Transformer

# Initialize model parameters
src_vocab_size = 10000
tgt_vocab_size = 10000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

# Create the model
model = Transformer(
    src_vocab_size=src_vocab_size,
    tgt_vocab_size=tgt_vocab_size,
    d_model=d_model,
    num_heads=num_heads,
    num_layers=num_layers,
    d_ff=d_ff,
    max_seq_length=max_seq_length,
    dropout=dropout
)

# Example forward pass
batch_size = 32
src_seq_len = 50
tgt_seq_len = 40

src = torch.randint(1, src_vocab_size, (batch_size, src_seq_len))
tgt = torch.randint(1, tgt_vocab_size, (batch_size, tgt_seq_len))

output = model(src, tgt)
print(f"Output shape: {output.shape}")  # [batch_size, tgt_seq_len, tgt_vocab_size]

🏗️ Architecture Components

🔍 Multi-Head Attention

The core of the Transformer - allows the model to attend to different positions simultaneously:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        # Creates multiple attention heads
        # Each head learns different relationships

Key Features:

✅ Parallel attention computation
✅ Scaled dot-product attention
✅ Proper head splitting and combining
✅ Optional masking support

📍 Positional Encoding

Since Transformers don't have built-in notion of sequence order, we add positional information:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        # Uses sinusoidal functions
        # PE(pos, 2i) = sin(pos/10000^(2i/d_model))
        # PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))

🏗️ Encoder & Decoder Layers

Encoder Layer:

Self-attention mechanism
Position-wise feed-forward network
Residual connections and layer normalization

Decoder Layer:

Masked self-attention
Cross-attention with encoder output
Position-wise feed-forward network
Residual connections and layer normalization

📊 Model Parameters

Parameter	Default	Description
`d_model`	512	Model dimension
`num_heads`	8	Number of attention heads
`num_layers`	6	Number of encoder/decoder layers
`d_ff`	2048	Feed-forward dimension
`dropout`	0.1	Dropout rate
`max_seq_length`	100	Maximum sequence length

🛠️ Training Example

import torch.optim as optim
import torch.nn as nn

# Initialize model and optimizer
model = Transformer(...)
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
criterion = nn.CrossEntropyLoss(ignore_index=0)

# Training loop
model.train()
for epoch in range(num_epochs):
    for batch in dataloader:
        src, tgt = batch
        
        # Forward pass
        output = model(src, tgt[:, :-1])  # Teacher forcing
        
        # Calculate loss
        loss = criterion(
            output.reshape(-1, tgt_vocab_size),
            tgt[:, 1:].reshape(-1)
        )
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(f'Epoch: {epoch}, Loss: {loss.item():.4f}')

📈 Performance Tips

🚀 Batch Size: Use larger batch sizes for better GPU utilization
🎯 Learning Rate: Start with 0.0001 and use learning rate scheduling
🔥 Warmup: Implement learning rate warmup for better convergence
💾 Memory: Use gradient checkpointing for very large models
⚡ Mixed Precision: Enable automatic mixed precision for faster training

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

📚 References

Attention Is All You Need - Original Transformer paper
The Annotated Transformer - Detailed explanation
PyTorch Transformer Tutorial

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

⭐ Star History

🎉 Thank you for using this Transformer implementation!

Made with ❤️ and lots of ☕

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 Transformer Model Implementation

🌟 Overview

✨ Features

🚀 Quick Start

Installation

Basic Usage

🏗️ Architecture Components

📊 Model Parameters

🛠️ Training Example

📈 Performance Tips

🤝 Contributing

📚 References

📄 License

⭐ Star History

🎉 Thank you for using this Transformer implementation!

About

Uh oh!

Releases

Packages

Languages

NeuroCoder47/Transformer-Model

Folders and files

Latest commit

History

Repository files navigation

🤖 Transformer Model Implementation

🌟 Overview

✨ Features

🚀 Quick Start

Installation

Basic Usage

🏗️ Architecture Components

📊 Model Parameters

🛠️ Training Example

📈 Performance Tips

🤝 Contributing

📚 References

📄 License

⭐ Star History

🎉 Thank you for using this Transformer implementation!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages