Skip to content

NeuroCoder47/Transformer-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Transformer Model Implementation

Transformer Python PyTorch License

Typing SVG

🌟 Overview

This repository contains a complete PyTorch implementation of the Transformer architecture introduced in the groundbreaking paper "Attention is All You Need" by Vaswani et al. (2017).

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    πŸ—οΈ Architecture                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input β†’ Embedding β†’ Positional Encoding               β”‚
β”‚    ↓                                                    β”‚
β”‚  Encoder Stack (N layers)                              β”‚
β”‚    β”œβ”€β”€ Multi-Head Attention                            β”‚
β”‚    β”œβ”€β”€ Add & Norm                                      β”‚
β”‚    β”œβ”€β”€ Feed Forward                                    β”‚
β”‚    └── Add & Norm                                      β”‚
β”‚    ↓                                                    β”‚
β”‚  Decoder Stack (N layers)                              β”‚
β”‚    β”œβ”€β”€ Masked Multi-Head Attention                     β”‚
β”‚    β”œβ”€β”€ Add & Norm                                      β”‚
β”‚    β”œβ”€β”€ Cross Multi-Head Attention                      β”‚
β”‚    β”œβ”€β”€ Add & Norm                                      β”‚
β”‚    β”œβ”€β”€ Feed Forward                                    β”‚
β”‚    └── Add & Norm                                      β”‚
β”‚    ↓                                                    β”‚
β”‚  Linear β†’ Softmax β†’ Output                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Features

  • 🎯 Multi-Head Attention - Parallel attention mechanisms for capturing different types of relationships
  • πŸ”„ Positional Encoding - Sinusoidal position embeddings for sequence order understanding
  • πŸ—οΈ Encoder-Decoder Architecture - Complete transformer stack with residual connections
  • 🎭 Masking Support - Proper attention masking for causal and padding tokens
  • ⚑ Optimized Implementation - Efficient PyTorch operations with proper tensor shapes
  • πŸ›‘οΈ Bug-Free Code - Fixed all issues from the original implementation

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/Transformer-Model.git
cd Transformer-Model

# Install dependencies
pip install torch torchvision torchaudio
pip install numpy matplotlib

Basic Usage

import torch
from model import Transformer

# Initialize model parameters
src_vocab_size = 10000
tgt_vocab_size = 10000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

# Create the model
model = Transformer(
    src_vocab_size=src_vocab_size,
    tgt_vocab_size=tgt_vocab_size,
    d_model=d_model,
    num_heads=num_heads,
    num_layers=num_layers,
    d_ff=d_ff,
    max_seq_length=max_seq_length,
    dropout=dropout
)

# Example forward pass
batch_size = 32
src_seq_len = 50
tgt_seq_len = 40

src = torch.randint(1, src_vocab_size, (batch_size, src_seq_len))
tgt = torch.randint(1, tgt_vocab_size, (batch_size, tgt_seq_len))

output = model(src, tgt)
print(f"Output shape: {output.shape}")  # [batch_size, tgt_seq_len, tgt_vocab_size]

πŸ—οΈ Architecture Components

πŸ” Multi-Head Attention

The core of the Transformer - allows the model to attend to different positions simultaneously:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        # Creates multiple attention heads
        # Each head learns different relationships

Key Features:

  • βœ… Parallel attention computation
  • βœ… Scaled dot-product attention
  • βœ… Proper head splitting and combining
  • βœ… Optional masking support
πŸ“ Positional Encoding

Since Transformers don't have built-in notion of sequence order, we add positional information:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        # Uses sinusoidal functions
        # PE(pos, 2i) = sin(pos/10000^(2i/d_model))
        # PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))
πŸ—οΈ Encoder & Decoder Layers

Encoder Layer:

  • Self-attention mechanism
  • Position-wise feed-forward network
  • Residual connections and layer normalization

Decoder Layer:

  • Masked self-attention
  • Cross-attention with encoder output
  • Position-wise feed-forward network
  • Residual connections and layer normalization

πŸ“Š Model Parameters

Parameter Default Description
d_model 512 Model dimension
num_heads 8 Number of attention heads
num_layers 6 Number of encoder/decoder layers
d_ff 2048 Feed-forward dimension
dropout 0.1 Dropout rate
max_seq_length 100 Maximum sequence length

πŸ› οΈ Training Example

import torch.optim as optim
import torch.nn as nn

# Initialize model and optimizer
model = Transformer(...)
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
criterion = nn.CrossEntropyLoss(ignore_index=0)

# Training loop
model.train()
for epoch in range(num_epochs):
    for batch in dataloader:
        src, tgt = batch
        
        # Forward pass
        output = model(src, tgt[:, :-1])  # Teacher forcing
        
        # Calculate loss
        loss = criterion(
            output.reshape(-1, tgt_vocab_size),
            tgt[:, 1:].reshape(-1)
        )
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(f'Epoch: {epoch}, Loss: {loss.item():.4f}')

πŸ“ˆ Performance Tips

  • πŸš€ Batch Size: Use larger batch sizes for better GPU utilization
  • 🎯 Learning Rate: Start with 0.0001 and use learning rate scheduling
  • πŸ”₯ Warmup: Implement learning rate warmup for better convergence
  • πŸ’Ύ Memory: Use gradient checkpointing for very large models
  • ⚑ Mixed Precision: Enable automatic mixed precision for faster training

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“š References

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

⭐ Star History

Star History Chart


πŸŽ‰ Thank you for using this Transformer implementation!

Footer Typing SVG

Made with ❀️ and lots of β˜•

About

It is an implementation of the paper "Attention is all you need"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages