This repository contains my personal journey and hands-on implementations of Transformer models from the ground up — one of the most revolutionary architectures in modern Deep Learning and Natural Language Processing.
The concepts and structure are inspired by the book "Building Transformers From Scratch" by Jason Brownlee, available at machinelearningmastery.com, which provides a clear and progressive path to mastering this essential architecture in AI.
- Introduction to Attention and Transformer Models
- Understanding Encoders and Decoders in Transformers
- Tokenizers in Language Models:
- Byte-Pair Encoding (BPE)
- WordPiece
- SentencePiece and Unigram
- Word Embeddings:
- Word2Vec implementations with Gensim and PyTorch
- Embeddings in Transformer models
- Positional Encodings:
- Sinusoidal Positional Encodings
- Learned Positional Encodings
- Rotary Positional Encodings (RoPE)
- Relative Positional Encodings
- YaRN for larger context windows
- Attention Mechanisms:
- Multi-Head Attention (MHA)
- Grouped-Query Attention (GQA)
- Multi-Query Attention (MQA)
- Multi-Head Latent Attention (MLA)
- Attention Masking
- Normalization Techniques:
- Layer Normalization
- RMS Normalization
- Adaptive Layer Norm
- Feed-Forward Networks:
- Linear layers and activation functions
- SwiGLU and variants
- Advanced Architectures:
- Mixture of Experts (MoE)
- Skip Connections
- Pre-norm vs Post-norm architectures
- Plain Seq2Seq Model for Language Translation (LSTM-based)
- Seq2Seq Model with Attention for Language Translation
- Full Transformer Model for Language Translation (Encoder-Decoder)
- Decoder-Only Transformer Model for Text Generation
Transformers have revolutionized AI by:
- Enabling large language models like GPT, BERT, LLaMA, and Claude
- Parallel processing of sequences (unlike RNNs)
- Capturing long-range dependencies through self-attention
- Scaling efficiently to billions of parameters
- Transferring across domains: NLP, Computer Vision, Audio, and more
Understanding Transformers from scratch is essential for:
- Building custom AI models
- Fine-tuning pre-trained models
- Understanding state-of-the-art architectures
- Optimizing inference and training performance
- PyTorch - Deep Learning framework
- tokenizers - Hugging Face tokenizers library
- torch.nn.functional - Neural network operations
- requests - Data downloading
- tqdm - Progress bars
- matplotlib - Visualization
Each notebook is self-contained and can be run independently. Start with the overview chapters to understand the fundamentals, then explore the building blocks before diving into complete model implementations.
✅ From-scratch implementations - No black boxes, understand every component
✅ Progressive learning path - Build from basics to complex architectures
✅ Modern techniques - RoPE, GQA, MoE, RMSNorm, and more
✅ Complete working models - Translation and text generation systems
✅ Well-commented code - Clear explanations in Spanish and English
✅ Production-ready patterns - Best practices for model design
- Start with the Overview - Understand what Transformers are and why they work
- Master the Building Blocks - Learn each component in isolation
- Build Complete Models - Combine components into working systems
- Experiment and Extend - Modify architectures and explore variations
Book: Building Transformers From Scratch
Author: Jason Brownlee
Website: https://machinelearningmastery.com
Paper: "Attention Is All You Need" (Vaswani et al., 2017)
This repository is for educational purposes. The implementations are based on the book by Jason Brownlee and academic papers.
Feel free to open issues or submit pull requests if you find any bugs or have suggestions for improvements!
- Jason Brownlee for the excellent book and structured approach
- The original Transformer paper authors (Vaswani et al.)
- The open-source ML community Happy Learning! 🚀
Building AI models one transformer layer at a time...