Skip to content

Latest commit

 

History

History
31 lines (22 loc) · 2.93 KB

File metadata and controls

31 lines (22 loc) · 2.93 KB

Foundational Concepts & Further Reading

This page lists key research papers that are foundational to understanding the concepts and architectures implemented in this project. While not mandatory for running the code, exploring these can provide a deeper understanding of why things are built the way they are.

Core Transformer and GPT Architecture

  1. "Attention Is All You Need" (Vaswani et al., 2017)

    • Link: PDF
    • Relevance: Introduced the original Transformer architecture, which is the basis for GPT models. Understanding self-attention, multi-head attention, and positional encoding from this paper is crucial for grasping the model in this repository.
  2. "Language Models are Unsupervised Multitask Learners" (Radford et al., OpenAI - GPT-2 Paper)

    • Link: PDF
    • Relevance: Details the GPT-2 model, a direct predecessor and significant inspiration for the architecture and training approach used in this project. It highlights the power of large-scale unsupervised pre-training for language models.

Training Techniques and Components

  1. "GLU Variants Improve Transformer" (Shazeer, 2020)

    • Link: arXiv PDF
    • Relevance: This paper explores variants of Gated Linear Units (GLU), including SwiGLU, which is used in the feed-forward network of this project's Transformer model. It demonstrates performance improvements over standard ReLU or GELU activations.
  2. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022)

    • Link: arXiv PDF
    • Relevance: Describes FlashAttention, an optimized attention algorithm that significantly speeds up computation and reduces memory usage by being IO-aware. This project incorporates FlashAttention for improved training and inference efficiency on compatible hardware. (A follow-up, FlashAttention-2, further improves upon this: arXiv PDF)
  3. "Adam: A Method for Stochastic Optimization" (Kingma & Ba, 2014)

    • Link: arXiv PDF
    • Relevance: Introduces the Adam optimization algorithm. This project uses AdamW, a variant of Adam that incorporates weight decay differently, which is a standard choice for training Transformers.
  4. "Why Warmup the Learning Rate? Underlying Mechanisms and Improvements" (Xiao et al., 2024)

    • Link: arXiv PDF
    • Relevance: Explores the learning rate warmup strategy, a common technique used in training large neural networks (including the one in this project) to improve stability and overall performance, especially in the early stages of training.