A PyTorch implementation of a GPT-style transformer language model trained from scratch, featuring modern training optimizations and both custom BPE tokenization and inference capabilities.
This repo hosts a cleaner version that outputs coherent text of our initial attempts from michaelgathara/gpt_f. GPT_F holds various different models and experiments.
- Modern Transformer Architecture: Implementation based on the GPT architecture with SwiGLU activation functions
- Advanced Optimizations:
- Flash Attention for improved performance on compatible hardware
- Mixed precision training (FP16)
- Gradient checkpointing (optional)
- Custom BPE Tokenization: Uses the GPT2 tokenizer with a 52k vocab size
- Performance Monitoring: TensorBoard integration for tracking training metrics (deprecated for now)
- Dataset: Supports dataset options like FineWeb-Edu
Note
You can change the size of the model simply by changing the config file without having to mess around with the transformer model itself
The main model trained under this repo has the following parameters:
- Embedding dimension: 1536
- Number of attention heads: 12
- Number of transformer layers: 24
- Context size: 1024 tokens
- Total parameters: ~1.1B (1,062,501,457)
This project is structured to guide you from the foundational concepts of Large Language Models (LLMs) to training, evaluating, and using your own GPT-style model from scratch. We recommend navigating through the documentation modules in the following order:
- Foundational Concepts & Further Reading: (Optional) Understand the key research behind LLMs.
- Setting Up Your Data: Learn how to acquire and prepare the datasets.
- Understanding Tokenization: Discover how text is converted into a format models can understand and how to train/use a tokenizer.
- The Model Architecture: Dive into the components of our Transformer model (defined in
models/transformer_setup/
). - Training the Model: Step-by-step guide to train your LLM (using
models/gpt.py
). - Using Your Trained Model - Inference: Generate text with your trained model (using
models/inference.py
). - Evaluating Model Performance: Assess the quality of your trained LLM.
- Working with Hugging Face: Make your model compatible with and share it on the Hugging Face Hub.
Each linked README.md
file (or section) within the respective directories serves as a chapter, explaining the purpose and usage of the code within that module.
- Install
uv
from here. - Create a virtual environment and sync dependencies:
uv venv # Create .venv # Activate the environment # On Linux/macOS: source .venv/bin/activate # On Windows (PowerShell): # .\.venv\Scripts\Activate.ps1 # On Windows (CMD): # .\.venv\Scripts\activate.bat uv sync
- If
flash-attn
did not install correctly viauv sync
(common on some platforms), you might need to install it separately. Ensure you have the compatible NVIDIA CUDA toolkit installed and that your GPU supports FlashAttention (Ampere architecture or newer).# Example, adjust for your PyTorch and CUDA version if necessary uv pip install flash-attn --no-build-isolation
- Login to Hugging Face CLI (required for downloading some datasets like FineWeb-Edu and for uploading models):
huggingface-cli login
This project is organized into several key directories:
data/
: Modules for acquiring, preprocessing, and loading datasets.preprocessing/
: Scripts for cleaning text data.
tokenization/
: BPE tokenizer implementation, training scripts (for a fully custom tokenizer), and configuration.models/
: Core model architecture (transformer_setup/
), main training script (gpt_training.py
), and inference script (inference.py
).evaluation/
: Scripts for evaluating trained models.hugging_face/
: Modules for Hugging Face Hub integration.PAPERS.md
: Curated list of influential research papers..gitignore
,LICENSE
,pyproject.toml
,requirements.txt
,STRUCTURE.md
: Standard project and configuration files.
Check out the READMEs above to get a detailed usage reference
To train the model from scratch:
cd models/
# For background training and logging:
nohup python3 -u gpt_training.py > train.log 2>&1 &
# For foreground training:
python3 gpt_training.py
# In a separate terminal:
tail -f train.log
# Or use the provided script from the project root:
./print_res.sh
This will:
- Download and preprocess the FineWeb-Edu dataset
- Load up the GPT2 tokenizer
- Start streaming and tokenizing the dataset
- Train the transformer model
- Save checkpoints to the
checkpoints_1B/
directory
To generate text with a trained model:
python3 inference.py checkpoints_1B/best_model.pt
This will:
- Load the checkpoint pt file
- Allow you to enter prompts and generate continuations
- Exit when you type 'exit'
- Optimizer: AdamW with weight decay and cosine learning rate scheduling
- Batch size: 72 per GPU (configurable)
- Learning rate: 6e-4 with warmup
- Gradient accumulation: 4 steps
- Mixed precision: FP16 training enabled
- Evaluation: Every 1000 iterations on validation set
This implementation draws inspiration from:
- The GPT architecture by OpenAI
- "Attention Is All You Need" (Vaswani et al., 2017)
- The Flash Attention implementation
- Hugging Face's tokenizers and datasets libraries
- Andrej Karpathy for his invaluable LLM videos
MIT