ScratchGPT is a Python project that implements a small-scale transformer-based language model from scratch. It provides functionality for training the model on custom datasets and generating text based on prompts. The purpose of this repo is educational, so the aim is to keep the code as legible as possible.
- Custom transformer architecture implementation
- Training on user-provided text data
- Text generation using the trained model
- Flexible tokenization using TikToken
- Command-line interfaces for training and inference
[x] Switch to uv [x] Make it easy to modify with a config file [] Extract the loss calculation from the model [] Rename main to train [] Create an easy to use interface [] Create or check tokenizer interface [] Make it into a package [] Apply SOTA optimizations
- Python 3.12+
uv
for dependency management
-
Clone the repository:
git clone https://github.com/LabStrangeLoop/scratchgpt.git cd scratchgpt
-
Install dependencies using uv:
uv sync --all-groups
To train the model on your custom dataset:
uv run train -t <path_to_training_data> -e <experiment_folder>
-t, --train_source
: Path to the training data file or folder-e, --experiment
: Path to the folder where experiment checkpoints will be saved
To generate text using a trained model:
uv run infer -e <experiment_folder> [-d <device>] [-m <max_tokens>]
-e, --experiment
: Path to the folder containing the trained model-d, --device
: Device to run the model on (default: "cuda")-m, --max_tokens
: Maximum number of tokens to generate (default: 512)
To explore the TikToken tokenizer:
uv run tiktoken
scratchgpt/main.py
: Main training scriptscratchgpt/infer.py
: Inference script for text generationscratchgpt/model_io.py
: Utilities for saving and loading modelsscratchgpt/tokenizer/
: Tokenizer implementations
This project uses various development tools:
mypy
for static type checkingruff
for formatting and standard adherencepytest
for testing
Run the following commands to ensure code quality:
uv run ruff --fix .
uv run mypy .
uv run pytest
Contributions are welcome! Please feel free to submit a Pull Request.
- Aleksandr Yeganov
- Dario Cazzani