ScratchGPT

ScratchGPT is a Python project that implements a small-scale transformer-based language model from scratch. It provides functionality for training the model on custom datasets and generating text based on prompts. The purpose of this repo is educational, so the aim is to keep the code as legible as possible.

Features

Custom transformer architecture implementation
Training on user-provided text data
Text generation using the trained model
Flexible tokenization using TikToken
Command-line interfaces for training and inference

Roadmap

[x] Switch to uv [x] Make it easy to modify with a config file [] Extract the loss calculation from the model [] Rename main to train [] Create an easy to use interface [] Create or check tokenizer interface [] Make it into a package [] Apply SOTA optimizations

Requirements

Python 3.12+
uv for dependency management

Installation

Clone the repository:

git clone https://github.com/LabStrangeLoop/scratchgpt.git
cd scratchgpt

Install dependencies using uv:
```
uv sync --all-groups
```

Usage

Training

To train the model on your custom dataset:

uv run train -t <path_to_training_data> -e <experiment_folder>

-t, --train_source: Path to the training data file or folder
-e, --experiment: Path to the folder where experiment checkpoints will be saved

Inference

To generate text using a trained model:

uv run infer -e <experiment_folder> [-d <device>] [-m <max_tokens>]

-e, --experiment: Path to the folder containing the trained model
-d, --device: Device to run the model on (default: "cuda")
-m, --max_tokens: Maximum number of tokens to generate (default: 512)

Tokenization

To explore the TikToken tokenizer:

uv run tiktoken

Project Structure

scratchgpt/main.py: Main training script
scratchgpt/infer.py: Inference script for text generation
scratchgpt/model_io.py: Utilities for saving and loading models
scratchgpt/tokenizer/: Tokenizer implementations

Development

This project uses various development tools:

mypy for static type checking
ruff for formatting and standard adherence
pytest for testing

Run the following commands to ensure code quality:

uv run ruff --fix .
uv run mypy .
uv run pytest

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Authors

Aleksandr Yeganov
Dario Cazzani

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
.hooks		.hooks
assets		assets
docker		docker
scratchgpt		scratchgpt
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
scratch_gpt.yaml.sample		scratch_gpt.yaml.sample
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScratchGPT

Features

Roadmap

Requirements

Installation

Usage

Training

Inference

Tokenization

Project Structure

Development

Contributing

License

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

LabStrangeLoop/scratchgpt

Folders and files

Latest commit

History

Repository files navigation

ScratchGPT

Features

Roadmap

Requirements

Installation

Usage

Training

Inference

Tokenization

Project Structure

Development

Contributing

License

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages