Mini Training Demo

A minimal end-to-end pipeline demonstrating how to train a tokenizer and a lm from scratch.

This project is intended for educational purposes and demonstrates the core components of modern LLM systems.

Pipeline:

Corpus → Tokenizer Training

GPT2 Tokenizer → Transformer Training

Features

Train a BPE tokenizer from raw text
Train a Transformer LM from GPT2 tokenizer
Minimal and easy to understand code

Environment Setup

This project uses Python venv.

1 Create virtual environment

python3 -m venv venv

2 Activate environment

Mac / Linux

source venv/bin/activate

Windows

venv\Scripts\activate

3 Install dependencies

pip install -r requirements.txt

Train Tokenizer

Train tokenizer from raw corpus:

Download data on https://huggingface.co/datasets/Skylion007/openwebtext/tree/main/plain_text

python tokenizer/train_tokenizer.py

Output:

tokenizer.json

Tokenizer type:

Byte-level BPE
Vocabulary size: 16000

Train Model

Train a small transformer language model:

the basic training scipt is uploaded as train.py, yet the data needs to be preprocessed into tokenized .pt format. I trained it on autoDL's NV A800 GPU, and the model is uploaded to https://huggingface.co/Hippocrene/MiniLLM-0.1B

Notes

This project is a minimal demonstration and not intended for production use.

Real LLM training typically requires:

billions of tokens
larger transformer models
distributed GPU training

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
tokenizer_demo		tokenizer_demo
.gitignore		.gitignore
generate.py		generate.py
readme.md		readme.md
requirements.txt		requirements.txt
testEnv.py		testEnv.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Training Demo

Features

Environment Setup

1 Create virtual environment

2 Activate environment

3 Install dependencies

Train Tokenizer

Train Model

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini Training Demo

Features

Environment Setup

1 Create virtual environment

2 Activate environment

3 Install dependencies

Train Tokenizer

Train Model

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages