A complete, end-to-end implementation of a small language model trained from scratch, featuring modern Transformer architecture, efficient data pipelines, and instruction fine-tuning. This project demonstrates the full ML lifecycle: data preparation, pretraining, fine-tuning, and evaluation.
TinyLLM is a compact language model (512 hidden dim, 12 layers, 8 heads) designed to generate CLI commands from natural language instructions. The model achieves 93.94% exact-match accuracy on a held-out test set, demonstrating that small models can be highly effective for domain-specific tasks.
- Pretraining: 50,000 steps (~204M tokens seen) from 133 Wikipedia shards
- Fine-tuning: 2,000 steps on 2,302 instruction-command pairs
- Test Accuracy: 93.94% (93/99 exact matches)
- Model Size: 66.73M parameters (266.91 MB FP32)
- Training Time: ~13 hours pretraining + ~4 minutes fine-tuning on Apple Silicon
<instruction> text
โ
[ Tokenizer (SentencePiece) ]
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TinyLLM Transformer (12L) โ
โ โข RoPE โ
โ โข Multi-head Attention โ
โ โข SwiGLU FFN โ
โ โข RMSNorm โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
<command> tokens
- Transformer Architecture: 12 layers, 512 hidden dimension, 8 attention heads
- RoPE (Rotary Position Embedding): Modern positional encoding
- RMSNorm: Root Mean Square Layer Normalization
- SwiGLU: Swish-Gated Linear Unit activation
- Weight Tying: Shared embeddings between input and output layers
- SentencePiece Tokenizer: 32K vocabulary with special tokens
<instruction>: Marks the start of natural language instructions<command>: Marks the start of CLI command output- Standard tokens:
<pad>,<bos>,<eos>,<unk>
| Stage | Metric | Value |
|---|---|---|
| Pretraining | Final loss | 3.59 |
| Pretraining | Tokens processed | 204.8M |
| Fine-tuning (SFT) | Steps | 2,000 |
| Held-out evaluation | Exact match | 93.94% (93/99) |
| Model size | Parameters | 66.73M |
| Model size | FP32 checkpoint size | 266.91 MB |
| Hardware | Training device | 24GB M4 Mac Mini (MPS) |
| Inference | Generation speed | 103.4 tok/s (MPS) |
Tested on 99 held-out instruction-command pairs:
Total examples: 99
Exact matches: 93
Accuracy: 93.94%
The 6 failure cases (6.06%) fall into specific categories:
- Special characters/escape sequences (1 case): Complex regex patterns
- Email addresses (1 case): Domain completion
- Input redirections (1 case): File redirection syntax
- Version specifiers (1 case): Package version syntax
- Complex pipe chains (1 case): Multi-stage pipelines
- Regex patterns (1 case): Character class syntax
All failures involve the model stopping early (EOS token) rather than generation errors, suggesting training data augmentation opportunities.
TinyLLM is designed to run efficiently on consumer hardware. Here is the measured generation throughput on a 24GB M4 Mac Mini (MPS):
Total tokens generated: 1280
Total time: 12.38 s
Tokens per second: 103.4 tok/s
Time per token: 9.67 ms
Inference throughput: ~103 tokens/second
Hardware: Apple Silicon M4 (24GB unified memory)
This performance shows that TinyLLM can generate commands nearly instantly for interactive CLI agents, local assistants, or embedded tools.
TinyLLM demonstrates that:
- You don't need massive GPUs to train an LLM end-to-end โ Trained entirely on a Mac Mini with 24GB unified memory
- Small, domain-specific LLMs can achieve high accuracy โ 93.94% exact-match accuracy with just 66.73M parameters
- Training infrastructure matters as much as model size โ Efficient data pipelines enable training on consumer hardware
- Modern LLM techniques scale down effectively โ RoPE, RMSNorm, and SwiGLU work well even at small scales
- Building your own stack teaches you more than using off-the-shelf models โ Deep understanding of every component from tokenization to inference
This project proves that with careful engineering, domain-specific language models can be trained from scratch on accessible hardware while maintaining production-grade quality.
- Custom tokenizer training โ SentencePiece with domain-specific special tokens
- Streaming dataset pipelines โ HuggingFace datasets โ efficient numpy shards
- Transformer architecture implementation โ Built from scratch with modern components
- Training loop engineering โ MPS optimization, checkpointing, learning rate scheduling
- Instruction tuning (SFT) โ Supervised fine-tuning with proper loss masking
- Loss masking and formatting strategies โ Only command portion contributes to loss
- Evaluation harness development โ Comprehensive accuracy testing framework
- Error analysis and model debugging โ Systematic failure case analysis
- End-to-end ML pipeline โ From raw data to deployed model
- Production-ready code โ Type hints, error handling, modular design
# Clone repository
git clone https://github.com/Geddydukes/tiny_llm.git
cd tiny_llm
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install torch datasets sentencepiece tqdm
pip install -e .python scripts/create_tokenizer_corpus.py \
--out data/tokenizer_corpus.txt \
--articles 50000python scripts/train_tokenizer.py \
--input data/tokenizer_corpus.txt \
--model_prefix tokenizerThis creates tokenizer.model with special tokens including <instruction> and <command>.
mkdir -p data/wiki_shards
python scripts/stream_wiki_to_shards.py \
--tokenizer tokenizer.model \
--out_dir data/wiki_shards \
--seq_len 512 \
--shard_tokens 1000000 \
--max_articles 500000mkdir -p checkpoints/wiki_pretrain
python scripts/train_pretrain.py \
--tokenizer tokenizer.model \
--data_dir data/wiki_shards \
--out_dir checkpoints/wiki_pretrain \
--batch_size 8 \
--max_steps 50000 \
--warmup_steps 1000python scripts/convert_raw_to_jsonl.py \
--input data/raw_cli_pairs.txt \
--output data/cli_sft.jsonlmkdir -p checkpoints/cli_sft
python scripts/train_cli_sft.py \
--tokenizer tokenizer.model \
--jsonl data/cli_sft.jsonl \
--out_dir checkpoints/cli_sft \
--from_ckpt checkpoints/wiki_pretrain/pretrain_step_050000.pt \
--batch_size 8 \
--max_steps 2000 \
--warmup_steps 200 \
--max_seq_len 256python scripts/generate_command.py \
--tokenizer tokenizer.model \
--ckpt checkpoints/cli_sft/cli_sft_step_002000.ptThen type instructions:
> find all .log files not accessed in the last 30 days
[command] find . -type f -name '*.log' -atime +30 -print
python scripts/generate_command.py \
--tokenizer tokenizer.model \
--ckpt checkpoints/cli_sft/cli_sft_step_002000.pt \
--instruction "list all files in current directory"python scripts/eval_cli_accuracy.py \
--tokenizer tokenizer.model \
--ckpt checkpoints/cli_sft/cli_sft_step_002000.pt \
--instructions data/test_instructions_heldout.txt \
--gold data/test_commands_heldout.txt \
--max_new_tokens 64 \
--limit 100 \
--output_jsonl results/cli_eval_heldout.jsonltiny_llm/
โโโ src/tiny_llm/ # Core model implementation
โ โโโ __init__.py
โ โโโ config.py # Model configuration
โ โโโ tokenizer.py # SentencePiece wrapper
โ โโโ data.py # Dataset classes
โ โโโ rope.py # RoPE implementation
โ โโโ layers.py # Transformer blocks
โ โโโ model.py # Main model
โโโ scripts/ # Training and utility scripts
โ โโโ train_tokenizer.py
โ โโโ create_tokenizer_corpus.py
โ โโโ stream_wiki_to_shards.py
โ โโโ train_pretrain.py
โ โโโ convert_raw_to_jsonl.py
โ โโโ train_cli_sft.py
โ โโโ generate_command.py
โ โโโ eval_cli_accuracy.py
โโโ data/ # Data files
โ โโโ raw_cli_pairs.txt
โ โโโ test_instructions_heldout.txt
โ โโโ test_commands_heldout.txt
โโโ checkpoints/ # Model checkpoints (gitignored)
Pretraining:
- Optimizer: AdamW (lr=3e-4, ฮฒโ=0.9, ฮฒโ=0.95, weight_decay=0.01)
- Learning Rate: Cosine schedule with 1,000 step warmup
- Gradient Clipping: 1.0
- Sequence Length: 512 tokens
- Batch Size: 8
- Total Steps: 50,000
Fine-tuning:
- Optimizer: AdamW (lr=1e-4, same betas)
- Learning Rate: Cosine schedule with 200 step warmup
- Sequence Length: 256 tokens
- Batch Size: 8
- Total Steps: 2,000
During fine-tuning, only the command portion (after <command>) contributes to the loss. The instruction portion is masked with -100 (ignore index).
- Streaming: Wikipedia data is streamed directly from HuggingFace datasets
- Efficient Storage: Tokenized sequences stored as numpy arrays (
.npyshards) - Memory-Mapped: Shards loaded with memory mapping for efficient access
- Low Disk Usage: ~300MB for 133 shards (~133M tokens)
- Initial loss: ~60.67 (step 100)
- Final loss: ~3.59 (step 50,000)
- 93% reduction over training
Loss progression:
- Step 1,000: 10.97
- Step 5,000: 6.31
- Step 10,000: 5.39
- Step 20,000: 4.56
- Step 30,000: 4.23
- Step 50,000: 3.59
- โ Device Agnostic: Automatic MPS/CUDA/CPU detection
- โ Efficient Data Loading: Memory-mapped numpy shards
- โ Streaming Data: Low-memory Wikipedia streaming
- โ Checkpointing: Regular model saves during training
- โ Evaluation Suite: Comprehensive accuracy testing
- โ Error Handling: Robust parsing and validation
- Type hints throughout
- Comprehensive error messages
- Modular, reusable components
- Clean separation of concerns
This project demonstrates:
- End-to-end ML pipeline: From raw data to deployed model
- Modern architectures: RoPE, RMSNorm, SwiGLU
- Efficient training: Streaming data, memory-mapped shards
- Instruction tuning: Supervised fine-tuning for specific tasks
- Evaluation methodology: Held-out test sets, exact-match accuracy
If you use this code in your research, please cite:
@software{tiny_llm,
title = {TinyLLM: A Minimal Yet Production-Grade Language Model Stack},
author = {Dukes, Geddy},
year = {2025},
url = {https://github.com/Geddydukes/tiny_llm}
}MIT License - see LICENSE file for details.
- SentencePiece for tokenization
- HuggingFace datasets for Wikipedia streaming
- PyTorch for the deep learning framework
- Inspired by modern LLM architectures (LLaMA, GPT, etc.)
- Add more training examples for failure cases (email addresses, pipes, etc.)
- Implement beam search for generation
- Add support for multi-turn conversations
- Experiment with larger model sizes
- Add support for other command types (PowerShell, etc.)
Built with โค๏ธ for learning and research