Train, evaluate, fine-tune, and serve compact language models on one GPU.
ForgeLM is an end-to-end small-language-model workshop built around a Llama-style decoder, a packed-data pipeline, resumable training, SFT, DPO, evaluation, inference, and a mobile-friendly web dashboard. It is designed for experiments that can be understood and operated by one person on a single consumer GPU.
Project status: active development. The 120M reference model is currently training; published quality results remain pending. See Current status and MODEL_CARD.md.
- One coherent pipeline: tokenizer → packed dataset → pretraining → SFT → DPO → evaluation → chat.
- Single-GPU operation: shipped model sizes range from roughly 32M to 571M parameters, with the 120M configuration as the recommended default.
- Recoverable runs: background jobs, atomic checkpoints, explicit resume controls, and numbered checkpoints every 500 steps.
- Visible training: live GPU telemetry, loss curves, throughput, learning rate, token count, VRAM, ETA, logs, and run controls.
- Honest reporting: measured engineering data is separated from unfinished or unmeasured model-quality results.
- Multiple interfaces: Python package,
forgeCLI, REST/SSE API, and a responsive web UI.
| Dashboard | Training detail | Checkpoint actions |
|---|---|---|
| Placeholder: live dashboard | Placeholder: curves and metrics | Placeholder: resume, SFT, and DPO |
| Dataset pipeline | Model chat | Benchmark report |
|---|---|---|
| Placeholder: tokenizer and preparation | Placeholder: streaming chat | Placeholder: evaluation results |
The capture checklist, filenames, framing, and redaction guidance are in docs/SCREENSHOTS.md.
flowchart LR
A[Text sources] --> B[Byte-level BPE tokenizer]
A --> C[Packed uint16 token dataset]
B --> C
C --> D[Llama-style decoder pretraining]
D --> E[Numbered checkpoints]
E --> F[Evaluation]
E --> G[Supervised fine-tuning]
G --> H[Direct preference optimization]
E --> I[Inference and chat]
G --> I
H --> I
J[Web UI] --> K[FastAPI server]
K --> L[Background job manager]
L --> B
L --> C
L --> D
L --> F
L --> G
L --> H
K --> I
The model backbone uses:
- decoder-only transformer blocks;
- pre-normalization with RMSNorm;
- rotary position embeddings;
- grouped-query attention through PyTorch SDPA;
- SwiGLU feed-forward layers;
- tied input and output embeddings;
- bf16 training, AdamW, gradient accumulation, warmup, and cosine decay;
- KV-cached generation with streaming output.
Long-running tasks execute as background CLI subprocesses. Metrics are written to JSONL, run state is persisted to JSON, and the dashboard polls the API without owning the training process. Closing or refreshing the browser does not stop a run.
See docs/architecture.md for the design rationale.
- Python 3.10 or newer
- PyTorch built for the installed CUDA stack
- NVIDIA GPU recommended; the reference workflow targets a 24GB RTX 4090
- NVIDIA Container Toolkit when using Docker
git clone <your-repository-url> forgelm
cd forgelm
# Install the correct PyTorch build for your CUDA environment first.
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -e .
# Validate the complete pipeline on synthetic data.
bash scripts/run_smoke.sh
# Start the dashboard and API.
forge serve --host 0.0.0.0 --port 8080Open http://localhost:8080. If the machine is reachable through Tailscale or
another private network, open http://<machine-address>:8080 from a phone or
another computer.
docker compose up --buildThen open http://localhost:8080.
forge tokenizer \
--sources configs/sources_fineweb_edu.json \
--vocab 32000 \
--out data/tokenizer.json
forge prepare \
--sources configs/sources_fineweb_edu.json \
--tokenizer data/tokenizer.json \
--out data/tokenized/train
forge train --config configs/120m.yamlThe web UI exposes the same workflow and adds checkpoint selection, run controls, SFT/DPO launch forms, benchmarks, reports, playground inference, and chat.
Pretraining checkpoints are saved every 500 steps. Resume from the dashboard or
with a config whose train.resume points to a checkpoint:
train:
resume: checkpoints/120m/last.pt
ckpt_interval: 500Instruction-tune any compatible checkpoint:
forge sft \
--base checkpoints/120m/best.pt \
--data data/sft/train.jsonl \
--tokenizer data/tokenizer.json \
--out checkpoints/120m-sftOptionally apply DPO after SFT:
forge dpo \
--sft checkpoints/120m-sft/best.pt \
--data data/dpo/train.jsonl \
--tokenizer data/tokenizer.json \
--out checkpoints/120m-dpoSnapshot captured June 22, 2026. These are operational measurements, not final release claims:
| Item | Status |
|---|---|
| 120M pretraining | Active |
| Training step | 3,820 / 30,000 |
| Tokens processed | 2,002,780,160 |
| Latest train loss | 3.0188 |
| Best recorded validation loss | 3.1166 |
| Training throughput | ~126k tokens/s |
| Model VRAM reported by trainer | 11.65GB |
| Prepared pretraining corpus | 3.0M documents; 2.811B train tokens; 1.495M validation tokens |
| Latest numbered checkpoints | ckpt_2500.pt, ckpt_3000.pt, ckpt_3500.pt |
| Quality benchmarks | Pending; no final scores published |
The snapshot will become stale as training continues. Live values are available
in the dashboard and checkpoints/120m/status.json.
ForgeLM supports:
- held-out token-level perplexity;
- length-normalized multiple-choice accuracy;
- inference throughput;
- fixed-prompt generation samples;
- a categorized assistant benchmark with an offline heuristic or optional LLM judge;
- base-versus-SFT-versus-DPO comparisons.
Use immutable numbered checkpoints for comparisons. Do not run GPU evaluation on the same GPU while pretraining is active; it can reduce throughput or exhaust VRAM. The complete schedule and command templates are in docs/BENCHMARK_PLAN.md.
forgelm/
model/ transformer architecture and generation
data/ tokenizer, source iteration, packing, memmap dataset
train/ trainer, checkpoint manager, metrics logging
sft/ supervised fine-tuning and DPO
eval/ perplexity, multiple choice, throughput, assistant benchmark
infer/ completion, chat, and streaming inference
server/ FastAPI application, job manager, telemetry, web UI
cli.py forge command-line entry point
configs/ model, dataset, and source configurations
docs/ architecture, datasets, model guidance, and launch documentation
scripts/ smoke test, dataset fetch, and sweep utilities
tests/ model, data, training, evaluation, inference, and API tests
- 120M model card
- Benchmark plan
- Screenshot plan
- Architecture rationale
- Dataset curriculum
- Recommended model recipe
- Detailed quickstart
- The 120M model is still training and is not a finished release.
- No final downstream or assistant-quality scores are published yet.
- Models at this scale have limited knowledge, reasoning, context retention, and instruction-following capacity.
- Training and evaluation currently assume a single process and are optimized for one NVIDIA GPU.
- Checkpoints include optimizer state and are substantially larger than inference-only weights.
- Data sources carry their own licenses, terms, biases, and quality limitations.
The Python package metadata declares MIT. Add a root LICENSE file before a
public release so the repository contains the full license text.