ForgeLM

Train, evaluate, fine-tune, and serve compact language models on one GPU.

ForgeLM is an end-to-end small-language-model workshop built around a Llama-style decoder, a packed-data pipeline, resumable training, SFT, DPO, evaluation, inference, and a mobile-friendly web dashboard. It is designed for experiments that can be understood and operated by one person on a single consumer GPU.

Project status: active development. The 120M reference model is currently training; published quality results remain pending. See Current status and MODEL_CARD.md.

Why ForgeLM

One coherent pipeline: tokenizer → packed dataset → pretraining → SFT → DPO → evaluation → chat.
Single-GPU operation: shipped model sizes range from roughly 32M to 571M parameters, with the 120M configuration as the recommended default.
Recoverable runs: background jobs, atomic checkpoints, explicit resume controls, and numbered checkpoints every 500 steps.
Visible training: live GPU telemetry, loss curves, throughput, learning rate, token count, VRAM, ETA, logs, and run controls.
Honest reporting: measured engineering data is separated from unfinished or unmeasured model-quality results.
Multiple interfaces: Python package, forge CLI, REST/SSE API, and a responsive web UI.

Screenshots

Dashboard	Training detail	Checkpoint actions
Placeholder: live dashboard	Placeholder: curves and metrics	Placeholder: resume, SFT, and DPO

Dataset pipeline	Model chat	Benchmark report
Placeholder: tokenizer and preparation	Placeholder: streaming chat	Placeholder: evaluation results

The capture checklist, filenames, framing, and redaction guidance are in docs/SCREENSHOTS.md.

Architecture

flowchart LR
    A[Text sources] --> B[Byte-level BPE tokenizer]
    A --> C[Packed uint16 token dataset]
    B --> C
    C --> D[Llama-style decoder pretraining]
    D --> E[Numbered checkpoints]
    E --> F[Evaluation]
    E --> G[Supervised fine-tuning]
    G --> H[Direct preference optimization]
    E --> I[Inference and chat]
    G --> I
    H --> I

    J[Web UI] --> K[FastAPI server]
    K --> L[Background job manager]
    L --> B
    L --> C
    L --> D
    L --> F
    L --> G
    L --> H
    K --> I

The model backbone uses:

decoder-only transformer blocks;
pre-normalization with RMSNorm;
rotary position embeddings;
grouped-query attention through PyTorch SDPA;
SwiGLU feed-forward layers;
tied input and output embeddings;
bf16 training, AdamW, gradient accumulation, warmup, and cosine decay;
KV-cached generation with streaming output.

Long-running tasks execute as background CLI subprocesses. Metrics are written to JSONL, run state is persisted to JSON, and the dashboard polls the API without owning the training process. Closing or refreshing the browser does not stop a run.

See docs/architecture.md for the design rationale.

Quickstart

Requirements

Python 3.10 or newer
PyTorch built for the installed CUDA stack
NVIDIA GPU recommended; the reference workflow targets a 24GB RTX 4090
NVIDIA Container Toolkit when using Docker

Local installation

git clone <your-repository-url> forgelm
cd forgelm

# Install the correct PyTorch build for your CUDA environment first.
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -e .

# Validate the complete pipeline on synthetic data.
bash scripts/run_smoke.sh

# Start the dashboard and API.
forge serve --host 0.0.0.0 --port 8080

Open http://localhost:8080. If the machine is reachable through Tailscale or another private network, open http://<machine-address>:8080 from a phone or another computer.

Docker

docker compose up --build

Then open http://localhost:8080.

Train the 120M reference configuration

forge tokenizer \
  --sources configs/sources_fineweb_edu.json \
  --vocab 32000 \
  --out data/tokenizer.json

forge prepare \
  --sources configs/sources_fineweb_edu.json \
  --tokenizer data/tokenizer.json \
  --out data/tokenized/train

forge train --config configs/120m.yaml

The web UI exposes the same workflow and adds checkpoint selection, run controls, SFT/DPO launch forms, benchmarks, reports, playground inference, and chat.

Resume and fine-tune

Pretraining checkpoints are saved every 500 steps. Resume from the dashboard or with a config whose train.resume points to a checkpoint:

train:
  resume: checkpoints/120m/last.pt
  ckpt_interval: 500

Instruction-tune any compatible checkpoint:

forge sft \
  --base checkpoints/120m/best.pt \
  --data data/sft/train.jsonl \
  --tokenizer data/tokenizer.json \
  --out checkpoints/120m-sft

Optionally apply DPO after SFT:

forge dpo \
  --sft checkpoints/120m-sft/best.pt \
  --data data/dpo/train.jsonl \
  --tokenizer data/tokenizer.json \
  --out checkpoints/120m-dpo

Current status

Snapshot captured June 22, 2026. These are operational measurements, not final release claims:

Item	Status
120M pretraining	Active
Training step	3,820 / 30,000
Tokens processed	2,002,780,160
Latest train loss	3.0188
Best recorded validation loss	3.1166
Training throughput	~126k tokens/s
Model VRAM reported by trainer	11.65GB
Prepared pretraining corpus	3.0M documents; 2.811B train tokens; 1.495M validation tokens
Latest numbered checkpoints	`ckpt_2500.pt`, `ckpt_3000.pt`, `ckpt_3500.pt`
Quality benchmarks	Pending; no final scores published

The snapshot will become stale as training continues. Live values are available in the dashboard and checkpoints/120m/status.json.

Evaluation

ForgeLM supports:

held-out token-level perplexity;
length-normalized multiple-choice accuracy;
inference throughput;
fixed-prompt generation samples;
a categorized assistant benchmark with an offline heuristic or optional LLM judge;
base-versus-SFT-versus-DPO comparisons.

Use immutable numbered checkpoints for comparisons. Do not run GPU evaluation on the same GPU while pretraining is active; it can reduce throughput or exhaust VRAM. The complete schedule and command templates are in docs/BENCHMARK_PLAN.md.

Repository map

forgelm/
  model/       transformer architecture and generation
  data/        tokenizer, source iteration, packing, memmap dataset
  train/       trainer, checkpoint manager, metrics logging
  sft/         supervised fine-tuning and DPO
  eval/        perplexity, multiple choice, throughput, assistant benchmark
  infer/       completion, chat, and streaming inference
  server/      FastAPI application, job manager, telemetry, web UI
  cli.py       forge command-line entry point
configs/       model, dataset, and source configurations
docs/          architecture, datasets, model guidance, and launch documentation
scripts/       smoke test, dataset fetch, and sweep utilities
tests/         model, data, training, evaluation, inference, and API tests

Documentation

Known limitations

The 120M model is still training and is not a finished release.
No final downstream or assistant-quality scores are published yet.
Models at this scale have limited knowledge, reasoning, context retention, and instruction-following capacity.
Training and evaluation currently assume a single process and are optimized for one NVIDIA GPU.
Checkpoints include optimizer state and are substantially larger than inference-only weights.
Data sources carry their own licenses, terms, biases, and quality limitations.

License

The Python package metadata declares MIT. Add a root LICENSE file before a public release so the repository contains the full license text.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks/single-4090-cost		benchmarks/single-4090-cost
configs		configs
data		data
docs		docs
forgelm		forgelm
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
MODEL_CARD.md		MODEL_CARD.md
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ForgeLM

Why ForgeLM

Screenshots

Architecture

Quickstart

Requirements

Local installation

Docker

Train the 120M reference configuration

Resume and fine-tune

Current status

Evaluation

Repository map

Documentation

Known limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ForgeLM

Why ForgeLM

Screenshots

Architecture

Quickstart

Requirements

Local installation

Docker

Train the 120M reference configuration

Resume and fine-tune

Current status

Evaluation

Repository map

Documentation

Known limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages