Skip to content

Linutesto/forgelm

Repository files navigation

ForgeLM

Train, evaluate, fine-tune, and serve compact language models on one GPU.

ForgeLM is an end-to-end small-language-model workshop built around a Llama-style decoder, a packed-data pipeline, resumable training, SFT, DPO, evaluation, inference, and a mobile-friendly web dashboard. It is designed for experiments that can be understood and operated by one person on a single consumer GPU.

Project status: active development. The 120M reference model is currently training; published quality results remain pending. See Current status and MODEL_CARD.md.

Why ForgeLM

  • One coherent pipeline: tokenizer → packed dataset → pretraining → SFT → DPO → evaluation → chat.
  • Single-GPU operation: shipped model sizes range from roughly 32M to 571M parameters, with the 120M configuration as the recommended default.
  • Recoverable runs: background jobs, atomic checkpoints, explicit resume controls, and numbered checkpoints every 500 steps.
  • Visible training: live GPU telemetry, loss curves, throughput, learning rate, token count, VRAM, ETA, logs, and run controls.
  • Honest reporting: measured engineering data is separated from unfinished or unmeasured model-quality results.
  • Multiple interfaces: Python package, forge CLI, REST/SSE API, and a responsive web UI.

Screenshots

Dashboard Training detail Checkpoint actions
Placeholder: live dashboard Placeholder: curves and metrics Placeholder: resume, SFT, and DPO
Dataset pipeline Model chat Benchmark report
Placeholder: tokenizer and preparation Placeholder: streaming chat Placeholder: evaluation results

The capture checklist, filenames, framing, and redaction guidance are in docs/SCREENSHOTS.md.

Architecture

flowchart LR
    A[Text sources] --> B[Byte-level BPE tokenizer]
    A --> C[Packed uint16 token dataset]
    B --> C
    C --> D[Llama-style decoder pretraining]
    D --> E[Numbered checkpoints]
    E --> F[Evaluation]
    E --> G[Supervised fine-tuning]
    G --> H[Direct preference optimization]
    E --> I[Inference and chat]
    G --> I
    H --> I

    J[Web UI] --> K[FastAPI server]
    K --> L[Background job manager]
    L --> B
    L --> C
    L --> D
    L --> F
    L --> G
    L --> H
    K --> I
Loading

The model backbone uses:

  • decoder-only transformer blocks;
  • pre-normalization with RMSNorm;
  • rotary position embeddings;
  • grouped-query attention through PyTorch SDPA;
  • SwiGLU feed-forward layers;
  • tied input and output embeddings;
  • bf16 training, AdamW, gradient accumulation, warmup, and cosine decay;
  • KV-cached generation with streaming output.

Long-running tasks execute as background CLI subprocesses. Metrics are written to JSONL, run state is persisted to JSON, and the dashboard polls the API without owning the training process. Closing or refreshing the browser does not stop a run.

See docs/architecture.md for the design rationale.

Quickstart

Requirements

  • Python 3.10 or newer
  • PyTorch built for the installed CUDA stack
  • NVIDIA GPU recommended; the reference workflow targets a 24GB RTX 4090
  • NVIDIA Container Toolkit when using Docker

Local installation

git clone <your-repository-url> forgelm
cd forgelm

# Install the correct PyTorch build for your CUDA environment first.
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -e .

# Validate the complete pipeline on synthetic data.
bash scripts/run_smoke.sh

# Start the dashboard and API.
forge serve --host 0.0.0.0 --port 8080

Open http://localhost:8080. If the machine is reachable through Tailscale or another private network, open http://<machine-address>:8080 from a phone or another computer.

Docker

docker compose up --build

Then open http://localhost:8080.

Train the 120M reference configuration

forge tokenizer \
  --sources configs/sources_fineweb_edu.json \
  --vocab 32000 \
  --out data/tokenizer.json

forge prepare \
  --sources configs/sources_fineweb_edu.json \
  --tokenizer data/tokenizer.json \
  --out data/tokenized/train

forge train --config configs/120m.yaml

The web UI exposes the same workflow and adds checkpoint selection, run controls, SFT/DPO launch forms, benchmarks, reports, playground inference, and chat.

Resume and fine-tune

Pretraining checkpoints are saved every 500 steps. Resume from the dashboard or with a config whose train.resume points to a checkpoint:

train:
  resume: checkpoints/120m/last.pt
  ckpt_interval: 500

Instruction-tune any compatible checkpoint:

forge sft \
  --base checkpoints/120m/best.pt \
  --data data/sft/train.jsonl \
  --tokenizer data/tokenizer.json \
  --out checkpoints/120m-sft

Optionally apply DPO after SFT:

forge dpo \
  --sft checkpoints/120m-sft/best.pt \
  --data data/dpo/train.jsonl \
  --tokenizer data/tokenizer.json \
  --out checkpoints/120m-dpo

Current status

Snapshot captured June 22, 2026. These are operational measurements, not final release claims:

Item Status
120M pretraining Active
Training step 3,820 / 30,000
Tokens processed 2,002,780,160
Latest train loss 3.0188
Best recorded validation loss 3.1166
Training throughput ~126k tokens/s
Model VRAM reported by trainer 11.65GB
Prepared pretraining corpus 3.0M documents; 2.811B train tokens; 1.495M validation tokens
Latest numbered checkpoints ckpt_2500.pt, ckpt_3000.pt, ckpt_3500.pt
Quality benchmarks Pending; no final scores published

The snapshot will become stale as training continues. Live values are available in the dashboard and checkpoints/120m/status.json.

Evaluation

ForgeLM supports:

  • held-out token-level perplexity;
  • length-normalized multiple-choice accuracy;
  • inference throughput;
  • fixed-prompt generation samples;
  • a categorized assistant benchmark with an offline heuristic or optional LLM judge;
  • base-versus-SFT-versus-DPO comparisons.

Use immutable numbered checkpoints for comparisons. Do not run GPU evaluation on the same GPU while pretraining is active; it can reduce throughput or exhaust VRAM. The complete schedule and command templates are in docs/BENCHMARK_PLAN.md.

Repository map

forgelm/
  model/       transformer architecture and generation
  data/        tokenizer, source iteration, packing, memmap dataset
  train/       trainer, checkpoint manager, metrics logging
  sft/         supervised fine-tuning and DPO
  eval/        perplexity, multiple choice, throughput, assistant benchmark
  infer/       completion, chat, and streaming inference
  server/      FastAPI application, job manager, telemetry, web UI
  cli.py       forge command-line entry point
configs/       model, dataset, and source configurations
docs/          architecture, datasets, model guidance, and launch documentation
scripts/       smoke test, dataset fetch, and sweep utilities
tests/         model, data, training, evaluation, inference, and API tests

Documentation

Known limitations

  • The 120M model is still training and is not a finished release.
  • No final downstream or assistant-quality scores are published yet.
  • Models at this scale have limited knowledge, reasoning, context retention, and instruction-following capacity.
  • Training and evaluation currently assume a single process and are optimized for one NVIDIA GPU.
  • Checkpoints include optimizer state and are substantially larger than inference-only weights.
  • Data sources carry their own licenses, terms, biases, and quality limitations.

License

The Python package metadata declares MIT. Add a root LICENSE file before a public release so the repository contains the full license text.

About

ForgeLM — local LLM training system for a single RTX 4090 (30M–500M models, web UI, Docker)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors