A local-first project to train a small GPT-style language model on WikiText-2 with:
- CUDA-first device selection (CPU fallback)
- Background training runs
- Status tracking
- Resume training for more epochs
- Prompt-based text generation
- Textual TUI monitoring
- Train on WikiText-2 (
datasets) - Tokenize + dataloader pipeline
- GPT-style model in PyTorch
- Checkpointing (
latest.pt+ periodic epoch files) - Background run mode + run metadata in
runs/<run_id>/ - Status command for quick run inspection
- Live GPU telemetry in
status+ TUI (n/asafe on CPU/no-NVML) - Resume from checkpoint with
--more-epochs - Generate text from trained checkpoints
- TUI monitor + actions via Textual (start/resume/generate)
OpenClawTest/
├── configs/
│ └── default.toml
├── src/
│ └── llm_trainer/
│ ├── cli.py
│ ├── data.py
│ ├── tokenization.py
│ ├── dataloader.py
│ ├── model.py
│ ├── trainer.py
│ ├── background.py
│ └── run_metadata.py
├── tests/
├── data/ # local dataset artifacts (gitignored)
├── checkpoints/ # model checkpoints (gitignored)
├── runs/ # run state/logs (gitignored)
└── TASKS.md
- Python 3.11+
- Optional CUDA-capable GPU for acceleration
cd /home/theo/OpenClawTest
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .
pip install torch datasets textualIf llm-trainer is not found, use module mode:
PYTHONPATH=src python -m llm_trainer --helpDefault config: configs/default.toml
Key fields:
training.epochs(default: 3)training.batch_sizetraining.seq_lengthtraining.learning_ratetraining.precision(off/fp16/bf16)training.grad_accum_stepstraining.dataloader_workerstraining.dataloader_prefetch_factortraining.dataloader_pin_memorydata.dataset_name(wikitext-2)device.preference(auto,cuda,cuda:N, or GPU hint likeA30)device.strict(falseby default)
PYTHONPATH=src llm-trainer train --config configs/default.toml
# Optional overrides:
# --device auto|cpu|cuda|cuda:0|A30
# --strict-device
# --precision off|fp16|bf16
# --grad-accum-steps 2
# --dataloader-workers 4 --dataloader-prefetch-factor 2 --dataloader-pin-memoryThis creates a run ID and writes run metadata under runs/<run_id>/.
PYTHONPATH=src llm-trainer status
# or
PYTHONPATH=src llm-trainer status --run-id <run_id>
# includes GPU util/memory/temp/power when availablePYTHONPATH=src llm-trainer resume --run-id <run_id> --more-epochs 3PYTHONPATH=src llm-trainer generate \
--run-id <run_id> \
--prompt "Once upon a time" \
--max-new-tokens 80 \
--temperature 0.8 \
--top-k 40PYTHONPATH=src llm-trainer tui
# keys: n start new run | u resume selected | g generate from selected
# tuning shortcuts: [ / ] epochs, d device cycle, p prompt cycle- Run state:
runs/<run_id>/state.json - Run metadata:
runs/<run_id>/meta.json - Worker logs:
runs/<run_id>/worker.log - Training logs:
runs/<run_id>/train.log - Checkpoints:
checkpoints/<run_id>/latest.pt,epoch-<n>.pt
Use this command before committing:
PYTHONPATH=src ./.venv/bin/ruff check . && PYTHONPATH=src ./.venv/bin/pytest -q--more-epochsis supported on resume, not on train.- Data/checkpoints/runs are intentionally gitignored.
- CUDA is used automatically when available; otherwise CPU is used.
See TASKS.md for completed tasks and upcoming work (including ETA timer and improved TUI).