An LLM Pretraining Framework Built End-to-End by an Autonomous Agent Loop + a matching Harness scaffolding (coming soon)
🤖 100% AI-Authored · 🚀 44.13% MFU on H100 · 📈 +10% over Megatron-LM · ✅ Production-Validated
An LLM pretraining framework written end-to-end by an AI Agent Loop with zero human edits — plus the Harness that produced the pretraining framework (coming soon).
Current release: v0.1.0 (NVIDIA H100 · MiniCPM4-0.5B / MiniCPM4-8B training frameworks; matching Harness coming soon)
- 🤖 100% Agent-Loop Authored — the entire framework produced by an AI Agent running in auto-loop mode, with zero manual edits
- 🔄 Self-Diagnosing Agent Loop — read reference → implement → launch job → parse logs → root-cause → patch → pass gate → commit, fully autonomous
- 🚀 44.13% MFU on H100 — ~10% above the Megatron-LM baseline (~40%), validated on 64× H100 with BF16, DP-only
- ✅ Production-Validated — MiniCPM4-0.5B fully pretrained, real model weights produced (not a demo)
- 🛠️ GEMM + Attention kernels authored by the agent loop — per-op MFU up to 90%; FlashAttention written from scratch, outperforms Transformer Engine / FA3, on par with FA4
- Reproduction live demo
- Huawei MiniCPM5-1B training framework
- Training framework self-generates the Harness scaffolding
| Feature | ForgeTrain | Megatron-LM |
|---|---|---|
| MFU on H100 (MiniCPM4-0.5B, BF16, DP) | 44.13% | ~40% |
| 100% AI-Authored Code | ✅ | ❌ |
| CuTeDSL custom GEMMs (AOT C-export) | ✅ (5 GEMMs) | ❌ |
| Custom FlashAttention (on par with FA4) | ✅ (self-built CuTeDSL impl) | ❌ (uses upstream TE / FA) |
| Checkpoint → HuggingFace export | ✅ (one script) | Manual |
Also supports CUDA Graph, Triton fused kernels, and comm-compute overlap out of the box.
Comparison based on Megatron-LM v0.15 on the same hardware (H100, SM90). ForgeTrain v1 is scoped to MiniCPM4-0.5B (DP-only) and MiniCPM4-8B (TP=2) × BF16; Megatron-LM supports broader model families and parallelism strategies.
- 📌 [2026-05] ForgeTrain v0.1.0 released — first public release of the training engine; the Harness that produced it is coming soon. MiniCPM4-0.5B pretrained on 64× H100, achieving 44.13% MFU.
- Highlights
- Roadmap
- Feature Comparison
- News
- Agent-Friendly Quick Deploy
- Repository Layout
- Quick Start
- Core Technology
- Performance
- Contributing
- License
- Acknowledgments
- Citation
This repo was produced by an AI Agent and is friendliest to AI Agents. Paste the prompt below into Cursor / Claude Code / Codex / Cline — it will read the README, install dependencies, run the smoke test and report the MFU, without you typing commands one at a time.
🟢 5-step minimal pretraining demo (paste into your Coding Agent)
Following this project's exports/train_engine_0.5B/README.md,
run a 5-step minimal pretraining demo on the current node:
1. Check the environment (Python ≥ 3.11, CUDA ≥ 12.x, H100, PyTorch ≥ 2.4)
and install anything missing;
2. Install the repo: pip install -e . and HF deps: pip install datasets transformers;
3. Import smoke test:
PYTHONPATH=src python -c "from training_engine_tensor import config; print('OK')"
4. Run 5 steps on HF GSM8K:
torchrun --standalone --nproc-per-node=1 \
-m training_engine_tensor pretrain \
--num-steps 5 --global-batch-size 1 --micro-batch-size 1 \
--seq-length 4096 \
--hf-dataset openai/gsm8k --hf-dataset-config main \
--hf-text-template "Question: {question}\nAnswer: {answer}" \
--tokenizer-path openbmb/MiniCPM4-0.5B \
--save-dir ./checkpoints/demo
5. Print the final loss, step time, and MFU.
If anything fails, dig into the source on your own — do not ask me.
Full single-node 8× H100 and multi-node commands are in the Quick Start section below.
This repo bundles a family of subprojects in a strict producer / product relationship:
| Subdirectory | Role |
|---|---|
harness/ (coming soon) |
Harness — the scaffolding that drives an Agent Loop to autonomously build a training framework |
exports/train_engine_0.5B/ |
TrainingEngine (0.5B) — produced end-to-end by harness/ (coming soon); targets MiniCPM4-0.5B at 44.13% MFU on 8× H100 |
exports/train_engine_8b/ |
TrainingEngine (8B) — also produced by harness/ (coming soon); targets MiniCPM4-8B with TP=2 / DP=4 at 50.9% MFU on a single 8× H100 host |
harness/ ──(bash agent-loop.sh, zero human input)──▶ exports/train_engine_0.5B/
Harness (coming soon) exports/train_engine_8b/
producer (gates + prompts + control plane) product (a runnable training framework)
Each subdirectory has its own README with full CLI docs, config reference, layout, performance baselines, and limitations.
Environment: Python ≥ 3.11 · CUDA 12.x · PyTorch ≥ 2.4 · NVIDIA H100 80GB (SM90). Full pretraining requires 8× H100; early alignment stages run on a single GPU.
Use the training framework directly → exports/train_engine_0.5B/
Goal: take the ready-made framework and run pretraining on your H100s.
git clone https://github.com/OpenBMB/ForgeTrain.git
cd ForgeTrain/exports/train_engine_0.5B
pip install -e .
pip install datasets transformers # HuggingFace data path (required)PYTHONPATH=src python -c "from training_engine_tensor import config; print('OK')"Expected output: OK
PYTHONPATH=src CUSTOM_GEMM=1 OP_ATTENTION=v1 \
python scripts/precompile_ops.pyWarms up AOT export + cpp_extension builds for the 5 CuTeDSL GEMMs, persisting under ${ENGINE_ROOT}/.persist_cache/. Subsequent jobs reuse the cache; only a few seconds of dlopen cost remains.
torchrun --standalone --nproc-per-node=8 \
-m training_engine_tensor pretrain \
--num-steps 200 \
--global-batch-size 1280 --micro-batch-size 10 \
--seq-length 4096 \
--hf-dataset openai/gsm8k \
--hf-dataset-config main \
--hf-text-template "Question: {question}\nAnswer: {answer}" \
--tokenizer-path openbmb/MiniCPM4-0.5B \
--save-dir ./checkpoints/run1--hf-dataset accepts a HuggingFace Hub name (e.g. openai/gsm8k) or a local Parquet / Arrow / JSON directory.
Expected output (200 steps on 8× H100)
Each step logs a line like:
[STEP 200] loss=X.XXX | step_time=XXXms | mfu=44.XX%
On 8× H100 with BF16, expect MFU ~44% and step time ~XXXms for GBS=1280 / MBS=10 / seq=4096.
Full documentation including multi-node training, checkpoint resume, and HuggingFace export →
exports/train_engine_0.5B/README.md·exports/train_engine_8b/README.md
- Zero-touch Agent Loop —
bash agent-loop.shruns a Coding Agent in a loop against the prompt files, with no human in the loop - Two-stage gate-driven convergence:
- Stage 1 (M1-M6) — bitwise forward / backward alignment → DP=8 multi-step training → long-train statistical gate (loss rel diff < 1%, MFU(standard) ≥ 36%)
- Stage 2 — per-operator CUDA kernel optimization, 30 rounds per op with best-MFU election, with a DP=8 long-train integration gate on every merge
- Portable to any reference training stack — Megatron-LM v0.15 / torch in this reproduction, but any working stack (DeepSpeed, custom, …) honoring the same env contract drops in unchanged
- CUDA Graph × 5 capture granularities — forward / step / step_full / step_optimizer / step_nccl_opt, freely composable with BucketedGradReducer / sharded optimizer / wgrad-overlap
- Triton fused kernels — one kernel each for CE fwd+bwd / SwiGLU / RMSNorm+residual / RoPE / fused Adam+param sync
- Self-explored optimization space — the Agent enumerated and benchmarked CuTeDSL / cuBLAS / Triton / TransformerEngine operator variants and dozens of comm + CUDA-Graph capture combinations in real distributed jobs, scoring both MFU and loss alignment; production defaults are the optimum the Agent picked
- MiniCPM4-8B on a single 8× H100 host —
tensor_model_parallel_size=2, 50.9% MFU, ~8% above the Megatron-LM baseline
| Metric | ForgeTrain | Megatron-LM (baseline) |
|---|---|---|
| MFU | 44.13% | ~40% |
| MFU improvement | +10% | — |
Test conditions: MiniCPM4-0.5B · 64× H100 · BF16 · DP-only.
Full performance guide →
exports/train_engine_0.5B/README.md·exports/train_engine_8b/README.md
Contributions are welcome! Here are some ways to help:
- 🐛 Bug reports — open an issue
- 💡 Feature requests — open an issue with
[feature]in the title - 📝 Reproducibility reports — share your experience reproducing the Agent Loop on different setups
- 🔧 Pull requests — code improvements, documentation fixes, and new model support
Licensed under the Apache License 2.0.
The vendored exports/train_engine_0.5B/src/quack/ and exports/train_engine_8b/src/quack/ snapshots retain their upstream copyright headers; see exports/train_engine_0.5B/src/quack/NOTICE.md and exports/train_engine_8b/src/quack/NOTICE.md for provenance. Built on a reference training stack (Megatron-LM v0.15 in this reproduction) and the Cursor Coding Agent; data and tokenizers follow MiniCPM4-0.5B / MiniCPM4-8B upstream.
ForgeTrain builds on the work of several outstanding open-source projects:
- CUTLASS / CuTeDSL — CuTeDSL GEMMs and helper utilities
- FlashAttention-4 (Dao-AILab) — FA4 CuTeDSL SM90 attention kernels
- TransformerEngine (NVIDIA) — reference operator implementations
- Megatron-LM (NVIDIA) — reference training stack for gate verification
- Cursor — the Coding Agent that authored the engine
- MiniCPM4 (OpenBMB) — target model architecture and tokenizer
If you find this project useful, please consider citing:
@software{forgetrain_2026,
title = {ForgeTrain: An LLM Pretraining Framework Built End-to-End by an Autonomous Agent Loop},
year = {2026},
url = {https://github.com/OpenBMB/ForgeTrain}
}| Item | Requirement |
|---|---|
| GPU | NVIDIA H100 80GB (SM90, Hopper) |
| GPU count | 8× H100 for full gates / pretraining; 1× H100 for early alignment |
| CUDA | 12.x |
| PyTorch | ≥ 2.4 |
| Python | ≥ 3.11 |
| Validated scope | MiniCPM4-0.5B (DP-only) / MiniCPM4-8B (TP=2) × H100 × BF16 |
