An execution-gated framework that forces AI to ground its work in reality before it ships anything.
What it is Β· How it works Β· EXECUTE Β· Demos Β· Quickstart Β· Full docs
Status: pre-release. The architecture is locked. Public packaging is being rebuilt against the Claude Code plugin layout (see FORGER_FLAWS.md for the open list).
You ask an AI to design a system. It scrapes a snippet, fills the gaps from memory, and writes a beautiful architecture with citations. The APIs were renamed two years ago. The benchmarks were retracted. Three days later you are still chasing a hallucination wrapped in confidence. The output passed every soft review because it was internally coherent. It just did not match the world.
FORGER exists to stop that specific failure mode. It is a harness around the agent, not a smarter agent. The harness blocks completion until the work has touched the real internet, survived an adversarial reviewer from a different model family, and actually run.
FORGER is a seven-phase pipeline that runs on top of an LLM agent (currently Claude Code; the plugin layout is portable). Each phase has a deterministic gate. You cannot skip phases. You cannot self-grade. The only way to get an artifact out of FORGER is to satisfy every gate.
The three things FORGER guarantees on a successful run:
| Guarantee | Mechanism |
|---|---|
| Grounded in live sources | playwright-cli + cloakbrowser open real pages; no search-API snippets |
| Reviewed by an outsider | The GRILL phase invokes a model from a different family that tries to break the proposal |
| Ran | EXECUTE refuses to mark the task done until the artifact has been executed and acceptance tests pass |
The three things FORGER explicitly is not:
- A correctness oracle. It catches the failure modes it is built for. It will not catch every bug.
- A creativity suppressor. Speculative ideas are allowed; they live in a separate tier and cannot reach production without explicit promotion.
- A lightweight wrapper for trivial tasks. Use
forger init --mode quickfor those, or skip the framework entirely.
If any of these sound familiar, FORGER is meant for you:
- You have asked an agent to build something non-trivial and watched it ship a hallucinated architecture.
- You spend more time verifying agent output than you would have spent writing the code.
- You want to run agents overnight and trust the artifact in the morning.
- You are doing research synthesis and need every claim to trace to a real source.
If you are pair-programming small refactors with an agent, FORGER is overkill. Use it when the cost of being subtly wrong is higher than the cost of running a longer pipeline.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FORGER v2 PIPELINE β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β CONTRACT βββΆβ FIND βββΆβ OBSERVE βββΆβRECOMBINE βββΆβ GRILL β β
β β Clarify β β Ground β β Internal β β Tiered β β Cross- β β
β β intent β β in live β β probes, β β blending β β model β β
β β Socratic β β sources β β risk map β β + firewallβ β attack β β
β ββββββββββββ ββββββββββββ ββββββββββββ βββββββ¬βββββ βββββββ¬βββββ β
β β β β
β ββββββββββββββββββββββββββββββββ β β
β βΌ βΌ β
β ββββββββββββ ββββββββββββ (gate: hypotheses β
β β RETAIN ββββ€ EXECUTE ββββββββββββββββββββββββββββ resolved) β
β β Persist β β Build β β β
β β wins + β β Run β β β
β β failures β β Prove β β
β ββββββββββββ ββββββββββββ β
β β
β Single Source of Truth: Definition of Works (YAML, produced in CONTRACT) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each phase resolves one question:
| Phase | The question it answers |
|---|---|
| 0. CONTRACT | What are we actually building, and how will we know it works? |
| 1. FIND | What does the world already know about this, in writing, today? |
| 2. OBSERVE | Which of our beliefs survive a real test? |
| 3. RECOMBINE | What is the best idea we can assemble from verified parts? |
| 4. GRILL | Why might this idea be wrong? Can another model break it? |
| 5. EXECUTE | Does the artifact actually run and pass the acceptance suite? |
| 6. RETAIN | What did we learn that future runs in this domain can skip? |
For the full philosophy and per-phase reference, see docs/02-how-it-works.md and docs/03-phases.md.
EXECUTE is the phase that fails the most and matters the most. It runs its own micro-loop, separate from the outer pipeline. Most agents skip this loop entirely and announce success after writing code that compiled.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXECUTE PHASE (Phase 5) β
β β
β Enter with: validated Tier-1 proposal + Definition of Works + tests β
β β
β βββββββββββββββββββββββ β
β β Acceptance Suite β β compiled from Definition of Works β
β β (auto-generated + β in CONTRACT β
β β hand-curated) β β
β ββββββββββββ¬βββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββ β
β β TDD MICRO-CYCLE (loop) β β
β ββββββββββββββ¬ββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ β
β ββββββββΆβ 1. Write failing β β
β β β test β β
β β ββββββββββ¬ββββββββββ β
β β βΌ β
β β ββββββββββββββββββββ β
β β β 2. Write minimal β β
β β β code to pass β β
β β ββββββββββ¬ββββββββββ β
β β βΌ β
β β ββββββββββββββββββββ fail (β€2) β
β β β 3. Run tests + ββββββββββββββββββ β
β β β linter β β β
β β ββββββββββ¬ββββββββββ βΌ β
β β β pass βββββββββββββββββ β
β β β β Patch + retry β β
β β βΌ βββββββββ¬ββββββββ β
β β ββββββββββββββββββββ β β
β β β 4. Summarise β β β
β β β delta to log β β β
β β ββββββββββ¬ββββββββββ βΌ β
β β βΌ fail again (β₯3) β
β β ββββββββββββββββββββ ββββββββββββββββββββββ β
β βββββββββ next behaviour? β yes β Web-search exact β β
β β continue cycle β β error β fix β retryβ β
β ββββββββββ¬ββββββββββ βββββββββββ¬βββββββββββ β
β β no β β
β βΌ βΌ β
β ββββββββββββββββββββββ still failing? β
β β Run full β ββββββββββββββββ β
β β acceptance suite β β ESCALATE β β
β βββββββββββ¬βββββββββββ β β human or β β
β β β abort β β
β pass β fail ββββββββββββββββ β
β ββββββββββββββ΄βββββββββββββ β
β βΌ βΌ β
β ββββββββββββββ βββββββββββββββββββ β
β β DONE β β Re-enter β β
β β MEANS RAN β β TDD micro-cycle β β
β β β RETAIN β β with failure as β β
β ββββββββββββββ β next test β β
β βββββββββββββββββββ β
β β
β Invariants enforced here: β
β β’ No "complete" claim without a green acceptance suite β
β β’ No silent skip of failing tests β
β β’ Tier-2/3 ideas blocked from this loop unless promoted β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Two rules govern this loop:
- Done Means Ran. The artifact must have executed and passed acceptance, or the agent must document why execution was impossible. "I'm confident it works" is not a valid exit.
- Failures become tests. Every test failure that gets fixed leaves behind a regression test. The acceptance suite grows during EXECUTE; it never shrinks.
For research and design artifacts the loop is the same, but the acceptance criteria differ. A research report passes EXECUTE when every claim is ledger-verified with directly_supported entailment. A design passes when the rubric is satisfied and screenshots match the spec.
Four short walkthroughs are in docs/04-examples.md. Quick previews below.
You type:
"Build me an emotion recognition thing for my app."
FORGER asks back:
- What runs the inference: device, server, or both?
- Real-time video, single images, or audio?
- Which emotions, and what counts as right?
- What is the worst acceptable latency at p95?
- Is "good" measured against FER+, AffectNet, or your own labels?
You answer in one paragraph. FORGER writes a definition_of_works.yaml with measurable acceptance criteria and a reframe memo that asks whether you actually need deep learning or whether facial-action-unit detection would do. The reframe is recorded even if you stick with the original framing.
Instead of a search API summary, FIND opens GitHub Issues, scrolls JS-rendered docs, follows code references, and produces:
- url: https://github.com/serengil/deepface/issues/1207
domain: github.com
authority: high # repo with active CI, named maintainers
recency: 2025-11 # within TTL for ML libs
reproducibility: verified # repro steps in the issue
quote: "FER backend returns logits, not softmaxed probabilities."
severity_for_dow: critical # contradicts assumption in CONTRACT
entailment: directly_supportedEvery claim ties back to one of the rows in the Source Ledger. Anything weakly_supported cannot be load-bearing in EXECUTE.
The proposal goes to a reviewer in a different model family. It does not score. It hunts.
FAILURE HYPOTHESIS #2
claim: "MobileNetV3 small fits in <100ms on Pixel 7"
what would make it fail: thermal throttling under sustained video at 30fps
evidence required: 5-minute sustained run on Pixel 7 with thermal logging
severity if wrong: critical (kills latency invariant in DoW)
confidence: medium-high
EXECUTE cannot proceed until every open failure hypothesis is either accepted (with a test added), rejected (with counter-evidence from the Source Ledger), or escalated.
[cycle 1] test: model loads in <2s on cold start β FAIL (load = 4.3s)
[cycle 1] fix: switch to TFLite quantised int8 β PASS (load = 1.1s)
[cycle 2] test: p95 latency <100ms over 30s at 30fps β FAIL (p95 = 142ms)
[cycle 2] fix: drop input resolution 240β192 β PASS (p95 = 78ms)
[cycle 3] test: sustained 5-minute run, no thermal throttle β PASS
[cycle 4] acceptance suite: 17/17 β DONE
The fail-then-fix transcript is stored. If you re-run the task tomorrow, RETAIN serves these as known-good shortcuts.
Prerequisites: Node.js 20+, Python 3.10+, Playwright with cloakbrowser, Claude Code (or another supported host).
# 1. install
npm install -g forger-framework
# 2. cd into your project
cd my-project
# 3. initialise the workspace (writes .forger/ and registers the plugin)
forger init --mode standard
# 4. run a task
forger run "Build a real-time facial emotion classifier for iOS"You will be dropped into CONTRACT first. From there the pipeline runs to RETAIN, asking for your input only when the contract is ambiguous or when a Tier-3 transformational idea wants promotion.
| Mode | When to pick it | Token budget (cold) | Differences |
|---|---|---|---|
quick |
You have built this kind of thing in this domain before | ~5k | FIND reads cached KB, OBSERVE trims probes, GRILL optional, Tier 1 only |
standard |
The default | ~22k | Full pipeline, single cross-model reviewer |
deep |
High stakes, novel domain, overnight autonomy | ~35k+ | Dual reviewer in GRILL, no waivers on high/critical probes, extended acceptance |
The token budgets in the table assume a cold run with no cached domain knowledge. The actual full-pipeline budget on first runs is closer to 700k for standard; the framework is being recalibrated. Tracking issue: FORGER_FLAWS #6.
These do not get bent for convenience:
- Done Means Ran. No completion claim without execution and a green acceptance suite, or a documented reason execution was impossible.
- No claim without evidence or label. Every claim is either ledger-supported, test-supported, or tagged
speculative. - Test critical, skip trivial. Only
highandcriticalassumptions require runtime probes. - Tier 2/3 firewall. Speculative ideas live in their own files. They do not enter EXECUTE without explicit human promotion.
- Cross-model review for standard and deep. GRILL reviewer must be from a different model family than the executor.
- Mechanism-fit before recombination. Every combination passes a "does the causal mechanism transfer?" check.
- Knowledge self-evolves. Successful runs update the KB. After three runs in a domain, shortcuts become available.
- Scripts are mechanical. Hooks check liveness, schema, quote-existence. Judgment is LLM only.
- Definition of Works is the single source of truth. Every gate traces back to it.
| Doc | What's in it |
|---|---|
| docs/01-what-is-forger.md | Plain-language framing, the failure mode it targets, the philosophy in short |
| docs/02-how-it-works.md | Architecture, harness vs agent, BDI split, epistemic triangle map |
| docs/03-phases.md | Per-phase reference, including the EXECUTE sub-pipeline in full |
| docs/04-examples.md | Full demo walkthroughs for CONTRACT, FIND, OBSERVE, RECOMBINE, GRILL, EXECUTE, RETAIN |
| docs/05-usage.md | Install, init, modes, CLI reference, troubleshooting, common workflows |
| docs/06-architecture.md | Plugin filesystem, ledger schemas, audit hooks, KB structure |
| framework/FORGER.md | Long-form philosophical foundation (legacy; being merged into /docs) |
The current packaging has 8 critical flaws documented in framework/FORGER_FLAWS.md. The pipeline ships artifacts that pass acceptance, but the install path requires an external shim while the plugin layout is restructured. If you want to try it before the v0.2 packaging lands, follow the install instructions in docs/05-usage.md and use the shim at forger-local-marketplace.
What we are working on next:
- Restructure the plugin to the standard Claude Code layout (
.claude-plugin/,agents/,skills/<slug>/) - Move FIND fan-out to the orchestrator level (Claude Code strips
Taskfrom subagent toolsets) - Recalibrate the token budgets against real telemetry
- Real reviewer adapters for OpenAI, Gemini, and cross-account Anthropic
- Discord β coming with the v0.2 release
- GitHub Issues β bug reports, especially when the framework lets a hallucination through
- Contributing β see
CONTRIBUTING.md(in progress)