Stanford's Meta-Harness paper had a linear loop. We mapped it onto LangGraph and made it a tree.
LangGraph-native substrate for self-improving agent harnesses. Applies the Stanford Meta-Harness paradigm (arXiv:2603.28052, yoonholee.com/meta-harness) to a coding-agent domain — expressed as two LangGraph state machines with Postgres-backed checkpointing, time-travel forking, and cross-run memory.
A creative reinterpretation of the work at yoonholee.com/meta-harness.
The Stanford paper showed that an outer-loop agent reading raw execution traces and rewriting an inner-loop harness beats every prior text optimizer — +7.7 points over ACE with 4× fewer context tokens, top-2 on TerminalBench-2.
But their loop is linear: iter 1 → 2 → 3 → 4. Real harness optimization
needs branching: rewind to iter 2, try a different proposer prior, fork,
compare. By mapping the loop onto LangGraph state machines, three properties
fall out by construction:
| Property | Mechanism |
|---|---|
| Secure | Each candidate is a sandboxed subgraph — a buggy candidate cannot corrupt the run |
| Consistent | Every state transition is checkpointed via AsyncPostgresSaver; replays are deterministic |
| Reversible | Time-travel via get_state_history + update_state + ainvoke(None, ckpt_id) |
The substrate IS the contribution.
OUTER STATE MACHINE (4 nodes, checkpointed via AsyncPostgresSaver)
──────────────────────────────────────────────────────────────────
propose ──► validate ──► benchmark ──► update_frontier
│ │ │
│ │ └─ loop while budget > 0
▼ ▼
spawns `claude` CLI spawns inner
subprocess + SKILL.md subgraph per
(proposer writes a candidate
new agents/<name>.py)
│
▼
INNER STATE MACHINE (5 nodes, sandboxed subgraph per candidate)
────────────────────────────────────────────────────────────────
orient ─► plan ─► act ─► verify ─► submit
│
│ ▸ 6 fixed tools (read_file, apply_patch, write_file,
│ run_bash, grep_search, task_complete) — the contract
│ ▸ 11 override points (system prompt, plan template, turn
│ budget, retry policy, tool-result formatting, ...)
│ — the search space
│
▼ traces, scores, file diffs streamed via SSE
DASHBOARD (Next.js 16)
───────────────────────
▸ outer state graph (ReactFlow) — live nodes lighting up per iteration
▸ candidate trajectory tree (D3) — branches when you fork a checkpoint
▸ code diff viewer (Monaco) — agents/<n>.py vs parent, live
▸ score chart + Pareto frontier — accuracy × context tokens
▸ cross-run memory panel — patterns learned by prior runs
▸ right-click any checkpoint → fork modal → resume on a new branch
Baseline harness, 5 coding-agent tasks × 5 trials each, on Haiku 4.5:
Iter 1: retry on schema_drift errors → 0.70 (+0.08) ✓
Iter 2: stricter tool-description hashing → 0.66 (-0.04) ✗
Iter 3: early-exit on auth failures → 0.74 (+0.04) ✓
Iter 4: more specific tool descriptions → 0.80 (+0.06) ✓ NEW BEST
┌─ right-click iter 2 → "Fork from here" → edit proposer prior ┐
│ │
▼ ▼
Iter 2': rewrite tool descriptions w/ examples → 0.78 (+0.16) ✓
Iter 3': add few-shot demos to descriptions → 0.85 (+0.07) ✓ GLOBAL BEST
Two branches. Both Pareto-optimal at different (accuracy, tokens) tradeoffs.
The meta-harness loop is no longer a sequence — it's a search tree.
Prerequisites
- Python 3.11+ and uv
- Docker (for local Postgres)
- Node.js 20+ + npm (for the dashboard, optional until step 11)
- The Claude Code CLI (
claude) for the real proposer - An Anthropic API key (
ANTHROPIC_API_KEY)
Get running
git clone https://github.com/ManagementMO/Meta-Harness.git
cd Meta-Harness
cp .env.example .env # add ANTHROPIC_API_KEY
uv sync
docker compose -f infra/docker-compose.yml up -d postgres
# Run the backend test suite (live LLM test skips without ANTHROPIC_API_KEY)
cd backend && uv run pytest tests/ -q
# Smoke-test the inner loop end-to-end on one task (~24 s, ~$0.05)
uv run meta-harness inner --task task-001-fix-typo --candidate baseline
# Run the meta-harness with the mock proposer (no LLM, completes in <1 s)
uv run meta-harness loop --proposer mock --mock-bench --budget 2 --fresh
# Run with the real claude CLI proposer (~3 min on subscription auth)
uv run meta-harness loop --proposer claude --budget 1 --fresh --mock-bench
# Resume an interrupted run from its last Postgres checkpoint
uv run meta-harness resume <run-name>The implementation is tracked as a topological sequence of 13 verified
steps in docs/BUILD_ORDER.md, each with a literal definition-of-done
command that proves it works. Backend tests currently pass with the live
LLM test skipped when ANTHROPIC_API_KEY is unavailable.
| Step | Goal | Status |
|---|---|---|
| 1 | Repo skeleton + Postgres + first eval task | ✓ |
| 2 | Sandbox + 6 fixed inner-loop tools | ✓ |
| 3 | Inner StateGraph end-to-end on one task (live LLM) | ✓ |
| 4 | 5 eval tasks + multi-trial benchmark | ✓ |
| 5 | Outer StateGraph + mock proposer + Pareto frontier | ✓ |
| 6 | Real proposer (claude CLI subprocess) + SKILL.md |
✓ |
| 7 | AsyncPostgresSaver + full async refactor | ✓ |
| 8 | Cross-run memory (PostgresStore) | next |
| 9 | Time-travel + concurrent branches | next |
| 10 | FastAPI REST + SSE with closed-set event registry | next |
| 11 | Frontend dashboard (Next.js + ReactFlow + D3 + Monaco) | next |
| 12 | CLI completeness + holdout evaluation | next |
| 13 | End-to-end demo dry-run (formal acceptance) | final |
Run cd backend && uv run pytest tests/ -q at any commit to confirm the
test floor.
- Two LangGraph state machines, not one. The outer machine evolves the inner machine's source code. Both are checkpointed; both will support time-travel.
- The "meta-harness tool" is a SKILL.md, not a framework feature.
~150 lines of Markdown injected via
--append-system-promptwhen the proposer'sclaudesubprocess is spawned. Anti-overfitting and anti-parameter-tuning rules live there; they're load-bearing per the paper's Section 5 ablations. - The inner loop has a fixed contract and an evolvable shape.
Six tools (
read_file,apply_patch,write_file,run_bash,grep_search,task_complete) are the contract with the evaluator and cannot be modified by candidates. Eleven override points define the search space. apply_patchreturnscontext_echoon mismatch. When a unified diff fails to apply, the tool surfaces the file's actual current content at the failed range so the model fixes the patch without re-reading the file.- Forks are concurrent, not sequential. Per Appendix A,
asyncio.create_taskovergraph.ainvokecalls share a singleAsyncPostgresSaver; both branches grow on the dashboard at once. - Cross-run memory persists across runs. A pattern learned in run A flows into run B's proposer system prompt, so each new run starts smarter than cold.
meta-harness/
├── backend/ # FastAPI + LangGraph
│ ├── app/
│ │ ├── cli.py # `meta-harness` CLI (typer)
│ │ ├── main.py # FastAPI app entry (step 10)
│ │ └── meta_harness/ # internal namespace
│ │ ├── outer.py # outer 4-node StateGraph
│ │ ├── inner.py # inner 5-phase StateGraph
│ │ ├── state.py # MetaHarnessState + CodingAgentState
│ │ ├── harness.py # CodingAgentHarness (11 override points)
│ │ ├── proposer.py # claude_propose + mock_propose
│ │ ├── tools.py # 6 fixed inner-loop tools
│ │ ├── sandbox.py # /tmp/meta-harness-task-{uuid}/
│ │ ├── frontier.py # Pareto on (accuracy × tokens)
│ │ ├── persistence.py # AsyncPostgresSaver
│ │ ├── runs.py # filesystem lifecycle
│ │ ├── memory.py # cross-run patterns (step 8)
│ │ └── branches.py # time-travel forks (step 9)
│ └── tests/ # backend pytest suite
├── frontend/ # Next.js 16 dashboard (step 11)
├── sdk/meta_harness/ # public Python library
├── skills/meta-harness-coding-agent/SKILL.md # the proposer's workflow
├── eval/
│ ├── tasks/ # 5 frozen calibration tasks
│ ├── holdout/ # 2 unseen test tasks (step 12)
│ └── score.py # multi-task pytest scorer
├── agents/
│ ├── baseline.py # immutable starting harness
│ └── (...) # proposer-generated candidates (gitignored)
├── infra/docker-compose.yml # postgres:16 service
└── docs/ # phase-0 contracts (read these first)
The reference docs are layered so a new contributor can read in order and end up oriented:
| Doc | When to read |
|---|---|
ARCHITECTURE_SECTION_1.md |
First — the locked architecture |
docs/PROJECT_LAYOUT.md |
First — repo tree + naming rules |
docs/INTERFACES.md |
Always — every cross-component contract |
docs/BUILD_ORDER.md |
When picking a step — DoD per step |
docs/DEFINITION_OF_DONE.md |
Before demo — the formal acceptance test |
docs/TEAM_HANDOFF.md |
When joining the build — 4-person coordination |
relay_metaharness_v7.md |
For the why — canonical design doc |
relay_v7_appendix_a_worktrees.md |
For step 9 — concurrent branches via asyncio |
relay_v7_appendix_b_metaharness_internals.md |
For step 6+ — Stanford repo deep-dive |
relay_v7_appendix_c_inner_loop.md |
For inner-loop work — 5-phase agent design |
skills/meta-harness-coding-agent/SKILL.md |
When debugging the proposer — what it actually reads |
The single most important rule: docs/INTERFACES.md is the contract.
Every change touching a state schema, JSON shape, REST endpoint, SSE
event, tool I/O, override point, or SKILL.md section updates that doc
in the same commit.
| Component | Choice |
|---|---|
| State machines | LangGraph 0.2+ |
| Checkpointer | AsyncPostgresSaver (langgraph-checkpoint-postgres) |
| Database | Postgres 16 (Docker; infra/docker-compose.yml) |
| Backend API | FastAPI 0.115+ + Uvicorn |
| Inner-loop LLM | Claude Haiku 4.5 (default; rate-limit-friendly + ~10× cheaper than Sonnet) |
| Proposer LLM | Claude Code CLI subprocess (subscription auth) |
| CLI | Typer + python-dotenv |
| Frontend | Next.js 16, Tailwind 4, ReactFlow, D3, Monaco |
| Workspace tooling | uv (workspace mode: sdk/ + backend/) |
| Testing | pytest, pytest-asyncio (asyncio_mode = "auto"), Playwright |
META_HARNESS_INNER_MODEL env var overrides the inner-loop model if a
higher API tier is available (e.g. claude-sonnet-4-6).
Built on, and grateful for, the work of:
- Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn — Meta-Harness: End-to-End Optimization of Model Harnesses, arXiv:2603.28052, project page, and the reference framework at stanford-iris-lab/meta-harness.
- The LangChain team for LangGraph's time-travel primitives, which make the linear-to-tree mapping possible without a bespoke orchestration layer.
- Anthropic for the Claude Code CLI's
--append-system-promptand stream-json output format, which let us reproduce the paper's filesystem-mediated proposer pattern verbatim.
MIT — see LICENSE.
Time-travel for Meta-Harness. Built on LangGraph state machines. Secure, consistent, reversible — by construction. Open source. One spark.