Agents.Code — Step-by-Step Autonomous Coding Harness

An autonomous coding harness built with the Claude Agent SDK. Takes a short prompt, scaffolds a feature spec you can edit, derives an ordered step plan, then builds the application one step at a time with a planner → generator → evaluator loop per step. The harness itself is stack-agnostic — the tech stack is chosen in spec.md.

More details are in my blog.

Architecture

(a) short prompt ──[Initializer]──▶ artifacts/<slug>/spec.md   (you edit)
(b) spec.md path ──▶ deterministic parse → steps.json          (auto; you may edit)
                ──▶ for each step:
                      [Planner] ─▶ contract.md
                      [Generator] ─▶ app code + build-status.md   ◀─┐
                      [Evaluator] ─▶ feedback.md (PASS / FAIL) ─────┘ retry on FAIL

Agent	Role	Tools
Initializer	Expands 1–4 sentence prompt into a draft `spec.md` scaffold	File I/O
Planner (per step)	Reads spec + prior build-status, writes the step's `contract.md`	File I/O
Generator (per step)	Implements the step against the contract, commits to git, writes `build-status.md`	File, Bash, Git, Playwright MCP
Evaluator (per step)	Verifies acceptance criteria against the running app, writes `feedback.md` ending in `OVERALL_RESULT: PASS` or `FAIL`	File, Bash, Playwright MCP

A run halts on the first step that fails after --max-step-rounds retries. Re-invoking with the same spec path resumes from the first non-passing step.

Quick Start

# Install dependencies
npm install

# Install Playwright (used by generator + evaluator)
npx playwright install chromium

# (a) Scaffold a draft spec.md from a short prompt.
# Writes to ./kanban/artifacts/<auto-slug>/spec.md
npx tsx src/index.ts "Build a task management app with kanban boards" --output-dir ./kanban

# ... edit ./kanban/artifacts/<auto-slug>/spec.md by hand ...

# (b) Build from the spec. steps.json is derived on first run, then reused on resume.
npx tsx src/index.ts ./kanban/artifacts/<auto-slug>/spec.md

# ... edit steps.json alongside spec.md if needed, then re-run the same command ...

Configuration via `.env`

Any run option can be set in .env instead of passed on the CLI. CLI flags take precedence when both are set.

`.env` variable	CLI flag	Default
`MODEL`	`--model`	Claude Code's default
`MAX_STEP_FIX_ROUNDS`	`--max-step-rounds`	`10`

Authentication

Signed into Claude Code locally — nothing to configure. The SDK subprocess inherits your existing session via your settings.json.
Explicit auth — set one of the auth variables in .env (API key, OAuth token, Bedrock / Vertex / Foundry, or an apiKeyHelper). See .env.example for the full list.

CLI

# Scaffold a spec from a short prompt
npx tsx src/index.ts "<short prompt>" --output-dir <dir> [--name <slug>] [--force]

# Build from a spec (derives steps.json if missing, then runs)
npx tsx src/index.ts <path/to/spec.md> [--output-dir <dir>] [--force] [options]

Options

--name <slug>            Override the auto-derived feature-bucket slug (scaffold only)
--output-dir <dir>       Application root. For the build flow, defaults to the
                         ancestor of the spec's 'artifacts/' dir.
--force                  Scaffold: overwrite an existing spec.md.
                         Build:    re-derive steps.json even if it exists.
--model <model>          Claude model to use
--max-step-rounds <n>    Per-step generator→evaluator retry budget (default: 10)
--max-budget <usd>       Max total budget in USD (default: 50)
--debug                  Enable debug logging

Artifacts Layout

Everything the harness produces lives under --output-dir:

<output-dir>/
├── artifacts/
│   ├── run.log.txt.<ts>                      # structured run log (per invocation)
│   └── <feature-slug>/                       # the "feature bucket"
│       ├── spec.md                           # user-editable spec (from initializer)
│       ├── steps.json                        # ordered step plan (status tracked here)
│       └── 01-project-setup/
│           ├── contract.md                   # planner's definition of done
│           ├── build-status.md               # generator's report
│           ├── feedback.md                   # evaluator's verdict (PASS/FAIL)
│           └── mcp/                          # Playwright MCP side artifacts
│               ├── generator-attempt-1/
│               └── evaluator-round-1/
└── <your application code>                   # built here, committed to git by the generator

steps.json is the source of truth for progress. Statuses: pending → in_progress → passing | failed. Edit it between runs to skip, retry, or re-order steps.

Multiple feature buckets can coexist under artifacts/ — each spec path targets its own bucket.

How It Works

Scaffold — The initializer takes your short prompt and drafts a spec.md with overview, tech stack, design direction, and 4–8 implementation steps. The bucket slug is auto-derived from your prompt (override with --name). Guesses are flagged (reviewer: confirm). You refine it.
Derive + build — Passing the spec path kicks off the harness. On first invocation it deterministically parses spec.md into steps.json; on subsequent invocations it reuses (or resumes against) the existing plan. Pass --force to re-derive.
Per-step loop —
- Planner reads spec.md and prior steps' build-status.md files, then writes a contract.md grounded in what already exists.
- Generator implements against the contract, runs typecheck/build, commits to git, and writes build-status.md.
- Evaluator independently verifies every acceptance criterion against the running app (using Playwright where applicable) and writes feedback.md ending in OVERALL_RESULT: PASS or FAIL.
- On FAIL, the generator gets the feedback and retries, up to --max-step-rounds times.
Resume — If a step fails out or you kill the run, re-invoking with the same spec path picks up from the first non-passing step.

Skills & Plugins

# Frontend design skill
npx skills add vercel-labs/agent-skills

# Next.js best practices
npx skills add vercel-labs/next-skills --skill next-best-practices

# Playwright MCP (used by generator + evaluator)
npx playwright install chromium

References

Harness Design for Long-Running Apps — Anthropic Engineering
Claude Agent SDK — Official Docs

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
docs		docs
examples/weather		examples/weather
results		results
skills		skills
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agents.Code — Step-by-Step Autonomous Coding Harness

Architecture

Quick Start

Configuration via `.env`

Authentication

CLI

Options

Artifacts Layout

How It Works

Skills & Plugins

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Agents.Code — Step-by-Step Autonomous Coding Harness

Architecture

Quick Start

Configuration via .env

Authentication

CLI

Options

Artifacts Layout

How It Works

Skills & Plugins

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Configuration via `.env`

Packages