Generator-Evaluator multi-agent harness for building applications autonomously with any LLM.
Inspired by Anthropic's engineering publication "Harness design for long-running application development", agent-bober implements the Generator-Evaluator multi-agent pattern as a reusable, installable workflow. It orchestrates AI agents in a structured loop: a Planner decomposes your idea into sprint contracts, a Generator writes the code, and an Evaluator independently verifies each sprint against its contract before moving on. The result is autonomous, high-quality software development with built-in guardrails, context resets, and brutally honest evaluation.
Works with Claude, GPT, Gemini, Ollama, and any OpenAI-compatible endpoint. Mix and match providers per agent role.
You describe a feature
|
v
+-----------+
| Planner | Asks clarifying questions, produces a PlanSpec
+-----------+ with sprint contracts and acceptance criteria.
|
v
+-----------+ +-----------+
| Generator | --> | Evaluator | Writes code, then verifies it:
+-----------+ +-----------+ typecheck, lint, build, tests.
^ |
| (rework) |
+---------------+
|
v Repeats per sprint until all
[Next Sprint] contracts are satisfied.
# Install globally
npm install -g agent-bober
# Or use directly with npx
npx agent-bober initagent-bober works in multiple environments:
- Claude Code -- Plugin with 10 slash commands (
/bober-plan,/bober-run, etc.) - Cursor / Windsurf -- MCP server with 10 tools in the chat interface
- Any MCP-compatible IDE -- MCP server via stdio transport
- Any terminal -- CLI commands (
npx agent-bober run "feature")
npx agent-bober initInteractive setup -- pick your AI provider, choose a preset, describe what you want to build.
npx agent-bober init nextjs # Next.js full-stack app
npx agent-bober init react-vite # React + Vite
npx agent-bober init solidity # EVM smart contracts (Hardhat)
npx agent-bober init anchor # Solana programs (Anchor)
npx agent-bober init api-node # Node.js API
npx agent-bober init python-api # Python API (FastAPI)cd your-existing-project
npx agent-bober init brownfieldThen in Claude Code:
/bober-principles # Define project standards (optional but recommended)
/bober-plan # Describe your feature, get a structured plan
/bober-sprint # Execute the next sprint
/bober-eval # Evaluate the sprint output
/bober-run # Full autonomous pipeline
Specialized workflows:
/bober-react # React web app workflow
/bober-solidity # EVM smart contract workflow
/bober-anchor # Solana program workflow
/bober-brownfield # Existing codebase workflow
/bober-playwright # Set up and generate E2E tests
agent-bober is provider-agnostic. Use any LLM provider for any agent role. Mix and match -- Opus for planning, GPT-4.1 for generation, local Ollama for evaluation.
| Provider | Models | API Key |
|---|---|---|
| Anthropic (default) | opus, sonnet, haiku |
ANTHROPIC_API_KEY |
| OpenAI | gpt-4.1, gpt-4.1-mini, o3, o4-mini |
OPENAI_API_KEY |
| Google Gemini | gemini-pro, gemini-flash |
GOOGLE_API_KEY or GEMINI_API_KEY |
| OpenAI-Compatible | Any model (Ollama, LM Studio, Groq, DeepSeek, etc.) | Optional |
Set providers per agent role in bober.config.json:
Model shorthands auto-resolve to the correct provider:
"opus"/"sonnet"/"haiku"-- Anthropic"gpt-4.1"/"o3"/"o4-mini"-- OpenAI"gemini-pro"/"gemini-flash"-- Google"ollama/llama3"-- OpenAI-compatible at localhost:11434
Override provider for all roles from the CLI:
npx agent-bober run "feature" --provider openaiProvider SDKs (openai, @google/generative-ai) are optional peer dependencies -- install only what you use. Only @anthropic-ai/sdk is required by default.
agent-bober includes an MCP (Model Context Protocol) server that exposes all functionality as tools in any MCP-compatible IDE.
Add to .cursor/mcp.json:
{
"mcpServers": {
"bober": {
"command": "npx",
"args": ["agent-bober", "mcp"]
}
}
}Add to your Windsurf MCP configuration:
{
"mcpServers": {
"bober": {
"command": "npx",
"args": ["agent-bober", "mcp"]
}
}
}| Tool | Type | Description |
|---|---|---|
bober_init |
sync | Initialize project config and .bober/ directory |
bober_plan |
sync | Plan a feature, create sprint contracts |
bober_sprint |
sync | Execute the next sprint (generator + evaluator loop) |
bober_eval |
sync | Evaluate a sprint independently |
bober_run |
async | Full autonomous pipeline (returns immediately, poll with status) |
bober_status |
poll | Check pipeline progress or read current status |
bober_contracts |
read | List all sprint contracts or read a specific one |
bober_spec |
read | Read the current PlanSpec |
bober_principles |
read/write | Read or set project principles |
bober_config |
read/write | Read or update bober.config.json |
| Command | Description |
|---|---|
/bober-principles |
Define project principles -- AI expands your rough notes into standards |
/bober-plan |
Plan any feature -- stack-agnostic, sprint-decomposed |
/bober-sprint |
Execute the next sprint contract |
/bober-eval |
Evaluate current sprint output |
/bober-run |
Full autonomous pipeline (plan + sprint + eval loop) |
/bober-react |
React web application workflow |
/bober-solidity |
EVM smart contract workflow |
/bober-anchor |
Solana program workflow |
/bober-brownfield |
Existing codebase workflow |
/bober-playwright |
Set up Playwright E2E testing, generate tests, debug failures |
npx agent-bober init [preset] # Initialize project (with provider selection)
npx agent-bober plan "feature" # Run the planner
npx agent-bober sprint # Execute next sprint
npx agent-bober eval # Evaluate current sprint
npx agent-bober run "feature" # Full autonomous loop
npx agent-bober mcp # Start MCP server (Cursor/Windsurf)Option A: Claude Code (recommended)
Launch Claude Code with auto-accept permissions, then run the pipeline:
cd your-project
agent-bober init nextjs
claude --dangerously-skip-permissions
# Inside Claude Code:
/bober-run Build a complete dashboard with auth, CRUD, and chartsClaude will plan, build, evaluate, rework, and iterate without asking you anything. Come back to a finished project.
Option B: CLI with API key
export ANTHROPIC_API_KEY=sk-ant-...
cd your-project
agent-bober init nextjs
agent-bober run "Build a complete dashboard with auth, CRUD, and charts"The CLI uses the Anthropic SDK directly -- no approval prompts at all.
Option C: With a different provider
export OPENAI_API_KEY=sk-...
cd your-project
agent-bober init nextjs
agent-bober run "Build a complete dashboard with auth, CRUD, and charts" --provider openaiAll configuration lives in bober.config.json at your project root. The init command creates this file from a template, and you can customize it afterward.
{
// -- Project -----------------------------------------
"project": {
"name": "my-app", // Project name
"mode": "greenfield", // "greenfield" | "brownfield"
"preset": "nextjs", // Optional: "nextjs" | "react-vite" | "solidity" | "anchor" | "api-node" | "python-api"
"description": "A task management app with real-time collaboration"
},
// -- Planner -----------------------------------------
"planner": {
"provider": "anthropic", // "anthropic" | "openai" | "google" | "openai-compat"
"model": "opus", // Any model string or shorthand
"endpoint": null, // Custom base URL (for openai-compat)
"providerConfig": {}, // Provider-specific settings
"maxClarifications": 5, // Max clarifying questions (0 to skip)
"contextFiles": [ // Extra files the planner should read
"docs/architecture.md"
]
},
// -- Generator ---------------------------------------
"generator": {
"provider": "anthropic", // "anthropic" | "openai" | "google" | "openai-compat"
"model": "sonnet", // Any model string or shorthand
"endpoint": null, // Custom base URL (for openai-compat)
"providerConfig": {}, // Provider-specific settings
"maxTurnsPerSprint": 50, // Max tool-use turns per sprint
"autoCommit": true, // Auto-commit after each sprint
"branchPattern": "bober/{feature-name}" // Git branch naming
},
// -- Evaluator ---------------------------------------
"evaluator": {
"provider": "anthropic", // "anthropic" | "openai" | "google" | "openai-compat"
"model": "sonnet", // Any model string or shorthand
"endpoint": null, // Custom base URL (for openai-compat)
"providerConfig": {}, // Provider-specific settings
"strategies": [ // Evaluation strategies to run
{ "type": "typecheck", "required": true },
{ "type": "lint", "required": true },
{ "type": "build", "required": true },
{ "type": "unit-test", "required": true },
{ "type": "playwright","required": false }
],
"maxIterations": 3, // Max rework cycles per sprint
"plugins": [] // Custom evaluator plugin paths
},
// -- Sprint ------------------------------------------
"sprint": {
"maxSprints": 10, // Max sprints per plan
"requireContracts": true, // Require contract agreement before coding
"sprintSize": "medium" // "small" | "medium" | "large"
},
// -- Pipeline ----------------------------------------
"pipeline": {
"maxIterations": 20, // Max total iterations across all sprints
"requireApproval": false, // Pause for user approval between sprints
"contextReset": "always" // "always" | "on-threshold" | "never"
},
// -- Commands ----------------------------------------
"commands": {
"install": "npm install",
"build": "npm run build",
"test": "npm test",
"lint": "npm run lint",
"dev": "npm run dev",
"typecheck": "npx tsc --noEmit"
}
}| Size | Generator Effort | Files Changed | Scope |
|---|---|---|---|
small |
30-60 min | 1-2 files | Single concern |
medium |
1-3 hours | 3-8 files | One cohesive feature slice |
large |
3-5 hours | 5-15 files | Full feature vertical |
| Mode | Behavior |
|---|---|
always |
Fresh context for every sprint (recommended for long plans) |
on-threshold |
Reset when context usage exceeds 80% |
never |
Carry context across sprints (only for short plans) |
| Strategy | What It Does |
|---|---|
typecheck |
Runs the configured typecheck command (e.g., tsc --noEmit) |
lint |
Runs the configured lint command (e.g., eslint .) |
build |
Runs the configured build command and checks for success |
unit-test |
Runs the configured test command |
playwright |
Runs Playwright E2E tests |
api-check |
Validates API endpoints respond correctly |
The strategy type is open -- you can use any name and provide a shell command directly. No plugin file needed:
{
"evaluator": {
"strategies": [
{ "type": "typecheck", "required": true },
{ "type": "lint", "required": true },
{ "type": "k6", "command": "k6 run load-test.js", "required": false, "label": "Load Test" },
{ "type": "slither", "command": "slither .", "required": true, "label": "Security Audit" },
{ "type": "anchor-verify", "command": "anchor verify", "required": true },
{ "type": "cargo-test", "command": "cargo test", "required": true },
{ "type": "pytest", "command": "pytest --tb=short", "required": true },
{ "type": "mypy", "command": "mypy . --strict", "required": false }
]
}
}Any strategy with a command field runs that command and checks the exit code (0 = pass). Error output is parsed and included in the evaluator feedback. You can set a custom timeout in the config:
{ "type": "k6", "command": "k6 run load.js", "required": false, "config": { "timeout": 300000 } }For more complex evaluation logic, write a plugin that implements the EvaluatorPlugin interface:
import type { EvaluatorPlugin, EvalContext, EvalResult } from "agent-bober";
const myPlugin: EvaluatorPlugin = {
name: "My Custom Check",
description: "Validates something specific to my project",
async canRun(_projectRoot, _config) {
return true;
},
async evaluate(context: EvalContext): Promise<EvalResult> {
return {
evaluator: "my-custom-check",
passed: true,
score: 100,
details: [],
summary: "All checks passed",
feedback: "Everything looks good.",
timestamp: new Date().toISOString(),
};
},
};
export default () => myPlugin;Register plugins in bober.config.json:
{
"evaluator": {
"strategies": [
{ "type": "custom", "plugin": "./my-evaluator.ts", "required": true }
]
}
}Next.js full-stack (App Router, API routes, Prisma). Includes:
- Next.js with TypeScript, Tailwind CSS, ESLint
- API routes for backend logic
- Prisma ORM for database access
- Vitest for unit tests, Playwright for E2E
React + Vite + any backend. Includes:
- Vite dev server with React and TypeScript
- Vitest for unit tests, Playwright for E2E
- ESLint configured for TypeScript + React
- Flexible backend pairing (Express, Fastify, etc.)
EVM smart contracts (Hardhat/Foundry). Includes:
- Hardhat or Foundry project setup
- OpenZeppelin Contracts integration
- Solhint for linting
- Hardhat tests or Forge tests
- Deployment and verification scripts
Solana programs (Anchor/Rust). Includes:
- Anchor project setup with program scaffold
- TypeScript integration tests
- Cargo clippy for Rust linting
- IDL generation and client SDK
- Deployment scripts for devnet/mainnet
Node.js API (Express/NestJS/Fastify). Includes:
- TypeScript API project structure
- Testing with Vitest or Jest
- ESLint and TypeScript strict mode
- Database integration (Prisma/Drizzle)
Python API (FastAPI/Django). Includes:
- FastAPI or Django project structure
- pytest for testing
- Ruff/Black for linting and formatting
- SQLAlchemy or Django ORM for database access
Existing codebase (conservative defaults). No scaffold files -- just configuration:
- Conservative sprint sizes (
small) - Higher evaluator iteration limit (5 rework cycles)
- Requires user approval between sprints
- Emphasizes reading existing patterns before making changes
Minimal config, planner decides everything. Just a bober.config.json with build as the only required evaluator strategy. Intended as a starting point for any tech stack not covered by other presets.
If your evaluation strategies include playwright, the generator will automatically:
- Add
data-testidattributes to all interactive UI elements - Write Playwright test files in
e2e/alongside UI code - Verify tests pass before completing each sprint
To set up Playwright in your project:
/bober-playwright setup
This installs @playwright/test, creates playwright.config.ts with a webServer block that auto-starts your dev server, scaffolds an e2e/ directory with an example smoke test, and configures JSON reporting for structured feedback.
To generate tests for a specific feature:
/bober-playwright "test the login flow"
The evaluator runs Playwright tests automatically during evaluation and feeds failures back to the generator for rework. Failed tests include the test name, file location, error message, and screenshot paths when available.
To debug failing E2E tests:
/bober-playwright debug
bober.config.json
|
+---------+---------+
| |
.bober/specs/ .bober/contracts/
| |
v v
User Idea --> [Planner] --> PlanSpec + SprintContracts
|
v
[Generator]
| ^
v | (rework feedback)
[Evaluator]
|
pass? ----+---- fail?
| |
[Next Sprint] [Rework Loop]
|
v
All sprints done
|
v
Feature Complete
This architecture implements the patterns described in Anthropic's "Harness design for long-running application development" by Prithvi Rajasekaran. The key insight from that research: separating code generation from code evaluation creates a feedback loop that catches errors early and dramatically improves output quality. In their tests, a solo agent produced broken output in 20 minutes, while the full harness produced a polished, working application -- demonstrating that multi-agent orchestration with honest evaluation is worth the investment.
Each agent runs as a multi-turn agentic loop with tool access via the unified LLMClient interface. The provider layer abstracts away the differences between Anthropic, OpenAI, Google, and OpenAI-compatible APIs. System prompts are loaded from the detailed agent definitions in agents/bober-*.md (300-600 lines of role-specific instructions, anti-leniency protocols, and evaluation criteria).
- Planner (default: Claude Opus): Explores the codebase via read-only tools (
read_file,glob,grep), then produces sprint-decomposed plans. Thinks about scope, dependencies, and risk. - Generator (default: Claude Sonnet): Full tool access (
bash,read_file,write_file,edit_file,glob,grep). Reads existing code, writes implementation, runs tests, and commits -- all autonomously within the sprint contract boundaries. - Evaluator (default: Claude Sonnet): Read-only + bash tools (
bash,read_file,glob,grep-- deliberately NO write/edit). Independently verifies by running the dev server, taking Playwright screenshots, executing tests, and inspecting code. Cannot fix bugs -- only report them with precise feedback.
The separation ensures that:
- The Generator cannot "mark its own homework" -- an independent evaluation step with its own tool access catches issues through actual runtime verification, not just reading the generator's self-report.
- Sprint contracts provide clear scope boundaries, preventing feature creep.
- Automated checks (programmatic evaluators) + agent-based qualitative evaluation run after every sprint.
- Context resets between sprints keep the Generator focused and prevent context degradation.
- The Evaluator's anti-leniency protocol ensures passing on the first iteration is rare for non-trivial work.
All bober state lives in the .bober/ directory:
.bober/
specs/ PlanSpec JSON files
contracts/ SprintContract JSON files
eval-results/ Evaluation result logs
handoffs/ Context handoff documents
progress.md Human-readable progress tracker
history.jsonl Machine-readable event log
For environments where you need to run bober operations outside of Claude Code:
| Script | Purpose |
|---|---|
scripts/init-project.sh |
Initialize a project with a template |
scripts/detect-stack.sh |
Auto-detect tech stack (outputs JSON) |
scripts/run-eval.sh |
Run evaluation strategies from config |
# Initialize a new project
bash scripts/init-project.sh nextjs
# Detect an existing project's stack
bash scripts/detect-stack.sh /path/to/project
# Run evaluations
bash scripts/run-eval.sh /path/to/projectContributions are welcome. To set up the development environment:
git clone https://github.com/BOBER3r/agent-bober.git
cd agent-bober
npm install
npm run build
npm run typecheck
npm testagent-bober/
src/
cli/ CLI entry point (commander)
config/ Config schema, loader, defaults
contracts/ Sprint contract and eval result types
evaluators/ Built-in evaluator plugins
mcp/ MCP server and tool definitions
tools/ 10 MCP tools (init, plan, sprint, eval, run, status, etc.)
orchestrator/ Agent runners, agentic loop, tool infrastructure
tools/ Tool schemas, sandboxed handlers, role-based sets
providers/ LLM provider adapters (Anthropic, OpenAI, Google, OpenAI-compat)
state/ State management for .bober/ directory
utils/ Shared utilities
agents/ Agent system prompts (.md files, loaded at runtime)
skills/ Claude Code slash command definitions
templates/ Project templates and scaffolds
hooks/ Claude Code hooks
scripts/ Shell scripts for init, detect, eval
- TypeScript strict mode, no
any. - ESM only (
"type": "module"). - All evaluator plugins implement the
EvaluatorPlugininterface. - Sprint contracts are validated against Zod schemas.
- Test with
vitest. Runnpm testbefore submitting.
This project is inspired by and implements the patterns from Anthropic's "Harness design for long-running application development" by Prithvi Rajasekaran. The paper demonstrated that separating generation from evaluation, using sprint contracts, and applying context resets between agents dramatically improves the quality of autonomously built software. agent-bober packages these patterns into a reusable tool.
MIT -- Copyright (c) 2026 BOBER3r
{ "planner": { "provider": "anthropic", "model": "opus" }, "generator": { "provider": "openai", "model": "gpt-4.1" }, "evaluator": { "provider": "openai-compat", "model": "llama3.1:70b", "endpoint": "http://localhost:11434/v1" } }