node-agent-eval

Compare coding agents head-to-head. Pass rate, cost, time, consistency — one command.

The problem

Every "which coding agent is best?" discussion runs on vibes. There's no lightweight tool to systematically compare agents on your tasks, with your codebase, tracking what actually matters: does it work, how long did it take, and what did it cost?

Quickstart

npm i -g node-agent-eval

node-agent-eval run --tasks examples/tasks.yaml --agents claude-code,mock --runs 3
node-agent-eval report --input report.json
node-agent-eval list-agents

Example output

Sample Evaluation Suite
-------------------------------------------------------------------
Agent            Pass Rate   Avg Time   Avg Cost   Consistency
-------------------------------------------------------------------
claude-code           80%      45.2s    $0.1200         0.95
aider                 60%      30.1s    $0.0300         0.85
-------------------------------------------------------------------
Best pass rate: claude-code (80%) | Fastest: aider (30.1s avg) | Cheapest: aider ($0.0300 avg)

How it works

Define tasks in YAML — description, repo, test command, timeout
Isolated execution — each agent runs in a fresh git worktree (no Docker needed)
Run agents — Claude Code, Aider, or any CLI agent via adapters
Judge results — test commands (exit 0 = pass) or LLM-as-judge
Report — pass rate, avg time, avg cost, consistency across runs

Supported agents

Agent	CLI	Status
Claude Code	`claude`	Supported
Aider	`aider`	Supported
Mock	built-in	For testing
Custom	Extend `AgentAdapter`	Extensible

Task format

name: "My Eval Suite"
description: "Example tasks"
tasks:
  - id: add-auth
    description: "Add JWT auth middleware to the Flask app"
    repo: "./my-flask-app"
    test_cmd: "python -m pytest tests/ -v"
    timeout_seconds: 180
    tags: [python, auth]

  - id: fix-csv-parser
    description: "Fix the CSV parser to handle empty input gracefully"
    repo: "./my-flask-app"
    test_cmd: "python test_csv.py"
    timeout_seconds: 120
    tags: [python, bugfix]

Field reference

Field	Required	Default	Description
`id`	yes	—	Unique task identifier
`description`	yes	—	Prompt given to the agent
`repo`	yes	—	Path to the git repository
`ref`	no	`HEAD`	Git ref to check out
`test_cmd`	no	—	Shell command; exit 0 = pass
`judge_prompt`	no	—	LLM-as-judge prompt (future)
`timeout_seconds`	no	`300`	Max execution time
`tags`	no	`[]`	Arbitrary string tags

CLI reference

node-agent-eval <command> [options]

Commands:
  run          Run evaluation
  report       Display report from JSON
  list-agents  List available agent adapters

Run options:
  --tasks <path>         Path to tasks YAML file (required)
  --agents <names>       Comma-separated agent names (required)
  --runs <number>        Runs per task per agent (default: 1)
  --concurrency <number> Concurrent agent runs (default: 1)
  --output <path>        Output JSON path

Report options:
  --input <path>         Path to report JSON (required)

Metrics

Metric	What it measures
Pass Rate	Did the agent produce code that passes the judge?
Avg Time	Average wall-clock seconds to completion
Avg Cost	Average API spend per task (when available)
Consistency	1 - std_dev of per-task pass rates across repeated runs

Extending with custom adapters

const { AgentAdapter } = require('node-agent-eval/lib/adapters/base');
const { spawn } = require('child_process');

class MyAgent extends AgentAdapter {
  name = 'my-agent';

  async run(task, workdir) {
    // Spawn your agent CLI, return { exitCode, stdout, stderr, durationSeconds, tokensUsed, costUsd }
  }

  isAvailable() {
    return true;
  }
}

module.exports = { MyAgent };

vs SWE-bench

	SWE-bench	node-agent-eval
Tasks	Fixed dataset (GitHub issues)	Your custom tasks
Setup	Docker, heavy infra	Git worktrees, no Docker
Cost tracking	No	Yes
Consistency measurement	No (1 run)	Yes (N runs, std dev)
Time to first result	Hours	Minutes

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bin		bin
examples		examples
lib		lib
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

node-agent-eval

The problem

Quickstart

Example output

How it works

Supported agents

Task format

Field reference

CLI reference

Metrics

Extending with custom adapters

vs SWE-bench

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

node-agent-eval

The problem

Quickstart

Example output

How it works

Supported agents

Task format

Field reference

CLI reference

Metrics

Extending with custom adapters

vs SWE-bench

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages