AgentTaskBench is a lightweight open-source harness for checking whether an AI coding agent actually completed a task correctly instead of only sounding confident.
It gives you a small, readable structure for:
- writing a task specification
- stating agent rules and acceptance criteria
- running local validation
- recording the outcome as pass or fail
Coding agents often produce plausible answers without fully satisfying the task. That creates a gap between output quality and task completion quality.
AgentTaskBench exists to close that gap with simple, local, inspectable examples. It is designed to make failure modes obvious:
- scope drift
- partial fixes
- missed validation
- hidden behavior regressions
- overconfident but unverified output
This repo helps a human give an agent a well-scoped task and then verify the result with a repeatable local check.
Instead of asking, "Did the model give a good-looking response?", you can ask:
- Did it change only what was allowed?
- Did the tests pass?
- Did the result match the acceptance criteria?
- Can I see the evidence in a consistent format?
Run the bundled examples from the repo root:
./scripts/run_validation.sh examples/python_bugfix
./scripts/run_validation.sh examples/spec_drift_guard
./scripts/run_all_validations.shIf you want to inspect a benchmark example, open the files in its folder:
TASK.mdAGENT_RULES.mdACCEPTANCE_CRITERIA.mdRESULT.mdsrc/tests/
agent-taskbench/
├── docs/
├── examples/
├── scripts/
├── templates/
├── logs/
├── out/
├── PROJECT_STATUS.md
├── CONTRIBUTING.md
├── SECURITY.md
└── CHANGELOG.md
- Choose an example benchmark.
- Read
TASK.mdandAGENT_RULES.md. - Hand the task to a coding agent.
- Run the example validator.
- Compare the validator output with
ACCEPTANCE_CRITERIA.md. - Record the result in
RESULT.md.
That workflow keeps the evaluation anchored in behavior, not just narrative.
This project is useful as a portfolio piece because it shows:
- task decomposition
- evaluation design
- test-driven validation
- careful scoping
- failure-mode awareness
- readable documentation
- cross-platform scripting basics
- It is intentionally simple and does not include a database, web app, or orchestration layer.
- It validates a handful of local examples rather than running a large benchmark suite.
- It does not measure model reasoning internally; it measures task outcomes externally.
- It is designed for clarity first, not scale.
Possible next steps after v0.1:
- add more example tasks in different languages
- standardize machine-readable results
- add a simple summary generator
- add richer batch reporting across all examples
- add scoring metadata for benchmark comparison
AgentTaskBench is meant to read like a practical systems-and-evaluation project, not a toy demo. It shows that the author can define success criteria, build validation around them, and design for failure detection instead of demo theater. For hiring managers and AI engineering teams, that signals good judgment, strong workflow thinking, and a bias toward measurable outcomes.