Status: In progress — results will be published here as they are completed.
All token counts in the main README and demo are currently estimated from manual analysis. This directory will contain measured results from live API calls.
Each benchmark run:
- Runs a workflow in slow mode (English coordination) N times
- Runs the identical workflow in fast mode (AACP coordination) N times
- The task prompt is identical in both modes — only the coordination message differs
- Logs real token counts from
usage_metadatain the API response - Scores output accuracy using a separate evaluator call
| Metric | Why |
|---|---|
| Coordination tokens in | The primary AACP claim |
| Task tokens in | Should be ~equal — validates isolation |
| Tokens out | Both modes should produce comparable output |
| Total cost USD | Real dollar impact |
| Latency ms | Speed benefit |
| Accuracy score | Outputs must be equivalent or savings are meaningless |
-
payroll— 10 runs × 2 modes × Claude Sonnet -
it_provisioning— 10 runs × 2 modes × Claude Sonnet -
contract_review— 10 runs × 2 modes × Claude Sonnet (expect low total saving) -
payrollcross-model — 5 runs × 2 modes × GPT-4o - Edge cases — ambiguous instructions, missing fields, novel domains
export ANTHROPIC_API_KEY=your_key_here
python benchmark/run_test.py --workflow payroll --runs 10
python benchmark/evaluate.py --input benchmarks/payroll_TIMESTAMP_raw.json- Accuracy evaluator uses Claude to score Claude outputs — potential evaluator bias
- Task token isolation is approximate (we split input tokens by coordination length estimate)
- N=10 runs per condition — small sample, treat as directional not definitive
- Cross-model evaluation (GPT-4o, Gemini) planned for v0.2