AACP Benchmarks

Status: In progress — results will be published here as they are completed.

All token counts in the main README and demo are currently estimated from manual analysis. This directory will contain measured results from live API calls.

Methodology

Each benchmark run:

Runs a workflow in slow mode (English coordination) N times
Runs the identical workflow in fast mode (AACP coordination) N times
The task prompt is identical in both modes — only the coordination message differs
Logs real token counts from usage_metadata in the API response
Scores output accuracy using a separate evaluator call

What we measure

Metric	Why
Coordination tokens in	The primary AACP claim
Task tokens in	Should be ~equal — validates isolation
Tokens out	Both modes should produce comparable output
Total cost USD	Real dollar impact
Latency ms	Speed benefit
Accuracy score	Outputs must be equivalent or savings are meaningless

Planned runs

payroll — 10 runs × 2 modes × Claude Sonnet
it_provisioning — 10 runs × 2 modes × Claude Sonnet
contract_review — 10 runs × 2 modes × Claude Sonnet (expect low total saving)
payroll cross-model — 5 runs × 2 modes × GPT-4o
Edge cases — ambiguous instructions, missing fields, novel domains

Reproducing results

export ANTHROPIC_API_KEY=your_key_here
python benchmark/run_test.py --workflow payroll --runs 10
python benchmark/evaluate.py --input benchmarks/payroll_TIMESTAMP_raw.json

Known limitations

Accuracy evaluator uses Claude to score Claude outputs — potential evaluator bias
Task token isolation is approximate (we split input tokens by coordination length estimate)
N=10 runs per condition — small sample, treat as directional not definitive
Cross-model evaluation (GPT-4o, Gemini) planned for v0.2

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
aacp		aacp
benchmark		benchmark
benchmarks		benchmarks
examples		examples
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AACP Benchmarks

Methodology

What we measure

Planned runs

Reproducing results

Known limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AACP Benchmarks

Methodology

What we measure

Planned runs

Reproducing results

Known limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages