Skip to content

MackayAndrew/aacp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AACP Benchmarks

Status: In progress — results will be published here as they are completed.

All token counts in the main README and demo are currently estimated from manual analysis. This directory will contain measured results from live API calls.

Methodology

Each benchmark run:

  1. Runs a workflow in slow mode (English coordination) N times
  2. Runs the identical workflow in fast mode (AACP coordination) N times
  3. The task prompt is identical in both modes — only the coordination message differs
  4. Logs real token counts from usage_metadata in the API response
  5. Scores output accuracy using a separate evaluator call

What we measure

Metric Why
Coordination tokens in The primary AACP claim
Task tokens in Should be ~equal — validates isolation
Tokens out Both modes should produce comparable output
Total cost USD Real dollar impact
Latency ms Speed benefit
Accuracy score Outputs must be equivalent or savings are meaningless

Planned runs

  • payroll — 10 runs × 2 modes × Claude Sonnet
  • it_provisioning — 10 runs × 2 modes × Claude Sonnet
  • contract_review — 10 runs × 2 modes × Claude Sonnet (expect low total saving)
  • payroll cross-model — 5 runs × 2 modes × GPT-4o
  • Edge cases — ambiguous instructions, missing fields, novel domains

Reproducing results

export ANTHROPIC_API_KEY=your_key_here
python benchmark/run_test.py --workflow payroll --runs 10
python benchmark/evaluate.py --input benchmarks/payroll_TIMESTAMP_raw.json

Known limitations

  • Accuracy evaluator uses Claude to score Claude outputs — potential evaluator bias
  • Task token isolation is approximate (we split input tokens by coordination length estimate)
  • N=10 runs per condition — small sample, treat as directional not definitive
  • Cross-model evaluation (GPT-4o, Gemini) planned for v0.2

About

gent Action Compression Protocol — v0.1 draft

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages