feat(benchmarks): token efficiency metric + scatter plot by George-iam · Pull Request #104 · AxmeAI/axme-code

George-iam · 2026-04-13T21:06:31Z

Summary

Introduces tokens per correct answer as a model-agnostic efficiency metric, complementing accuracy with a measure that doesn't depend on LLM pricing.

AXME Code: ~10K tokens/correct (measured, 500-question run)
Supermemory: ~29K tokens/correct (est)
Mem0: ~31K tokens/correct (est)
Zep: ~70K tokens/correct (est, Graphiti graph construction)
Mastra OM: ~105K–119K tokens/correct (est, Observer+Reflector per turn)

AXME is ~10× more token-efficient than Mastra at 89% accuracy. Mastra trades 10× tokens for +5.7pp accuracy via continuous Observer/Reflector pipeline; AXME runs only 2 LLM calls (reader + judge) at query time.

Why tokens instead of dollars?

Model-agnostic — AXME would consume ~9K tokens/question on Sonnet, gpt-4o, or a local Llama
Measures architecture efficiency independent of LLM provider
Pricing changes over time; token counts don't
Non-disputable — can't be argued with "your pricing estimate is wrong"

What's in this PR

benchmarks/token-performance.py — script generating the scatter plot
benchmarks/token-performance.svg/.png — dark-themed efficiency chart (log-scale X axis, accuracy Y axis)
benchmarks/README.md — new Token efficiency section with breakdown table + methodology notes
Main README.md — new tokens/correct row in comparison table + embedded scatter plot

Test plan

Python script runs and produces SVG + PNG
Both READMEs updated with new row + section
AXME value is ✓ measured; all competitor values footnoted as estimates with methodology
Chart rendered in dark theme consistent with architecture diagram

🤖 Generated with Claude Code

Introduces tokens_per_correct as a model-agnostic efficiency metric: total LLM tokens / correct answers. Complements accuracy with a metric that doesn't depend on pricing — AXME would consume ~9K tokens/question on any model (Sonnet, gpt-4o, local Llama). Results: - AXME Code: ~10K tokens/correct (measured, 500 questions) - Supermemory: ~29K tokens/correct (est) - Mem0: ~31K tokens/correct (est) - Zep: ~70K tokens/correct (est, Graphiti graph construction) - Mastra OM: ~105K-119K tokens/correct (est, Observer+Reflector per turn) AXME is ~10x more token-efficient than Mastra at 89% accuracy. Mastra trades 10x tokens for +5.7pp accuracy via continuous Observer/Reflector pipeline; AXME runs only 2 LLM calls (reader + judge) at query time. Adds: - benchmarks/token-performance.py — script that generates the scatter plot - benchmarks/token-performance.svg/.png — dark-themed efficiency chart - benchmarks/README.md Token efficiency section with breakdown + methodology - tokens/correct row in main README and benchmarks/README comparison tables - Token efficiency section in main README with embedded chart Token counts for competitors are estimated from their published methodology (Observer/Reflector calls, graph construction, fact extraction). AXME count is measured from the 500-question run. #!axme pr=none repo=AxmeAI/axme-code

- Swap axes: X = accuracy, Y = tokens/correct - Invert Y (log scale) so fewer tokens = higher on plot - AXME Code now sits in the top-right corner (high accuracy + low tokens = best) - Callout positioned next to the AXME point without crossing its label - Add "↗ Top-right = best" hint in bottom-left corner #!axme pr=104 repo=AxmeAI/axme-code

George-iam added 2 commits April 13, 2026 21:06

George-iam merged commit e05de98 into main Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): token efficiency metric + scatter plot#104

feat(benchmarks): token efficiency metric + scatter plot#104
George-iam merged 2 commits intomainfrom
feat/token-efficiency-20260413

George-iam commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

George-iam commented Apr 13, 2026

Summary

Why tokens instead of dollars?

What's in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant