feat(benchmarks): token efficiency metric + scatter plot#104
Merged
George-iam merged 2 commits intomainfrom Apr 13, 2026
Merged
feat(benchmarks): token efficiency metric + scatter plot#104George-iam merged 2 commits intomainfrom
George-iam merged 2 commits intomainfrom
Conversation
Introduces tokens_per_correct as a model-agnostic efficiency metric: total LLM tokens / correct answers. Complements accuracy with a metric that doesn't depend on pricing — AXME would consume ~9K tokens/question on any model (Sonnet, gpt-4o, local Llama). Results: - AXME Code: ~10K tokens/correct (measured, 500 questions) - Supermemory: ~29K tokens/correct (est) - Mem0: ~31K tokens/correct (est) - Zep: ~70K tokens/correct (est, Graphiti graph construction) - Mastra OM: ~105K-119K tokens/correct (est, Observer+Reflector per turn) AXME is ~10x more token-efficient than Mastra at 89% accuracy. Mastra trades 10x tokens for +5.7pp accuracy via continuous Observer/Reflector pipeline; AXME runs only 2 LLM calls (reader + judge) at query time. Adds: - benchmarks/token-performance.py — script that generates the scatter plot - benchmarks/token-performance.svg/.png — dark-themed efficiency chart - benchmarks/README.md Token efficiency section with breakdown + methodology - tokens/correct row in main README and benchmarks/README comparison tables - Token efficiency section in main README with embedded chart Token counts for competitors are estimated from their published methodology (Observer/Reflector calls, graph construction, fact extraction). AXME count is measured from the 500-question run. #!axme pr=none repo=AxmeAI/axme-code
- Swap axes: X = accuracy, Y = tokens/correct - Invert Y (log scale) so fewer tokens = higher on plot - AXME Code now sits in the top-right corner (high accuracy + low tokens = best) - Callout positioned next to the AXME point without crossing its label - Add "↗ Top-right = best" hint in bottom-left corner #!axme pr=104 repo=AxmeAI/axme-code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces tokens per correct answer as a model-agnostic efficiency metric, complementing accuracy with a measure that doesn't depend on LLM pricing.
AXME is ~10× more token-efficient than Mastra at 89% accuracy. Mastra trades 10× tokens for +5.7pp accuracy via continuous Observer/Reflector pipeline; AXME runs only 2 LLM calls (reader + judge) at query time.
Why tokens instead of dollars?
What's in this PR
benchmarks/token-performance.py— script generating the scatter plotbenchmarks/token-performance.svg/.png— dark-themed efficiency chart (log-scale X axis, accuracy Y axis)benchmarks/README.md— new Token efficiency section with breakdown table + methodology notesREADME.md— newtokens/correctrow in comparison table + embedded scatter plotTest plan
🤖 Generated with Claude Code