Skip to content

feat(benchmarks): token efficiency metric + scatter plot#104

Merged
George-iam merged 2 commits intomainfrom
feat/token-efficiency-20260413
Apr 13, 2026
Merged

feat(benchmarks): token efficiency metric + scatter plot#104
George-iam merged 2 commits intomainfrom
feat/token-efficiency-20260413

Conversation

@George-iam
Copy link
Copy Markdown
Contributor

Summary

Introduces tokens per correct answer as a model-agnostic efficiency metric, complementing accuracy with a measure that doesn't depend on LLM pricing.

  • AXME Code: ~10K tokens/correct (measured, 500-question run)
  • Supermemory: ~29K tokens/correct (est)
  • Mem0: ~31K tokens/correct (est)
  • Zep: ~70K tokens/correct (est, Graphiti graph construction)
  • Mastra OM: ~105K–119K tokens/correct (est, Observer+Reflector per turn)

AXME is ~10× more token-efficient than Mastra at 89% accuracy. Mastra trades 10× tokens for +5.7pp accuracy via continuous Observer/Reflector pipeline; AXME runs only 2 LLM calls (reader + judge) at query time.

Why tokens instead of dollars?

  • Model-agnostic — AXME would consume ~9K tokens/question on Sonnet, gpt-4o, or a local Llama
  • Measures architecture efficiency independent of LLM provider
  • Pricing changes over time; token counts don't
  • Non-disputable — can't be argued with "your pricing estimate is wrong"

What's in this PR

  • benchmarks/token-performance.py — script generating the scatter plot
  • benchmarks/token-performance.svg/.png — dark-themed efficiency chart (log-scale X axis, accuracy Y axis)
  • benchmarks/README.md — new Token efficiency section with breakdown table + methodology notes
  • Main README.md — new tokens/correct row in comparison table + embedded scatter plot

Test plan

  • Python script runs and produces SVG + PNG
  • Both READMEs updated with new row + section
  • AXME value is ✓ measured; all competitor values footnoted as estimates with methodology
  • Chart rendered in dark theme consistent with architecture diagram

🤖 Generated with Claude Code

Introduces tokens_per_correct as a model-agnostic efficiency metric:
total LLM tokens / correct answers. Complements accuracy with a metric
that doesn't depend on pricing — AXME would consume ~9K tokens/question
on any model (Sonnet, gpt-4o, local Llama).

Results:
- AXME Code:   ~10K tokens/correct (measured, 500 questions)
- Supermemory: ~29K tokens/correct (est)
- Mem0:        ~31K tokens/correct (est)
- Zep:         ~70K tokens/correct (est, Graphiti graph construction)
- Mastra OM:   ~105K-119K tokens/correct (est, Observer+Reflector per turn)

AXME is ~10x more token-efficient than Mastra at 89% accuracy. Mastra
trades 10x tokens for +5.7pp accuracy via continuous Observer/Reflector
pipeline; AXME runs only 2 LLM calls (reader + judge) at query time.

Adds:
- benchmarks/token-performance.py — script that generates the scatter plot
- benchmarks/token-performance.svg/.png — dark-themed efficiency chart
- benchmarks/README.md Token efficiency section with breakdown + methodology
- tokens/correct row in main README and benchmarks/README comparison tables
- Token efficiency section in main README with embedded chart

Token counts for competitors are estimated from their published
methodology (Observer/Reflector calls, graph construction, fact
extraction). AXME count is measured from the 500-question run.

#!axme pr=none repo=AxmeAI/axme-code
- Swap axes: X = accuracy, Y = tokens/correct
- Invert Y (log scale) so fewer tokens = higher on plot
- AXME Code now sits in the top-right corner (high accuracy + low tokens = best)
- Callout positioned next to the AXME point without crossing its label
- Add "↗ Top-right = best" hint in bottom-left corner

#!axme pr=104 repo=AxmeAI/axme-code
@George-iam George-iam merged commit e05de98 into main Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant