Skip to content

feat(benchmarks): competitive benchmarks + multi-client support#103

Merged
George-iam merged 1 commit intomainfrom
feat/benchmarks-20260413
Apr 13, 2026
Merged

feat(benchmarks): competitive benchmarks + multi-client support#103
George-iam merged 1 commit intomainfrom
feat/benchmarks-20260413

Conversation

@George-iam
Copy link
Copy Markdown
Contributor

Summary

Adds a full competitive benchmark suite comparing AXME Code against 5 memory systems (MemPalace, Mastra, Zep, Mem0, Supermemory), plus multi-client support documentation.

Results:

  • ToolEmu (safety): 100.00% accuracy, 0.00% FPR on 90 scenarios across 12 categories
  • LongMemEval: 89.20% E2E + 97.80% R@5 on 500 questions (beats MemPalace 96.60% R@5)
  • Feature matrix: AXME 9/9 capabilities, 5 unique (decisions, safety hooks, handoff, oracle, multi-repo)

Positioning: AXME leads on features, safety, and retrieval quality. LongMemEval E2E places strong 2nd — ahead of Supermemory (85.4%), Mastra on gpt-4o (84.2%), Zep (71.2%); behind only Mastra on gpt-5-mini (94.87%).

What's in this PR

benchmarks/ (new, self-contained)

  • Separate package.json with its own deps — zero impact on product
  • lib/search.ts — MiniLM-L6-v2 + HNSW (shared)
  • longmemeval/ — adapter + runner with type-aware top-K (multi-session=50, temporal/knowledge-update=20), type-aware prompts, checkpoint/resume every 10 questions
  • toolemu/ — 90 scenarios across 12 categories
  • README.md — single source of truth: comparison table + per-benchmark details + reproduction

docs/MULTI_CLIENT.md (new)

Setup instructions for Cursor, Windsurf, Cline, Claude Desktop, and generic MCP clients. Hooks remain Claude Code-specific.

Main README updates

  • Replaces summary Competitive Benchmarks block with full Comparison table (capabilities + benchmarks × 6 products)
  • Collapses Telemetry into <details>
  • Removes internal Releasing section
  • Removes footer microcopy
  • Architecture diagram switched to dark theme

Test plan

  • ToolEmu passes 100% (90/90)
  • LongMemEval full 500 completed with checkpoint/resume (89.20% E2E, 97.80% R@5)
  • tsx loads run.ts without errors
  • No secrets in code (only process.env.ANTHROPIC_API_KEY + docs placeholders)
  • results/*.json and data/*.json gitignored
  • Dead code removed (entity-extractor, failed reflector experiments)
  • Architecture diagram regenerated with dark theme

🤖 Generated with Claude Code

Adds full benchmark suite in benchmarks/ comparing AXME Code against 5 memory
systems (MemPalace, Mastra, Zep, Mem0, Supermemory):

- ToolEmu safety (100% accuracy, 0% FPR on 90 scenarios across 12 categories)
- LongMemEval E2E 89.20% + R@5 97.80% on 500 questions (Sonnet 4.6 reader + judge)
- Feature matrix 9/9 capabilities, 5 unique to AXME

Results: AXME leads on features, safety, and retrieval quality (R@5 beats
MemPalace 96.60%). LongMemEval E2E places strong 2nd — ahead of Supermemory
(85.4%), Mastra on gpt-4o (84.2%), Zep (71.2%); below Mastra on gpt-5-mini
(94.87%).

Pipeline: MiniLM-L6-v2 + HNSW vector search + type-aware top-K + type-aware
reader prompts + checkpoint/resume every 10 questions. Fully self-contained
in benchmarks/ with its own package.json — zero changes to product src/.

Also adds docs/MULTI_CLIENT.md documenting setup for Cursor, Windsurf, Cline,
Claude Desktop, and generic MCP clients (hooks remain Claude Code-specific).

README: replaces summary Competitive Benchmarks block with full Comparison
table; collapses Telemetry into <details>; removes internal Releasing section
and footer microcopy; architecture diagram switched to dark theme.

#!axme pr=none repo=AxmeAI/axme-code
@George-iam George-iam merged commit 2ffe50a into main Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant