feat(benchmarks): competitive benchmarks + multi-client support#103
Merged
George-iam merged 1 commit intomainfrom Apr 13, 2026
Merged
feat(benchmarks): competitive benchmarks + multi-client support#103George-iam merged 1 commit intomainfrom
George-iam merged 1 commit intomainfrom
Conversation
Adds full benchmark suite in benchmarks/ comparing AXME Code against 5 memory systems (MemPalace, Mastra, Zep, Mem0, Supermemory): - ToolEmu safety (100% accuracy, 0% FPR on 90 scenarios across 12 categories) - LongMemEval E2E 89.20% + R@5 97.80% on 500 questions (Sonnet 4.6 reader + judge) - Feature matrix 9/9 capabilities, 5 unique to AXME Results: AXME leads on features, safety, and retrieval quality (R@5 beats MemPalace 96.60%). LongMemEval E2E places strong 2nd — ahead of Supermemory (85.4%), Mastra on gpt-4o (84.2%), Zep (71.2%); below Mastra on gpt-5-mini (94.87%). Pipeline: MiniLM-L6-v2 + HNSW vector search + type-aware top-K + type-aware reader prompts + checkpoint/resume every 10 questions. Fully self-contained in benchmarks/ with its own package.json — zero changes to product src/. Also adds docs/MULTI_CLIENT.md documenting setup for Cursor, Windsurf, Cline, Claude Desktop, and generic MCP clients (hooks remain Claude Code-specific). README: replaces summary Competitive Benchmarks block with full Comparison table; collapses Telemetry into <details>; removes internal Releasing section and footer microcopy; architecture diagram switched to dark theme. #!axme pr=none repo=AxmeAI/axme-code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a full competitive benchmark suite comparing AXME Code against 5 memory systems (MemPalace, Mastra, Zep, Mem0, Supermemory), plus multi-client support documentation.
Results:
Positioning: AXME leads on features, safety, and retrieval quality. LongMemEval E2E places strong 2nd — ahead of Supermemory (85.4%), Mastra on gpt-4o (84.2%), Zep (71.2%); behind only Mastra on gpt-5-mini (94.87%).
What's in this PR
benchmarks/ (new, self-contained)
package.jsonwith its own deps — zero impact on productlib/search.ts— MiniLM-L6-v2 + HNSW (shared)longmemeval/— adapter + runner with type-aware top-K (multi-session=50, temporal/knowledge-update=20), type-aware prompts, checkpoint/resume every 10 questionstoolemu/— 90 scenarios across 12 categoriesREADME.md— single source of truth: comparison table + per-benchmark details + reproductiondocs/MULTI_CLIENT.md (new)
Setup instructions for Cursor, Windsurf, Cline, Claude Desktop, and generic MCP clients. Hooks remain Claude Code-specific.
Main README updates
<details>Test plan
process.env.ANTHROPIC_API_KEY+ docs placeholders)results/*.jsonanddata/*.jsongitignored🤖 Generated with Claude Code