A hackable, verifiable agent-system workbench built from a reverse-engineered Claude Code core.
Research Vault MCP module, public handoff dashboard, and four formal retrieval artifacts (648 + 972 + 72 + 72 calls).
📊 Headline benchmark · ⚡ Quick start · 🏗 Architecture · 💬 Discussions
- What this is
- 📊 The headline result
- 🏗 Architecture
- ⚡ Quick start
- 📁 Directory layout
- 🧩 Retrieval API
- 📚 Research notes
- 🤝 Contributing
- 📄 License
- 🙏 Attribution
Evensong is not just a benchmark and not just an MCP package. It is a public workbench around a runnable, modifiable CCR CLI: the core can be read and changed, Research Vault MCP is shipped as a module, and the retrieval claims are backed by committed evidence files instead of vibes.
This repository exists for four jobs:
| Purpose | What that means |
|---|---|
| Study | Read the Claude Code source without the closed binary |
| Modify | Swap agent tools, retrieval pipelines, and telemetry without treating the system as a sealed box |
| Package | Ship Research Vault MCP as an Evensong module/dependency instead of treating it as the whole product |
| Validate | Measure Retrieve-and-Rerank (RaR) architectures against the EverMemOS §3.4 direction with our own reproducible numbers |
Four formal retrieval artifacts are committed under benchmarks/runs/: Wave 3+F/G cover the original 108-query cross-LLM design; Wave 3+I is the newer 24-query adversarial suite for dense stage-1 + RAR. The Wave 3+I claim is scoped to that hard suite and does not replace the broader 108-query F/G comparisons.
| Pipeline | Top-1 accuracy | p50 latency | p90 latency | Prompt token cost |
|---|---|---|---|---|
| LLM-only judge | 76.9% (249/324) | 2056 ms | 3595 ms | 100% (200 entries) |
| Hybrid BM25 + LLM rerank | 79.3% (257/324) | 1509 ms | 2725 ms | 25% (50 entries) |
Raw: benchmarks/runs/wave3d-hybrid-scale-2026-04-19T1220.md. Per-run stddev 0.00–0.44pp.
| Pipeline | Top-1 | p50 | p90 | Avg latency | LLM calls |
|---|---|---|---|---|---|
| LLM-only judge | 77.8% (252/324) | 3861 ms | 6404 ms | 4139 ms | 100% (200 entries) |
| Hybrid BM25 + LLM rerank | 77.5% (251/324) | 2919 ms | 4669 ms | 3248 ms | 100% (50 entries) |
| Adaptive Hybrid | 73.1% (237/324) | 2519 ms | 4376 ms | 2365 ms | 73% (27% skip) |
Raw: benchmarks/runs/wave3g-pipelines-2026-04-19T1652.md. Wall time: 10.5 min. Per-run stddev: llm-only 0.76pp · hybrid 1.15pp · adaptive 0.76pp.
| Property | Wave 3+F (648 calls) | Wave 3+G (972 calls) | Verdict |
|---|---|---|---|
| Hybrid vs LLM-only latency (p50) | −27% | −24% | ✅ consistent latency edge |
| Hybrid prompt token cost | −75% | −75% | ✅ identical |
| Hybrid vs LLM-only top-1 accuracy | +2.5pp | −0.3pp (parity) | |
| Adaptive skip rate | — (not run) | 26.9% | ✅ dead-on internal prelim 27% |
| Adaptive top-1 | — | 73.1% | ✅ dead-on internal prelim |
- The hard claim is latency and token cost. Both formal runs agree: BM25 stage 1 narrows the LLM's pool and saves 22-27% p50 latency plus 75% prompt tokens. That is the part to ship.
- Accuracy should not be oversold. Wave 3+F saw Hybrid at +2.5pp over LLM-only; Wave 3+G's 972-call re-run saw practical parity (−0.3pp). Per-run stddev (0.8-1.2pp) and API-load-window variance cover that movement. Treat Hybrid as accuracy-parity-to-slight-edge, with stable latency and token savings.
- Adaptive is the product-shaped contribution. It trades −4.7pp top-1 for −43% average latency and makes 27% of queries use zero LLM calls. The knob is explicit, measurable, and useful.
Reproduce with one command (produces Wave 3+G's artifact):
bun run scripts/benchmark-hybrid-scale.ts \
--runs=3 --with-body \
--pipelines=llm-only,hybrid,adaptive \
--queries-file=benchmarks/wave3f-generated-queries-2026-04-19.jsonGenerator prompt is committed too — reviewers can audit exactly how queries were produced.
The always-rerank Hybrid pays 1 LLM call per query. For a large fraction of queries, BM25 alone is already confidently correct — paying for the LLM adds latency without changing the top-1. The createAdaptiveHybridProvider adds a gap-ratio gate: if BM25 scores[0] / scores[1] >= 1.5, trust stage 1 and skip the LLM entirely. Else fall through to stage 2.
Formal 972-call numbers (3 runs × 108 queries, benchmarks/runs/wave3g-pipelines-2026-04-19T1652.md):
- Skip rate: 26.9% (87/324 queries) — stage 2 LLM call avoided when BM25 is confident
- Top-1 on skipped: 58.6% (51/87) — BM25-alone is right ~59% on its confident picks
- Top-1 on invoked: 78.5% (186/237) — LLM resolves the ambiguous BM25 cases
- Overall Adaptive top-1: 73.1% — matches internal preliminary dogfood exactly
- Latency: avg 2365 ms / p90 4376 ms (vs always-rerank hybrid 3248 / 4669, vs llm-only 4139 / 6404)
- Per-run stddev: 0.76pp — the most stable of the three pipelines
Trade-off: −4.7pp top-1 accuracy vs llm-only buys −43% avg latency. The gate is a tuning knob: gapRatioThreshold: 1.3 raises skip rate and drops accuracy; 2.0 reverts toward Hybrid parity.
Positioning vs EverOS: this fills the gap between EverOS's published Fast tier (0 LLM calls, 200-600 ms) and Agentic tier (1-3 LLM calls, 2-5 s) — Adaptive Hybrid is 0 or 1 conditional LLM call, with a user-tunable gating knob. Not covered by any published EverOS / EverMemOS / HyperMem design.
See src/services/retrieval/providers/adaptiveHybridProvider.ts and the 7 unit tests in adaptiveHybridProvider.test.ts. Shipped at 86bb4ee. 66/66 retrieval-domain tests pass.
The Dense RAR path now has a clean, formal retrieval result: 24/24 Top-1 and 24/24 Top-5 on a 24-query adversarial hard suite. The suite uses a 200-entry manifest (18 real vault documents + 182 adversarial junk distractors), BGE-M3 for dense Stage 1, and deepseek-v4-flash as the Stage 2 judge with thinking disabled.
Canonical run: dense-rar-2026-04-24T0854 · mode formal · clean metadata commit 9148853 · Stage-1 TopK 50.
Operator handoff: evensong.zonicdesign.art/handoff · dashboard: /handoff/dashboard.
| Pipeline | Top-1 | Top-5 | Valid | Errors | p50 | p90 |
|---|---|---|---|---|---|---|
| Dense BGE-M3 only | 17/24 (70.8%) | 18/24 (75.0%) | 24/24 | 0 | 526 ms | 576 ms |
| Dense RAR | 24/24 (100.0%) | 24/24 (100.0%) | 24/24 | 0 | 1703 ms | 1842 ms |
| Dense Adaptive RAR | 24/24 (100.0%) | 24/24 (100.0%) | 24/24 | 0 | 1615 ms | 1854 ms |
Evidence: summary, metadata, raw JSONL, and the formal ledger.
Boundary: dense-rar-2026-04-24T0801 remains the Stage-1 TopK 20 formal baseline at 23/24; its q113 miss was candidate recall, not reranker failure. dense-rar-2026-04-24T0644 remains internal/probe-only evidence. Stage-1 TopK 50 fixes that blind spot, but it also increases rerank candidate exposure, so the 24/24 claim is limited to this verified hard suite.
The diagram shows the public release shape, not every internal mechanism.
| Layer | Owns |
|---|---|
| Core | CCR runtime: Bun entrypoint, CLI loop, and Ink REPL |
| Modules | Research Vault MCP, hybrid retrieval, and the Atomic Chat-compatible judge path |
| Evidence | Vault fixtures, formal benchmark artifacts, and the public handoff/dashboard surface |
Diagram source lives in docs/assets/evensong-architecture.svg and was exported as PNG with fireworks-tech-graph style 6.
See AGENTS.md and CLAUDE.md for detailed developer notes.
Prerequisites: Bun 1.3+ (Node.js not supported for the Evensong repo). Atomic Chat running on 127.0.0.1:1337 is optional and only needed for local retrieval features (docs).
If you only want the Research Vault MCP module, install the package directly:
npx @syndash/research-vault-mcp --transport=stdio
# or: bunx @syndash/research-vault-mcp --transport=stdioFor the full Evensong repo:
# 1. Install
bun install
# 2. Dev mode REPL
bun run dev
# 3. Build single-file bundle (~27MB)
bun run build # → dist/cli.js
# 4. Run the retrieval test suite
bun test src/services/retrieval src/services/api
# 5. Fire an ad-hoc vault retrieval
bun run scripts/vault-recall.ts "hypergraph memory for conversations"
# 6. Replay the 648-trial hybrid vs LLM-only benchmark
bun run scripts/benchmark-hybrid-scale.ts --runs=3 --with-body \
--queries-file=benchmarks/wave3f-generated-queries-2026-04-19.jsonsrc/ Reverse-engineered CCR core (CLI, REPL, tools, state)
src/services/
api/localGemma.ts Atomic Chat OpenAI-compat client + model registry
retrieval/ Hybrid RaR + manifest builder + BM25 + providers
packages/
research-vault-mcp/ MCP server for the research vault (npx-ready)
scripts/
benchmark-hybrid-scale.ts Scale benchmark harness (--runs, --concurrency)
benchmark-judge.ts Single-pipeline judge benchmark
generate-benchmark-queries.ts Cross-LLM query generator (grok-3)
vault-recall.ts CLI entrypoint for ad-hoc retrieval
dogfood-wave2b.ts Model-comparison harness
benchmarks/
wave3f-generated-queries-*.json 108-query set, committed for reproducibility
wave3-judge-queries.json Original 20-query manual set
runs/ Raw JSONL + Markdown summaries
docs/ Design specs, plans, debug notes
tests/ Regression + integration suites
services/ 8-service microservice suite used inside benchmarks
api/ HTTP relay / provider fallback chain
Library usage — compose the hybrid pipeline manually for custom flows:
import { createLocalGemmaClient, ATOMIC_MODELS } from 'src/services/api/localGemma'
import { createAtomicProvider } from 'src/services/retrieval/providers/atomicProvider'
import { createBM25Provider } from 'src/services/retrieval/providers/bm25Provider'
import { createHybridProvider } from 'src/services/retrieval/providers/hybridProvider'
import { createAdaptiveHybridProvider } from 'src/services/retrieval/providers/adaptiveHybridProvider'
import { buildVaultManifest } from 'src/services/retrieval/manifestBuilder'
import { vaultRetrieve } from 'src/services/retrieval/vaultRetrieve'
const manifest = await buildVaultManifest({ vaultRoot: '_vault', withBody: true })
// Always-rerank Hybrid — pays 1 LLM call per query for max accuracy.
const hybrid = createHybridProvider({
stage1: createBM25Provider(),
stage2: createAtomicProvider(
createLocalGemmaClient({ model: ATOMIC_MODELS.DEEPSEEK_V32 })
),
stage1TopK: 50,
})
// Adaptive variant — skips the LLM when BM25 is confidently top-1.
// Trade-off: −4.7pp top-1 for −67% avg latency. See Adaptive tier above.
const adaptive = createAdaptiveHybridProvider({
stage2: createAtomicProvider(
createLocalGemmaClient({ model: ATOMIC_MODELS.DEEPSEEK_V32 })
),
gapRatioThreshold: 1.5, // skip stage 2 when scores[0] / scores[1] >= 1.5
})
const result = await vaultRetrieve(
{ query: 'hypergraph memory for long-term conversations', manifest, topK: 5 },
{ providers: [hybrid] }, // or [adaptive] for the gated variant
)Available providers: createAtomicProvider, createBM25Provider, createHybridProvider, createAdaptiveHybridProvider. All implement the VaultRetrievalProvider contract — compose or swap freely.
This project dialogs with recent published work on agentic memory systems:
| Work | Reference | How we use it |
|---|---|---|
| EverMemOS | arxiv 2601.02163 (EverMind / Shanda) | We adopt §3.4 two-stage pattern. Simplified stage 2 to direct listwise judge (no verifier loop). |
| HyperMem | arxiv 2604.08256 | Three-layer hypergraph memory — cited as related art. |
| MemGPT | arxiv 2310.08560 | LLM-as-OS paging; benchmark includes MemGPT query category. |
| MSA | arxiv 2604.08256 | Memory Sparse Attention — not integrated, benchmarked for comparison. |
| Reflexion | arxiv 2303.11366 | Self-reflective agents — query set includes Reflexion-style tasks. |
| Extended Mind | Clark & Chalmers 1998 | Philosophical grounding for external-memory-as-cognition. |
The full 108-query test corpus with provenance is at benchmarks/wave3f-generated-queries-2026-04-19.json.
We welcome PRs — especially around:
- Dense-vector stage 1 providers (BGE-M3 integration, RRF fusion with BM25) 🔴
- Adaptive gating threshold auto-calibration — the 1.5× gap-ratio default is a conservative hand-pick; PRs welcome to sweep thresholds against live query distributions (gate itself already shipped at
86bb4ee✅) - Additional model connectors via the
atomicProviderfactory - New benchmark categories (adversarial queries, multi-intent, negation traps)
- Vault-size scaling experiments (100 / 500 / 1000+ entries)
Templates are in place for:
- 🐛 Bug reports
- ✨ Feature requests
- 📊 Benchmark reports
- 🔀 Pull request template
- 💬 Discussions — ideas, Q&A, show-and-tell, benchmarks
File an issue before non-trivial PRs to align on shape.
Dual-licensed. See LICENSING.md for per-directory mapping and compatibility matrix.
| Applies to | License | File |
|---|---|---|
| Source code, tests, benchmarks, scripts, configs, developer docs | Apache License 2.0 | LICENSE-APACHE |
| Research papers, long-form narrative | CC BY-NC-ND 4.0 | LICENSE-CC-BY-NC-ND |
All code is Apache 2.0 and can be freely incorporated into other Apache-compatible open-source projects (including EverMind-AI/EverOS).
Created by Fearvox / 0xVox (Hengyuan Zhu).
The CCR runtime is a clean-room reverse-engineered study of Anthropic's Claude Code CLI. All identifying strings, telemetry endpoints, and internal APIs have been stubbed or removed. This repository makes no claims over Anthropic's trademarks or original binary design.
Upstream lineage: derived from the community reverse-engineered baseline at github.com/claude-code-best/claude-code (CCB). CCR continues from that foundation with additional infrastructure, retrieval pipeline, benchmark harness, and packaging work.
The hybrid retrieval architecture, benchmark harness, and all original code in src/services/retrieval/, scripts/benchmark-*.ts, and benchmarks/wave3* are original work, independently inspired by the EverMemOS published design.
If you build an agent memory system on top of this, we'd love to hear about it.
Open an issue · Start a discussion · ⭐ Star the repo · Fork and ship
