Evensong

A hackable, verifiable agent-system workbench built from a reverse-engineered Claude Code core.
Research Vault MCP module, public handoff dashboard, and four formal retrieval artifacts (648 + 972 + 72 + 72 calls).

🇺🇸 English · 🇨🇳 中文

📊 Headline benchmark · ⚡ Quick start · 🏗 Architecture · 💬 Discussions

What this is

Evensong is not just a benchmark and not just an MCP package. It is a public workbench around a runnable, modifiable CCR CLI: the core can be read and changed, Research Vault MCP is shipped as a module, and the retrieval claims are backed by committed evidence files instead of vibes.

This repository exists for four jobs:

Purpose	What that means
Study	Read the Claude Code source without the closed binary
Modify	Swap agent tools, retrieval pipelines, and telemetry without treating the system as a sealed box
Package	Ship Research Vault MCP as an Evensong module/dependency instead of treating it as the whole product
Validate	Measure Retrieve-and-Rerank (RaR) architectures against the EverMemOS §3.4 direction with our own reproducible numbers

↑ back to top

📊 The headline result

Four formal retrieval artifacts are committed under benchmarks/runs/: Wave 3+F/G cover the original 108-query cross-LLM design; Wave 3+I is the newer 24-query adversarial suite for dense stage-1 + RAR. The Wave 3+I claim is scoped to that hard suite and does not replace the broader 108-query F/G comparisons.

Wave 3+F — 648-call two-pipeline run ✅

Pipeline	Top-1 accuracy	p50 latency	p90 latency	Prompt token cost
LLM-only judge	76.9% (249/324)	2056 ms	3595 ms	100% (200 entries)
Hybrid BM25 + LLM rerank	79.3% (257/324)	1509 ms	2725 ms	25% (50 entries)

Raw: benchmarks/runs/wave3d-hybrid-scale-2026-04-19T1220.md. Per-run stddev 0.00–0.44pp.

Wave 3+G — 972-call three-pipeline formal re-run ✅

Pipeline	Top-1	p50	p90	Avg latency	LLM calls
LLM-only judge	77.8% (252/324)	3861 ms	6404 ms	4139 ms	100% (200 entries)
Hybrid BM25 + LLM rerank	77.5% (251/324)	2919 ms	4669 ms	3248 ms	100% (50 entries)
Adaptive Hybrid	73.1% (237/324)	2519 ms	4376 ms	2365 ms	73% (27% skip)

Raw: benchmarks/runs/wave3g-pipelines-2026-04-19T1652.md. Wall time: 10.5 min. Per-run stddev: llm-only 0.76pp · hybrid 1.15pp · adaptive 0.76pp.

What holds across both runs

Property	Wave 3+F (648 calls)	Wave 3+G (972 calls)	Verdict
Hybrid vs LLM-only latency (p50)	−27%	−24%	✅ consistent latency edge
Hybrid prompt token cost	−75%	−75%	✅ identical
Hybrid vs LLM-only top-1 accuracy	+2.5pp	−0.3pp (parity)	⚠️ run-to-run variance — within 2σ of per-run stddev
Adaptive skip rate	— (not run)	26.9%	✅ dead-on internal prelim 27%
Adaptive top-1	—	73.1%	✅ dead-on internal prelim

Honest read

The hard claim is latency and token cost. Both formal runs agree: BM25 stage 1 narrows the LLM's pool and saves 22-27% p50 latency plus 75% prompt tokens. That is the part to ship.
Accuracy should not be oversold. Wave 3+F saw Hybrid at +2.5pp over LLM-only; Wave 3+G's 972-call re-run saw practical parity (−0.3pp). Per-run stddev (0.8-1.2pp) and API-load-window variance cover that movement. Treat Hybrid as accuracy-parity-to-slight-edge, with stable latency and token savings.
Adaptive is the product-shaped contribution. It trades −4.7pp top-1 for −43% average latency and makes 27% of queries use zero LLM calls. The knob is explicit, measurable, and useful.

Reproduce with one command (produces Wave 3+G's artifact):

bun run scripts/benchmark-hybrid-scale.ts \
  --runs=3 --with-body \
  --pipelines=llm-only,hybrid,adaptive \
  --queries-file=benchmarks/wave3f-generated-queries-2026-04-19.json

Generator prompt is committed too — reviewers can audit exactly how queries were produced.

Adaptive tier (Wave 3+G, shipped 2026-04-19) ✅

The always-rerank Hybrid pays 1 LLM call per query. For a large fraction of queries, BM25 alone is already confidently correct — paying for the LLM adds latency without changing the top-1. The createAdaptiveHybridProvider adds a gap-ratio gate: if BM25 scores[0] / scores[1] >= 1.5, trust stage 1 and skip the LLM entirely. Else fall through to stage 2.

Formal 972-call numbers (3 runs × 108 queries, benchmarks/runs/wave3g-pipelines-2026-04-19T1652.md):

Skip rate: 26.9% (87/324 queries) — stage 2 LLM call avoided when BM25 is confident
Top-1 on skipped: 58.6% (51/87) — BM25-alone is right ~59% on its confident picks
Top-1 on invoked: 78.5% (186/237) — LLM resolves the ambiguous BM25 cases
Overall Adaptive top-1: 73.1% — matches internal preliminary dogfood exactly
Latency: avg 2365 ms / p90 4376 ms (vs always-rerank hybrid 3248 / 4669, vs llm-only 4139 / 6404)
Per-run stddev: 0.76pp — the most stable of the three pipelines

Trade-off: −4.7pp top-1 accuracy vs llm-only buys −43% avg latency. The gate is a tuning knob: gapRatioThreshold: 1.3 raises skip rate and drops accuracy; 2.0 reverts toward Hybrid parity.

Positioning vs EverOS: this fills the gap between EverOS's published Fast tier (0 LLM calls, 200-600 ms) and Agentic tier (1-3 LLM calls, 2-5 s) — Adaptive Hybrid is 0 or 1 conditional LLM call, with a user-tunable gating knob. Not covered by any published EverOS / EverMemOS / HyperMem design.

See src/services/retrieval/providers/adaptiveHybridProvider.ts and the 7 unit tests in adaptiveHybridProvider.test.ts. Shipped at 86bb4ee. 66/66 retrieval-domain tests pass.

Wave 3+I — Dense RAR hard-suite formal evidence ✅

The Dense RAR path now has a clean, formal retrieval result: 24/24 Top-1 and 24/24 Top-5 on a 24-query adversarial hard suite. The suite uses a 200-entry manifest (18 real vault documents + 182 adversarial junk distractors), BGE-M3 for dense Stage 1, and deepseek-v4-flash as the Stage 2 judge with thinking disabled.

Canonical run: dense-rar-2026-04-24T0854 · mode formal · clean metadata commit 9148853 · Stage-1 TopK 50.

Operator handoff: evensong.zonicdesign.art/handoff · dashboard: /handoff/dashboard.

Pipeline	Top-1	Top-5	Valid	p50	p90
Dense BGE-M3 only	17/24 (70.8%)	18/24 (75.0%)	24/24	526 ms	576 ms
Dense RAR	24/24 (100.0%)	24/24 (100.0%)	24/24	1703 ms	1842 ms
Dense Adaptive RAR	24/24 (100.0%)	24/24 (100.0%)	24/24	1615 ms	1854 ms

Evidence: summary, metadata, raw JSONL, and the formal ledger.

Boundary: dense-rar-2026-04-24T0801 remains the Stage-1 TopK 20 formal baseline at 23/24; its q113 miss was candidate recall, not reranker failure. dense-rar-2026-04-24T0644 remains internal/probe-only evidence. Stage-1 TopK 50 fixes that blind spot, but it also increases rerank candidate exposure, so the 24/24 claim is limited to this verified hard suite.

↑ back to top

🏗 Architecture

The diagram shows the public release shape, not every internal mechanism.

Layer	Owns
Core	CCR runtime: Bun entrypoint, CLI loop, and Ink REPL
Modules	Research Vault MCP, hybrid retrieval, and the Atomic Chat-compatible judge path
Evidence	Vault fixtures, formal benchmark artifacts, and the public handoff/dashboard surface

Diagram source lives in docs/assets/evensong-architecture.svg and was exported as PNG with fireworks-tech-graph style 6.

See AGENTS.md and CLAUDE.md for detailed developer notes.

↑ back to top

⚡ Quick start

Prerequisites: Bun 1.3+ (Node.js not supported for the Evensong repo). Atomic Chat running on 127.0.0.1:1337 is optional and only needed for local retrieval features (docs).

If you only want the Research Vault MCP module, install the package directly:

npx @syndash/research-vault-mcp --transport=stdio
# or: bunx @syndash/research-vault-mcp --transport=stdio

For the full Evensong repo:

# 1. Install
bun install

# 2. Dev mode REPL
bun run dev

# 3. Build single-file bundle (~27MB)
bun run build      # → dist/cli.js

# 4. Run the retrieval test suite
bun test src/services/retrieval src/services/api

# 5. Fire an ad-hoc vault retrieval
bun run scripts/vault-recall.ts "hypergraph memory for conversations"

# 6. Replay the 648-trial hybrid vs LLM-only benchmark
bun run scripts/benchmark-hybrid-scale.ts --runs=3 --with-body \
  --queries-file=benchmarks/wave3f-generated-queries-2026-04-19.json

↑ back to top

📁 Directory layout

src/                 Reverse-engineered CCR core (CLI, REPL, tools, state)
src/services/
  api/localGemma.ts           Atomic Chat OpenAI-compat client + model registry
  retrieval/                  Hybrid RaR + manifest builder + BM25 + providers
packages/
  research-vault-mcp/         MCP server for the research vault (npx-ready)
scripts/
  benchmark-hybrid-scale.ts       Scale benchmark harness (--runs, --concurrency)
  benchmark-judge.ts              Single-pipeline judge benchmark
  generate-benchmark-queries.ts   Cross-LLM query generator (grok-3)
  vault-recall.ts                 CLI entrypoint for ad-hoc retrieval
  dogfood-wave2b.ts               Model-comparison harness
benchmarks/
  wave3f-generated-queries-*.json  108-query set, committed for reproducibility
  wave3-judge-queries.json         Original 20-query manual set
  runs/                            Raw JSONL + Markdown summaries
docs/                 Design specs, plans, debug notes
tests/                Regression + integration suites
services/             8-service microservice suite used inside benchmarks
api/                  HTTP relay / provider fallback chain

↑ back to top

🧩 Retrieval API

Library usage — compose the hybrid pipeline manually for custom flows:

import { createLocalGemmaClient, ATOMIC_MODELS } from 'src/services/api/localGemma'
import { createAtomicProvider } from 'src/services/retrieval/providers/atomicProvider'
import { createBM25Provider } from 'src/services/retrieval/providers/bm25Provider'
import { createHybridProvider } from 'src/services/retrieval/providers/hybridProvider'
import { createAdaptiveHybridProvider } from 'src/services/retrieval/providers/adaptiveHybridProvider'
import { buildVaultManifest } from 'src/services/retrieval/manifestBuilder'
import { vaultRetrieve } from 'src/services/retrieval/vaultRetrieve'

const manifest = await buildVaultManifest({ vaultRoot: '_vault', withBody: true })

// Always-rerank Hybrid — pays 1 LLM call per query for max accuracy.
const hybrid = createHybridProvider({
  stage1: createBM25Provider(),
  stage2: createAtomicProvider(
    createLocalGemmaClient({ model: ATOMIC_MODELS.DEEPSEEK_V32 })
  ),
  stage1TopK: 50,
})

// Adaptive variant — skips the LLM when BM25 is confidently top-1.
// Trade-off: −4.7pp top-1 for −67% avg latency. See Adaptive tier above.
const adaptive = createAdaptiveHybridProvider({
  stage2: createAtomicProvider(
    createLocalGemmaClient({ model: ATOMIC_MODELS.DEEPSEEK_V32 })
  ),
  gapRatioThreshold: 1.5,  // skip stage 2 when scores[0] / scores[1] >= 1.5
})

const result = await vaultRetrieve(
  { query: 'hypergraph memory for long-term conversations', manifest, topK: 5 },
  { providers: [hybrid] },  // or [adaptive] for the gated variant
)

Available providers: createAtomicProvider, createBM25Provider, createHybridProvider, createAdaptiveHybridProvider. All implement the VaultRetrievalProvider contract — compose or swap freely.

↑ back to top

📚 Research notes

This project dialogs with recent published work on agentic memory systems:

Work	Reference	How we use it
EverMemOS	arxiv 2601.02163 (EverMind / Shanda)	We adopt §3.4 two-stage pattern. Simplified stage 2 to direct listwise judge (no verifier loop).
HyperMem	arxiv 2604.08256	Three-layer hypergraph memory — cited as related art.
MemGPT	arxiv 2310.08560	LLM-as-OS paging; benchmark includes MemGPT query category.
MSA	arxiv 2604.08256	Memory Sparse Attention — not integrated, benchmarked for comparison.
Reflexion	arxiv 2303.11366	Self-reflective agents — query set includes Reflexion-style tasks.
Extended Mind	Clark & Chalmers 1998	Philosophical grounding for external-memory-as-cognition.

The full 108-query test corpus with provenance is at benchmarks/wave3f-generated-queries-2026-04-19.json.

↑ back to top

🤝 Contributing

We welcome PRs — especially around:

Dense-vector stage 1 providers (BGE-M3 integration, RRF fusion with BM25) 🔴
Adaptive gating threshold auto-calibration — the 1.5× gap-ratio default is a conservative hand-pick; PRs welcome to sweep thresholds against live query distributions (gate itself already shipped at 86bb4ee ✅)
Additional model connectors via the atomicProvider factory
New benchmark categories (adversarial queries, multi-intent, negation traps)
Vault-size scaling experiments (100 / 500 / 1000+ entries)

Templates are in place for:

🐛 Bug reports
✨ Feature requests
📊 Benchmark reports
🔀 Pull request template
💬 Discussions — ideas, Q&A, show-and-tell, benchmarks

File an issue before non-trivial PRs to align on shape.

↑ back to top

📄 License

Dual-licensed. See LICENSING.md for per-directory mapping and compatibility matrix.

Applies to	License	File
Source code, tests, benchmarks, scripts, configs, developer docs	Apache License 2.0	LICENSE-APACHE
Research papers, long-form narrative	CC BY-NC-ND 4.0	LICENSE-CC-BY-NC-ND

All code is Apache 2.0 and can be freely incorporated into other Apache-compatible open-source projects (including EverMind-AI/EverOS).

↑ back to top

🙏 Attribution

Created by Fearvox / 0xVox (Hengyuan Zhu).

The CCR runtime is a clean-room reverse-engineered study of Anthropic's Claude Code CLI. All identifying strings, telemetry endpoints, and internal APIs have been stubbed or removed. This repository makes no claims over Anthropic's trademarks or original binary design.

Upstream lineage: derived from the community reverse-engineered baseline at github.com/claude-code-best/claude-code (CCB). CCR continues from that foundation with additional infrastructure, retrieval pipeline, benchmark harness, and packaging work.

The hybrid retrieval architecture, benchmark harness, and all original code in src/services/retrieval/, scripts/benchmark-*.ts, and benchmarks/wave3* are original work, independently inspired by the EverMemOS published design.

↑ back to top

If you build an agent memory system on top of this, we'd love to hear about it.

Open an issue · Start a discussion · ⭐ Star the repo · Fork and ship

Name		Name	Last commit message	Last commit date
Latest commit History 409 Commits
.agents/skills/find-skills		.agents/skills/find-skills
.benchmark-planning		.benchmark-planning
.github		.github
.planning/phases		.planning/phases
api/relay		api/relay
benchmarks		benchmarks
docs		docs
ops/systemd		ops/systemd
packages		packages
scripts		scripts
services		services
shared		shared
skills		skills
src		src
tests		tests
.env.relay.ccr.example		.env.relay.ccr.example
.env.relay.example		.env.relay.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.vercelignore		.vercelignore
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
LICENSE-CC-BY-NC-ND		LICENSE-CC-BY-NC-ND
LICENSING.md		LICENSING.md
README-zh.md		README-zh.md
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
mcp.json		mcp.json
package.json		package.json
tsconfig.json		tsconfig.json
tsconfig.strict.json		tsconfig.strict.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evensong

Contents

What this is

📊 The headline result

Wave 3+F — 648-call two-pipeline run ✅

Wave 3+G — 972-call three-pipeline formal re-run ✅

What holds across both runs

Honest read

Adaptive tier (Wave 3+G, shipped 2026-04-19) ✅

Wave 3+I — Dense RAR hard-suite formal evidence ✅

🏗 Architecture

⚡ Quick start

📁 Directory layout

🧩 Retrieval API

📚 Research notes

🤝 Contributing

📄 License

🙏 Attribution

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evensong

Contents

What this is

📊 The headline result

Wave 3+F — 648-call two-pipeline run ✅

Wave 3+G — 972-call three-pipeline formal re-run ✅

What holds across both runs

Honest read

Adaptive tier (Wave 3+G, shipped 2026-04-19) ✅

Wave 3+I — Dense RAR hard-suite formal evidence ✅

🏗 Architecture

⚡ Quick start

📁 Directory layout

🧩 Retrieval API

📚 Research notes

🤝 Contributing

📄 License

🙏 Attribution

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages