Skip to content

Fearvox/Evensong

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

409 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Evensong

A hackable, verifiable agent-system workbench built from a reverse-engineered Claude Code core.
Research Vault MCP module, public handoff dashboard, and four formal retrieval artifacts (648 + 972 + 72 + 72 calls).

🇺🇸 English · 🇨🇳 中文

Evensong Code: Apache 2.0 Research: CC BY-NC-ND 4.0 Bilingual

Bun TypeScript 972+648+72+72-call benchmark Dialogs with EverOS

📊 Headline benchmark  ·  ⚡ Quick start  ·  🏗 Architecture  ·  💬 Discussions


Contents


What this is

Evensong is not just a benchmark and not just an MCP package. It is a public workbench around a runnable, modifiable CCR CLI: the core can be read and changed, Research Vault MCP is shipped as a module, and the retrieval claims are backed by committed evidence files instead of vibes.

This repository exists for four jobs:

Purpose What that means
Study Read the Claude Code source without the closed binary
Modify Swap agent tools, retrieval pipelines, and telemetry without treating the system as a sealed box
Package Ship Research Vault MCP as an Evensong module/dependency instead of treating it as the whole product
Validate Measure Retrieve-and-Rerank (RaR) architectures against the EverMemOS §3.4 direction with our own reproducible numbers

↑ back to top


📊 The headline result

Four formal retrieval artifacts are committed under benchmarks/runs/: Wave 3+F/G cover the original 108-query cross-LLM design; Wave 3+I is the newer 24-query adversarial suite for dense stage-1 + RAR. The Wave 3+I claim is scoped to that hard suite and does not replace the broader 108-query F/G comparisons.

Wave 3+F — 648-call two-pipeline run ✅

Pipeline Top-1 accuracy p50 latency p90 latency Prompt token cost
LLM-only judge 76.9% (249/324) 2056 ms 3595 ms 100% (200 entries)
Hybrid BM25 + LLM rerank 79.3% (257/324) 1509 ms 2725 ms 25% (50 entries)

Raw: benchmarks/runs/wave3d-hybrid-scale-2026-04-19T1220.md. Per-run stddev 0.00–0.44pp.

Wave 3+G — 972-call three-pipeline formal re-run ✅

Pipeline Top-1 p50 p90 Avg latency LLM calls
LLM-only judge 77.8% (252/324) 3861 ms 6404 ms 4139 ms 100% (200 entries)
Hybrid BM25 + LLM rerank 77.5% (251/324) 2919 ms 4669 ms 3248 ms 100% (50 entries)
Adaptive Hybrid 73.1% (237/324) 2519 ms 4376 ms 2365 ms 73% (27% skip)

Raw: benchmarks/runs/wave3g-pipelines-2026-04-19T1652.md. Wall time: 10.5 min. Per-run stddev: llm-only 0.76pp · hybrid 1.15pp · adaptive 0.76pp.

What holds across both runs

Property Wave 3+F (648 calls) Wave 3+G (972 calls) Verdict
Hybrid vs LLM-only latency (p50) −27% −24% ✅ consistent latency edge
Hybrid prompt token cost −75% −75% ✅ identical
Hybrid vs LLM-only top-1 accuracy +2.5pp −0.3pp (parity) ⚠️ run-to-run variance — within 2σ of per-run stddev
Adaptive skip rate — (not run) 26.9% ✅ dead-on internal prelim 27%
Adaptive top-1 73.1% ✅ dead-on internal prelim

Honest read

  • The hard claim is latency and token cost. Both formal runs agree: BM25 stage 1 narrows the LLM's pool and saves 22-27% p50 latency plus 75% prompt tokens. That is the part to ship.
  • Accuracy should not be oversold. Wave 3+F saw Hybrid at +2.5pp over LLM-only; Wave 3+G's 972-call re-run saw practical parity (−0.3pp). Per-run stddev (0.8-1.2pp) and API-load-window variance cover that movement. Treat Hybrid as accuracy-parity-to-slight-edge, with stable latency and token savings.
  • Adaptive is the product-shaped contribution. It trades −4.7pp top-1 for −43% average latency and makes 27% of queries use zero LLM calls. The knob is explicit, measurable, and useful.

Reproduce with one command (produces Wave 3+G's artifact):

bun run scripts/benchmark-hybrid-scale.ts \
  --runs=3 --with-body \
  --pipelines=llm-only,hybrid,adaptive \
  --queries-file=benchmarks/wave3f-generated-queries-2026-04-19.json

Generator prompt is committed too — reviewers can audit exactly how queries were produced.

Adaptive tier (Wave 3+G, shipped 2026-04-19) ✅

The always-rerank Hybrid pays 1 LLM call per query. For a large fraction of queries, BM25 alone is already confidently correct — paying for the LLM adds latency without changing the top-1. The createAdaptiveHybridProvider adds a gap-ratio gate: if BM25 scores[0] / scores[1] >= 1.5, trust stage 1 and skip the LLM entirely. Else fall through to stage 2.

Formal 972-call numbers (3 runs × 108 queries, benchmarks/runs/wave3g-pipelines-2026-04-19T1652.md):

  • Skip rate: 26.9% (87/324 queries) — stage 2 LLM call avoided when BM25 is confident
  • Top-1 on skipped: 58.6% (51/87) — BM25-alone is right ~59% on its confident picks
  • Top-1 on invoked: 78.5% (186/237) — LLM resolves the ambiguous BM25 cases
  • Overall Adaptive top-1: 73.1% — matches internal preliminary dogfood exactly
  • Latency: avg 2365 ms / p90 4376 ms (vs always-rerank hybrid 3248 / 4669, vs llm-only 4139 / 6404)
  • Per-run stddev: 0.76pp — the most stable of the three pipelines

Trade-off: −4.7pp top-1 accuracy vs llm-only buys −43% avg latency. The gate is a tuning knob: gapRatioThreshold: 1.3 raises skip rate and drops accuracy; 2.0 reverts toward Hybrid parity.

Positioning vs EverOS: this fills the gap between EverOS's published Fast tier (0 LLM calls, 200-600 ms) and Agentic tier (1-3 LLM calls, 2-5 s) — Adaptive Hybrid is 0 or 1 conditional LLM call, with a user-tunable gating knob. Not covered by any published EverOS / EverMemOS / HyperMem design.

See src/services/retrieval/providers/adaptiveHybridProvider.ts and the 7 unit tests in adaptiveHybridProvider.test.ts. Shipped at 86bb4ee. 66/66 retrieval-domain tests pass.

Wave 3+I — Dense RAR hard-suite formal evidence ✅

The Dense RAR path now has a clean, formal retrieval result: 24/24 Top-1 and 24/24 Top-5 on a 24-query adversarial hard suite. The suite uses a 200-entry manifest (18 real vault documents + 182 adversarial junk distractors), BGE-M3 for dense Stage 1, and deepseek-v4-flash as the Stage 2 judge with thinking disabled.

Canonical run: dense-rar-2026-04-24T0854 · mode formal · clean metadata commit 9148853 · Stage-1 TopK 50.

Operator handoff: evensong.zonicdesign.art/handoff · dashboard: /handoff/dashboard.

Pipeline Top-1 Top-5 Valid Errors p50 p90
Dense BGE-M3 only 17/24 (70.8%) 18/24 (75.0%) 24/24 0 526 ms 576 ms
Dense RAR 24/24 (100.0%) 24/24 (100.0%) 24/24 0 1703 ms 1842 ms
Dense Adaptive RAR 24/24 (100.0%) 24/24 (100.0%) 24/24 0 1615 ms 1854 ms

Evidence: summary, metadata, raw JSONL, and the formal ledger.

Boundary: dense-rar-2026-04-24T0801 remains the Stage-1 TopK 20 formal baseline at 23/24; its q113 miss was candidate recall, not reranker failure. dense-rar-2026-04-24T0644 remains internal/probe-only evidence. Stage-1 TopK 50 fixes that blind spot, but it also increases rerank candidate exposure, so the 24/24 claim is limited to this verified hard suite.

↑ back to top


🏗 Architecture

Evensong architecture tree showing CCR runtime, Research Vault MCP, Hybrid Retrieval, Atomic Gateway, and public evidence outputs

The diagram shows the public release shape, not every internal mechanism.

Layer Owns
Core CCR runtime: Bun entrypoint, CLI loop, and Ink REPL
Modules Research Vault MCP, hybrid retrieval, and the Atomic Chat-compatible judge path
Evidence Vault fixtures, formal benchmark artifacts, and the public handoff/dashboard surface

Diagram source lives in docs/assets/evensong-architecture.svg and was exported as PNG with fireworks-tech-graph style 6.

See AGENTS.md and CLAUDE.md for detailed developer notes.

↑ back to top


⚡ Quick start

Prerequisites: Bun 1.3+ (Node.js not supported for the Evensong repo). Atomic Chat running on 127.0.0.1:1337 is optional and only needed for local retrieval features (docs).

If you only want the Research Vault MCP module, install the package directly:

npx @syndash/research-vault-mcp --transport=stdio
# or: bunx @syndash/research-vault-mcp --transport=stdio

For the full Evensong repo:

# 1. Install
bun install

# 2. Dev mode REPL
bun run dev

# 3. Build single-file bundle (~27MB)
bun run build      # → dist/cli.js

# 4. Run the retrieval test suite
bun test src/services/retrieval src/services/api

# 5. Fire an ad-hoc vault retrieval
bun run scripts/vault-recall.ts "hypergraph memory for conversations"

# 6. Replay the 648-trial hybrid vs LLM-only benchmark
bun run scripts/benchmark-hybrid-scale.ts --runs=3 --with-body \
  --queries-file=benchmarks/wave3f-generated-queries-2026-04-19.json

↑ back to top


📁 Directory layout

src/                 Reverse-engineered CCR core (CLI, REPL, tools, state)
src/services/
  api/localGemma.ts           Atomic Chat OpenAI-compat client + model registry
  retrieval/                  Hybrid RaR + manifest builder + BM25 + providers
packages/
  research-vault-mcp/         MCP server for the research vault (npx-ready)
scripts/
  benchmark-hybrid-scale.ts       Scale benchmark harness (--runs, --concurrency)
  benchmark-judge.ts              Single-pipeline judge benchmark
  generate-benchmark-queries.ts   Cross-LLM query generator (grok-3)
  vault-recall.ts                 CLI entrypoint for ad-hoc retrieval
  dogfood-wave2b.ts               Model-comparison harness
benchmarks/
  wave3f-generated-queries-*.json  108-query set, committed for reproducibility
  wave3-judge-queries.json         Original 20-query manual set
  runs/                            Raw JSONL + Markdown summaries
docs/                 Design specs, plans, debug notes
tests/                Regression + integration suites
services/             8-service microservice suite used inside benchmarks
api/                  HTTP relay / provider fallback chain

↑ back to top


🧩 Retrieval API

Library usage — compose the hybrid pipeline manually for custom flows:

import { createLocalGemmaClient, ATOMIC_MODELS } from 'src/services/api/localGemma'
import { createAtomicProvider } from 'src/services/retrieval/providers/atomicProvider'
import { createBM25Provider } from 'src/services/retrieval/providers/bm25Provider'
import { createHybridProvider } from 'src/services/retrieval/providers/hybridProvider'
import { createAdaptiveHybridProvider } from 'src/services/retrieval/providers/adaptiveHybridProvider'
import { buildVaultManifest } from 'src/services/retrieval/manifestBuilder'
import { vaultRetrieve } from 'src/services/retrieval/vaultRetrieve'

const manifest = await buildVaultManifest({ vaultRoot: '_vault', withBody: true })

// Always-rerank Hybrid — pays 1 LLM call per query for max accuracy.
const hybrid = createHybridProvider({
  stage1: createBM25Provider(),
  stage2: createAtomicProvider(
    createLocalGemmaClient({ model: ATOMIC_MODELS.DEEPSEEK_V32 })
  ),
  stage1TopK: 50,
})

// Adaptive variant — skips the LLM when BM25 is confidently top-1.
// Trade-off: −4.7pp top-1 for −67% avg latency. See Adaptive tier above.
const adaptive = createAdaptiveHybridProvider({
  stage2: createAtomicProvider(
    createLocalGemmaClient({ model: ATOMIC_MODELS.DEEPSEEK_V32 })
  ),
  gapRatioThreshold: 1.5,  // skip stage 2 when scores[0] / scores[1] >= 1.5
})

const result = await vaultRetrieve(
  { query: 'hypergraph memory for long-term conversations', manifest, topK: 5 },
  { providers: [hybrid] },  // or [adaptive] for the gated variant
)

Available providers: createAtomicProvider, createBM25Provider, createHybridProvider, createAdaptiveHybridProvider. All implement the VaultRetrievalProvider contract — compose or swap freely.

↑ back to top


📚 Research notes

This project dialogs with recent published work on agentic memory systems:

Work Reference How we use it
EverMemOS arxiv 2601.02163 (EverMind / Shanda) We adopt §3.4 two-stage pattern. Simplified stage 2 to direct listwise judge (no verifier loop).
HyperMem arxiv 2604.08256 Three-layer hypergraph memory — cited as related art.
MemGPT arxiv 2310.08560 LLM-as-OS paging; benchmark includes MemGPT query category.
MSA arxiv 2604.08256 Memory Sparse Attention — not integrated, benchmarked for comparison.
Reflexion arxiv 2303.11366 Self-reflective agents — query set includes Reflexion-style tasks.
Extended Mind Clark & Chalmers 1998 Philosophical grounding for external-memory-as-cognition.

The full 108-query test corpus with provenance is at benchmarks/wave3f-generated-queries-2026-04-19.json.

↑ back to top


🤝 Contributing

We welcome PRs — especially around:

  • Dense-vector stage 1 providers (BGE-M3 integration, RRF fusion with BM25) 🔴
  • Adaptive gating threshold auto-calibration — the 1.5× gap-ratio default is a conservative hand-pick; PRs welcome to sweep thresholds against live query distributions (gate itself already shipped at 86bb4ee ✅)
  • Additional model connectors via the atomicProvider factory
  • New benchmark categories (adversarial queries, multi-intent, negation traps)
  • Vault-size scaling experiments (100 / 500 / 1000+ entries)

Templates are in place for:

File an issue before non-trivial PRs to align on shape.

↑ back to top


📄 License

Dual-licensed. See LICENSING.md for per-directory mapping and compatibility matrix.

Applies to License File
Source code, tests, benchmarks, scripts, configs, developer docs Apache License 2.0 LICENSE-APACHE
Research papers, long-form narrative CC BY-NC-ND 4.0 LICENSE-CC-BY-NC-ND

All code is Apache 2.0 and can be freely incorporated into other Apache-compatible open-source projects (including EverMind-AI/EverOS).

↑ back to top


🙏 Attribution

Created by Fearvox / 0xVox (Hengyuan Zhu).

The CCR runtime is a clean-room reverse-engineered study of Anthropic's Claude Code CLI. All identifying strings, telemetry endpoints, and internal APIs have been stubbed or removed. This repository makes no claims over Anthropic's trademarks or original binary design.

Upstream lineage: derived from the community reverse-engineered baseline at github.com/claude-code-best/claude-code (CCB). CCR continues from that foundation with additional infrastructure, retrieval pipeline, benchmark harness, and packaging work.

The hybrid retrieval architecture, benchmark harness, and all original code in src/services/retrieval/, scripts/benchmark-*.ts, and benchmarks/wave3* are original work, independently inspired by the EverMemOS published design.

↑ back to top


If you build an agent memory system on top of this, we'd love to hear about it.

Open an issue  ·  Start a discussion  ·  ⭐ Star the repo  ·  Fork and ship

About

Evensong — Hackable Claude Code CLI on Bun. Powered by EverOS.

Resources

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE
Apache-2.0
LICENSE-APACHE
Unknown
LICENSE-CC-BY-NC-ND

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages