Public, reproducible benchmarks for Agent Brain. Companion code to the paper Agent Brain: A Biologically Inspired Memory System for Autonomous AI Agents — LongMemEval-M Evaluation (Sritharan, 2026, v3).
500 QA pairs across 510 multi-turn workspaces, GPT-4o judge.
| Configuration | Accuracy | Reproducible here |
|---|---|---|
| Agent Brain — Test 0 (no consolidation) | 71.7% | Yes |
| Agent Brain — Test 1 (with Dream Cycle) | 69.8% | Yes |
| Baseline pgvector (our control) | 72.2% – 73.9% | Yes |
We report transparently a 1.9 pp regression when the Dream Cycle is enabled, and a 2.2 pp gap versus our own pgvector-only control. See §15.4 of the paper for discussion.
Aggregated per-run numbers: results/SUMMARY.md.
Important caveat on cross-system comparisons (April 2026).
To the best of our knowledge we are the first to publish numbers specifically on the
weaviate/longmemeval-m-cleanedvariant, so no strict apples-to-apples peer comparison exists yet.Published peer numbers from Zep, Mem0, LangMem, and OpenAI Memory that we cited in earlier preprint versions refer to the
LongMemEval-S(Small) variant or to third-party evaluations, not toLongMemEval-M. They are therefore not directly comparable to our 71.7% figure. In particular:
- The
63.8%number we cited for Zep in preprint v1/v2 is the baseline row of Rasmussen et al. 2025 Table 2 (full-context gpt-4o-mini), not Zep itself. Zep's own reported score on LongMemEval-S with gpt-4o-mini is 71.2%.- The
49%number we cited for Mem0 originated in third-party evaluations from 2024 when Mem0 had not yet published official LongMemEval numbers (their own benchmark focus at the time was LoCoMo). Mem0 v2 (released ~17 April 2026) reports ~92% on LongMemEval; our paper v2 (21 April 2026) did not yet reflect this — corrected in v3.The cleanest way to compare systems on this variant is to re-evaluate each under identical judging conditions on
m-cleaned. We welcome PRs in this repo adding such runs.
- Python 3.10+
- A Brain instance (self-hosted or the hosted evaluation endpoint)
- OpenAI API key (for GPT-4o answer generation and judge)
- ~8 h wall clock for a full 500-tenant run
- ~USD 22 in API cost (USD 18 judge + USD 4 embeddings, or USD 0 with self-hosted MiniLM)
git clone https://github.com/AgentBrainHQ/agentbrain-benchmarks
cd agentbrain-benchmarks
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# edit .env with your Brain DB + OpenAI keys
set -a; source .env; set +a# Faster: parquet-based loader (single file, ~1 min)
python download_parquet.py
# Or: HuggingFace rows API (resumable, slower)
python download_data.pyBoth variants write data/docs.jsonl (237,655 sessions) and data/queries.jsonl
(500 QA pairs).
# One-shot orchestrator — ingest + query both systems + judge
python run_full.py
# Smoke test with 10 tenants
python run_full.py --limit 10
# Or step-by-step (async variants)
python ingest.py # Ingest via Brain API
python query.py # Brain recall + GPT-4o answer
python baseline.py # pgvector-only control
python evaluate.py # GPT-4o judge
# Or via shell wrapper
./run_benchmark.sh # full run
./run_benchmark.sh --small # 100-doc smoke runResults land in results/.
agentbrain-benchmarks/
├── config.py # env-based config (no secrets committed)
├── .env.example # template for your environment
│
├── download_data.py # HuggingFace Rows API loader (resumable)
├── download_parquet.py # HuggingFace Parquet loader (faster)
│
├── ingest.py # async ingest via Brain API /memory/store
├── query.py # async Brain recall + GPT-4o answer
├── baseline.py # async pgvector-only control (RPC match_memories)
├── evaluate.py # async GPT-4o judge (rubric-graded)
├── run_full.py # one-shot sync orchestrator (ingest+query+eval)
├── run_benchmark.sh # shell wrapper for the modular flow
│
├── prompts/ # exact prompts used in the paper
│ ├── answer_prompt.md
│ └── judge_prompt.md
│
├── docs/
│ ├── METHODOLOGY.md # ingestion, retrieval, judging — details
│ ├── REPRODUCIBILITY.md # step-by-step reproduction guide
│ └── LIMITATIONS.md # what LongMemEval does and does not measure
│
├── results/
│ ├── SUMMARY.md # aggregated verdict table
│ └── eval_report_aggregated.json # per-run counts (no raw LongMemEval content)
│
├── CITATION.cff # cite this work
├── LICENSE # MIT
├── requirements.txt # pinned versions
└── README.md # you are here
- Dataset:
weaviate/longmemeval-m-cleaned(500 queries × 510 workspaces, ~115k tokens each). - Ingestion: one user × one session per workspace. Turn-level messages
written to
memoriestable via the Brain/memory/storeendpoint. - Embedding:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2(384 dimensions, normalized). - Retrieval: hybrid vector (0.7) + PostgreSQL tsvector (0.3) via Reciprocal Rank Fusion (k = 60), top-5 returned. For the baseline: pure pgvector cosine top-10 via Supabase RPC.
- Answer generation: GPT-4o with top-5 memories as context, single forward pass, no chain-of-thought, temperature 0.
- Judge: GPT-4o with the rubric in [
prompts/judge_prompt.md]. Verdicts:CORRECT,PARTIAL,WRONG,ABSTAIN_CORRECT. Accuracy = (correct + abstain) / total.
Full details: docs/METHODOLOGY.md.
A clean Brain deployment + the steps in the Quickstart above reproduces the
paper's Test 1 (69.8%) run end-to-end in ~8 hours for ~USD 22. Test 0 requires
disabling the Dream Cycle between ingest and query (see
docs/REPRODUCIBILITY.md).
If you can't reproduce a number within ±1 pp, please open an issue with your
results/eval_report.json. We take reproducibility seriously.
LongMemEval measures quiz-style factual recall from long conversational context.
It does not measure cross-session continuity, relational reasoning across
workspaces, temporal reasoning, or creative-connection synthesis — workloads
Agent Brain is primarily designed for. See docs/LIMITATIONS.md
and §15.5 of the paper.
@techreport{sritharan2026agentbrain,
title = {Agent Brain: A Biologically Inspired Memory System for Autonomous
AI Agents --- LongMemEval-M Evaluation},
author = {Sritharan, Theshoth},
year = {2026},
month = {4},
address = {Sachseln OW, Switzerland},
institution = {Valtis},
doi = {10.5281/zenodo.19673132},
url = {https://doi.org/10.5281/zenodo.19673132},
note = {Version 3 (current); Concept DOI resolves to the latest version.}
}The DOI above is the Concept DOI which always resolves to the latest version. To cite a specific version: v3 is 10.5281/zenodo.19673331; v2 (superseded) is 10.5281/zenodo.19673133.
Or use the CITATION.cff file directly (GitHub renders a
"Cite this repository" button).
MIT — see LICENSE. The LongMemEval dataset itself is distributed
under its own license (see
weaviate/longmemeval-m-cleaned).
This repository contains only evaluation code and aggregated scores; no raw
LongMemEval content is committed.
Paper and benchmark: Theshoth Sritharan · t.sritharan@valtis.ch · ORCID 0009-0006-4400-3352
Issues, bug reports, reproduction problems: please open a GitHub issue.