Agent Brain Benchmarks

Public, reproducible benchmarks for Agent Brain. Companion code to the paper Agent Brain: A Biologically Inspired Memory System for Autonomous AI Agents — LongMemEval-M Evaluation (Sritharan, 2026, v3).

Results: LongMemEval-m-cleaned

500 QA pairs across 510 multi-turn workspaces, GPT-4o judge.

Configuration	Accuracy	Reproducible here
Agent Brain — Test 0 (no consolidation)	71.7%	Yes
Agent Brain — Test 1 (with Dream Cycle)	69.8%	Yes
Baseline pgvector (our control)	72.2% – 73.9%	Yes

We report transparently a 1.9 pp regression when the Dream Cycle is enabled, and a 2.2 pp gap versus our own pgvector-only control. See §15.4 of the paper for discussion.

Aggregated per-run numbers: results/SUMMARY.md.

Important caveat on cross-system comparisons (April 2026).

To the best of our knowledge we are the first to publish numbers specifically on the weaviate/longmemeval-m-cleaned variant, so no strict apples-to-apples peer comparison exists yet.

Published peer numbers from Zep, Mem0, LangMem, and OpenAI Memory that we cited in earlier preprint versions refer to the LongMemEval-S (Small) variant or to third-party evaluations, not to LongMemEval-M. They are therefore not directly comparable to our 71.7% figure. In particular:

The 63.8% number we cited for Zep in preprint v1/v2 is the baseline row of Rasmussen et al. 2025 Table 2 (full-context gpt-4o-mini), not Zep itself. Zep's own reported score on LongMemEval-S with gpt-4o-mini is 71.2%.

The 49% number we cited for Mem0 originated in third-party evaluations from 2024 when Mem0 had not yet published official LongMemEval numbers (their own benchmark focus at the time was LoCoMo). Mem0 v2 (released ~17 April 2026) reports ~92% on LongMemEval; our paper v2 (21 April 2026) did not yet reflect this — corrected in v3.

The cleanest way to compare systems on this variant is to re-evaluate each under identical judging conditions on m-cleaned. We welcome PRs in this repo adding such runs.

Quickstart

Requirements

Python 3.10+
A Brain instance (self-hosted or the hosted evaluation endpoint)
OpenAI API key (for GPT-4o answer generation and judge)
~8 h wall clock for a full 500-tenant run
~USD 22 in API cost (USD 18 judge + USD 4 embeddings, or USD 0 with self-hosted MiniLM)

Setup

git clone https://github.com/AgentBrainHQ/agentbrain-benchmarks
cd agentbrain-benchmarks

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# edit .env with your Brain DB + OpenAI keys
set -a; source .env; set +a

Download the dataset

# Faster: parquet-based loader (single file, ~1 min)
python download_parquet.py

# Or: HuggingFace rows API (resumable, slower)
python download_data.py

Both variants write data/docs.jsonl (237,655 sessions) and data/queries.jsonl (500 QA pairs).

Run the full benchmark

# One-shot orchestrator — ingest + query both systems + judge
python run_full.py

# Smoke test with 10 tenants
python run_full.py --limit 10

# Or step-by-step (async variants)
python ingest.py          # Ingest via Brain API
python query.py           # Brain recall + GPT-4o answer
python baseline.py        # pgvector-only control
python evaluate.py        # GPT-4o judge

# Or via shell wrapper
./run_benchmark.sh        # full run
./run_benchmark.sh --small   # 100-doc smoke run

Results land in results/.

Repository Layout

agentbrain-benchmarks/
├── config.py              # env-based config (no secrets committed)
├── .env.example           # template for your environment
│
├── download_data.py       # HuggingFace Rows API loader (resumable)
├── download_parquet.py    # HuggingFace Parquet loader (faster)
│
├── ingest.py              # async ingest via Brain API /memory/store
├── query.py               # async Brain recall + GPT-4o answer
├── baseline.py            # async pgvector-only control (RPC match_memories)
├── evaluate.py            # async GPT-4o judge (rubric-graded)
├── run_full.py            # one-shot sync orchestrator (ingest+query+eval)
├── run_benchmark.sh       # shell wrapper for the modular flow
│
├── prompts/               # exact prompts used in the paper
│   ├── answer_prompt.md
│   └── judge_prompt.md
│
├── docs/
│   ├── METHODOLOGY.md     # ingestion, retrieval, judging — details
│   ├── REPRODUCIBILITY.md # step-by-step reproduction guide
│   └── LIMITATIONS.md     # what LongMemEval does and does not measure
│
├── results/
│   ├── SUMMARY.md                      # aggregated verdict table
│   └── eval_report_aggregated.json     # per-run counts (no raw LongMemEval content)
│
├── CITATION.cff           # cite this work
├── LICENSE                # MIT
├── requirements.txt       # pinned versions
└── README.md              # you are here

Methodology (at a glance)

Dataset: weaviate/longmemeval-m-cleaned (500 queries × 510 workspaces, ~115k tokens each).
Ingestion: one user × one session per workspace. Turn-level messages written to memories table via the Brain /memory/store endpoint.
Embedding: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions, normalized).
Retrieval: hybrid vector (0.7) + PostgreSQL tsvector (0.3) via Reciprocal Rank Fusion (k = 60), top-5 returned. For the baseline: pure pgvector cosine top-10 via Supabase RPC.
Answer generation: GPT-4o with top-5 memories as context, single forward pass, no chain-of-thought, temperature 0.
Judge: GPT-4o with the rubric in [prompts/judge_prompt.md]. Verdicts: CORRECT, PARTIAL, WRONG, ABSTAIN_CORRECT. Accuracy = (correct + abstain) / total.

Full details: docs/METHODOLOGY.md.

Reproducing our numbers

A clean Brain deployment + the steps in the Quickstart above reproduces the paper's Test 1 (69.8%) run end-to-end in ~8 hours for ~USD 22. Test 0 requires disabling the Dream Cycle between ingest and query (see docs/REPRODUCIBILITY.md).

If you can't reproduce a number within ±1 pp, please open an issue with your results/eval_report.json. We take reproducibility seriously.

What this benchmark does not measure

LongMemEval measures quiz-style factual recall from long conversational context. It does not measure cross-session continuity, relational reasoning across workspaces, temporal reasoning, or creative-connection synthesis — workloads Agent Brain is primarily designed for. See docs/LIMITATIONS.md and §15.5 of the paper.

Citation

@techreport{sritharan2026agentbrain,
  title   = {Agent Brain: A Biologically Inspired Memory System for Autonomous
             AI Agents --- LongMemEval-M Evaluation},
  author  = {Sritharan, Theshoth},
  year    = {2026},
  month   = {4},
  address = {Sachseln OW, Switzerland},
  institution = {Valtis},
  doi     = {10.5281/zenodo.19673132},
  url     = {https://doi.org/10.5281/zenodo.19673132},
  note    = {Version 3 (current); Concept DOI resolves to the latest version.}
}

The DOI above is the Concept DOI which always resolves to the latest version. To cite a specific version: v3 is 10.5281/zenodo.19673331; v2 (superseded) is 10.5281/zenodo.19673133.

Or use the CITATION.cff file directly (GitHub renders a "Cite this repository" button).

License

MIT — see LICENSE. The LongMemEval dataset itself is distributed under its own license (see weaviate/longmemeval-m-cleaned). This repository contains only evaluation code and aggregated scores; no raw LongMemEval content is committed.

Contact

Paper and benchmark: Theshoth Sritharan · t.sritharan@valtis.ch · ORCID 0009-0006-4400-3352

Issues, bug reports, reproduction problems: please open a GitHub issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Brain Benchmarks

Results: LongMemEval-m-cleaned

Quickstart

Requirements

Setup

Download the dataset

Run the full benchmark

Repository Layout

Methodology (at a glance)

Reproducing our numbers

What this benchmark does not measure

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
prompts		prompts
results		results
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
baseline.py		baseline.py
config.py		config.py
download_data.py		download_data.py
download_parquet.py		download_parquet.py
evaluate.py		evaluate.py
ingest.py		ingest.py
query.py		query.py
requirements.txt		requirements.txt
run_benchmark.sh		run_benchmark.sh
run_full.py		run_full.py

Folders and files

Latest commit

History

Repository files navigation

Agent Brain Benchmarks

Results: LongMemEval-m-cleaned

Quickstart

Requirements

Setup

Download the dataset

Run the full benchmark

Repository Layout

Methodology (at a glance)

Reproducing our numbers

What this benchmark does not measure

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages