MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

MedMemoryBench overview — four key challenges in medical memory evaluation

Solving the issues of agent memory evaluation in healthcare scenarios.

｜🤗 HuggingFace Dataset ｜ 📄 Arxiv Preprint ｜ 🌐 中文｜

MedMemoryBench is a benchmark framework for evaluating Agent memory methods, with a focus on memory capability assessment in medical dialogue scenarios. This framework provides unified evaluation interfaces, multiple baseline method implementations, and a flexible configuration management system, while also supporting the import and evaluation of other datasets.

📰 News

[2026.05] MedMemoryBench v1.0 is officially released — dataset, evaluation framework, and 16 memory method baselines.
[2026.05] Dataset available on HuggingFace.

✨ Features

Comprehensive Medical Dataset

20 longitudinal patient personas with background, life events, and trap events
~2,020 multi-session doctor–patient dialogue sessions
~1,986 evaluation queries across 6 clinically motivated types
Bilingual support: Chinese (~598 MB) + English (~443 MB)

Rich Baseline Coverage

3 classic baselines: Long Context, Embedding RAG, BM25 RAG
7 agentic memory systems: Mem0, Letta, MemOS, A-MEM, MIRIX, MemRL, LightMem
4 graph-based systems: GraphRAG, HippoRAG-v2, ReMem, Zep

Unified Evaluation Framework

Plug-and-play method integration via BaseAgent
Multi-metric evaluation: string match + LLM-as-a-Judge
Checkpoint & resume for long-running experiments
Dry-run mode for fast pipeline validation

Flexible Configuration

YAML-driven method & dataset configs
Multi-provider LLM support (OpenAI / BigModel / Azure)
Local & remote embedding models
Cross-benchmark evaluation (MedMemoryBench + LoCoMo)

📁 Project Structure

Click to expand full directory tree

MedMemoryBench/
├── main.py                       # Evaluation entry point
├── requirements.txt              # Python dependencies
├── LICENSE                       # Apache License 2.0
├── LEGAL.md                      # Comment-language legal notice
├── .env.example                  # Environment variable template
│
├── configs/                      # Configuration files
│   ├── method_config/            # Per-method YAML configs (gpt-5.1 / qwen3 variants)
│   │   ├── long_context_gpt-5.1.yaml
│   │   ├── embedding_rag_gpt-5.1.yaml
│   │   ├── bm25_rag_gpt-5.1.yaml
│   │   ├── graph_rag_gpt-5.1.yaml
│   │   ├── mem0_gpt-5.1.yaml
│   │   ├── memos_gpt-5.1.yaml
│   │   ├── memrl_gpt-5.1.yaml
│   │   ├── amem_gpt-5.1.yaml
│   │   ├── hipporag_gpt-5.1.yaml
│   │   ├── lightmem_gpt-5.1.yaml
│   │   ├── letta_gpt-5.1.yaml
│   │   ├── mirix_gpt-5.1.yaml
│   │   ├── remem_gpt-5.1.yaml
│   │   ├── zep_gpt-5.1-chat.yaml
│   │   └── ...                   # + qwen3 variants
│   └── dataset_config/
│       ├── medmemorybench.yaml
│       └── locomo.yaml
│
├── methods/                      # Memory method implementations
│   ├── base.py                   # BaseAgent abstract class
│   ├── long_context.py           # Long-context baseline
│   ├── embedding_rag.py          # Dense embedding RAG
│   ├── bm25_rag.py               # BM25 sparse RAG
│   ├── graph_rag.py              # Graph-based RAG
│   ├── self_rag.py               # Self-RAG
│   ├── mem0_agent.py             # Mem0 adapter
│   ├── memos_agent.py            # MemOS adapter
│   ├── memrl_agent.py            # MemRL adapter
│   ├── amem_agent.py             # A-MEM adapter
│   ├── hipporag_agent.py         # HippoRAG adapter
│   ├── lightmem_agent.py         # LightMem adapter
│   ├── letta_agent.py            # Letta adapter
│   ├── mirix_agent.py            # MIRIX adapter
│   ├── remem_agent.py            # ReMem adapter
│   ├── zep_agent.py              # Zep Cloud adapter
│   └── <vendored repos>          # mem0/, memOS/, MemRL/, amem/, HippoRAG/,
│                                 # LightMem/, letta/, MIRIX/, REMem/, MEM1/,
│                                 # cognee/, memorag/  (third-party sources)
│
├── benchmarks/                   # Dataset evaluation implementations
│   ├── base.py                   # BaseDataset abstract class
│   ├── medmemorybench/           # MedMemoryBench dataset
│   │   ├── dataset.py
│   │   ├── evaluator.py
│   │   └── checkpoint.py
│   └── locomo/                   # LoCoMo dataset
│       ├── dataset.py
│       └── evaluator.py
│
├── metrics/                      # Evaluation metrics
│   ├── base.py                   # BaseMetric abstract class
│   ├── string_match.py           # String matching metrics
│   ├── llm_judge.py              # LLM-as-a-Judge metrics
│   └── locomo_metrics.py         # LoCoMo-specific metrics
│
├── src/                          # Core orchestration modules
│   ├── config.py                 # Configuration loader
│   ├── agent.py                  # AgentManager
│   ├── evaluator.py              # Evaluation dispatcher
│   └── result.py                 # Result collection & reporting
│
├── utils/                        # Utility modules
│   ├── llm_client.py             # Unified LLM client
│   ├── tokenizer.py              # Tokenizer helpers
│   ├── templates.py              # Prompt templates
│   ├── prompts_qa.py             # QA prompts
│   ├── prompts_judge.py          # Judge prompts
│   ├── prompts_memorize.py       # Memorization prompts
│   ├── langchain_callback.py     # LangChain callback hooks
│   └── logger.py                 # Logger
│
├── docker/                       # Optional service compose files
│   ├── mirix-init.sql
│   └── mirix-services.yml
│
├── scripts/                      # Helper scripts
│   ├── run_eval.sh
│   └── mirix-services.sh
│
├── data/                         # Datasets (Git LFS)
│   ├── MedMemoryBench/           # Chinese, ~598 MB
│   ├── MedMemoryBench_EN/        # English, ~443 MB
│   └── locomo/                   # LoCoMo, ~18 MB
│
├── generation/                   # Dataset generation pipeline (sub-project)
├── outputs/                      # Evaluation outputs (gitignored)
├── exp_results/                  # Curated experiment reports
├── logs/                         # Runtime logs (gitignored)
└── results/                      # Method-side caches (gitignored)

🚀 Quick Start

1. Clone the Repository

Note: This repository ships datasets via Git LFS. Please install it before cloning.

# Install Git LFS (skip if already installed)
brew install git-lfs                  # macOS
sudo apt-get install git-lfs          # Ubuntu/Debian
# Windows: https://git-lfs.github.com/

git lfs install
git clone https://github.com/AQ-MedAI/MedMemoryBench.git
cd MedMemoryBench

2. Environment Setup

Using uv (recommended)

curl -LsSf https://astral.sh/uv/install.sh | sh

uv venv
source .venv/bin/activate          # Linux/macOS
# .venv\Scripts\activate           # Windows

uv pip install -r requirements.txt

Using conda

conda create -n medmemorybench python=3.10
conda activate medmemorybench
pip install -r requirements.txt

Method-specific dependencies: Some memory methods vendor upstream packages under methods/ (e.g. methods/mem0/, methods/memOS/). If a method has its own requirements.txt or README, follow those instructions to enable it.

Embedding models: Method configs reference local embedding models or API. For the former, please download the embedded model before running.

3. Configure Environment Variables

cp .env.example .env

Edit .env and fill in the API keys you intend to use:

# BigModel (OpenAI-compatible, primary endpoint used in this project)
BIGMODEL_API_KEY=your_bigmodel_api_key
BIGMODEL_BASE_URL=https://open.bigmodel.cn/api/paas/v4

# OpenAI (optional)
OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=https://api.openai.com/v1

# Azure OpenAI (optional)
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/

# Zep Cloud (optional, only needed for the Zep agent)
ZEP_API_KEY=your_zep_api_key

# Default model selection
DEFAULT_LLM_MODEL=gpt-4o-mini
DEFAULT_EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_PROVIDER=openai

# Optional: isolate Letta local runtime data (defaults to ~/.letta)
LETTA_DIR=.tmp/letta_runtime

Tips:

For BigModel, set BIGMODEL_API_KEY / BIGMODEL_BASE_URL first; the framework maps them to OpenAI-compatible settings internally.

LETTA_DIR is recommended to avoid stale SQLite metadata from previous Letta runs.

4. Run Evaluation

Via shell script:

./scripts/run_eval.sh bm25_rag_gpt-5.1 medmemorybench

Via Python:

# Standard run
python main.py -m bm25_rag_gpt-5.1 -d medmemorybench

# Dry run (no real LLM/API calls)
python main.py -m embedding_rag_gpt-5.1 -d medmemorybench --dry-run

# Resume from checkpoint
python main.py -m embedding_rag_gpt-5.1 -d medmemorybench --resume

🔧 Configuration

Method Configuration

Each method is driven by a YAML file under configs/method_config/:

# configs/method_config/embedding_rag_gpt-5.1.yaml

method_name: "embedding_rag"
method_type: "rag"                  # baseline / rag / agentic_memory
description: "Embedding RAG Agent - Dense vector retrieval based RAG method"

model:
  provider: "openai"
  name: "gpt-5.1"
  temperature: 0.3
  max_completion_tokens: 100000

agent_params:
  top_k: 5                          # Number of documents to retrieve
  chunk_size: 512                   # Text chunk size
  chunk_overlap: 50                 # Chunk overlap

embedding:
  provider: "local"                 # openai / local / huggingface
  model: "/path/to/local/model"

Dataset Configuration

Dataset configs live under configs/dataset_config/:

# configs/dataset_config/medmemorybench.yaml

dataset_name: "medmemorybench"
description: "Medical dialogue memory evaluation dataset"
language: "zh"

data:
  root_dir: "data/MedMemoryBench"
  sessions_pattern: "persona_{id}/eval/generated_dialogues.json"
  queries_pattern: "persona_{id}/eval/generated_queries.json"

evaluation:
  mode: "independent"               # independent / merged
  evaluation_interval: 10           # Evaluate every N sessions

query_types:
  - name: "entity_exact_match"
    metric: "string_contain"
  - name: "temporal_localization"
    metric: "llm_judge"
  # ... more types

📄 Output

Evaluation results are saved under outputs/<method>_<model>/:

outputs/
└── bm25_rag_gpt-5.1/
    ├── eval_medmemorybench_20260330_181703.json    # Detailed results (JSON)
    ├── report_medmemorybench_20260330_181703.txt   # Human-readable report
    └── memory_builds_20260330_181703.json          # Memory build logs

📝 Citations

If you find MedMemoryBench useful in your research, please consider citing our work:

@article{wang2026medmemorybench,
  title={MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare},
  author={Yihao Wang and Haoran Xu and Renjie Gu and Yixuan Ye and Xinyi Chen and Xinyu Mu and Yuan Gao and Chunxiao Guo and Peng Wei and Jinjie Gu and Huan Li and Ke Chen and Lidan Shou},
  journal={arXiv preprint arXiv:2605.11814},
  year={2026}
}

📜 License

Code — Apache License 2.0
Dataset (data/MedMemoryBench/, data/MedMemoryBench_EN/) — CC BY 4.0
Vendored third-party sources under methods/ retain their original upstream licenses.
See LEGAL.md for the source-comment language clause.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

Table of Contents

📰 News

✨ Features

📁 Project Structure

🚀 Quick Start

1. Clone the Repository

2. Environment Setup

3. Configure Environment Variables

4. Run Evaluation

🔧 Configuration

Method Configuration

Dataset Configuration

📄 Output

📝 Citations

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
benchmarks		benchmarks
configs		configs
data		data
figs		figs
generation		generation
methods		methods
metrics		metrics
scripts		scripts
src		src
utils		utils
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LEGAL.md		LEGAL.md
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

Table of Contents

📰 News

✨ Features

📁 Project Structure

🚀 Quick Start

1. Clone the Repository

2. Environment Setup

3. Configure Environment Variables

4. Run Evaluation

🔧 Configuration

Method Configuration

Dataset Configuration

📄 Output

📝 Citations

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages