Solving the issues of agent memory evaluation in healthcare scenarios.
|🤗 HuggingFace Dataset | 📄 Arxiv Preprint | 🌐 中文 |
MedMemoryBench is a benchmark framework for evaluating Agent memory methods, with a focus on memory capability assessment in medical dialogue scenarios. This framework provides unified evaluation interfaces, multiple baseline method implementations, and a flexible configuration management system, while also supporting the import and evaluation of other datasets.
- [2026.05] MedMemoryBench v1.0 is officially released — dataset, evaluation framework, and 16 memory method baselines.
- [2026.05] Dataset available on HuggingFace.
|
Comprehensive Medical Dataset
|
Rich Baseline Coverage
|
|
Unified Evaluation Framework
|
Flexible Configuration
|
Click to expand full directory tree
MedMemoryBench/
├── main.py # Evaluation entry point
├── requirements.txt # Python dependencies
├── LICENSE # Apache License 2.0
├── LEGAL.md # Comment-language legal notice
├── .env.example # Environment variable template
│
├── configs/ # Configuration files
│ ├── method_config/ # Per-method YAML configs (gpt-5.1 / qwen3 variants)
│ │ ├── long_context_gpt-5.1.yaml
│ │ ├── embedding_rag_gpt-5.1.yaml
│ │ ├── bm25_rag_gpt-5.1.yaml
│ │ ├── graph_rag_gpt-5.1.yaml
│ │ ├── mem0_gpt-5.1.yaml
│ │ ├── memos_gpt-5.1.yaml
│ │ ├── memrl_gpt-5.1.yaml
│ │ ├── amem_gpt-5.1.yaml
│ │ ├── hipporag_gpt-5.1.yaml
│ │ ├── lightmem_gpt-5.1.yaml
│ │ ├── letta_gpt-5.1.yaml
│ │ ├── mirix_gpt-5.1.yaml
│ │ ├── remem_gpt-5.1.yaml
│ │ ├── zep_gpt-5.1-chat.yaml
│ │ └── ... # + qwen3 variants
│ └── dataset_config/
│ ├── medmemorybench.yaml
│ └── locomo.yaml
│
├── methods/ # Memory method implementations
│ ├── base.py # BaseAgent abstract class
│ ├── long_context.py # Long-context baseline
│ ├── embedding_rag.py # Dense embedding RAG
│ ├── bm25_rag.py # BM25 sparse RAG
│ ├── graph_rag.py # Graph-based RAG
│ ├── self_rag.py # Self-RAG
│ ├── mem0_agent.py # Mem0 adapter
│ ├── memos_agent.py # MemOS adapter
│ ├── memrl_agent.py # MemRL adapter
│ ├── amem_agent.py # A-MEM adapter
│ ├── hipporag_agent.py # HippoRAG adapter
│ ├── lightmem_agent.py # LightMem adapter
│ ├── letta_agent.py # Letta adapter
│ ├── mirix_agent.py # MIRIX adapter
│ ├── remem_agent.py # ReMem adapter
│ ├── zep_agent.py # Zep Cloud adapter
│ └── <vendored repos> # mem0/, memOS/, MemRL/, amem/, HippoRAG/,
│ # LightMem/, letta/, MIRIX/, REMem/, MEM1/,
│ # cognee/, memorag/ (third-party sources)
│
├── benchmarks/ # Dataset evaluation implementations
│ ├── base.py # BaseDataset abstract class
│ ├── medmemorybench/ # MedMemoryBench dataset
│ │ ├── dataset.py
│ │ ├── evaluator.py
│ │ └── checkpoint.py
│ └── locomo/ # LoCoMo dataset
│ ├── dataset.py
│ └── evaluator.py
│
├── metrics/ # Evaluation metrics
│ ├── base.py # BaseMetric abstract class
│ ├── string_match.py # String matching metrics
│ ├── llm_judge.py # LLM-as-a-Judge metrics
│ └── locomo_metrics.py # LoCoMo-specific metrics
│
├── src/ # Core orchestration modules
│ ├── config.py # Configuration loader
│ ├── agent.py # AgentManager
│ ├── evaluator.py # Evaluation dispatcher
│ └── result.py # Result collection & reporting
│
├── utils/ # Utility modules
│ ├── llm_client.py # Unified LLM client
│ ├── tokenizer.py # Tokenizer helpers
│ ├── templates.py # Prompt templates
│ ├── prompts_qa.py # QA prompts
│ ├── prompts_judge.py # Judge prompts
│ ├── prompts_memorize.py # Memorization prompts
│ ├── langchain_callback.py # LangChain callback hooks
│ └── logger.py # Logger
│
├── docker/ # Optional service compose files
│ ├── mirix-init.sql
│ └── mirix-services.yml
│
├── scripts/ # Helper scripts
│ ├── run_eval.sh
│ └── mirix-services.sh
│
├── data/ # Datasets (Git LFS)
│ ├── MedMemoryBench/ # Chinese, ~598 MB
│ ├── MedMemoryBench_EN/ # English, ~443 MB
│ └── locomo/ # LoCoMo, ~18 MB
│
├── generation/ # Dataset generation pipeline (sub-project)
├── outputs/ # Evaluation outputs (gitignored)
├── exp_results/ # Curated experiment reports
├── logs/ # Runtime logs (gitignored)
└── results/ # Method-side caches (gitignored)
Note: This repository ships datasets via Git LFS. Please install it before cloning.
# Install Git LFS (skip if already installed)
brew install git-lfs # macOS
sudo apt-get install git-lfs # Ubuntu/Debian
# Windows: https://git-lfs.github.com/
git lfs install
git clone https://github.com/AQ-MedAI/MedMemoryBench.git
cd MedMemoryBenchUsing uv (recommended)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
uv pip install -r requirements.txtUsing conda
conda create -n medmemorybench python=3.10
conda activate medmemorybench
pip install -r requirements.txtMethod-specific dependencies: Some memory methods vendor upstream packages under
methods/(e.g.methods/mem0/,methods/memOS/). If a method has its ownrequirements.txtorREADME, follow those instructions to enable it.
Embedding models: Method configs reference local embedding models or API. For the former, please download the embedded model before running.
cp .env.example .envEdit .env and fill in the API keys you intend to use:
# BigModel (OpenAI-compatible, primary endpoint used in this project)
BIGMODEL_API_KEY=your_bigmodel_api_key
BIGMODEL_BASE_URL=https://open.bigmodel.cn/api/paas/v4
# OpenAI (optional)
OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=https://api.openai.com/v1
# Azure OpenAI (optional)
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
# Zep Cloud (optional, only needed for the Zep agent)
ZEP_API_KEY=your_zep_api_key
# Default model selection
DEFAULT_LLM_MODEL=gpt-4o-mini
DEFAULT_EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_PROVIDER=openai
# Optional: isolate Letta local runtime data (defaults to ~/.letta)
LETTA_DIR=.tmp/letta_runtimeTips:
- For BigModel, set
BIGMODEL_API_KEY/BIGMODEL_BASE_URLfirst; the framework maps them to OpenAI-compatible settings internally.LETTA_DIRis recommended to avoid stale SQLite metadata from previous Letta runs.
Via shell script:
./scripts/run_eval.sh bm25_rag_gpt-5.1 medmemorybenchVia Python:
# Standard run
python main.py -m bm25_rag_gpt-5.1 -d medmemorybench
# Dry run (no real LLM/API calls)
python main.py -m embedding_rag_gpt-5.1 -d medmemorybench --dry-run
# Resume from checkpoint
python main.py -m embedding_rag_gpt-5.1 -d medmemorybench --resumeEach method is driven by a YAML file under configs/method_config/:
# configs/method_config/embedding_rag_gpt-5.1.yaml
method_name: "embedding_rag"
method_type: "rag" # baseline / rag / agentic_memory
description: "Embedding RAG Agent - Dense vector retrieval based RAG method"
model:
provider: "openai"
name: "gpt-5.1"
temperature: 0.3
max_completion_tokens: 100000
agent_params:
top_k: 5 # Number of documents to retrieve
chunk_size: 512 # Text chunk size
chunk_overlap: 50 # Chunk overlap
embedding:
provider: "local" # openai / local / huggingface
model: "/path/to/local/model"Dataset configs live under configs/dataset_config/:
# configs/dataset_config/medmemorybench.yaml
dataset_name: "medmemorybench"
description: "Medical dialogue memory evaluation dataset"
language: "zh"
data:
root_dir: "data/MedMemoryBench"
sessions_pattern: "persona_{id}/eval/generated_dialogues.json"
queries_pattern: "persona_{id}/eval/generated_queries.json"
evaluation:
mode: "independent" # independent / merged
evaluation_interval: 10 # Evaluate every N sessions
query_types:
- name: "entity_exact_match"
metric: "string_contain"
- name: "temporal_localization"
metric: "llm_judge"
# ... more typesEvaluation results are saved under outputs/<method>_<model>/:
outputs/
└── bm25_rag_gpt-5.1/
├── eval_medmemorybench_20260330_181703.json # Detailed results (JSON)
├── report_medmemorybench_20260330_181703.txt # Human-readable report
└── memory_builds_20260330_181703.json # Memory build logs
If you find MedMemoryBench useful in your research, please consider citing our work:
@article{wang2026medmemorybench,
title={MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare},
author={Yihao Wang and Haoran Xu and Renjie Gu and Yixuan Ye and Xinyi Chen and Xinyu Mu and Yuan Gao and Chunxiao Guo and Peng Wei and Jinjie Gu and Huan Li and Ke Chen and Lidan Shou},
journal={arXiv preprint arXiv:2605.11814},
year={2026}
}- Code — Apache License 2.0
- Dataset (
data/MedMemoryBench/,data/MedMemoryBench_EN/) — CC BY 4.0 - Vendored third-party sources under
methods/retain their original upstream licenses. - See LEGAL.md for the source-comment language clause.
