Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. MemEye is a diagnostic framework that evaluates multimodal agent memory through a two-axis taxonomy:
- X-axis (Visual Evidence Granularity): from scene-level (X1) to pixel-level (X4) evidence
- Y-axis (Memory Reasoning Depth): from atomic retrieval (Y1) to relational association (Y2) and evolutionary synthesis (Y3)
The benchmark includes 371 mirrored MCQ + open-ended questions across 8 life-scenario tasks, with annotated clue rounds and validation gates for answerability, shortcut resistance, visual necessity, and reasoning structure.
- Captions remain competitive for scene/region-level evidence but leave gaps at instance/pixel-level
- Semantic retrieval can confuse relevance with temporal authority, ranking stale evidence above valid updates
- Native visual evidence helps high-X questions but does not by itself solve evolutionary synthesis
MemEye exhibits stronger visual irreplaceability than prior long-term memory benchmarks — the gap between caption-only and multimodal settings is significantly larger.
Representative method performance across the MemEye matrix (gpt-5.4-mini). Left: Open-ended LLM-as-a-Judge; Right: Multiple-choice EM.
| Category | Method | Config | Modality |
|---|---|---|---|
| Full Context | FC-Text | full_context_text_only |
Text |
| FC-Multimodal | full_context_multimodal |
Visual | |
| Retrieval | SRAG-Text | semantic_rag_text_only |
Text |
| SRAG-Multimodal | semantic_rag_multimodal |
Visual | |
| Summarization | SimpleMem | simplemem |
Text |
| SimpleMem-MM | simplemem_multimodal |
Visual | |
| Agentic Memory | A-MEM | a_mem |
Text |
| Reflexion | reflexion |
Text | |
| Gen. Agents | gen_agents |
Text | |
| MemoryOS | memoryos |
Text | |
| M2A | m2a |
Visual | |
| MMA | mma |
Visual | |
| MIRIX | mirix |
Visual |
* Gen. Agents requires MemEngine (not bundled). See benchmark/gen_agents/SETUP.md.
conda create -n memeye python=3.10 -y
conda activate memeye
pip install -r requirements.txtThe benchmark data (dialogue JSONs + images) is hosted on HuggingFace:
git lfs install
git clone https://huggingface.co/datasets/MemEyeBench/MemEye dataThen generate task configs pointing to your local data:
python register_external_data.py --data-root ./data --overwriteThis creates task configs under config/tasks_external/.
export OPENAI_API_KEY=<your_key> # for GPT models
export GEMINI_API_KEY=<your_key> # for Gemini modelspython run_benchmark.py \
--task-config config/tasks_external/brand_memory_test.yaml \
--model-config config/models/gpt_4_1_nano.yaml \
--method-config config/methods/full_context_multimodal.yamlpython run_matrix.py \
--task-config config/tasks_external/brand_memory_test.yaml \
--model-config config/models/gpt_4_1_nano.yaml \
--method-config config/methods/full_context_multimodal.yaml \
--method-config config/methods/semantic_rag_multimodal.yaml \
--method-config config/methods/m2a.yamlpython score_locked_llm_judge.py \
--root runs/<model>/open \
--judge-model gpt-5.2Each task ships two variants:
| Mode | File Pattern | Scoring |
|---|---|---|
| MCQ | Task_Name.json |
Exact match on extracted choice (A/B/C) |
| Open | Task_Name_Open.json |
F1, BLEU, BERTScore, LLM-as-a-judge |
The runner auto-detects the variant. LLM-as-a-judge is the recommended primary metric for open-ended evaluation.
Enable rich metrics:
python run_benchmark.py \
--task-config config/tasks_external/brand_memory_test_open.yaml \
--model-config config/models/gpt_4_1_nano.yaml \
--method-config config/methods/full_context_text_only.yaml \
--enable-bert-score \
--enable-llm-judgeEach run writes to runs/:
config.json— resolved run configurationmetrics.json— aggregate metrics with breakdowns by X/Y axespredictions.jsonl— per-question predictions and scores
{
"character_profile": { "..." },
"multi_session_dialogues": [
{
"session_id": "D1",
"date": "2024-03-10",
"dialogues": [
{
"round": "D1:1",
"user": "...",
"assistant": "...",
"input_image": ["image/Task_Name/IMG.png"]
}
]
}
],
"human-annotated QAs": [
{
"point": [["X2"], ["Y1"]],
"question": "...",
"answer": "...",
"session_id": ["D1"],
"clue": ["D1:1"]
}
]
}- Prepare dialogue JSON + images in the MemEye format above
- Create a task config under
config/tasks/:
name: my_task
dataset:
dialog_json: data/dialog/My_Task.json
image_root: data/image
eval:
mode: mcq # or "open"
max_questions: 0- Run:
python run_benchmark.py \
--task-config config/tasks/my_task.yaml \
--model-config config/models/gpt_4_1_nano.yaml \
--method-config config/methods/full_context_multimodal.yaml.
├── run_benchmark.py # Main benchmark entry point
├── run_matrix.py # Model x method matrix runner
├── score_locked_llm_judge.py # Post-hoc LLM judge scoring
├── register_external_data.py # Generate task configs from external data
├── benchmark/ # Core modules
│ ├── dataset.py # Data loading
│ ├── methods.py # Method registry & history construction
│ ├── retrieval.py # TF-IDF & dense retrieval
│ ├── embeddings.py # Text & multimodal embeddings
│ ├── evaluator.py # MCQ & open-ended scoring
│ ├── runner.py # Run orchestration
│ ├── m2a/ # M2A agentic memory
│ ├── mma/ # MMA confidence-aware memory
│ ├── mirix/ # MIRIX multi-layer agent
│ └── ... # Other method implementations
├── router/ # Model routers (OpenAI, Gemini, local)
├── config/
│ ├── methods/ # Method configs
│ ├── models/ # Model configs
│ └── tasks/ # Task configs (examples)
├── tools/ # Utility scripts (HF sync, caption preprocessing)
├── docs/ # Documentation
Coming soon.
This project is licensed under the Apache License 2.0 - see LICENSE for details.




