Benchmark and evaluation of LLM memory systems on the LOCOMO dataset.
| Directory | Description |
|---|---|
main.py |
Demo script for Huawei Cloud AgentArts Memory SDK |
memory-locomo-benchmark/ |
LOCOMO benchmark for AgentArts Memory (BLEU, F1, LLM Judge) |
memobase/ |
Forked Memobase with reference LOCOMO benchmark |
Evaluates AgentArts Memory on the LOCOMO dataset using dual-perspective sessions and memory search, scored with BLEU-1, token F1, and LLM Judge.
cd memory-locomo-benchmark
# Configure environment
cp .env.example .env # Edit with your credentials
# Run full pipeline
uv run python run.py add --max_samples 10 # Write conversations to memory
uv run python run.py search --output results.json # Retrieve & answer questions
uv run python run.py eval --input results.json # Score (BLEU/F1/LLM Judge)
uv run python run.py score --input evals.json # Aggregate by category| Variable | Description |
|---|---|
AGENTARTS_MEMORY_REGION |
AgentArts Memory region (e.g. cn-southwest-2) |
AGENTARTS_MEMORY_API_KEY |
AgentArts Memory API key |
AGENTARTS_MEMORY_SPACE_ID |
AgentArts Memory space ID |
ANSWER_LLM_BASE_URL |
OpenAI-compatible endpoint for answer generation |
ANSWER_LLM_API_KEY |
API key for answer LLM |
ANSWER_LLM_MODEL |
Model name for answer generation |
JUDGE_LLM_BASE_URL |
OpenAI-compatible endpoint for LLM Judge |
JUDGE_LLM_API_KEY |
API key for judge LLM |
JUDGE_LLM_MODEL |
Model name for LLM Judge |
- add -- Parse LOCOMO conversations, create dual-perspective sessions (one per speaker), write messages to AgentArts Memory, wait for memory extraction.
- search -- For each QA pair, search memories using the question as query, format retrieved memories, call LLM to generate an answer.
- eval -- Compute BLEU-1, token F1, and binary LLM Judge (CORRECT/WRONG) for each question.
- score -- Aggregate metrics by LOCOMO category (single_hop, temporal, multi_hop, open_domain).
| Category | BLEU | F1 | LLM Judge | Count |
|---|---|---|---|---|
| single_hop | 29.03% | 38.61% | 60.64% | 282 |
| temporal | 16.61% | 20.09% | 19.63% | 321 |
| multi_hop | 18.22% | 23.46% | 41.67% | 96 |
| open_domain | 38.30% | 44.50% | 63.26% | 841 |
| Overall | 30.83% | 37.02% | 52.34% | 1540 |
Reference: Memobase v0.0.37 achieves 75.78% LLM Judge (with gpt-4o as judge).