Memory LOCOMO

Benchmark and evaluation of LLM memory systems on the LOCOMO dataset.

Projects

Directory	Description
`main.py`	Demo script for Huawei Cloud AgentArts Memory SDK
`memory-locomo-benchmark/`	LOCOMO benchmark for AgentArts Memory (BLEU, F1, LLM Judge)
`memobase/`	Forked Memobase with reference LOCOMO benchmark

AgentArts LOCOMO Benchmark

Evaluates AgentArts Memory on the LOCOMO dataset using dual-perspective sessions and memory search, scored with BLEU-1, token F1, and LLM Judge.

Quick Start

cd memory-locomo-benchmark

# Configure environment
cp .env.example .env   # Edit with your credentials

# Run full pipeline
uv run python run.py add --max_samples 10         # Write conversations to memory
uv run python run.py search --output results.json  # Retrieve & answer questions
uv run python run.py eval --input results.json     # Score (BLEU/F1/LLM Judge)
uv run python run.py score --input evals.json      # Aggregate by category

Environment Variables

Variable	Description
`AGENTARTS_MEMORY_REGION`	AgentArts Memory region (e.g. `cn-southwest-2`)
`AGENTARTS_MEMORY_API_KEY`	AgentArts Memory API key
`AGENTARTS_MEMORY_SPACE_ID`	AgentArts Memory space ID
`ANSWER_LLM_BASE_URL`	OpenAI-compatible endpoint for answer generation
`ANSWER_LLM_API_KEY`	API key for answer LLM
`ANSWER_LLM_MODEL`	Model name for answer generation
`JUDGE_LLM_BASE_URL`	OpenAI-compatible endpoint for LLM Judge
`JUDGE_LLM_API_KEY`	API key for judge LLM
`JUDGE_LLM_MODEL`	Model name for LLM Judge

Pipeline

add -- Parse LOCOMO conversations, create dual-perspective sessions (one per speaker), write messages to AgentArts Memory, wait for memory extraction.
search -- For each QA pair, search memories using the question as query, format retrieved memories, call LLM to generate an answer.
eval -- Compute BLEU-1, token F1, and binary LLM Judge (CORRECT/WRONG) for each question.
score -- Aggregate metrics by LOCOMO category (single_hop, temporal, multi_hop, open_domain).

Results (10 conversations, glm-4.7 as judge)

Category	BLEU	F1	LLM Judge	Count
single_hop	29.03%	38.61%	60.64%	282
temporal	16.61%	20.09%	19.63%	321
multi_hop	18.22%	23.46%	41.67%	96
open_domain	38.30%	44.50%	63.26%	841
Overall	30.83%	37.02%	52.34%	1540

Reference: Memobase v0.0.37 achieves 75.78% LLM Judge (with gpt-4o as judge).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
examples		examples
memory-locomo-benchmark		memory-locomo-benchmark
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memory LOCOMO

Projects

AgentArts LOCOMO Benchmark

Quick Start

Environment Variables

Pipeline

Results (10 conversations, glm-4.7 as judge)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Memory LOCOMO

Projects

AgentArts LOCOMO Benchmark

Quick Start

Environment Variables

Pipeline

Results (10 conversations, glm-4.7 as judge)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages