Skip to content

HScarb/agentarts-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Memory LOCOMO

Benchmark and evaluation of LLM memory systems on the LOCOMO dataset.

Projects

Directory Description
main.py Demo script for Huawei Cloud AgentArts Memory SDK
memory-locomo-benchmark/ LOCOMO benchmark for AgentArts Memory (BLEU, F1, LLM Judge)
memobase/ Forked Memobase with reference LOCOMO benchmark

AgentArts LOCOMO Benchmark

Evaluates AgentArts Memory on the LOCOMO dataset using dual-perspective sessions and memory search, scored with BLEU-1, token F1, and LLM Judge.

Quick Start

cd memory-locomo-benchmark

# Configure environment
cp .env.example .env   # Edit with your credentials

# Run full pipeline
uv run python run.py add --max_samples 10         # Write conversations to memory
uv run python run.py search --output results.json  # Retrieve & answer questions
uv run python run.py eval --input results.json     # Score (BLEU/F1/LLM Judge)
uv run python run.py score --input evals.json      # Aggregate by category

Environment Variables

Variable Description
AGENTARTS_MEMORY_REGION AgentArts Memory region (e.g. cn-southwest-2)
AGENTARTS_MEMORY_API_KEY AgentArts Memory API key
AGENTARTS_MEMORY_SPACE_ID AgentArts Memory space ID
ANSWER_LLM_BASE_URL OpenAI-compatible endpoint for answer generation
ANSWER_LLM_API_KEY API key for answer LLM
ANSWER_LLM_MODEL Model name for answer generation
JUDGE_LLM_BASE_URL OpenAI-compatible endpoint for LLM Judge
JUDGE_LLM_API_KEY API key for judge LLM
JUDGE_LLM_MODEL Model name for LLM Judge

Pipeline

  1. add -- Parse LOCOMO conversations, create dual-perspective sessions (one per speaker), write messages to AgentArts Memory, wait for memory extraction.
  2. search -- For each QA pair, search memories using the question as query, format retrieved memories, call LLM to generate an answer.
  3. eval -- Compute BLEU-1, token F1, and binary LLM Judge (CORRECT/WRONG) for each question.
  4. score -- Aggregate metrics by LOCOMO category (single_hop, temporal, multi_hop, open_domain).

Results (10 conversations, glm-4.7 as judge)

Category BLEU F1 LLM Judge Count
single_hop 29.03% 38.61% 60.64% 282
temporal 16.61% 20.09% 19.63% 321
multi_hop 18.22% 23.46% 41.67% 96
open_domain 38.30% 44.50% 63.26% 841
Overall 30.83% 37.02% 52.34% 1540

Reference: Memobase v0.0.37 achieves 75.78% LLM Judge (with gpt-4o as judge).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors