LoCoMo Benchmark Results - Significant Accuracy Gap
Issue: EverMemOS achieves 38.38% accuracy vs paper's claimed 93% on LoCoMo benchmark
Environment
- OS: Windows 10
- Python: 3.12.7
- Docker: 28.1.1 (MongoDB, Elasticsearch, Milvus, Redis)
- Dependencies:
uv sync --group evaluation
Configuration
Models:
- LLM:
openai/gpt-4.1-mini (OpenRouter, temp=0.3)
- Embedding:
Qwen/Qwen3-Embedding-4B (DeepInfra, dim=1024)
- Reranker:
Qwen/Qwen3-Reranker-4B (DeepInfra)
- Search mode:
agentic
Commands Run
# Start services
docker-compose up -d
# Smoke test (30 questions, 10 messages/conv)
uv run python -m evaluation.cli --dataset locomo --system evermemos --smoke
# Full conv-26 (152 questions, 419 messages)
uv run python -m evaluation.cli --dataset locomo --system evermemos --from-conv 0 --to-conv 1
Results
| Test |
Messages |
Questions |
Accuracy |
vs Paper |
| Paper (LoCoMo) |
All |
1,986 |
93.0% |
- |
| Smoke test |
10/conv |
30 |
52.22% |
-40.78% |
| Conv-26 (full) |
419 |
152 |
38.38% |
-54.62% |
Category Breakdown (Smoke Test)
- Single-hop: 41.67% (vs 96.08% in paper)
- Multi-hop: 54.76% (vs 91.13% in paper)
- Temporal: 66.67% (vs 89.72% in paper)
- Open domain: 100% (vs 70.83% in paper)
Key Findings
-
Performance degrades with more context:
- 10 messages: 52.22%
- 419 messages: 38.38% (-13.84%)
-
Only tested 1/10 conversations (7.66% of full benchmark)
-
Possible causes:
- Retrieval struggles with large memory banks
- Memory consolidation losing information
- Different evaluation methodology or configuration
Questions for Authors
Can you share some more details on reproducing the results? Thanks!
LoCoMo Benchmark Results - Significant Accuracy Gap
Issue: EverMemOS achieves 38.38% accuracy vs paper's claimed 93% on LoCoMo benchmark
Environment
uv sync --group evaluationConfiguration
Models:
openai/gpt-4.1-mini(OpenRouter, temp=0.3)Qwen/Qwen3-Embedding-4B(DeepInfra, dim=1024)Qwen/Qwen3-Reranker-4B(DeepInfra)agenticCommands Run
Results
Category Breakdown (Smoke Test)
Key Findings
Performance degrades with more context:
Only tested 1/10 conversations (7.66% of full benchmark)
Possible causes:
Questions for Authors
Can you share some more details on reproducing the results? Thanks!