Cannot Reproduce LoCoMo Benchmark Results

# LoCoMo Benchmark Results - Significant Accuracy Gap

**Issue**: EverMemOS achieves 38.38% accuracy vs paper's claimed 93% on LoCoMo benchmark

## Environment

- **OS**: Windows 10
- **Python**: 3.12.7
- **Docker**: 28.1.1 (MongoDB, Elasticsearch, Milvus, Redis)
- **Dependencies**: `uv sync --group evaluation`

## Configuration

**Models**:
- LLM: `openai/gpt-4.1-mini` (OpenRouter, temp=0.3)
- Embedding: `Qwen/Qwen3-Embedding-4B` (DeepInfra, dim=1024)
- Reranker: `Qwen/Qwen3-Reranker-4B` (DeepInfra)
- Search mode: `agentic`

## Commands Run

```bash
# Start services
docker-compose up -d

# Smoke test (30 questions, 10 messages/conv)
uv run python -m evaluation.cli --dataset locomo --system evermemos --smoke

# Full conv-26 (152 questions, 419 messages)
uv run python -m evaluation.cli --dataset locomo --system evermemos --from-conv 0 --to-conv 1
```

## Results

| Test | Messages | Questions | Accuracy | vs Paper |
|------|----------|-----------|----------|----------|
| **Paper (LoCoMo)** | All | 1,986 | **93.0%** | - |
| **Smoke test** | 10/conv | 30 | 52.22% | -40.78% |
| **Conv-26 (full)** | 419 | 152 | **38.38%** | **-54.62%** |

### Category Breakdown (Smoke Test)
- Single-hop: 41.67% (vs 96.08% in paper)
- Multi-hop: 54.76% (vs 91.13% in paper)
- Temporal: 66.67% (vs 89.72% in paper)
- Open domain: 100% (vs 70.83% in paper)

## Key Findings

1. **Performance degrades with more context**:
   - 10 messages: 52.22%
   - 419 messages: 38.38% (-13.84%)

2. **Only tested 1/10 conversations** (7.66% of full benchmark)

3. **Possible causes**:
   - Retrieval struggles with large memory banks
   - Memory consolidation losing information
   - Different evaluation methodology or configuration

## Questions for Authors
Can you share some more details on reproducing the results? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot Reproduce LoCoMo Benchmark Results #73

LoCoMo Benchmark Results - Significant Accuracy Gap

Environment

Configuration

Commands Run

Results

Category Breakdown (Smoke Test)

Key Findings

Questions for Authors

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	Messages	Questions	Accuracy	vs Paper
Paper (LoCoMo)	All	1,986	93.0%	-
Smoke test	10/conv	30	52.22%	-40.78%
Conv-26 (full)	419	152	38.38%	-54.62%

Cannot Reproduce LoCoMo Benchmark Results #73

Description

LoCoMo Benchmark Results - Significant Accuracy Gap

Environment

Configuration

Commands Run

Results

Category Breakdown (Smoke Test)

Key Findings

Questions for Authors

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions