Find out why your AI agent forgot.
Your agent told a customer they're vegan -- then recommended a steak restaurant. Your support bot asked for the account ID the customer already gave. Your personal assistant lost the project deadline mentioned three messages ago.
These are memory failures. They happen in production every day. And until now, there was no way to test for them before they reach your users.
memeval detects memory corruption, measures retrieval quality, and catches context loss in multi-turn conversations -- across any memory provider.
pip install memoryeval
memeval run --adapter in_memorySession: Mid-Conversation Correction
[turn 1] User: "Book a flight to Tokyo"
[turn 3] User: "Actually, change that to Seoul"
[assert] Query: "Where does the user want to fly?"
[result] Retrieved: "Seoul" -- PASS
Preference Update
[setup] Stored: "User is vegetarian"
[step 1] Stored: "User switched to vegan"
[assert] Query: "dietary preferences"
[result] Retrieved: "vegetarian" AND "vegan" -- FAIL: old fact not replaced
Contradiction retention -- your memory stores "User earns $80K" and "User earns $120K" side by side. memeval detects this using embedding similarity analysis, not keyword matching.
Context loss in conversations -- a fact shared in turn 1 disappears by turn 10. memeval tests recall depth across 10+ turn conversations.
Stale data -- user corrected a preference but the old value still appears in search results. memeval verifies that updates propagate.
Cross-user data leakage -- user A's private data shows up in user B's session. memeval plants sentinel values and probes for leaks.
Silent forgetting -- deleted facts are gone, but so are facts that should have survived. memeval measures both forgetting precision and retention rate.
pip install memoryeval
# Run 30 built-in scenarios against the test adapter
memeval run --adapter in_memory
# Test against real Mem0 (requires OPENAI_API_KEY)
pip install memoryeval[mem0]
memeval run --adapter mem0
# Compare providers side-by-side
memeval benchmark --adapters in_memory --adapters mem0memeval tests what users actually care about -- does the agent remember what was said earlier in this conversation?
name: "Customer Support Multi-Turn"
steps:
- create_session:
session_id: "ticket_789"
- add_message:
session_id: "ticket_789"
role: "user"
content: "I was charged $99 but my plan is Basic at $29"
- add_message:
session_id: "ticket_789"
role: "user"
content: "My account email is frank@email.com"
- add_message:
session_id: "ticket_789"
role: "user"
content: "Please refund the difference"
# 3 turns later, does the agent still know the issue?
- assert_context:
session_id: "ticket_789"
query: "What is the billing issue?"
expected_contains: ["99"]Each adapter maps sessions to the provider's native concept:
- Mem0:
run_id - Zep: threads
- Letta: agent message sequence
| Category | Scenarios | What they test |
|---|---|---|
| Session (6) | Basic recall, correction, 10-turn depth, isolation, support, preferences | Multi-turn conversation memory |
| Core (7) | Basic recall, adversarial, multi-hop, entity resolution, negation, high-volume, scale | Fact storage and retrieval |
| Lifecycle (6) | Preference update, contradiction, rapid updates, stale data, forgetting, GDPR deletion | Memory evolution over time |
| Governance (3) | Privacy isolation, multi-user isolation, cross-session recall | Data boundaries |
| Edge cases (2) | UTF-8 characters, empty/boundary conditions | Robustness |
| Operations (6) | Cascading deletion, consolidation, support handoff, context restoration, soft contradictions, scale stress | Memory management |
7 evaluation dimensions, each with concrete metrics:
| Dimension | What it catches | How |
|---|---|---|
| Recall | "Agent forgot what I said" | Store facts, search for them, measure hit rate |
| Relevance | "Agent returned the wrong memory" | MRR and NDCG@k on ranked results |
| Consistency | "Agent has contradictory facts" | Embedding similarity to detect same-topic divergence |
| Update propagation | "Old value still appears after correction" | Update fact, verify old value is gone |
| Forgetting quality | "Deleted the wrong things" | Selective deletion precision + retention rate |
| Latency | "Memory operations are too slow" | p50/p95/p99 for reads vs writes separately |
| Privacy | "User A's data leaked to User B" | Sentinel-based cross-session probing |
# JSON reports for pipeline integration
memeval run --adapter mem0 --output report.json
# Reproducible multi-run benchmarks
python scripts/run_benchmark.py --adapter mem0 --runs 3 --output results/# .github/workflows/memeval.yml
- name: Memory evaluation
run: memeval run --adapter mem0 --output report.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}pytest --memeval-adapter=mem0memeval auto-discovers YAML scenario files. Write your own or use the 30 built-in ones.
import asyncio
from memeval import evaluate, InMemoryAdapter
async def main():
adapter = InMemoryAdapter()
results = await evaluate(adapter=adapter, scenarios="builtin")
for r in results:
status = "PASS" if r.passed else "FAIL"
print(f"[{status}] {r.scenario.name}")
asyncio.run(main())| Provider | Session model | Install |
|---|---|---|
| In-Memory (testing) | dict-based | Built-in |
| Mem0 | run_id | pip install memoryeval[mem0] |
| Zep | threads | pip install memoryeval[zep] |
| Letta | agent state | pip install memoryeval[letta] |
| Custom | You define it | See writing adapters |
Single-run results from 2026-05-27 (memeval 0.1.2, Python 3.14, macOS ARM64, Mem0 self-hosted with gpt-4o-mini). Not statistically significant -- see methodology for reproducible multi-run benchmarks.
| Finding | Detail |
|---|---|
| Mem0 recall: 1.000 | LLM fact extraction makes semantic retrieval excellent |
| Mem0 consistency: 0.917 | Stores both old and new facts -- does not auto-resolve contradictions |
| Mem0 write latency: p95 ~3,500ms | Every write calls OpenAI for extraction |
| Mem0 search latency: p95 ~500ms | Retrieval is fast once facts are indexed |
| Zep graph: async indexing | Facts not searchable immediately after write |
# Reproduce
python scripts/run_benchmark.py --adapter in_memory --adapter mem0 --output results/Apache 2.0