Hi MemOS team,
The three-state memory model (Activation / Plaintext / Parameter) is one of the most ambitious memory architectures I've seen. As the system matures, I think there's a growing need for standardized benchmarking of cross-state retrieval quality.
Specifically: when MemScheduler migrates memories between states, how do we verify that the right memories are accessible at the right time? This is hard to evaluate with ad-hoc tests because:
- Activation memory (KV-Cache) has different access patterns than Plaintext memory (vector search)
- The scheduler's migration decisions (what gets promoted/demoted) directly impact retrieval quality
- Parameter memory (LoRA) effectiveness is difficult to measure quantitatively
I've been working on MemTest, a benchmark database design system for AI memory evaluation. It provides:
- Controlled test databases with known ground truth (21,793 memories from Chinese classical literature, 750 queries)
- 6 evaluation dimensions including temporal retrieval, forgetting curves, and multi-hop reasoning
- Corpus-driven builder that generates test data from any text corpus
- Adapter pattern: implement 3 methods and get a full evaluation report
For MemOS, benchmarks could help:
- Evaluate whether MemScheduler's state transitions preserve retrieval quality
- Compare retrieval accuracy across Activation vs Plaintext states
- Track regression as the scheduler logic evolves
We found that retrieval method choice matters enormously — TF-IDF + LLM reranking achieves 87% precision vs 9.1% for sentence-transformers on Chinese text, a 10x difference that would be invisible without systematic evaluation.
Would standardized benchmarking align with your roadmap? Happy to discuss how MemTest could adapt to MemOS's unique three-state architecture.
Hi MemOS team,
The three-state memory model (Activation / Plaintext / Parameter) is one of the most ambitious memory architectures I've seen. As the system matures, I think there's a growing need for standardized benchmarking of cross-state retrieval quality.
Specifically: when MemScheduler migrates memories between states, how do we verify that the right memories are accessible at the right time? This is hard to evaluate with ad-hoc tests because:
I've been working on MemTest, a benchmark database design system for AI memory evaluation. It provides:
For MemOS, benchmarks could help:
We found that retrieval method choice matters enormously — TF-IDF + LLM reranking achieves 87% precision vs 9.1% for sentence-transformers on Chinese text, a 10x difference that would be invisible without systematic evaluation.
Would standardized benchmarking align with your roadmap? Happy to discuss how MemTest could adapt to MemOS's unique three-state architecture.