Observable long-term memory infrastructure for AI systems.
Mimir prioritizes measurable retrieval behavior, transparent tradeoffs, and observable memory operations over opaque “magic memory” claims.
Mimir is an open-source memory and retrieval orchestration platform for MCP-compatible agents. It provides lifecycle-aware memory management, multi-provider retrieval with inspectable pipelines, trust-weighted ranking, and comparative benchmarks — so recall quality is measurable, not marketed.
Self-hostable. No cloud dependency. OAuth or API-key auth for Cursor, Claude Code, and other MCP clients.
Mimir stores episodic, semantic, and procedural memories, orchestrates retrieval across six providers, applies lifecycle and trust policies, and returns token-budgeted context with a debug trace explaining what was ranked and why.
It is infrastructure for long-horizon agent memory — not a black-box RAG wrapper and not a simulation of human cognition.
| Theme | What you get |
|---|---|
| Transparency | Per-recall debug: providers used, exclusions, agreement scores, token cost |
| Observability | Telemetry dashboard: provider usefulness, drift, retrieval heatmaps |
| Benchmarkability | Fixture harness comparing naive_rag, vector_only, conversational, and mimir |
| Lifecycle | Active → aging → stale → archived; stale suppression at retrieval time |
| Trust | Trust scores, verification status, quarantine — retrieval is trust-weighted |
| Token efficiency | Context builder enforces budgets; benchmarks report token cost per query |
Comparative retrieval evaluation lives under benchmarks/retrieval/.
| Resource | Description |
|---|---|
| benchmarks/retrieval/README.md | Systems, metrics, honesty policy |
| benchmarks/retrieval/reports/latest.md | Latest local run (regenerate with make bench-retrieval) |
| benchmarks/retrieval/reports/sample_v1.md | Sample output for GitHub visitors (fixture run, timestamped) |
| docs/BENCHMARK_WALKTHROUGH.md | How to run, read reports, and interpret weak spots |
| docs/TOKEN_EFFICIENCY.md | Measured token cost vs MRR (fixture data) |
| docs/RETRIEVAL_TRACE_GUIDE.md | Export and interpret retrieval traces |
| docs/COLD_START_VALIDATION.md | Clean-machine Docker validation |
make bench-retrieval
# or: python -m benchmarks.retrieval.runners.cli --seed 42Policy: Fixture-based comparisons with documented weak spots (e.g. orchestration latency). No fabricated superiority claims. See sample report disclaimer before citing numbers externally.
Honest scope for v0.1.0-rc — not a production-scale evaluation platform yet.
- Benchmarks are fixture-based (
standard_v1/standard_v2), not production corpora at scale - Retrieval latency is often higher than naive RAG / vector-only due to multi-provider orchestration
- Token efficiency varies by scenario; Mimir may inject more tokens than baselines while improving rank quality (see docs/TOKEN_EFFICIENCY.md)
standard_v2is synthetic but noisier than v1 — still deterministic, not real user traffic- Experimental lifecycle / consolidation subsystems exist in-tree but are not required for core OSS memory + retrieval
- Dashboard trace UX is improving; full per-provider candidate replay is richest on live
POST /api/events/recallwithtoken_budget
git clone https://github.com/SketchOTP/mimir
cd mimir
cp .env.example .env
docker compose up -d
./scripts/doctor.sh| Resource | URL |
|---|---|
| API health | http://127.0.0.1:8787/health |
| Dashboard | http://127.0.0.1:5173 |
| Telemetry | http://127.0.0.1:5173/telemetry (after recalls) |
| API docs | http://127.0.0.1:8787/api/docs |
Timing: ~20–40s startup with pre-built images; first docker compose build may take several minutes (embeddings stack).
RAM: ~4 GB minimum, ~8 GB comfortable. CPU-only is supported.
Details: docs/COLD_START_VALIDATION.md
Optional API-key owner (multi-user deployments):
docker compose exec api python -m mimir.auth.create_owner \
--email you@example.com --display-name "Your Name"Healthchecks
curl -s http://127.0.0.1:8787/health
curl -s http://127.0.0.1:8787/api/telemetry/retrieval/stats
make bench-retrieval # or: python3 -m benchmarks.retrieval.runners.cliAdd to Cursor — Settings → MCP → Add Server:
{
"mcpServers": {
"mimir": {
"url": "http://127.0.0.1:8787/mcp"
}
}
}For SSH or headless setups, use Bearer API-key auth — see docs/CURSOR_MCP_SETUP.md.
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
alembic upgrade head
make dev # API :8787
make web # UI :5173 (optional, second terminal)First recall may load embedding model (~30s on CPU). Use examples/retrieval_debugger/ to inspect traces.
| Example | Path |
|---|---|
| OpenAI chat + memory | examples/openai_chat_memory/ |
| Claude + memory | examples/claude_memory/ |
| Local LLM (Ollama) | examples/local_llm_memory/ |
| Agent recall loop | examples/agent_memory_loop/ |
| Trace debugger | examples/retrieval_debugger/ |
| Token budgeting | examples/token_budgeting/ |
High-level map (detail: ARCHITECTURE.md):
┌──────────────┐ ┌─────────────────────────────────────────────┐
│ MCP / REST │────▶│ FastAPI — auth, MCP HTTP, memory API │
│ clients │ └──────────────┬──────────────────────────────┘
└──────────────┘ │
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌─────────────┐ ┌──────────┐
│ Ingestion│ │ Retrieval │ │ Graph │
│ + layers │ │ orchestrator│ │ memory │
└────┬─────┘ └──────┬──────┘ └────┬─────┘
│ │ │
└────────────────┼──────────────┘
▼
┌──────────────────────────────┐
│ SQLite/Postgres + ChromaDB │
│ lifecycle · telemetry · API │
└──────────────────────────────┘
▲
┌──────────────┴──────────────┐
│ worker — consolidate, │
│ lifecycle, reflect, graph │
└─────────────────────────────┘
query → task categorization → adaptive provider weights
→ 6 providers (async) → merge / dedupe → trust & lifecycle filter
→ confidence scoring → token budget → context string + debug trace
Mimir surfaces retrieval behavior in the API and dashboard — not only final context text.
| Signal | Where |
|---|---|
| Retrieval traces | debug on recall (providers, exclusions, ranked IDs) |
| Confidence / agreement | Cross-provider agreement in debug + telemetry |
| Token usage | Per-session token cost vs budget |
| Provider weighting | Adaptive weights by task category; dashboard provider stats |
| Lifecycle metadata | Memory state, trust, verification on browse views |
| Weak spots | Auto-listed in benchmark reports when Mimir underperforms baselines |
Open Telemetry in the web UI after a few recall operations, or inspect recall debug in API responses.
| Capability | What it means |
|---|---|
| Three memory layers | Episodic, semantic, procedural — auto-classified at write time |
| Knowledge graph | Entities and relationships; graph-aware retrieval |
| Multi-source retrieval | Six providers fused with adaptive, task-aware weights |
| Trust scoring | Trust, confidence, verification status — retrieval is trust-weighted |
| Adversarial quarantine | Seven pattern classes blocked before storage |
| Memory lifecycle | Four-stage state machine with recency and retrieval boosts |
| Offline consolidation | Nightly dedup, chain compression, trust updates from feedback |
| Reflection + contradictions | Async contradiction detection and improvement proposals |
| Skills + approvals | Reusable procedures and human gates for high-risk actions |
| OAuth 2.1 / PKCE + API keys | Browser OAuth or Bearer for SSH/headless |
| Multi-user isolation | user_id scoped across stores and workers |
| React PWA dashboard | Memories, telemetry, approvals, timeline |
| REST + MCP + Python SDK | HTTP, MCP Streamable HTTP, programmatic SDK |
| Tool | What it does |
|---|---|
memory.remember |
Store an event or fact; layer auto-classified |
memory.recall |
Retrieve relevant memories — token-budgeted context + debug |
memory.search |
Semantic search with optional layer filter |
memory.record_outcome |
Record task outcome; feeds trust and reflection |
skill.list |
List approved procedures for the project |
approval.request / approval.status |
Human gate for high-risk actions |
reflection.log |
Log observations for offline analysis |
improvement.propose |
Propose system-level changes (approval required) |
Technical influences — stated as engineering choices, not biological claims:
- Layered stores — separate write/retrieval paths for events, facts, and procedures (Tulving-style taxonomy as data model, not cognition simulation).
- Offline consolidation — nightly integration of episodic traces into durable knowledge without silent deletion of high-trust items.
- Multi-provider retrieval — mixture-of-experts style routing; weights adapt from task outcome feedback.
- Lifecycle + forgetting — recency, retrieval frequency, and trust modulate active vs stale vs archived.
- Quarantine at write time — injection, credential, and policy-overwrite patterns blocked before storage.
Experimental subsystems (consolidation research, architecture governor) are documented under docs/architecture/ and are not required for core OSS memory + retrieval.
| Layer | Use |
|---|---|
| Episodic | Session events, outcomes, temporal logs |
| Semantic | Facts, preferences, rules, identity |
| Procedural | Workflows, runbooks, promoted patterns |
| Graph | Entity relationships and multi-hop context |
| Worker | Schedule | Role |
|---|---|---|
consolidator |
Nightly | Dedup, chain compression, trust from feedback |
reflector |
Every 30 min | Contradictions, improvement proposals |
lifecycle |
Nightly | Aging, decay, supersession, deletion |
procedural_promoter |
Nightly | Promote validated episodic patterns |
graph_builder |
Nightly | Extract graph from corpus |
- Quarantine is permanent — updates cannot reactivate quarantined memories
- System mutation endpoints off by default in production
- High-trust memories not silently deleted by consolidator
- Cross-user isolation at the DB layer
See docs/SECURITY.md · docs/MULTI_USER_SECURITY.md
| Doc | Purpose |
|---|---|
| ARCHITECTURE.md | Subsystems, retrieval pipeline, lifecycle, benchmarks |
| docs/BENCHMARK_WALKTHROUGH.md | Run and interpret retrieval benchmarks |
| CONTRIBUTING.md | Setup, tests, PR expectations |
| ROADMAP.md | Near-term OSS priorities |
| BENCHMARK_RESULTS.md | Eval harness + retrieval numbers (evidence policy) |
| docs/SELF_HOSTING.md | Local and production deployment |
| docs/CURSOR_MCP_SETUP.md | Cursor MCP configuration |
make dev # API hot-reload :8787
make web # Vite dev :5173
make test # pytest
make evals # 8-suite eval harness
make bench-retrieval # comparative retrieval benchmarks
make gate # release gate| Layer | Technology |
|---|---|
| API | Python 3.12, FastAPI, Uvicorn |
| Storage | SQLAlchemy 2 async, SQLite / Postgres, ChromaDB |
| Jobs | APScheduler |
| Frontend | React 18, TypeScript, Vite, PWA |
| Auth | OAuth 2.1 / PKCE, API keys |
| MCP | Streamable HTTP (JSON-RPC 2.0) |
Apache-2.0 — see LICENSE.
Security: GitHub Security Advisory · docs/SECURITY.md
