Skip to content

SketchOTP/mimir

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mimir

Observable long-term memory infrastructure for AI systems.

Mimir prioritizes measurable retrieval behavior, transparent tradeoffs, and observable memory operations over opaque “magic memory” claims.

Mimir is an open-source memory and retrieval orchestration platform for MCP-compatible agents. It provides lifecycle-aware memory management, multi-provider retrieval with inspectable pipelines, trust-weighted ranking, and comparative benchmarks — so recall quality is measurable, not marketed.

Self-hostable. No cloud dependency. OAuth or API-key auth for Cursor, Claude Code, and other MCP clients.


What Mimir is

Mimir stores episodic, semantic, and procedural memories, orchestrates retrieval across six providers, applies lifecycle and trust policies, and returns token-budgeted context with a debug trace explaining what was ranked and why.

It is infrastructure for long-horizon agent memory — not a black-box RAG wrapper and not a simulation of human cognition.


What makes it different

Theme What you get
Transparency Per-recall debug: providers used, exclusions, agreement scores, token cost
Observability Telemetry dashboard: provider usefulness, drift, retrieval heatmaps
Benchmarkability Fixture harness comparing naive_rag, vector_only, conversational, and mimir
Lifecycle Active → aging → stale → archived; stale suppression at retrieval time
Trust Trust scores, verification status, quarantine — retrieval is trust-weighted
Token efficiency Context builder enforces budgets; benchmarks report token cost per query

Benchmarks

Comparative retrieval evaluation lives under benchmarks/retrieval/.

Resource Description
benchmarks/retrieval/README.md Systems, metrics, honesty policy
benchmarks/retrieval/reports/latest.md Latest local run (regenerate with make bench-retrieval)
benchmarks/retrieval/reports/sample_v1.md Sample output for GitHub visitors (fixture run, timestamped)
docs/BENCHMARK_WALKTHROUGH.md How to run, read reports, and interpret weak spots
docs/TOKEN_EFFICIENCY.md Measured token cost vs MRR (fixture data)
docs/RETRIEVAL_TRACE_GUIDE.md Export and interpret retrieval traces
docs/COLD_START_VALIDATION.md Clean-machine Docker validation
make bench-retrieval
# or: python -m benchmarks.retrieval.runners.cli --seed 42

Policy: Fixture-based comparisons with documented weak spots (e.g. orchestration latency). No fabricated superiority claims. See sample report disclaimer before citing numbers externally.


Current limitations

Honest scope for v0.1.0-rc — not a production-scale evaluation platform yet.

  • Benchmarks are fixture-based (standard_v1 / standard_v2), not production corpora at scale
  • Retrieval latency is often higher than naive RAG / vector-only due to multi-provider orchestration
  • Token efficiency varies by scenario; Mimir may inject more tokens than baselines while improving rank quality (see docs/TOKEN_EFFICIENCY.md)
  • standard_v2 is synthetic but noisier than v1 — still deterministic, not real user traffic
  • Experimental lifecycle / consolidation subsystems exist in-tree but are not required for core OSS memory + retrieval
  • Dashboard trace UX is improving; full per-provider candidate replay is richest on live POST /api/events/recall with token_budget

Quick start (Docker) — recommended

git clone https://github.com/SketchOTP/mimir
cd mimir
cp .env.example .env
docker compose up -d
./scripts/doctor.sh
Resource URL
API health http://127.0.0.1:8787/health
Dashboard http://127.0.0.1:5173
Telemetry http://127.0.0.1:5173/telemetry (after recalls)
API docs http://127.0.0.1:8787/api/docs

Timing: ~20–40s startup with pre-built images; first docker compose build may take several minutes (embeddings stack).
RAM: ~4 GB minimum, ~8 GB comfortable. CPU-only is supported.
Details: docs/COLD_START_VALIDATION.md

Optional API-key owner (multi-user deployments):

docker compose exec api python -m mimir.auth.create_owner \
  --email you@example.com --display-name "Your Name"

Healthchecks

curl -s http://127.0.0.1:8787/health
curl -s http://127.0.0.1:8787/api/telemetry/retrieval/stats
make bench-retrieval   # or: python3 -m benchmarks.retrieval.runners.cli

Add to Cursor — Settings → MCP → Add Server:

{
  "mcpServers": {
    "mimir": {
      "url": "http://127.0.0.1:8787/mcp"
    }
  }
}

For SSH or headless setups, use Bearer API-key auth — see docs/CURSOR_MCP_SETUP.md.

Quick start (local dev)

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
alembic upgrade head
make dev    # API :8787
make web    # UI :5173 (optional, second terminal)

First recall may load embedding model (~30s on CPU). Use examples/retrieval_debugger/ to inspect traces.

Integration examples

Example Path
OpenAI chat + memory examples/openai_chat_memory/
Claude + memory examples/claude_memory/
Local LLM (Ollama) examples/local_llm_memory/
Agent recall loop examples/agent_memory_loop/
Trace debugger examples/retrieval_debugger/
Token budgeting examples/token_budgeting/

Architecture

High-level map (detail: ARCHITECTURE.md):

┌──────────────┐     ┌─────────────────────────────────────────────┐
│ MCP / REST   │────▶│ FastAPI — auth, MCP HTTP, memory API        │
│ clients      │     └──────────────┬──────────────────────────────┘
└──────────────┘                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
             ┌──────────┐   ┌─────────────┐  ┌──────────┐
             │ Ingestion│   │ Retrieval   │  │ Graph    │
             │ + layers │   │ orchestrator│  │ memory   │
             └────┬─────┘   └──────┬──────┘  └────┬─────┘
                  │                │              │
                  └────────────────┼──────────────┘
                                   ▼
                    ┌──────────────────────────────┐
                    │ SQLite/Postgres + ChromaDB   │
                    │ lifecycle · telemetry · API  │
                    └──────────────────────────────┘
                                   ▲
                    ┌──────────────┴──────────────┐
                    │ worker — consolidate,       │
                    │ lifecycle, reflect, graph   │
                    └─────────────────────────────┘

Retrieval flow

query → task categorization → adaptive provider weights
     → 6 providers (async) → merge / dedupe → trust & lifecycle filter
     → confidence scoring → token budget → context string + debug trace

Observability

Mimir surfaces retrieval behavior in the API and dashboard — not only final context text.

Signal Where
Retrieval traces debug on recall (providers, exclusions, ranked IDs)
Confidence / agreement Cross-provider agreement in debug + telemetry
Token usage Per-session token cost vs budget
Provider weighting Adaptive weights by task category; dashboard provider stats
Lifecycle metadata Memory state, trust, verification on browse views
Weak spots Auto-listed in benchmark reports when Mimir underperforms baselines

Telemetry and provider observability

Open Telemetry in the web UI after a few recall operations, or inspect recall debug in API responses.


Capabilities

Capability What it means
Three memory layers Episodic, semantic, procedural — auto-classified at write time
Knowledge graph Entities and relationships; graph-aware retrieval
Multi-source retrieval Six providers fused with adaptive, task-aware weights
Trust scoring Trust, confidence, verification status — retrieval is trust-weighted
Adversarial quarantine Seven pattern classes blocked before storage
Memory lifecycle Four-stage state machine with recency and retrieval boosts
Offline consolidation Nightly dedup, chain compression, trust updates from feedback
Reflection + contradictions Async contradiction detection and improvement proposals
Skills + approvals Reusable procedures and human gates for high-risk actions
OAuth 2.1 / PKCE + API keys Browser OAuth or Bearer for SSH/headless
Multi-user isolation user_id scoped across stores and workers
React PWA dashboard Memories, telemetry, approvals, timeline
REST + MCP + Python SDK HTTP, MCP Streamable HTTP, programmatic SDK

MCP tools

Tool What it does
memory.remember Store an event or fact; layer auto-classified
memory.recall Retrieve relevant memories — token-budgeted context + debug
memory.search Semantic search with optional layer filter
memory.record_outcome Record task outcome; feeds trust and reflection
skill.list List approved procedures for the project
approval.request / approval.status Human gate for high-risk actions
reflection.log Log observations for offline analysis
improvement.propose Propose system-level changes (approval required)

Design rationale (selected)

Technical influences — stated as engineering choices, not biological claims:

  • Layered stores — separate write/retrieval paths for events, facts, and procedures (Tulving-style taxonomy as data model, not cognition simulation).
  • Offline consolidation — nightly integration of episodic traces into durable knowledge without silent deletion of high-trust items.
  • Multi-provider retrieval — mixture-of-experts style routing; weights adapt from task outcome feedback.
  • Lifecycle + forgetting — recency, retrieval frequency, and trust modulate active vs stale vs archived.
  • Quarantine at write time — injection, credential, and policy-overwrite patterns blocked before storage.

Experimental subsystems (consolidation research, architecture governor) are documented under docs/architecture/ and are not required for core OSS memory + retrieval.


Memory layers (reference)

Layer Use
Episodic Session events, outcomes, temporal logs
Semantic Facts, preferences, rules, identity
Procedural Workflows, runbooks, promoted patterns
Graph Entity relationships and multi-hop context

Background workers

Worker Schedule Role
consolidator Nightly Dedup, chain compression, trust from feedback
reflector Every 30 min Contradictions, improvement proposals
lifecycle Nightly Aging, decay, supersession, deletion
procedural_promoter Nightly Promote validated episodic patterns
graph_builder Nightly Extract graph from corpus

Safety

  • Quarantine is permanent — updates cannot reactivate quarantined memories
  • System mutation endpoints off by default in production
  • High-trust memories not silently deleted by consolidator
  • Cross-user isolation at the DB layer

See docs/SECURITY.md · docs/MULTI_USER_SECURITY.md


Documentation

Doc Purpose
ARCHITECTURE.md Subsystems, retrieval pipeline, lifecycle, benchmarks
docs/BENCHMARK_WALKTHROUGH.md Run and interpret retrieval benchmarks
CONTRIBUTING.md Setup, tests, PR expectations
ROADMAP.md Near-term OSS priorities
BENCHMARK_RESULTS.md Eval harness + retrieval numbers (evidence policy)
docs/SELF_HOSTING.md Local and production deployment
docs/CURSOR_MCP_SETUP.md Cursor MCP configuration

Development

make dev              # API hot-reload :8787
make web              # Vite dev :5173
make test             # pytest
make evals            # 8-suite eval harness
make bench-retrieval  # comparative retrieval benchmarks
make gate             # release gate

Tech stack

Layer Technology
API Python 3.12, FastAPI, Uvicorn
Storage SQLAlchemy 2 async, SQLite / Postgres, ChromaDB
Jobs APScheduler
Frontend React 18, TypeScript, Vite, PWA
Auth OAuth 2.1 / PKCE, API keys
MCP Streamable HTTP (JSON-RPC 2.0)

License

Apache-2.0 — see LICENSE.

Security: GitHub Security Advisory · docs/SECURITY.md

About

Self-hosted MCP memory server for AI coding assistants. Adds long-term memory, retrieval, procedural learning, approvals, rollback, graph reasoning, telemetry, and multi-user auth for Cursor and other MCP clients.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors