Production-grade RAG backend for game lore — Aethelgard Online
Hybrid retrieval · SSE streaming · Multi-user · Security layers · MCP integration
Oracle LoreKeeper is a retrieval-augmented generation (RAG) system that answers player questions about the lore of Aethelgard Online. It retrieves grounded, source-backed answers from indexed game documents, streams them in real time, and integrates directly into Minecraft via a Paper plugin.
Key design principles:
- Single, clean answer pipeline (
POST /api/ask) - No PyTorch runtime — all embeddings/reranking via ONNX (FastEmbed)
- Config-driven: every provider, model, and feature flag is an environment variable
- Production-ready: rate limiting, semantic cache, security validation, monitoring dashboard
| Layer | Technology | Notes |
|---|---|---|
| API | FastAPI + Gunicorn/Uvicorn | Async, streaming, self-documenting |
| Vector DB | Qdrant | Cloud or local mode |
| Embeddings | FastEmbed ONNX | Multilingual MiniLM (384-dim) — no GPU required |
| Reranker | FastEmbed ONNX | Xenova/ms-marco cross-encoder — smart activation |
| Lexical search | BM25 (rank-bm25) | French stopword support |
| LLM Primary | Cerebras (llama3.1-8b) |
~500 ms, fast inference |
| LLM Fallback | Groq (llama-3.3-70b-versatile) |
Auto-activated on 429s |
| Auth & events | Supabase | PostgreSQL-backed, JWT auth |
| Cache & limits | Redis + SlowAPI | Semantic cache + per-user rate limiting |
| Frontend | React + Vite + Tailwind | Served statically by FastAPI |
| Observability | Langfuse (optional) | Full pipeline tracing |
| Security | Lakera Guard (optional) + regex chunk checks | Injection detection, PII masking |
| Quality | Langfuse | LLM-as-Judge, pipeline tracing, evaluation scores |
Client → POST /api/ask
│
├── Auth (JWT or guest)
├── PII masking
├── Security validation (Lakera Guard, optional)
├── Semantic cache lookup ──► cached? return immediately
│
├── Parallel context fetch
│ ├── Conversation history (last 5 exchanges)
│ ├── User summary (150-word LLM-generated profile)
│ └── Vector memories (semantic similarity)
│
├── Query reformulation (optional, makes question self-contained)
│
├── Hybrid retrieval
│ ├── Vector search (Qdrant)
│ ├── BM25 fallback (if vector signal weak)
│ ├── RRF fusion
│ ├── Smart rerank (cross-encoder, conditionally activated)
│ └── HyDE fallback (hypothetical doc embeddings, if low score)
│
├── LLM generation → SSE stream to client
│
└── Background: persist · cache · track · update memory
- Python 3.11+
- Node.js 18+ and npm (required to build the frontend)
- Docker (recommended for production-like local run)
- API keys: Cerebras (LLM), Qdrant (vector DB), Supabase (auth/data)
git clone <repo-url>
cd Oracle-LoreKeeper
python -m venv venv
# Windows:
venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate
pip install -r requirements.txt
# Build the frontend (requires Node.js 18+)
cd src/frontend-react && npm install && npm run build && cd ../..cp .env.example .env
# Edit .env — minimum required keys:
# LLM_API_KEY, QDRANT_URL, QDRANT_API_KEY, SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEYSee docs/DOCUMENTATION.md for full environment variable reference.
Development:
python main.py
# → API + frontend at http://localhost:8000Production (Docker):
docker compose up --build
# → API at http://localhost:8000
# → MCP server at http://localhost:8001Makefile shortcuts:
make setup # First-time setup (venv + .env + index)
make run # Dev server
make docker-up # Docker start
make test # Run unit tests
make index # Force reindex| Endpoint | Method | Auth | Description |
|---|---|---|---|
/health |
GET | — | Component health check |
/api/ask |
POST | JWT / guest | Main RAG endpoint (SSE stream) |
/api/feedback |
POST | JWT / guest | Submit rating (1–5) |
/api/auth/config |
GET | — | Supabase public config |
/api/auth/me |
GET | JWT | Current user ID |
/api/swagger |
GET | — | OpenAPI Swagger UI |
/api/redoc |
GET | — | OpenAPI ReDoc UI |
/api/feedback/vote |
POST | JWT / guest | Submit thumbs up/down (-1 / +1) by trace_id |
/api/conversations |
GET | JWT / guest | Get conversation history for a session |
/api/conversations/list |
GET | JWT / guest | List all conversations for current user |
/api/conversations/messages |
GET | JWT / guest | Get raw messages for a session |
/api/conversations |
DELETE | JWT / guest | Delete conversation history |
/api/reindex |
POST | monitoring key | Force reindex of data/sample/ |
/api/monitoring/stats |
GET | monitoring key | Global usage statistics |
/api/monitoring/pipeline |
GET | monitoring key | Retrieval pipeline details |
/api/monitoring/features |
GET | monitoring key | Full feature health dashboard |
/api/monitoring/reformulation |
GET / POST | monitoring key | Read / toggle reformulation |
/api/monitoring/reformulation/history |
GET | monitoring key | Last 20 reformulations |
/api/monitoring/search-switches |
GET | monitoring key | Current search config (read-only) |
/api/monitoring/runtime-profile |
GET | monitoring key | Active profile (fast / balanced / quality) |
/api/monitoring/contextual-retrieval |
GET | monitoring key | % of chunks with doc_summary enrichment |
/api/monitoring/user-memories |
GET | monitoring key | Last 20 user memory summaries |
/api/monitoring/feedbacks |
GET | monitoring key | Recent feedback events |
/api/monitoring/pii |
GET | monitoring key | PII masking history |
/api/monitoring/logs |
GET | monitoring key | In-memory system log buffer |
/api/cache/stats |
GET | monitoring key | Semantic cache statistics |
/api/admin/sources |
GET | monitoring key | Indexed source files |
/api/admin/delete |
DELETE | monitoring key | Remove a source file |
Request body for /api/ask:
{
"question": "Who is Alaric the Fallen?",
"session_id": "uuid-v4",
"user_id": "guest_uuid-v4"
}Response: Server-Sent Events stream
data: {"type": "text", "text": "Alaric was..."}
data: {"type": "text", "text": " a general who..."}
data: {"type": "done", "trace_id": "...", "model": "llama3.1-8b"}
Unit suite:
python -m pytest src/test-unitaires -qTargeted run after retrieval changes:
python -m pytest src/test-unitaires/test_search.py src/test-unitaires/test_routes.py -qLoad tests (opt-in):
# Windows:
set RUN_LOAD_TESTS=true
# Linux/macOS:
export RUN_LOAD_TESTS=true
python -m pytest src/test-unitaires/test_load.py -qLocust load testing:
set LOCUST_BEARER_TOKEN=<your_jwt>
locust -f src/test-unitaires/locustfile.py --host http://localhost:8000Oracle LoreKeeper exposes a Model Context Protocol server, allowing any MCP-compatible client (Claude Desktop, Cursor, etc.) to query the lore knowledge base directly as a tool.
Start the server:
# Stdio mode (local, embedded in client)
python mcp_server.py
# SSE mode (remote, network-accessible)
export MCP_TRANSPORT=sse
export MCP_PORT=8001
python mcp_server.pyClaude Desktop config (%APPDATA%\Claude\claude_desktop_config.json):
{
"mcpServers": {
"lorekeeper": {
"command": "python",
"args": ["C:/path/to/mcp_server.py"],
"env": {
"LLM_API_KEY": "csk_...",
"QDRANT_URL": "https://...",
"QDRANT_API_KEY": "..."
}
}
}
}Once configured, Claude can invoke the Oracle knowledge base as a tool during any conversation.
Recommended .env settings for deployment:
APP_ENV=production
ENV=production
RAG_PROFILE=balanced
WEB_CONCURRENCY=2
BACKGROUND_MAX_WORKERS=8
GUNICORN_TIMEOUT=120
GUNICORN_GRACEFUL_TIMEOUT=30
GUNICORN_KEEPALIVE=10
REDIS_URL=redis://redis:6379
EMBEDDING_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
QDRANT_VECTOR_SIZE=384
FASTEMBED_CACHE_PATH=/app/fastembed_cache
HF_TOKEN=hf_xxx
RERANKER_ENABLED=true
SMART_RERANK_ENABLED=true
RERANKER_MODEL=Xenova/ms-marco-MiniLM-L-6-v2
RERANKER_MAX_INPUT=4
HYDE_ENABLED=true
HYDE_TIMEOUT_SECONDS=3.5
MAX_RESPONSE_SECONDS=10
QDRANT_AUTO_RECREATE_ON_DIM_MISMATCH=trueFor stable model downloads and startup speed in production:
- Set
HF_TOKENin your deployment environment. - Keep
FASTEMBED_CACHE_PATHon a persistent volume (example path:/app/fastembed_cache).
.envand.env.*are git-ignored.env.exampleis versioned with all variables documented- No hardcoded API keys or base URLs anywhere in the codebase
No CI/CD workflow is currently configured in this repository.
See docs/DOCUMENTATION.md for:
- Detailed architecture per component
- Full environment variable reference
- Deployment checklist (Coolify)
- Ingestion pipeline deep-dive
- Security layers explained
- Troubleshooting guide
- MCP server setup