Local document RAG with hybrid retrieval, cross-encoder reranking, and inline clickable citations. Single-user, fully containerised.
Stack: Docling · Crawl4AI · faster-whisper · BGE-M3 · Qdrant · BGE-reranker-v2-m3 · OpenAI · FastAPI · Next.js · Celery · Postgres · Redis · Ragas
apps/
api/ FastAPI: ingest, chat (SSE), library, jobs (SSE), originals
worker/ Celery: parse → chunk → embed → upsert
web/ Next.js + Tailwind: Sources, Library, Chat, Settings
packages/
rag-core/ Reusable Python library: parsers, chunking, embedding,
vectorstore, reranker, retrieval, generator, pipeline
infra/
docker-compose.yml + Dockerfile.gpu + Dockerfile.web
eval/ Ragas golden-set runner + run differ
data/
originals/ Uploaded source files (one dir per document)
models/ Embedder/reranker/Whisper weights cache
docs/ architecture, ingestion, retrieval, eval, runbook
Prereqs: Docker (for infra only), uv, pnpm, make, an OpenAI API key,
~16 GB VRAM on the host for the embedder + reranker (+ Whisper for audio).
No NVIDIA Container Toolkit needed — api/worker run on the host venv with
direct GPU access.
cp .env.example .env # set OPENAI_API_KEY
make install # uv sync + pnpm install
make up # qdrant + postgres + redis (in Docker)
make migrate # apply Postgres schema (once)Then run the app on the host (three terminals):
make api # uvicorn on :8000 (GPU)
make worker # celery worker (GPU)
make web # next dev on :3000Open http://localhost:3000.
Fully containerised stack (optional): if you do have the NVIDIA Container
Toolkit and prefer everything in Docker, run make up-full instead — that
builds the api/worker/web containers and starts them alongside the infra.
- One model, two views. BGE-M3 produces dense + sparse vectors in a single forward pass. Qdrant uses both with RRF fusion — no separate BM25 service.
- Cross-encoder rerank. BGE-reranker-v2-m3 reorders the top-50 to top-8. Toggleable for ablation.
- Citations are real. Generator is forced to cite
[n]from passed sources; the UI parses these as you stream and turns them into chips that open a side drawer with the chunk preview and a link to the original file. - Multilingual. EN / RU / HE supported by BGE-M3 + reranker; Hebrew renders RTL automatically.
- Audio + video + images + websites. Docling handles documents and OCR; Crawl4AI fetches single URLs or recursively crawls; faster-whisper transcribes.
- Eval from day one.
make evalruns Ragas (faithfulness, context precision/recall, answer relevancy) plus a deterministic citation-recall check againsteval/golden.jsonl.
- Architecture — components and data flow
- Ingestion — parsers and chunking
- Retrieval — hybrid search, rerank, citations
- Eval — golden set + Ragas + diff workflow
- Runbook — operations, failure modes, backups
All runtime knobs live in .env. See .env.example for the full list. Common
toggles:
| Variable | Default | What it does |
|---|---|---|
OPENAI_MODEL |
gpt-5 |
Generator |
RETRIEVAL_TOP_K |
8 |
Final passages sent to the LLM |
RETRIEVAL_PREFETCH |
50 |
Hybrid candidates before rerank |
RERANK_ENABLED |
true |
Cross-encoder rerank stage |
QUERY_REWRITING_ENABLED |
true |
Optional query rewrite for retrieval |
EMBEDDING_MODEL |
BAAI/bge-m3 |
|
RERANKER_MODEL |
BAAI/bge-reranker-v2-m3 |
|
WHISPER_MODEL |
distil-large-v3 |
Audio/video transcription |
make fmt # ruff format
make lint # ruff check + mypy
make test # pytest (rag-core unit tests)API: apps/api/app/main.py. Worker: apps/worker/worker/celery_app.py. Web
dev server: pnpm --filter web dev.