Skip to content

Artemnikov/simple_rag_engine

Repository files navigation

Document RAG

Local document RAG with hybrid retrieval, cross-encoder reranking, and inline clickable citations. Single-user, fully containerised.

Stack: Docling · Crawl4AI · faster-whisper · BGE-M3 · Qdrant · BGE-reranker-v2-m3 · OpenAI · FastAPI · Next.js · Celery · Postgres · Redis · Ragas

Layout

apps/
  api/        FastAPI: ingest, chat (SSE), library, jobs (SSE), originals
  worker/     Celery: parse → chunk → embed → upsert
  web/        Next.js + Tailwind: Sources, Library, Chat, Settings
packages/
  rag-core/   Reusable Python library: parsers, chunking, embedding,
              vectorstore, reranker, retrieval, generator, pipeline
infra/
  docker-compose.yml + Dockerfile.gpu + Dockerfile.web
eval/         Ragas golden-set runner + run differ
data/
  originals/  Uploaded source files (one dir per document)
  models/     Embedder/reranker/Whisper weights cache
docs/         architecture, ingestion, retrieval, eval, runbook

Quickstart

Prereqs: Docker (for infra only), uv, pnpm, make, an OpenAI API key, ~16 GB VRAM on the host for the embedder + reranker (+ Whisper for audio). No NVIDIA Container Toolkit needed — api/worker run on the host venv with direct GPU access.

cp .env.example .env       # set OPENAI_API_KEY
make install               # uv sync + pnpm install
make up                    # qdrant + postgres + redis (in Docker)
make migrate               # apply Postgres schema (once)

Then run the app on the host (three terminals):

make api          # uvicorn on :8000 (GPU)
make worker       # celery worker (GPU)
make web          # next dev on :3000

Open http://localhost:3000.

Fully containerised stack (optional): if you do have the NVIDIA Container Toolkit and prefer everything in Docker, run make up-full instead — that builds the api/worker/web containers and starts them alongside the infra.

Highlights

  • One model, two views. BGE-M3 produces dense + sparse vectors in a single forward pass. Qdrant uses both with RRF fusion — no separate BM25 service.
  • Cross-encoder rerank. BGE-reranker-v2-m3 reorders the top-50 to top-8. Toggleable for ablation.
  • Citations are real. Generator is forced to cite [n] from passed sources; the UI parses these as you stream and turns them into chips that open a side drawer with the chunk preview and a link to the original file.
  • Multilingual. EN / RU / HE supported by BGE-M3 + reranker; Hebrew renders RTL automatically.
  • Audio + video + images + websites. Docling handles documents and OCR; Crawl4AI fetches single URLs or recursively crawls; faster-whisper transcribes.
  • Eval from day one. make eval runs Ragas (faithfulness, context precision/recall, answer relevancy) plus a deterministic citation-recall check against eval/golden.jsonl.

Documentation

  • Architecture — components and data flow
  • Ingestion — parsers and chunking
  • Retrieval — hybrid search, rerank, citations
  • Eval — golden set + Ragas + diff workflow
  • Runbook — operations, failure modes, backups

Configuration

All runtime knobs live in .env. See .env.example for the full list. Common toggles:

Variable Default What it does
OPENAI_MODEL gpt-5 Generator
RETRIEVAL_TOP_K 8 Final passages sent to the LLM
RETRIEVAL_PREFETCH 50 Hybrid candidates before rerank
RERANK_ENABLED true Cross-encoder rerank stage
QUERY_REWRITING_ENABLED true Optional query rewrite for retrieval
EMBEDDING_MODEL BAAI/bge-m3
RERANKER_MODEL BAAI/bge-reranker-v2-m3
WHISPER_MODEL distil-large-v3 Audio/video transcription

Development

make fmt        # ruff format
make lint       # ruff check + mypy
make test       # pytest (rag-core unit tests)

API: apps/api/app/main.py. Worker: apps/worker/worker/celery_app.py. Web dev server: pnpm --filter web dev.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors