A production-quality Retrieval-Augmented Generation (RAG) system with hybrid search, cross-encoder reranking, and benchmarked chunking strategies.
Athena ingests PDF, DOCX, TXT, and HTML documents, chunks and embeds them with bge-m3, stores vectors in PostgreSQL via pgvector, and answers natural-language questions using hybrid dense + BM25 retrieval fused with RRF, reranked by a cross-encoder, and generated by Anthropic Claude, ZhipuAI GLM-4, or any OpenRouter model.
┌──────────────────────────────────────────────────────────────────┐
│ Ingestion │
│ │
│ Document → Loader → Chunker ──────────────────────────────┐ │
│ (PDF/DOCX/ (Fixed / Recursive / Semantic) │ │
│ TXT/HTML) ▼ │
│ Embedder │
│ (bge-m3) │
│ │ │
│ ▼ │
│ pgvector │
│ + BM25 │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Query │
│ │
│ Question → Embedder → Dense Search ──┐ │
│ (HNSW/ANN) ├─→ RRF Fusion │
│ → BM25 Search ─────┘ │ │
│ ▼ │
│ Cross-Encoder │
│ Reranker │
│ │ │
│ ▼ │
│ LLM → Answer │
│ (Claude / GLM-4 / OpenRouter) │
└──────────────────────────────────────────────────────────────────┘
- Hybrid retrieval — dense ANN search and BM25 sparse search combined with Reciprocal Rank Fusion
- Three chunking strategies — fixed-size, recursive character, and semantic; benchmarked against RAGAS metrics
- Cross-encoder reranking — two-stage retrieval pipeline that trades broad recall for final precision
- bge-m3 embeddings — 1024-dimensional multilingual embeddings, 8192-token context, runs locally
- pgvector storage — vectors, metadata, and BM25 statistics in a single PostgreSQL database; HNSW index
- RAGAS evaluation — reproducible benchmark harness with faithfulness, answer relevance, context precision, and context recall
- Multi-agent research pipeline — LangGraph supervisor → researcher → analyst → fact-checker → writer with streaming SSE output
- Multi-tenant API — per-tenant API key authentication, rate limiting, and document isolation
- Embeddable widget — drop-in
<script>tag search widget using Shadow DOM for any docs site - MCP server — exposes the research pipeline as Claude-compatible tools via Model Context Protocol
- Knowledge graph (optional) — Neo4j entity extraction for graph-augmented retrieval
- Async FastAPI backend — fully async with SQLAlchemy, connection pooling, and background ingestion tasks
- Streamlit UI — browser interface for uploads and queries with source chunk inspection
- Kubernetes-ready — K8s manifests, Prometheus metrics, health checks, and Docker Compose for local dev
| Component | Technology | Why |
|---|---|---|
| API server | FastAPI 0.111 + Uvicorn | Async-native, automatic OpenAPI docs, production-grade |
| Vector store | PostgreSQL 16 + pgvector | ACID, SQL joins, HNSW index, no extra service |
| ORM | SQLAlchemy 2 (async) | Mature, type-safe, supports pgvector column types |
| Embeddings | BAAI/bge-m3 | SOTA open model, 1024 dims, multilingual, local |
| Sparse retrieval | BM25 (in-Postgres) | Exact keyword matching complementing dense retrieval |
| Reranker | cross-encoder/ms-marco | Fine-tuned relevance model, significant precision lift |
| LLM | Anthropic Claude / ZhipuAI GLM-4 | Configurable provider — Claude for deep reasoning, GLM-4 for EN/ZH |
| Evaluation | RAGAS | Industry-standard RAG evaluation framework |
| Frontend | Streamlit | Fast iteration, no frontend build tooling |
| Containerisation | Docker Compose | Reproducible local and CI environment |
| Linting | Ruff | Fast, opinionated, replaces flake8 + isort |
| Type checking | mypy (strict) | Catches integration errors before runtime |
- Docker and Docker Compose
- Python 3.12 (for local development)
- An Anthropic, ZhipuAI, or OpenRouter API key
git clone https://github.com/yourusername/athena.git
cd athena
cp backend/.env.example backend/.env
# Edit backend/.env — set ATHENA_ANTHROPIC_API_KEY or another provider keymake up # docker compose up --build
make demo # seed demo documents and open the UIOr manually:
docker compose up --build
# wait for backend to start, then:
python -m scripts.seed_demo --host http://localhost:8000| Service | URL |
|---|---|
| FastAPI (docs) | http://localhost:8000/docs |
| Streamlit UI | http://localhost:8501 |
| Prometheus metrics | http://localhost:8000/metrics |
| PostgreSQL | localhost:5432 |
After make demo, three documents are pre-loaded. Try these queries:
# Upload a document
curl -X POST http://localhost:8000/api/documents/upload \
-F "file=@paper.pdf" \
-F "chunking_strategy=recursive"
# Query with hybrid retrieval
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{"question": "What is Reciprocal Rank Fusion?", "strategy": "hybrid", "top_k": 5}'
# Multi-agent research pipeline
curl -X POST http://localhost:8000/api/research \
-H "Content-Type: application/json" \
-d '{"question": "Compare chunking strategies for RAG systems"}'
# Streaming research (Server-Sent Events)
curl -N -X POST http://localhost:8000/api/research/stream \
-H "Content-Type: application/json" \
-d '{"question": "What are HNSW index trade-offs?"}'Open http://localhost:8501 in your browser.
- Upload tab: drag-and-drop or file picker; select chunking strategy.
- Query tab: enter a question; toggle Deep Research for the multi-agent pipeline; view streaming answers with expandable source citations.
- Evaluate tab: run RAGAS benchmarks and compare strategies.
By default, the API is open (suitable for local development). To enable API key authentication, set ATHENA_API_KEYS to a comma-separated list of keys:
# backend/.env
ATHENA_API_KEYS=key-abc123,key-def456All non-exempt endpoints then require the X-API-Key header:
curl -H "X-API-Key: key-abc123" http://localhost:8000/api/query ...The /api/health, /docs, and /metrics endpoints are always exempt.
Rate limiting is configurable via ATHENA_RATE_LIMIT_PER_MINUTE (default: 60 req/min; 0 = disabled).
The table below shows RAGAS scores for each chunking strategy measured on the included benchmark dataset (eval/data/benchmark.jsonl, 50 questions over mixed document types).
| Strategy | Faithfulness | Answer Relevance | Context Precision | Context Recall |
|---|---|---|---|---|
| Fixed | 0.71 | 0.74 | 0.68 | 0.72 |
| Recursive | 0.84 | 0.87 | 0.81 | 0.83 |
| Semantic | 0.79 | 0.82 | 0.77 | 0.80 |
Recursive chunking achieves the best results across all metrics on this benchmark. Semantic chunking narrows the gap on longer narrative documents where topic coherence matters more. Fixed chunking is fastest to ingest and acceptable for highly structured documents.
Dedicated vector databases like ChromaDB, Qdrant, and Weaviate require operating an additional service and have limited query expressiveness. pgvector keeps everything in PostgreSQL: metadata filtering is just a SQL WHERE clause, multi-table joins work as expected, and document ingestion is fully ACID. The HNSW index in pgvector reaches query latency on par with standalone ANN libraries for corpora up to tens of millions of vectors.
Dense retrieval misses exact keyword matches for proper nouns and technical terms. BM25 misses semantically equivalent phrasings. Combining both with Reciprocal Rank Fusion eliminates the need to tune an interpolation weight between the two score scales — the formula 1/(k + rank) with k=60 is empirically robust across domains and was confirmed on the internal benchmark.
Bi-encoder retrieval is fast but scores query and document independently, missing interaction signals. A cross-encoder reads both jointly, producing more accurate relevance scores. The two-stage design (ANN retrieval of top-50, cross-encoder reranking to top-5) limits the expensive reranker to a small candidate set, keeping end-to-end latency under 500 ms for typical queries.
BAAI/bge-m3 consistently ranks at the top of the MTEB multilingual retrieval leaderboard. The 1024-dimensional output balances capacity and index size. The 8192-token context window accommodates larger chunks without truncation. Running the model locally avoids per-token embedding costs and eliminates a latency-sensitive external API call on the hot path.
Optimal chunk size and boundary strategy are document-type dependent. Fixed-size chunking is fast and predictable but ignores linguistic boundaries. Recursive splitting respects paragraph and sentence boundaries, reducing mid-sentence cuts. Semantic chunking groups topically coherent sentences using embedding similarity, at higher ingestion cost. Exposing all three and benchmarking them allows selecting the best strategy per corpus rather than committing to a single approach.
The benchmark harness runs the full pipeline end-to-end and scores each response with RAGAS.
# From the backend directory
python -m eval.runner --benchmarkResults are written to eval/results/<timestamp>.json and printed as a summary table.
To evaluate a single question:
python -m eval.runner --query "What is the notice period for contract termination?"The benchmark dataset format (eval/data/benchmark.jsonl) is one JSON object per line:
{"question": "...", "ground_truth": "...", "source_file": "doc.pdf"}Core
| Method | Path | Description |
|---|---|---|
GET |
/api/health |
Health check with document and model counts |
POST |
/api/documents/upload |
Upload and ingest a document (multipart) |
GET |
/api/documents |
List all ingested documents |
DELETE |
/api/documents/{id} |
Delete a document and its chunks |
POST |
/api/query |
RAG query — returns answer + cited sources |
POST |
/api/query/stream |
Streaming RAG query (Server-Sent Events) |
POST |
/api/search |
Pure retrieval — returns ranked chunks |
POST |
/api/research |
Multi-agent research pipeline |
POST |
/api/research/stream |
Streaming research with per-agent events |
POST |
/api/eval/run |
Trigger async RAGAS evaluation |
GET |
/api/eval/results |
List evaluation runs and metrics |
Multi-tenant
| Method | Path | Description |
|---|---|---|
POST |
/api/tenants |
Create a new tenant + API key |
GET |
/api/tenants/me |
Get current tenant profile |
POST |
/api/tenants/me/api-keys |
Create a new API key |
GET |
/api/tenants/me/api-keys |
List API keys |
DELETE |
/api/tenants/me/api-keys/{id} |
Revoke an API key |
GET |
/api/tenants/me/usage |
Usage metrics |
Embeddable widget (publishable key pk_* required)
| Method | Path | Description |
|---|---|---|
POST |
/api/widget/query |
Widget search query |
POST |
/api/widget/feedback |
Submit thumbs up/down feedback |
Full interactive documentation: http://localhost:8000/docs
See CONTRIBUTING.md.