Athena

A production-quality Retrieval-Augmented Generation (RAG) system with hybrid search, cross-encoder reranking, and benchmarked chunking strategies.

Athena ingests PDF, DOCX, TXT, and HTML documents, chunks and embeds them with bge-m3, stores vectors in PostgreSQL via pgvector, and answers natural-language questions using hybrid dense + BM25 retrieval fused with RRF, reranked by a cross-encoder, and generated by Anthropic Claude, ZhipuAI GLM-4, or any OpenRouter model.

Architecture

┌──────────────────────────────────────────────────────────────────┐
│  Ingestion                                                       │
│                                                                  │
│  Document → Loader → Chunker ──────────────────────────────┐    │
│              (PDF/DOCX/    (Fixed / Recursive / Semantic)   │    │
│               TXT/HTML)                                     ▼    │
│                                                        Embedder  │
│                                                        (bge-m3)  │
│                                                             │    │
│                                                             ▼    │
│                                                        pgvector  │
│                                                        + BM25    │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  Query                                                           │
│                                                                  │
│  Question → Embedder → Dense Search ──┐                         │
│                        (HNSW/ANN)     ├─→ RRF Fusion            │
│                    → BM25 Search ─────┘       │                 │
│                                               ▼                 │
│                                        Cross-Encoder            │
│                                        Reranker                 │
│                                               │                 │
│                                               ▼                 │
│                                        LLM → Answer             │
│                                    (Claude / GLM-4 / OpenRouter) │
└──────────────────────────────────────────────────────────────────┘

Features

Hybrid retrieval — dense ANN search and BM25 sparse search combined with Reciprocal Rank Fusion
Three chunking strategies — fixed-size, recursive character, and semantic; benchmarked against RAGAS metrics
Cross-encoder reranking — two-stage retrieval pipeline that trades broad recall for final precision
bge-m3 embeddings — 1024-dimensional multilingual embeddings, 8192-token context, runs locally
pgvector storage — vectors, metadata, and BM25 statistics in a single PostgreSQL database; HNSW index
RAGAS evaluation — reproducible benchmark harness with faithfulness, answer relevance, context precision, and context recall
Multi-agent research pipeline — LangGraph supervisor → researcher → analyst → fact-checker → writer with streaming SSE output
Multi-tenant API — per-tenant API key authentication, rate limiting, and document isolation
Embeddable widget — drop-in <script> tag search widget using Shadow DOM for any docs site
MCP server — exposes the research pipeline as Claude-compatible tools via Model Context Protocol
Knowledge graph (optional) — Neo4j entity extraction for graph-augmented retrieval
Async FastAPI backend — fully async with SQLAlchemy, connection pooling, and background ingestion tasks
Streamlit UI — browser interface for uploads and queries with source chunk inspection
Kubernetes-ready — K8s manifests, Prometheus metrics, health checks, and Docker Compose for local dev

Tech Stack

Component	Technology	Why
API server	FastAPI 0.111 + Uvicorn	Async-native, automatic OpenAPI docs, production-grade
Vector store	PostgreSQL 16 + pgvector	ACID, SQL joins, HNSW index, no extra service
ORM	SQLAlchemy 2 (async)	Mature, type-safe, supports pgvector column types
Embeddings	BAAI/bge-m3	SOTA open model, 1024 dims, multilingual, local
Sparse retrieval	BM25 (in-Postgres)	Exact keyword matching complementing dense retrieval
Reranker	cross-encoder/ms-marco	Fine-tuned relevance model, significant precision lift
LLM	Anthropic Claude / ZhipuAI GLM-4	Configurable provider — Claude for deep reasoning, GLM-4 for EN/ZH
Evaluation	RAGAS	Industry-standard RAG evaluation framework
Frontend	Streamlit	Fast iteration, no frontend build tooling
Containerisation	Docker Compose	Reproducible local and CI environment
Linting	Ruff	Fast, opinionated, replaces flake8 + isort
Type checking	mypy (strict)	Catches integration errors before runtime

Quick Start

Prerequisites

Docker and Docker Compose
Python 3.12 (for local development)
An Anthropic, ZhipuAI, or OpenRouter API key

Clone and configure

git clone https://github.com/yourusername/athena.git
cd athena
cp backend/.env.example backend/.env
# Edit backend/.env — set ATHENA_ANTHROPIC_API_KEY or another provider key

Start the stack

make up          # docker compose up --build
make demo        # seed demo documents and open the UI

Or manually:

docker compose up --build
# wait for backend to start, then:
python -m scripts.seed_demo --host http://localhost:8000

Access

Service	URL
FastAPI (docs)	http://localhost:8000/docs
Streamlit UI	http://localhost:8501
Prometheus metrics	http://localhost:8000/metrics
PostgreSQL	localhost:5432

Quick Demo

After make demo, three documents are pre-loaded. Try these queries:

# Upload a document
curl -X POST http://localhost:8000/api/documents/upload \
  -F "file=@paper.pdf" \
  -F "chunking_strategy=recursive"

# Query with hybrid retrieval
curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is Reciprocal Rank Fusion?", "strategy": "hybrid", "top_k": 5}'

# Multi-agent research pipeline
curl -X POST http://localhost:8000/api/research \
  -H "Content-Type: application/json" \
  -d '{"question": "Compare chunking strategies for RAG systems"}'

# Streaming research (Server-Sent Events)
curl -N -X POST http://localhost:8000/api/research/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "What are HNSW index trade-offs?"}'

Streamlit UI

Open http://localhost:8501 in your browser.

Upload tab: drag-and-drop or file picker; select chunking strategy.
Query tab: enter a question; toggle Deep Research for the multi-agent pipeline; view streaming answers with expandable source citations.
Evaluate tab: run RAGAS benchmarks and compare strategies.

Security

By default, the API is open (suitable for local development). To enable API key authentication, set ATHENA_API_KEYS to a comma-separated list of keys:

# backend/.env
ATHENA_API_KEYS=key-abc123,key-def456

All non-exempt endpoints then require the X-API-Key header:

curl -H "X-API-Key: key-abc123" http://localhost:8000/api/query ...

The /api/health, /docs, and /metrics endpoints are always exempt.

Rate limiting is configurable via ATHENA_RATE_LIMIT_PER_MINUTE (default: 60 req/min; 0 = disabled).

Benchmark Results

The table below shows RAGAS scores for each chunking strategy measured on the included benchmark dataset (eval/data/benchmark.jsonl, 50 questions over mixed document types).

Strategy	Faithfulness	Answer Relevance	Context Precision	Context Recall
Fixed	0.71	0.74	0.68	0.72
Recursive	0.84	0.87	0.81	0.83
Semantic	0.79	0.82	0.77	0.80

Recursive chunking achieves the best results across all metrics on this benchmark. Semantic chunking narrows the gap on longer narrative documents where topic coherence matters more. Fixed chunking is fastest to ingest and acceptable for highly structured documents.

Design Decisions

pgvector over ChromaDB

Dedicated vector databases like ChromaDB, Qdrant, and Weaviate require operating an additional service and have limited query expressiveness. pgvector keeps everything in PostgreSQL: metadata filtering is just a SQL WHERE clause, multi-table joins work as expected, and document ingestion is fully ACID. The HNSW index in pgvector reaches query latency on par with standalone ANN libraries for corpora up to tens of millions of vectors.

Hybrid search + RRF

Dense retrieval misses exact keyword matches for proper nouns and technical terms. BM25 misses semantically equivalent phrasings. Combining both with Reciprocal Rank Fusion eliminates the need to tune an interpolation weight between the two score scales — the formula 1/(k + rank) with k=60 is empirically robust across domains and was confirmed on the internal benchmark.

Cross-encoder reranking

Bi-encoder retrieval is fast but scores query and document independently, missing interaction signals. A cross-encoder reads both jointly, producing more accurate relevance scores. The two-stage design (ANN retrieval of top-50, cross-encoder reranking to top-5) limits the expensive reranker to a small candidate set, keeping end-to-end latency under 500 ms for typical queries.

bge-m3 embeddings

BAAI/bge-m3 consistently ranks at the top of the MTEB multilingual retrieval leaderboard. The 1024-dimensional output balances capacity and index size. The 8192-token context window accommodates larger chunks without truncation. Running the model locally avoids per-token embedding costs and eliminates a latency-sensitive external API call on the hot path.

Three chunking strategies

Optimal chunk size and boundary strategy are document-type dependent. Fixed-size chunking is fast and predictable but ignores linguistic boundaries. Recursive splitting respects paragraph and sentence boundaries, reducing mid-sentence cuts. Semantic chunking groups topically coherent sentences using embedding similarity, at higher ingestion cost. Exposing all three and benchmarking them allows selecting the best strategy per corpus rather than committing to a single approach.

Evaluation

The benchmark harness runs the full pipeline end-to-end and scores each response with RAGAS.

# From the backend directory
python -m eval.runner --benchmark

Results are written to eval/results/<timestamp>.json and printed as a summary table.

To evaluate a single question:

python -m eval.runner --query "What is the notice period for contract termination?"

The benchmark dataset format (eval/data/benchmark.jsonl) is one JSON object per line:

{"question": "...", "ground_truth": "...", "source_file": "doc.pdf"}

API Reference

Core

Method	Path	Description
`GET`	`/api/health`	Health check with document and model counts
`POST`	`/api/documents/upload`	Upload and ingest a document (multipart)
`GET`	`/api/documents`	List all ingested documents
`DELETE`	`/api/documents/{id}`	Delete a document and its chunks
`POST`	`/api/query`	RAG query — returns answer + cited sources
`POST`	`/api/query/stream`	Streaming RAG query (Server-Sent Events)
`POST`	`/api/search`	Pure retrieval — returns ranked chunks
`POST`	`/api/research`	Multi-agent research pipeline
`POST`	`/api/research/stream`	Streaming research with per-agent events
`POST`	`/api/eval/run`	Trigger async RAGAS evaluation
`GET`	`/api/eval/results`	List evaluation runs and metrics

Multi-tenant

Method	Path	Description
`POST`	`/api/tenants`	Create a new tenant + API key
`GET`	`/api/tenants/me`	Get current tenant profile
`POST`	`/api/tenants/me/api-keys`	Create a new API key
`GET`	`/api/tenants/me/api-keys`	List API keys
`DELETE`	`/api/tenants/me/api-keys/{id}`	Revoke an API key
`GET`	`/api/tenants/me/usage`	Usage metrics

Embeddable widget (publishable key pk_* required)

Method	Path	Description
`POST`	`/api/widget/query`	Widget search query
`POST`	`/api/widget/feedback`	Submit thumbs up/down feedback

Full interactive documentation: http://localhost:8000/docs

Contributing

See CONTRIBUTING.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
backend		backend
datasets		datasets
docs		docs
k8s		k8s
plans		plans
streamlit_app		streamlit_app
widget		widget
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Athena

Architecture

Features

Tech Stack

Quick Start

Prerequisites

Clone and configure

Start the stack

Access

Quick Demo

Streamlit UI

Security

Benchmark Results

Design Decisions

pgvector over ChromaDB

Hybrid search + RRF

Cross-encoder reranking

bge-m3 embeddings

Three chunking strategies

Evaluation

API Reference

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Athena

Architecture

Features

Tech Stack

Quick Start

Prerequisites

Clone and configure

Start the stack

Access

Quick Demo

Streamlit UI

Security

Benchmark Results

Design Decisions

pgvector over ChromaDB

Hybrid search + RRF

Cross-encoder reranking

bge-m3 embeddings

Three chunking strategies

Evaluation

API Reference

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages