Skip to content

RahulModugula/athena

Repository files navigation

Athena

Python 3.12 FastAPI pgvector License MIT

A production-quality Retrieval-Augmented Generation (RAG) system with hybrid search, cross-encoder reranking, and benchmarked chunking strategies.

Athena ingests PDF, DOCX, TXT, and HTML documents, chunks and embeds them with bge-m3, stores vectors in PostgreSQL via pgvector, and answers natural-language questions using hybrid dense + BM25 retrieval fused with RRF, reranked by a cross-encoder, and generated by Anthropic Claude, ZhipuAI GLM-4, or any OpenRouter model.


Architecture

┌──────────────────────────────────────────────────────────────────┐
│  Ingestion                                                       │
│                                                                  │
│  Document → Loader → Chunker ──────────────────────────────┐    │
│              (PDF/DOCX/    (Fixed / Recursive / Semantic)   │    │
│               TXT/HTML)                                     ▼    │
│                                                        Embedder  │
│                                                        (bge-m3)  │
│                                                             │    │
│                                                             ▼    │
│                                                        pgvector  │
│                                                        + BM25    │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  Query                                                           │
│                                                                  │
│  Question → Embedder → Dense Search ──┐                         │
│                        (HNSW/ANN)     ├─→ RRF Fusion            │
│                    → BM25 Search ─────┘       │                 │
│                                               ▼                 │
│                                        Cross-Encoder            │
│                                        Reranker                 │
│                                               │                 │
│                                               ▼                 │
│                                        LLM → Answer             │
│                                    (Claude / GLM-4 / OpenRouter) │
└──────────────────────────────────────────────────────────────────┘

Features

  • Hybrid retrieval — dense ANN search and BM25 sparse search combined with Reciprocal Rank Fusion
  • Three chunking strategies — fixed-size, recursive character, and semantic; benchmarked against RAGAS metrics
  • Cross-encoder reranking — two-stage retrieval pipeline that trades broad recall for final precision
  • bge-m3 embeddings — 1024-dimensional multilingual embeddings, 8192-token context, runs locally
  • pgvector storage — vectors, metadata, and BM25 statistics in a single PostgreSQL database; HNSW index
  • RAGAS evaluation — reproducible benchmark harness with faithfulness, answer relevance, context precision, and context recall
  • Multi-agent research pipeline — LangGraph supervisor → researcher → analyst → fact-checker → writer with streaming SSE output
  • Multi-tenant API — per-tenant API key authentication, rate limiting, and document isolation
  • Embeddable widget — drop-in <script> tag search widget using Shadow DOM for any docs site
  • MCP server — exposes the research pipeline as Claude-compatible tools via Model Context Protocol
  • Knowledge graph (optional) — Neo4j entity extraction for graph-augmented retrieval
  • Async FastAPI backend — fully async with SQLAlchemy, connection pooling, and background ingestion tasks
  • Streamlit UI — browser interface for uploads and queries with source chunk inspection
  • Kubernetes-ready — K8s manifests, Prometheus metrics, health checks, and Docker Compose for local dev

Tech Stack

Component Technology Why
API server FastAPI 0.111 + Uvicorn Async-native, automatic OpenAPI docs, production-grade
Vector store PostgreSQL 16 + pgvector ACID, SQL joins, HNSW index, no extra service
ORM SQLAlchemy 2 (async) Mature, type-safe, supports pgvector column types
Embeddings BAAI/bge-m3 SOTA open model, 1024 dims, multilingual, local
Sparse retrieval BM25 (in-Postgres) Exact keyword matching complementing dense retrieval
Reranker cross-encoder/ms-marco Fine-tuned relevance model, significant precision lift
LLM Anthropic Claude / ZhipuAI GLM-4 Configurable provider — Claude for deep reasoning, GLM-4 for EN/ZH
Evaluation RAGAS Industry-standard RAG evaluation framework
Frontend Streamlit Fast iteration, no frontend build tooling
Containerisation Docker Compose Reproducible local and CI environment
Linting Ruff Fast, opinionated, replaces flake8 + isort
Type checking mypy (strict) Catches integration errors before runtime

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Python 3.12 (for local development)
  • An Anthropic, ZhipuAI, or OpenRouter API key

Clone and configure

git clone https://github.com/yourusername/athena.git
cd athena
cp backend/.env.example backend/.env
# Edit backend/.env — set ATHENA_ANTHROPIC_API_KEY or another provider key

Start the stack

make up          # docker compose up --build
make demo        # seed demo documents and open the UI

Or manually:

docker compose up --build
# wait for backend to start, then:
python -m scripts.seed_demo --host http://localhost:8000

Access

Service URL
FastAPI (docs) http://localhost:8000/docs
Streamlit UI http://localhost:8501
Prometheus metrics http://localhost:8000/metrics
PostgreSQL localhost:5432

Quick Demo

After make demo, three documents are pre-loaded. Try these queries:

# Upload a document
curl -X POST http://localhost:8000/api/documents/upload \
  -F "file=@paper.pdf" \
  -F "chunking_strategy=recursive"

# Query with hybrid retrieval
curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is Reciprocal Rank Fusion?", "strategy": "hybrid", "top_k": 5}'

# Multi-agent research pipeline
curl -X POST http://localhost:8000/api/research \
  -H "Content-Type: application/json" \
  -d '{"question": "Compare chunking strategies for RAG systems"}'

# Streaming research (Server-Sent Events)
curl -N -X POST http://localhost:8000/api/research/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "What are HNSW index trade-offs?"}'

Streamlit UI

Open http://localhost:8501 in your browser.

  • Upload tab: drag-and-drop or file picker; select chunking strategy.
  • Query tab: enter a question; toggle Deep Research for the multi-agent pipeline; view streaming answers with expandable source citations.
  • Evaluate tab: run RAGAS benchmarks and compare strategies.

Security

By default, the API is open (suitable for local development). To enable API key authentication, set ATHENA_API_KEYS to a comma-separated list of keys:

# backend/.env
ATHENA_API_KEYS=key-abc123,key-def456

All non-exempt endpoints then require the X-API-Key header:

curl -H "X-API-Key: key-abc123" http://localhost:8000/api/query ...

The /api/health, /docs, and /metrics endpoints are always exempt.

Rate limiting is configurable via ATHENA_RATE_LIMIT_PER_MINUTE (default: 60 req/min; 0 = disabled).


Benchmark Results

The table below shows RAGAS scores for each chunking strategy measured on the included benchmark dataset (eval/data/benchmark.jsonl, 50 questions over mixed document types).

Strategy Faithfulness Answer Relevance Context Precision Context Recall
Fixed 0.71 0.74 0.68 0.72
Recursive 0.84 0.87 0.81 0.83
Semantic 0.79 0.82 0.77 0.80

Recursive chunking achieves the best results across all metrics on this benchmark. Semantic chunking narrows the gap on longer narrative documents where topic coherence matters more. Fixed chunking is fastest to ingest and acceptable for highly structured documents.


Design Decisions

pgvector over ChromaDB

Dedicated vector databases like ChromaDB, Qdrant, and Weaviate require operating an additional service and have limited query expressiveness. pgvector keeps everything in PostgreSQL: metadata filtering is just a SQL WHERE clause, multi-table joins work as expected, and document ingestion is fully ACID. The HNSW index in pgvector reaches query latency on par with standalone ANN libraries for corpora up to tens of millions of vectors.

Hybrid search + RRF

Dense retrieval misses exact keyword matches for proper nouns and technical terms. BM25 misses semantically equivalent phrasings. Combining both with Reciprocal Rank Fusion eliminates the need to tune an interpolation weight between the two score scales — the formula 1/(k + rank) with k=60 is empirically robust across domains and was confirmed on the internal benchmark.

Cross-encoder reranking

Bi-encoder retrieval is fast but scores query and document independently, missing interaction signals. A cross-encoder reads both jointly, producing more accurate relevance scores. The two-stage design (ANN retrieval of top-50, cross-encoder reranking to top-5) limits the expensive reranker to a small candidate set, keeping end-to-end latency under 500 ms for typical queries.

bge-m3 embeddings

BAAI/bge-m3 consistently ranks at the top of the MTEB multilingual retrieval leaderboard. The 1024-dimensional output balances capacity and index size. The 8192-token context window accommodates larger chunks without truncation. Running the model locally avoids per-token embedding costs and eliminates a latency-sensitive external API call on the hot path.

Three chunking strategies

Optimal chunk size and boundary strategy are document-type dependent. Fixed-size chunking is fast and predictable but ignores linguistic boundaries. Recursive splitting respects paragraph and sentence boundaries, reducing mid-sentence cuts. Semantic chunking groups topically coherent sentences using embedding similarity, at higher ingestion cost. Exposing all three and benchmarking them allows selecting the best strategy per corpus rather than committing to a single approach.


Evaluation

The benchmark harness runs the full pipeline end-to-end and scores each response with RAGAS.

# From the backend directory
python -m eval.runner --benchmark

Results are written to eval/results/<timestamp>.json and printed as a summary table.

To evaluate a single question:

python -m eval.runner --query "What is the notice period for contract termination?"

The benchmark dataset format (eval/data/benchmark.jsonl) is one JSON object per line:

{"question": "...", "ground_truth": "...", "source_file": "doc.pdf"}

API Reference

Core

Method Path Description
GET /api/health Health check with document and model counts
POST /api/documents/upload Upload and ingest a document (multipart)
GET /api/documents List all ingested documents
DELETE /api/documents/{id} Delete a document and its chunks
POST /api/query RAG query — returns answer + cited sources
POST /api/query/stream Streaming RAG query (Server-Sent Events)
POST /api/search Pure retrieval — returns ranked chunks
POST /api/research Multi-agent research pipeline
POST /api/research/stream Streaming research with per-agent events
POST /api/eval/run Trigger async RAGAS evaluation
GET /api/eval/results List evaluation runs and metrics

Multi-tenant

Method Path Description
POST /api/tenants Create a new tenant + API key
GET /api/tenants/me Get current tenant profile
POST /api/tenants/me/api-keys Create a new API key
GET /api/tenants/me/api-keys List API keys
DELETE /api/tenants/me/api-keys/{id} Revoke an API key
GET /api/tenants/me/usage Usage metrics

Embeddable widget (publishable key pk_* required)

Method Path Description
POST /api/widget/query Widget search query
POST /api/widget/feedback Submit thumbs up/down feedback

Full interactive documentation: http://localhost:8000/docs


Contributing

See CONTRIBUTING.md.


License

MIT

About

RAG-powered research assistant with hybrid search (pgvector + BM25), cross-encoder reranking, and RAGAS evaluation

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors