GitHub - AlphaCorp-AI/RustyRAG: Production-grade RAG API built in Rust. Hybrid search with HNSW dense vectors and BM25 sparse matching, cross-encoder reranking, layout-aware document extraction via Docling, and 94.5% accuracy on Open RAG Bench. Powered by Cerebras, Groq, Milvus, and Jina AI.

The fastest open-source RAG API on GitHub

Sub-200ms time-to-first-token. Hybrid search + cross-encoder reranking + streaming answers from the fastest LLM providers on the planet. Written in Rust.

Built by Ignas Vaitukaitis

Performance

Benchmarked on 3,045 questions from Open RAG Bench (academic papers, 1,000 PDFs, 57,347 chunks). LLM-as-judge evaluation using qwen-3-235b. Infrastructure running on VastAI RTX 4090 (USA region).

Benchmark Results

                With Reranker                         Without Reranker
  ┌───────────────────────────────────┐    ┌───────────────────────────────────┐
  │                                   │    │                                   │
  │   Pass Rate:  94.5%               │    │   Pass Rate:  91.6%               │
  │   ████████████████████████░░░     │    │   ██████████████████████░░░░░     │
  │   2,857 / 3,024 judged            │    │   2,767 / 3,020 judged            │
  │                                   │    │                                   │
  └───────────────────────────────────┘    └───────────────────────────────────┘

Latency Comparison

  Time to First Token (TTFT)
  ──────────────────────────────────────────────────────────
  With Reranker     ████████████████████████████  279ms
  Without Reranker  ████████████████████          181ms

  Total Response Time
  ──────────────────────────────────────────────────────────
  With Reranker     ████████████████████████████████████████████████████████████  883ms
  Without Reranker  ██████████████████████████████████████                       511ms

Summary

Metric	With Reranker	Without Reranker
Pass rate	94.5%	91.6%
Avg TTFT	279ms	181ms
Avg total response	883ms	511ms
Failed	167 / 3,024	253 / 3,020
LLM	Cerebras `qwen-3-235b`	Cerebras `qwen-3-235b`
Embeddings	Jina v5 nano (768-dim)	Jina v5 nano (768-dim)
Search	Hybrid HNSW + BM25 + RRF	Hybrid HNSW + BM25 + RRF
Reranker	Jina Reranker v3	Disabled
Retrieval	20 → top 3 after reranking	Top 3 from Milvus
Infra	VastAI RTX 4090 (USA)	VastAI RTX 4090 (USA)

Reranking adds ~100ms TTFT and ~370ms total but improves accuracy by +2.9 percentage points and cuts failures by 34%. Without reranking, you get sub-200ms TTFT — choose your tradeoff.

Features

Sub-200ms TTFT — first tokens stream back before most APIs even respond
Hybrid search — dense HNSW vectors + BM25 sparse keywords fused with Reciprocal Rank Fusion
Cross-encoder reranking — Jina Reranker v3 scores each candidate against the query for precise relevance
SSE streaming — real-time token streaming with source citations and server-side timing
Multi-format ingestion — PDF, DOCX, PPTX, XLSX, HTML, TXT, ZIP via Docling (tables, OCR, layout-aware)
Image understanding — embedded images described by Llama 4 Scout vision model
Contextual retrieval — optional LLM-generated chunk prefixes for better search recall
Multiple LLM providers — Groq (LPU) and Cerebras (wafer-scale) with OpenAI-compatible APIs
Built-in eval framework — 3,045-question benchmark with LLM-as-judge scoring
Single binary — no Python, no Node.js, no runtime dependencies on the server
Swagger UI — auto-generated interactive API docs
Collection management — multiple document collections with backup/export

Use Cases

RustyRAG is built for scenarios where both speed and accuracy matter:

Voice AI / Conversational Agents — sub-200ms TTFT means RAG-grounded answers start streaming before the user notices a pause, critical for natural voice interactions
AI Agents — give your agents a fast, reliable knowledge backend — agents can call RustyRAG as a tool to ground their reasoning in real documents instead of hallucinating
Legal & Compliance — search across contracts, filings, and policy documents with table-aware extraction and precise reranking
Research & Academic — query thousands of papers with hybrid search that catches both semantic meaning and exact terminology
Internal Knowledge Bases — drop in PDFs, docs, and spreadsheets — get an instant Q&A system with no training required
Real-time Copilots — embed RustyRAG behind IDE plugins, dashboards, or Slack bots where latency kills adoption

If your RAG pipeline adds 2-3 seconds of latency, users won't use it. RustyRAG keeps the full pipeline — retrieval, reranking, and generation — under 1 second.

How It Works

Query Flow

  User Question
       │
       ▼
  ┌──────────┐     ┌─────────────────────────────────────┐     ┌──────────────┐
  │  Embed   │────▶│  Milvus Hybrid Search               │────▶│   Reranker   │
  │ (Jina v5)│     │  Dense HNSW + BM25 Sparse + RRF     │     │  (Jina v3)   │
  └──────────┘     │  → 20 candidates                    │     │   → Top 3    │
                   └─────────────────────────────────────┘     └──────┬───────┘
                                                                      │
                                                                      ▼
                                                               ┌──────────────┐
                                                               │  LLM (SSE)   │
                                                               │ Groq/Cerebras│
                                                               └──────┬───────┘
                                                                      │
                                                                      ▼
                                                               Sources + Tokens
                                                                + Timing (ms)

Upload Flow

  File (.pdf, .docx, .pptx, .xlsx, .html, .txt, .zip)
       │
       ├─ 1. Extract ──────────────────► Docling (layout-aware, tables, OCR)
       │
       ├─ 2. Describe images ──────────► Groq Llama 4 Scout (vision, 17B)
       │
       ├─ 3. Chunk ────────────────────► Sentence-boundary-aware splitting
       │
       ├─ 4. Contextual prefix (opt-in) ► LLM generates 1-2 sentence context
       │
       ├─ 5. Embed ────────────────────► Jina v5 nano (768-dim vectors)
       │
       └─ 6. Insert ───────────────────► Milvus (dense HNSW + BM25 index)

Why reranking matters

Hybrid search retrieves 20 candidates using two complementary signals — dense vectors (semantic meaning) and BM25 (exact keyword matching). But vector similarity and keyword overlap are rough proxies. The cross-encoder reranker reads each candidate alongside the query as a single sequence and produces a precise relevance score. This promotes the most relevant chunks and demotes false positives, improving answer quality by +2.9% without increasing LLM context size.

Quick Start

Prerequisites

Rust 1.70+ — install via rustup
Docker & Docker Compose — for Docling, embeddings, reranker, and Milvus
Groq API key — get one at console.groq.com (also powers vision model)
Cerebras API key — get one at cloud.cerebras.ai

1. Clone and configure

git clone https://github.com/AlphaCorp-AI/RustyRAG
cd rustyrag
cp .env.example .env

Edit .env with your API keys:

GROQ_API_KEY=gsk_your-groq-key-here
CEREBRAS_API_KEY=csk_your-cerebras-key-here

2. Start infrastructure

docker compose up -d

This starts Docling (document extraction), Jina embeddings (TEI), cross-encoder reranker (TEI), and Milvus 2.5 locally.

3. Build and run

cargo build --release
cargo run --release

The server starts at http://127.0.0.1:8080.

4. Try it out

Chat UI — http://localhost:8080/static/chat.html
Swagger UI — http://localhost:8080/swagger-ui/
Docling UI — http://localhost:5001/ui

Upload a PDF, ask a question, and watch tokens stream back with source citations.

Supported File Formats

Format	Extension	Extraction Method
Plain text	`.txt`	Direct UTF-8 read
PDF	`.pdf`	Docling (layout-aware, tables, OCR)
Word	`.docx`	Docling
PowerPoint	`.pptx`	Docling
Excel	`.xlsx`	Docling
HTML	`.html`	Docling
ZIP archive	`.zip`	Unpacked, each entry processed individually

Docling provides:

Table extraction — 97.9% accuracy on complex tables via TableFormer
Layout analysis — multi-column, headers/footers handled correctly
OCR — automatic for scanned pages (skipped for digital text = fast)
Image descriptions — embedded images described by Llama 4 Scout vision model

API Reference

All endpoints live under /api/v1. Interactive docs at /swagger-ui/.

Documents

Method	Endpoint	Description
`POST`	`/documents/upload`	Upload files — extract, chunk, embed, store in Milvus
`POST`	`/documents/search`	Semantic search across embedded documents
`GET`	`/documents/backup`	Download full collection as gzip-compressed JSON

Chat

Method	Endpoint	Description
`GET`	`/llms`	List supported models with provider names
`POST`	`/chat-rag`	RAG: hybrid search, rerank, generate answer
`POST`	`/chat-rag/stream`	SSE-streamed RAG (sources event + LLM tokens + timing)

Evals

Method	Endpoint	Description
`POST`	`/evals/run`	Run evaluation benchmark (3,045 questions) and download results as JSON

Health

Method	Endpoint	Description
`GET`	`/health`	Liveness check

Search result fields

Every search hit and RAG source returns:

Field	Type	Description
`id`	i64	Milvus document ID
`file_name`	string	Original uploaded filename
`file_size`	i64	File size in bytes
`chunk_index`	i64	Position of chunk within the file
`page_number`	i64	Page the chunk came from (0 if unknown)
`score`	f32	Relevance score (from reranker or cosine similarity)
`text`	string	Chunk content

LLM Providers

RustyRAG uses Groq and Cerebras for their low-latency inference hardware. Both expose OpenAI-compatible APIs.

Groq

llama-3.1-8b-instant
llama-3.3-70b-versatile
openai/gpt-oss-120b
openai/gpt-oss-20b

Cerebras

llama3.1-8b
gpt-oss-120b
qwen-3-235b-a22b-instruct-2507
zai-glm-4.7

Vision

meta-llama/llama-4-scout-17b-16e-instruct (Groq) — describes images found inside documents

Configuration

Required

Variable	Description
`HOST`	Server bind address
`PORT`	Server port
`GROQ_API_KEY`	Groq API key (also used for vision model)
`CEREBRAS_API_KEY`	Cerebras API key

Optional — services

Variable	Default	Description
`MILVUS_URL`	`http://localhost:19530`	Milvus REST API
`EMBEDDING_API_URL`	`http://localhost:7997/v1/embeddings`	Embedding endpoint
`EMBEDDING_MODEL`	`jinaai/jina-embeddings-v5-text-nano-retrieval`	Embedding model
`RERANKER_API_URL`	(empty = disabled)	TEI reranker endpoint
`DOCLING_URL`	`http://localhost:5001`	Docling extraction API
`VISION_MODEL`	`meta-llama/llama-4-scout-17b-16e-instruct`	Vision model for images

Optional — tuning

Variable	Default	Description
`RETRIEVAL_LIMIT`	`20`	Chunks fetched from Milvus before reranking
`RERANK_TOP_N`	`3`	Chunks kept after reranking
`CHUNK_SIZE`	`2000`	Max characters per chunk
`CHUNK_OVERLAP`	`200`	Overlap between chunks
`EMBEDDING_DIMENSION`	`768`	Vector dimensionality
`CORS_ALLOWED_ORIGINS`	(empty = permissive)	Comma-separated allowed origins

See .env.example for the full list including Milvus index tuning and contextual retrieval options.

Infrastructure

All services run via Docker Compose:

Service	Image	Port	Role
Embeddings	`text-embeddings-inference:cuda-1.9`	7997	Jina v5 nano (768-dim)
Reranker	`text-embeddings-inference:cuda-1.9`	7998	Jina Reranker v3
Docling	`docling-serve-cu128:v1.14.3`	5001	Document extraction (PDF/DOCX/PPTX)
Milvus	`milvus:v2.5.27`	19530	Vector DB (HNSW + BM25)

GPU is used by embeddings, reranker, and Docling. The Rust server itself runs on CPU only.

Project Structure

src/
├── main.rs                 # Entry point, server bootstrap
├── config.rs               # Env-based config (required/optional/tuning)
├── routes.rs               # Route registration
├── errors.rs               # AppError → HTTP response mapping
├── handlers/
│   ├── chat.rs             # /llms, /chat-rag + streaming variants
│   ├── documents.rs        # /documents/upload, /documents/search, /documents/backup
│   ├── evals.rs            # /evals/run (benchmark runner)
│   └── health.rs           # /health
├── schemas/
│   ├── requests.rs         # Request DTOs (serde + validator)
│   └── responses.rs        # Response DTOs (utoipa OpenAPI schemas)
├── services/
│   ├── docling.rs          # Docling API client + vision image descriptions
│   ├── document.rs         # Document extraction routing, semantic chunking
│   ├── embeddings.rs       # Jina/OpenAI-compatible embedding client
│   ├── llm.rs              # Groq + Cerebras LLM client
│   ├── milvus.rs           # Milvus v2 REST client (hybrid search, BM25, backup)
│   └── reranker.rs         # TEI cross-encoder reranker client
├── prompts/
│   ├── mod.rs              # Prompt builder functions
│   ├── rag_system_prompt.txt
│   └── contextual_prompt.txt
scripts/
└── judge_evals.py          # LLM-as-judge evaluation script
static/
└── chat.html               # Built-in SSE chat + RAG frontend
docs/
├── eval_data.csv           # 3,045 evaluation questions with expected answers
├── eval_judged_reranker.json    # Benchmark results (with reranker)
└── eval_judged_no_rerank.json   # Benchmark results (without reranker)
docker-compose.yml          # Docling + Jina TEI + Reranker + Milvus 2.5

Tech Stack

Layer	Technology	Role
Runtime	Rust + Tokio + Actix-web 4	Async web server
LLM	Groq (LPU) + Cerebras (wafer-scale)	Low-latency chat completions + SSE streaming
Vision	Llama 4 Scout 17B (Groq)	Image descriptions inside documents
Embeddings	Jina v5 text nano retrieval (TEI)	Local vectorization, 768-dim
Reranker	Jina Reranker v3 (TEI)	Cross-encoder reranking
Document extraction	Docling (IBM)	PDF/DOCX/PPTX tables, OCR, layout analysis
Vector DB	Milvus 2.5 (HNSW + BM25)	Hybrid dense + sparse search with RRF
Chunking	text-splitter crate	Semantic sentence-boundary-aware splitting
Docs	utoipa + Swagger UI	Auto-generated interactive API documentation
Infra	Docker Compose	One-command local stack

Development

# Development mode
cargo run

# Debug logging
RUST_LOG=debug cargo run

# Production build
cargo build --release
./target/release/rustyrag

# Run evaluation benchmark
curl -X POST http://localhost:8080/api/v1/evals/run -H 'Content-Type: application/json' -d '{}' -o eval_results.json

# Judge evaluation results
python3 scripts/judge_evals.py docs/eval_results.json

Roadmap

95%+ accuracy without reranker — improve retrieval quality so the fast path (sub-200ms TTFT) matches reranker-level accuracy
99%+ accuracy with reranker — push the reranker path toward near-perfect on Open RAG Bench through better chunking, contextual retrieval, and prompt tuning
Hosted version — deploy a managed RustyRAG instance so you can get started with a single API key, no infrastructure required

License

Elastic License 2.0 (ELv2)

Built by Ignas Vaitukaitis

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github		.github
docs		docs
scripts		scripts
services		services
src		src
static		static
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SETUP.md		SETUP.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

The fastest open-source RAG API on GitHub

Performance

Benchmark Results

Latency Comparison

Summary

Features

Use Cases

How It Works

Query Flow

Upload Flow

Why reranking matters

Quick Start

Prerequisites

1. Clone and configure

2. Start infrastructure

3. Build and run

4. Try it out

Supported File Formats

API Reference

Documents

Chat

Evals

Health

Search result fields

LLM Providers

Groq

Cerebras

Vision

Configuration

Required

Optional — services

Optional — tuning

Infrastructure

Project Structure

Tech Stack

Development

Roadmap

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages