RAG Customer Support Assistant

Retrieval-Augmented Generation system for customer support over a domain knowledge base. Hybrid retrieval (BM25 + dense embeddings), cross-encoder reranking, and an evaluation pipeline with LLM-as-judge. Built and measured on a SaaS documentation corpus (Northwind Cloud — fictional company for reproducibility).

Highlights

Real measured evaluation: 30-question test set, Recall@5 = 96.7%, Correctness = 93.3% (LLM-as-judge), avg total latency 1.1 s
Production architecture: hybrid retrieval (BM25 + dense) → cross-encoder rerank → calibrated generation
Ablation study: vector-only vs hybrid vs hybrid+rerank — recall and latency trade-offs documented
End-to-end runnable: python demo.py "your question" works after a single ingestion command
REST API (FastAPI) + CLI entry points
12 unit tests passing in CI

Quick Results

Per-category breakdown across 12 question types:

Architecture

flowchart TB
    subgraph ingest["📥 Ingestion (offline)"]
        A1[Markdown documents] --> A2["RecursiveCharacter<br/>chunker (512/50)"]
        A2 --> A3["all-MiniLM-L6-v2<br/>(384-dim embeddings)"]
        A3 --> A4[(ChromaDB<br/>cosine similarity)]
    end

    subgraph query["🔍 Query (online)"]
        B1[User question] --> B2{Hybrid retrieval}
        B2 --> B3["Dense vector<br/>top-20"]
        B2 --> B4["BM25<br/>top-20"]
        B3 --> B5[RRF fusion<br/>k=60]
        B4 --> B5
        B5 --> B6["Cross-encoder rerank<br/>ms-marco-MiniLM-L-6-v2"]
        B6 --> B7["Top-5 chunks"]
        B7 --> B8["Llama 3.3 70B<br/>via Groq"]
        B8 --> B9[Answer + sources]
    end

    A4 -.-> B3

    subgraph eval["📊 Evaluation"]
        C1[30 Q&A test set] --> C2[Retrieval eval<br/>Recall@5]
        C1 --> C3[Generation eval]
        C3 --> C4[LLM-as-judge<br/>correctness 0.0–1.0]
    end

Stack

Layer	Tool / Model
LLM	Llama 3.3 70B via Groq (primary), Gemini 2.0 Flash (fallback)
Embeddings	`sentence-transformers/all-MiniLM-L6-v2` (384-dim)
Vector DB	ChromaDB (persistent, cosine similarity)
Lexical retrieval	BM25 (`rank-bm25`)
Reranker	`cross-encoder/ms-marco-MiniLM-L-6-v2`
Chunking	`langchain-text-splitters` RecursiveCharacterTextSplitter
Agent	LangGraph (graph + tools — backend)
API	FastAPI + Uvicorn
Eval	Custom retrieval + LLM-as-judge pipeline
Testing	pytest

Evaluation Methodology

Test set: 30 questions across 12 categories (api, billing, pricing, security, sla, integrations, migration, policy, getting_started, limits, technical, troubleshooting). Each question has a ground-truth answer and an expected source document.

Metrics:

Recall@5 — does the expected source document appear in the top-5 retrieved chunks?
Correctness (0.0–1.0) — Llama 3.3 70B judges each generated answer against the ground truth on a fixed rubric.
Latency — end-to-end time from question to answer (retrieval + generation).

The full pipeline including the LLM judge is reproducible: python -m src.evaluation.evaluate full. Test set is committed at data/eval/test_set.json. Reports are written to data/eval/report.json.

Ablation: how much does each component contribute?

Configuration	Recall@5	Avg retrieval	Notes
Vector only (dense embeddings)	100.0%	9 ms	Sufficient on this clean small corpus
+ BM25 hybrid (RRF fusion)	96.7%	7 ms	Adds keyword precision; slightly noisier on small N
+ Cross-encoder reranker	96.7%	310 ms	Reorders for relevance; pays in latency

Honest finding: on a 100-chunk well-curated corpus, dense vector search alone is sufficient. Hybrid and reranking show their value on larger, noisier production corpora where keyword matching catches what semantic search misses (e.g., exact product names, error codes, version numbers). The reference architecture keeps the full stack because that was the production configuration; for this synthetic eval it's a documented trade-off.

Quick Start

# 1. Install
pip install -r requirements.txt

# 2. Set GROQ_API_KEY (free tier at https://console.groq.com)
cp .env.example .env
# edit .env to add your key

# 3. Build the knowledge base (loads markdown → chunks → ChromaDB)
python -m src.ingestion.build_knowledge_base

# 4. Ask a question
python demo.py "How much does the Business plan cost?"

REST API

uvicorn src.api.main:app --port 8000

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What payment methods do you accept?"}'

Docker

docker-compose up

Builds the image, runs ingestion at build-time, and exposes the API on :8000 with a health-check.

Project Structure

.
├── demo.py                          # CLI entry point
├── data/
│   ├── raw/                         # 14 markdown documents (synthetic SaaS docs)
│   ├── eval/test_set.json           # 30-question test set with ground truths
│   └── chroma_db/                   # built locally, gitignored
├── src/
│   ├── ingestion/
│   │   ├── chunker.py               # recursive markdown chunking
│   │   ├── embedder.py              # ChromaDB + sentence-transformers
│   │   └── build_knowledge_base.py  # full ingestion pipeline
│   ├── retrieval/
│   │   ├── hybrid.py                # BM25 + dense + RRF fusion
│   │   ├── reranker.py              # cross-encoder reranking
│   │   └── rag.py                   # full RAG: query → context → LLM
│   ├── evaluation/
│   │   └── evaluate.py              # retrieval / ablation / full eval modes
│   └── api/
│       └── main.py                  # FastAPI service
├── notebooks/
│   └── 01_retrieval_walkthrough.ipynb  # walk-through of vector / BM25 / hybrid / rerank
├── tests/                           # 12 unit tests
└── assets/                          # generated charts for this README

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Customer Support Assistant

Highlights

Quick Results

Architecture

Stack

Evaluation Methodology

Ablation: how much does each component contribute?

Quick Start

REST API

Docker

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
assets		assets
data		data
notebooks		notebooks
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
demo.py		demo.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Customer Support Assistant

Highlights

Quick Results

Architecture

Stack

Evaluation Methodology

Ablation: how much does each component contribute?

Quick Start

REST API

Docker

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages