A Retrieval Augmented Generation (RAG) system built from scratch. By implementing hybrid retrieval, cross-encoder reranking, hallucination guards, and Ragas-based evaluation.
Upload any PDF and ask questions in plain English. The system retrieves the most relevant chunks from your document, passes them to an LLM, and returns a grounded answer with source citations. If the retrieved context isn't strong enough to answer confidently, the system says so no hallucination.
PDF / Markdown
│
▼
┌─────────────────────┐
│ Document Ingestion │ PyMuPDF + RecursiveCharacterTextSplitter
│ 500-800 token │ 100 token overlap at boundaries
│ chunks │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Vector Store │ sentence-transformers (all-MiniLM-L6-v2)
│ ChromaDB │ 384-dim embeddings, persisted to disk
└────────┬────────────┘
│
User Query
│
▼
┌─────────────────────────────────────────┐
│ Hybrid Retrieval │
│ BM25 keyword search + Vector search │
│ (exact term match) (semantic intent) │
└────────┬────────────────────────────────┘
│
▼
┌─────────────────────┐
│ Cross-Encoder │ ms-marco-MiniLM-L-6-v2
│ Reranker │ rescores top-10 → top-3
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Citation Enforcement│ declines if evidence score < threshold
│ + Hallucination │ forces explicit source citation
│ Guard │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Groq LLM │ LLaMA-3 (free API)
│ (LLaMA-3) │ grounded answer generation
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Streamlit UI │ upload → ask → cited answer
└─────────────────────┘
| Metric | Score |
|---|---|
| Faithfulness | — |
| Answer Relevancy | — |
| Context Precision | — |
| Golden Dataset Size | 20 QA pairs |
| Evaluation Framework | Ragas |
| Component | Tool | Why |
|---|---|---|
| PDF parsing | PyMuPDF | Fast, handles complex PDFs |
| Chunking | LangChain text splitters | Recursive, context-aware |
| Embeddings | sentence-transformers | Free, runs locally, no API needed |
| Vector DB | ChromaDB | Local persistent storage |
| Keyword search | rank-bm25 | Exact term matching |
| Reranker | cross-encoder/ms-marco-MiniLM-L-6-v2 | Boosts precision significantly |
| LLM | Groq API (LLaMA-3) | Free tier, fast inference |
| Evaluation | Ragas | Industry-standard RAG metrics |
| UI | Streamlit | Fast to build, easy to demo |
production-rag/
│
├── notebooks/
│ ├── 01_document_ingestion.ipynb # load, chunk, embed PDFs
│ ├── 02_vector_store_retrieval.ipynb # ChromaDB storage + top-K search
│ ├── 03_groq_rag_pipeline.ipynb # connect Groq LLM, generate answers
│ ├── 04_hybrid_retrieval.ipynb # BM25 + vector hybrid search
│ ├── 05_reranker.ipynb # cross-encoder reranking
│ ├── 06_citation_hallucination_guard.ipynb # grounding + refusal logic
│ └── 07_ragas_evaluation.ipynb # faithfulness + relevancy scoring
│
├── docs/ # put your PDF files here
│ └── sample.pdf
│
├── data/
│ ├── golden_dataset.csv # 20 manually verified QA pairs
│ └── ragas_results.json # auto-generated evaluation scores
│
├── app/
│ └── app.py # Streamlit UI
│
├── requirements.txt
├── .env # GROQ_API_KEY (never commit this)
├── .gitignore
└── README.md
1. Clone the repo
git clone https://github.com/Neogenxx/DocMind.git
cd DocMind2. Install dependencies
pip install -r requirements.txt3. Get your free Groq API key
Sign up at console.groq.com → API Keys → Create Key. It's free.
4. Add your key to .env
GROQ_API_KEY=your_key_here
5. Add a PDF to the docs/ folder
Any PDF works — research papers, textbooks, your college notes.
6. Run the notebooks in order
jupyter notebookOpen notebooks/01_document_ingestion.ipynb and run each notebook sequentially.
7. Launch the Streamlit app
streamlit run app/app.pyPure vector search misses exact keyword matches (like model names, author names, specific terms). Pure BM25 misses semantic meaning. This system combines both with a weighted merge, then reranks — getting the best of both worlds.
The initial retriever returns top-10 candidates cheaply. A cross-encoder then reads each chunk alongside the query and produces a precise relevancy score. Only the top-3 reranked chunks reach the LLM — significantly improving answer quality.
If no retrieved chunk scores above the relevancy threshold, the system responds with "I don't have enough information in the provided documents to answer this confidently" rather than fabricating an answer. This is what separates a production system from a demo.
A golden dataset of 20 manually verified question-answer pairs is used to measure:
- Faithfulness — is every claim in the answer supported by the retrieved chunks?
- Answer Relevancy — does the answer actually address the question?
- Context Precision — were the retrieved chunks actually relevant?
# Inside 07_ragas_evaluation.ipynb
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=golden_dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(results)Building this project beyond a basic demo required understanding:
- Why chunking strategy matters chunk too large and retrieval loses precision, too small and you lose context
- Why hybrid search outperforms pure vector search on technical documents
- Why cross-encoder reranking exists bi-encoders are fast but imprecise, cross-encoders are slow but accurate, so you combine them
- Why citation enforcement matters an LLM that says "I don't know" is more trustworthy than one that hallucinates confidently
- How to measure RAG quality quantitatively with Ragas rather than just eyeballing answers