A production-oriented Retrieval-Augmented Generation (RAG) system built with LangChain, FAISS, and the Gemini API. Point it at any collection of documents (.pdf, .txt, .md) and ask questions in natural language — single-shot or as a multi-turn conversation.
Stack: Python 3.10+ · LangChain 0.3 · FAISS · Google text-embedding-004 · Gemini 1.5 Flash
- Features
- Architecture
- Prerequisites
- Installation
- Configuration
- Usage
- Design Decisions
- Evaluation
- Known Limitations
- Extending the Project
- Ingest PDF, plain text, and Markdown documents from a file or directory tree.
- Chunk documents with overlap using
RecursiveCharacterTextSplitterfor semantically coherent segments. - Embed chunks with Google's
text-embedding-004model (768-dim vectors). - Store and query vectors locally with FAISS (no cloud dependency for retrieval).
- Single-turn QA with source attribution.
- Multi-turn conversational chat with automatic question condensation.
- Incremental index updates without rebuilding from scratch.
- Retrieval quality evaluation with two heuristic metrics (coverage + groundedness).
- JSON evaluation reports for offline analysis.
User Question
|
v
[main.py CLI]
|
+-- ingest --> [document_loader.py] --> raw chunks
| |
| [embeddings.py]
| |
| [vector_store.py]
| |
| FAISS index (disk)
|
+-- query/chat --> [rag_pipeline.py]
|
+---------+---------+
| |
[vector_store.py] [embeddings.py]
|
FAISS similarity search
|
top-k chunks (context)
|
[Gemini LLM via LangChain]
|
Answer + source attribution
document_loader.ingest()scans the source path.- Files are loaded with type-appropriate loaders (
PyPDFLoaderfor PDF,TextLoaderfor .txt/.md). RecursiveCharacterTextSplittersplits documents into overlapping chunks (default 1000 chars, 200-char overlap).FAISS.from_documents()embeds every chunk withtext-embedding-004and stores vectors in a flat L2 index.- The index is serialized to
faiss_index/(.faiss+.pkl).
- The question is embedded with
text-embedding-004(retrieval_query task type). - FAISS returns the top-k chunks by L2 distance (equivalent to cosine similarity for normalized vectors).
- Retrieved chunks are concatenated into the QA prompt.
- Gemini generates an answer grounded in the retrieved context.
- Source filenames and page numbers are returned alongside the answer.
Adds one step before retrieval: a second Gemini call rewrites the follow-up question using conversation history into a fully standalone query. This prevents the retriever from receiving vague pronoun references that would return irrelevant chunks.
- Python 3.10 or later
- A Google Gemini API key (free at https://aistudio.google.com/app/apikey)
- pip
git clone https://github.com/Nerd-coderZero/rag-chatbot.git
cd rag-chatbot
python -m venv .venv
source .venv/bin/activate # Linux / macOS
# .venv\Scripts\activate # Windows
pip install -r requirements.txtcp .env.example .envEdit .env:
GEMINI_API_KEY=your_key_here
GEMINI_MODEL=gemini-1.5-flash
The .env file is loaded automatically. Never commit it — it is in .gitignore.
python main.py [--index-path PATH] <command> [options]
--index-path (default: faiss_index) is global across all subcommands.
python main.py ingest --source docs/
python main.py ingest --source paper.pdf
python main.py ingest --source docs/ --chunk-size 800 --chunk-overlap 150
python main.py ingest --source docs/ --no-recursive
python main.py --index-path product_docs ingest --source manuals/Supported: .pdf, .txt, .md
python main.py query "What are the main findings?"
python main.py query "Summarise the methodology" --show-sources
python main.py query "What datasets were used?" --show-sources --verboseExample output:
Answer:
The report identifies three primary risk factors: regulatory uncertainty,
supply chain fragility, and talent shortages. Mitigation strategies are
detailed in Section 4.2.
Sources:
[1] docs/annual_report_2024.pdf (page 7)
[2] docs/executive_brief.md
python main.py chat
python main.py chat --show-sources
python main.py chat --k 6Example session:
RAG Chatbot ready. Type 'exit' or press Ctrl+C to quit.
You: What was the company revenue for 2024?
Assistant: Based on the financial statements, total revenue for 2024 was
4.2 billion, representing 12% year-over-year growth driven by cloud services.
You: Which segment underperformed?
Assistant: The hardware division underperformed expectations, with revenue
flat at 0.8 billion against a projected 1.1 billion, as noted in Section 3.1.
You: exit
Session ended.
# Interactive input
python main.py evaluate
# Questions from file
python main.py evaluate --questions-file tests/questions.txt
# Save full JSON report
python main.py evaluate --questions-file tests/questions.txt --output eval_report.jsontests/questions.txt format — one question per line:
What is the executive summary?
What were the key findings in 2024?
How does the methodology differ from prior work?
Example console output:
==================================================
Evaluation Summary
==================================================
Total questions : 10
Successful : 10
Failed : 0
Mean retrieval coverage : 82.3%
Mean groundedness : 74.6%
Mean latency : 1.847s / question
==================================================
python main.py statsOutput:
Index path : /home/user/rag-chatbot/faiss_index
Vectors : 1,842
FAISS runs entirely in-process with no server, no Docker container, and no network calls. For a single-user RAG system over documents up to ~100k chunks, a flat FAISS index queries in under 5ms. ChromaDB adds value when you need metadata filtering, multi-user access, or a persistent HTTP API. Pinecone is appropriate for production systems that need managed infrastructure and sub-millisecond search at millions of vectors. Neither is necessary at this scale.
text-embedding-004 produces 768-dimensional vectors with competitive retrieval
quality on MTEB benchmarks. It is Google's current recommended embedding model,
pairs naturally with the Gemini LLM (same API key, same SDK), and its free tier
is sufficient for most project-scale corpora. Sentence-transformers would require
local GPU or slower CPU inference. OpenAI embeddings would introduce a second
vendor and a second billing relationship.
text-embedding-004 accepts a task_type parameter: retrieval_document for
indexing and retrieval_query for search. Google trains these to produce vectors
optimized for asymmetric retrieval — the document and query live in slightly
different sub-spaces that maximize recall. This setting is exposed in
embeddings.py and used transparently by LangChain's retriever.
1000 characters is roughly 150-200 words — enough context for the LLM to produce
a coherent answer from a single chunk, and small enough that most questions will
retrieve topically focused content rather than a broad summary. 200-character
overlap (20%) prevents answers from being missed because a relevant sentence
straddles a chunk boundary. These defaults are a reasonable starting point;
adjust --chunk-size and --chunk-overlap for your specific corpus.
For k <= 6 retrieved chunks, 'stuff' (concatenate all context into one prompt)
produces the best answer quality because the LLM sees all evidence at once with
full attention. Gemini 1.5 Flash has a 1M-token context window, so context
length is rarely a bottleneck. Switch to map_reduce if you need to synthesize
answers from very large numbers of chunks (k > 20) where stuffing all context
would degrade quality.
Returning source documents enables citation, retrieval inspection, and evaluation. Without them, the pipeline is a black box — you cannot determine whether a wrong answer resulted from retrieval failure or generation failure.
Two heuristic metrics are computed without requiring labeled ground-truth pairs.
Measures what fraction of non-trivial tokens in the question appear in the combined text of the retrieved chunks. A score of 1.0 means every keyword from the question was present in the retrieved context. Low scores indicate that the retriever is surfacing tangentially related chunks rather than topically relevant ones.
Interpretation:
-
0.80: retrieval is surfacing relevant context
- 0.50–0.80: retrieval is partially relevant; consider adjusting k or chunk size
- < 0.50: retrieval may be failing; check document coverage or embedding quality
Measures what fraction of non-trivial tokens in the generated answer appear in the retrieved context. High groundedness suggests the model is drawing from the documents rather than generating from parametric memory.
Limitation: this is a token-overlap proxy metric. It will undercount groundedness when the model paraphrases the source rather than quoting it directly. Rigorous faithfulness evaluation requires NLI models (SummaC, TRUE) or an LLM-as-judge setup.
{
"summary": {
"total_questions": 10,
"successful": 10,
"failed": 0,
"mean_retrieval_coverage": 0.823,
"mean_groundedness": 0.746,
"mean_latency_seconds": 1.847
},
"results": [
{
"question": "What is the return policy?",
"answer": "Returns are accepted within 30 days...",
"sources": ["docs/policy.pdf"],
"retrieval_coverage_score": 0.875,
"groundedness_score": 0.791,
"latency_seconds": 1.623,
"error": null
}
]
}Hallucination on gaps: when retrieved chunks do not contain the answer, the model sometimes generates plausible-sounding text rather than declining. The QA prompt instructs the model to decline, but this is not guaranteed. Groundedness evaluation helps identify these cases.
Single-language only: the current prompt and evaluation logic assume English. Multi-language support requires language-aware tokenization in the evaluator and potentially a multilingual embedding model.
No re-ranking: the pipeline uses raw FAISS cosine similarity scores for retrieval ordering. A cross-encoder re-ranker (e.g., Cohere Rerank, BGE-Reranker) would improve precision by re-scoring candidate chunks after retrieval.
No metadata filtering: FAISS does not support filtered search natively. If you need to restrict retrieval to specific documents, dates, or categories, migrate the vector store to ChromaDB or Weaviate, which support metadata predicates.
Index not thread-safe: loading and querying FAISS from multiple threads without locks can cause race conditions. Add a threading.Lock or move to a server-mode vector store for concurrent access.
Add a web UI: wrap rag_pipeline.query() with a FastAPI route and serve a
simple HTML front-end or integrate with Streamlit.
Add re-ranking: insert a cross-encoder model between FAISS retrieval and LLM generation to re-score and filter the top-k candidates.
Add metadata filtering: replace FAISS with ChromaDB and pass
where={"source": "docs/policy.pdf"} to scope retrieval to specific files.
Add hybrid search: combine dense vector retrieval (FAISS) with BM25 sparse retrieval (rank_bm25) and fuse scores with Reciprocal Rank Fusion (RRF) for better coverage of keyword-heavy queries.
Add LLM-as-judge evaluation: use a separate Gemini call to rate each answer for faithfulness and relevance on a 1-5 scale, replacing the token-overlap heuristics.
Support DOCX and HTML: add Docx2txtLoader and BSHTMLLoader from
langchain-community in document_loader.py.
rag-chatbot/
├── main.py Entry point and CLI (ingest, query, chat, evaluate, stats)
├── rag_pipeline.py Single-turn QA and multi-turn conversational chains
├── document_loader.py Document ingestion, loading, and chunking
├── embeddings.py Google text-embedding-004 model setup
├── vector_store.py FAISS index build, load, update, and search
├── evaluator.py Retrieval coverage and groundedness evaluation
├── requirements.txt Python dependencies
├── .env.example Environment variable template
└── README.md This file
Kushagra Jaiswal — ML Engineer & Research Lead
- GitHub: https://github.com/Nerd-coderZero
- LinkedIn: https://www.linkedin.com/in/kushagra356
- Portfolio: https://nerd-coderzero.github.io