Skip to content

Nerd-coderZero/rag-chatbot

Repository files navigation

Domain-Adaptive RAG Chatbot

A production-oriented Retrieval-Augmented Generation (RAG) system built with LangChain, FAISS, and the Gemini API. Point it at any collection of documents (.pdf, .txt, .md) and ask questions in natural language — single-shot or as a multi-turn conversation.

Stack: Python 3.10+ · LangChain 0.3 · FAISS · Google text-embedding-004 · Gemini 1.5 Flash


Table of Contents

  1. Features
  2. Architecture
  3. Prerequisites
  4. Installation
  5. Configuration
  6. Usage
  7. Design Decisions
  8. Evaluation
  9. Known Limitations
  10. Extending the Project

Features

  • Ingest PDF, plain text, and Markdown documents from a file or directory tree.
  • Chunk documents with overlap using RecursiveCharacterTextSplitter for semantically coherent segments.
  • Embed chunks with Google's text-embedding-004 model (768-dim vectors).
  • Store and query vectors locally with FAISS (no cloud dependency for retrieval).
  • Single-turn QA with source attribution.
  • Multi-turn conversational chat with automatic question condensation.
  • Incremental index updates without rebuilding from scratch.
  • Retrieval quality evaluation with two heuristic metrics (coverage + groundedness).
  • JSON evaluation reports for offline analysis.

Architecture

User Question
      |
      v
 [main.py CLI]
      |
      +-- ingest --> [document_loader.py] --> raw chunks
      |                                           |
      |                                    [embeddings.py]
      |                                           |
      |                                    [vector_store.py]
      |                                           |
      |                                    FAISS index (disk)
      |
      +-- query/chat --> [rag_pipeline.py]
                               |
                     +---------+---------+
                     |                   |
              [vector_store.py]    [embeddings.py]
                     |
              FAISS similarity search
                     |
              top-k chunks (context)
                     |
              [Gemini LLM via LangChain]
                     |
              Answer + source attribution

Data flow — ingestion

  1. document_loader.ingest() scans the source path.
  2. Files are loaded with type-appropriate loaders (PyPDFLoader for PDF, TextLoader for .txt/.md).
  3. RecursiveCharacterTextSplitter splits documents into overlapping chunks (default 1000 chars, 200-char overlap).
  4. FAISS.from_documents() embeds every chunk with text-embedding-004 and stores vectors in a flat L2 index.
  5. The index is serialized to faiss_index/ (.faiss + .pkl).

Data flow — query

  1. The question is embedded with text-embedding-004 (retrieval_query task type).
  2. FAISS returns the top-k chunks by L2 distance (equivalent to cosine similarity for normalized vectors).
  3. Retrieved chunks are concatenated into the QA prompt.
  4. Gemini generates an answer grounded in the retrieved context.
  5. Source filenames and page numbers are returned alongside the answer.

Data flow — conversational chat

Adds one step before retrieval: a second Gemini call rewrites the follow-up question using conversation history into a fully standalone query. This prevents the retriever from receiving vague pronoun references that would return irrelevant chunks.


Prerequisites


Installation

git clone https://github.com/Nerd-coderZero/rag-chatbot.git
cd rag-chatbot

python -m venv .venv
source .venv/bin/activate        # Linux / macOS
# .venv\Scripts\activate       # Windows

pip install -r requirements.txt

Configuration

cp .env.example .env

Edit .env:

GEMINI_API_KEY=your_key_here
GEMINI_MODEL=gemini-1.5-flash

The .env file is loaded automatically. Never commit it — it is in .gitignore.


Usage

python main.py [--index-path PATH] <command> [options]

--index-path (default: faiss_index) is global across all subcommands.


1. Ingest documents

python main.py ingest --source docs/
python main.py ingest --source paper.pdf
python main.py ingest --source docs/ --chunk-size 800 --chunk-overlap 150
python main.py ingest --source docs/ --no-recursive
python main.py --index-path product_docs ingest --source manuals/

Supported: .pdf, .txt, .md


2. Single-turn query

python main.py query "What are the main findings?"
python main.py query "Summarise the methodology" --show-sources
python main.py query "What datasets were used?" --show-sources --verbose

Example output:

Answer:
The report identifies three primary risk factors: regulatory uncertainty,
supply chain fragility, and talent shortages. Mitigation strategies are
detailed in Section 4.2.

Sources:
  [1] docs/annual_report_2024.pdf (page 7)
  [2] docs/executive_brief.md

3. Interactive chat

python main.py chat
python main.py chat --show-sources
python main.py chat --k 6

Example session:

RAG Chatbot ready. Type 'exit' or press Ctrl+C to quit.

You: What was the company revenue for 2024?
Assistant: Based on the financial statements, total revenue for 2024 was
4.2 billion, representing 12% year-over-year growth driven by cloud services.

You: Which segment underperformed?
Assistant: The hardware division underperformed expectations, with revenue
flat at 0.8 billion against a projected 1.1 billion, as noted in Section 3.1.

You: exit
Session ended.

4. Evaluate retrieval quality

# Interactive input
python main.py evaluate

# Questions from file
python main.py evaluate --questions-file tests/questions.txt

# Save full JSON report
python main.py evaluate --questions-file tests/questions.txt --output eval_report.json

tests/questions.txt format — one question per line:

What is the executive summary?
What were the key findings in 2024?
How does the methodology differ from prior work?

Example console output:

==================================================
Evaluation Summary
==================================================
  Total questions         : 10
  Successful              : 10
  Failed                  : 0
  Mean retrieval coverage : 82.3%
  Mean groundedness       : 74.6%
  Mean latency            : 1.847s / question
==================================================

5. Index statistics

python main.py stats

Output:

Index path   : /home/user/rag-chatbot/faiss_index
Vectors      : 1,842

Design Decisions

Why FAISS instead of ChromaDB or Pinecone?

FAISS runs entirely in-process with no server, no Docker container, and no network calls. For a single-user RAG system over documents up to ~100k chunks, a flat FAISS index queries in under 5ms. ChromaDB adds value when you need metadata filtering, multi-user access, or a persistent HTTP API. Pinecone is appropriate for production systems that need managed infrastructure and sub-millisecond search at millions of vectors. Neither is necessary at this scale.

Why text-embedding-004 instead of OpenAI or sentence-transformers?

text-embedding-004 produces 768-dimensional vectors with competitive retrieval quality on MTEB benchmarks. It is Google's current recommended embedding model, pairs naturally with the Gemini LLM (same API key, same SDK), and its free tier is sufficient for most project-scale corpora. Sentence-transformers would require local GPU or slower CPU inference. OpenAI embeddings would introduce a second vendor and a second billing relationship.

Why separate task types for documents vs. queries?

text-embedding-004 accepts a task_type parameter: retrieval_document for indexing and retrieval_query for search. Google trains these to produce vectors optimized for asymmetric retrieval — the document and query live in slightly different sub-spaces that maximize recall. This setting is exposed in embeddings.py and used transparently by LangChain's retriever.

Why chunk_size=1000, chunk_overlap=200?

1000 characters is roughly 150-200 words — enough context for the LLM to produce a coherent answer from a single chunk, and small enough that most questions will retrieve topically focused content rather than a broad summary. 200-character overlap (20%) prevents answers from being missed because a relevant sentence straddles a chunk boundary. These defaults are a reasonable starting point; adjust --chunk-size and --chunk-overlap for your specific corpus.

Why the 'stuff' chain type instead of map-reduce or refine?

For k <= 6 retrieved chunks, 'stuff' (concatenate all context into one prompt) produces the best answer quality because the LLM sees all evidence at once with full attention. Gemini 1.5 Flash has a 1M-token context window, so context length is rarely a bottleneck. Switch to map_reduce if you need to synthesize answers from very large numbers of chunks (k > 20) where stuffing all context would degrade quality.

Why return_source_documents=True?

Returning source documents enables citation, retrieval inspection, and evaluation. Without them, the pipeline is a black box — you cannot determine whether a wrong answer resulted from retrieval failure or generation failure.


Evaluation

Two heuristic metrics are computed without requiring labeled ground-truth pairs.

Retrieval Coverage Score

Measures what fraction of non-trivial tokens in the question appear in the combined text of the retrieved chunks. A score of 1.0 means every keyword from the question was present in the retrieved context. Low scores indicate that the retriever is surfacing tangentially related chunks rather than topically relevant ones.

Interpretation:

  • 0.80: retrieval is surfacing relevant context

  • 0.50–0.80: retrieval is partially relevant; consider adjusting k or chunk size
  • < 0.50: retrieval may be failing; check document coverage or embedding quality

Answer Groundedness Score

Measures what fraction of non-trivial tokens in the generated answer appear in the retrieved context. High groundedness suggests the model is drawing from the documents rather than generating from parametric memory.

Limitation: this is a token-overlap proxy metric. It will undercount groundedness when the model paraphrases the source rather than quoting it directly. Rigorous faithfulness evaluation requires NLI models (SummaC, TRUE) or an LLM-as-judge setup.

JSON report structure

{
  "summary": {
    "total_questions": 10,
    "successful": 10,
    "failed": 0,
    "mean_retrieval_coverage": 0.823,
    "mean_groundedness": 0.746,
    "mean_latency_seconds": 1.847
  },
  "results": [
    {
      "question": "What is the return policy?",
      "answer": "Returns are accepted within 30 days...",
      "sources": ["docs/policy.pdf"],
      "retrieval_coverage_score": 0.875,
      "groundedness_score": 0.791,
      "latency_seconds": 1.623,
      "error": null
    }
  ]
}

Known Limitations

Hallucination on gaps: when retrieved chunks do not contain the answer, the model sometimes generates plausible-sounding text rather than declining. The QA prompt instructs the model to decline, but this is not guaranteed. Groundedness evaluation helps identify these cases.

Single-language only: the current prompt and evaluation logic assume English. Multi-language support requires language-aware tokenization in the evaluator and potentially a multilingual embedding model.

No re-ranking: the pipeline uses raw FAISS cosine similarity scores for retrieval ordering. A cross-encoder re-ranker (e.g., Cohere Rerank, BGE-Reranker) would improve precision by re-scoring candidate chunks after retrieval.

No metadata filtering: FAISS does not support filtered search natively. If you need to restrict retrieval to specific documents, dates, or categories, migrate the vector store to ChromaDB or Weaviate, which support metadata predicates.

Index not thread-safe: loading and querying FAISS from multiple threads without locks can cause race conditions. Add a threading.Lock or move to a server-mode vector store for concurrent access.


Extending the Project

Add a web UI: wrap rag_pipeline.query() with a FastAPI route and serve a simple HTML front-end or integrate with Streamlit.

Add re-ranking: insert a cross-encoder model between FAISS retrieval and LLM generation to re-score and filter the top-k candidates.

Add metadata filtering: replace FAISS with ChromaDB and pass where={"source": "docs/policy.pdf"} to scope retrieval to specific files.

Add hybrid search: combine dense vector retrieval (FAISS) with BM25 sparse retrieval (rank_bm25) and fuse scores with Reciprocal Rank Fusion (RRF) for better coverage of keyword-heavy queries.

Add LLM-as-judge evaluation: use a separate Gemini call to rate each answer for faithfulness and relevance on a 1-5 scale, replacing the token-overlap heuristics.

Support DOCX and HTML: add Docx2txtLoader and BSHTMLLoader from langchain-community in document_loader.py.


Project Structure

rag-chatbot/
├── main.py             Entry point and CLI (ingest, query, chat, evaluate, stats)
├── rag_pipeline.py     Single-turn QA and multi-turn conversational chains
├── document_loader.py  Document ingestion, loading, and chunking
├── embeddings.py       Google text-embedding-004 model setup
├── vector_store.py     FAISS index build, load, update, and search
├── evaluator.py        Retrieval coverage and groundedness evaluation
├── requirements.txt    Python dependencies
├── .env.example        Environment variable template
└── README.md           This file

Author

Kushagra Jaiswal — ML Engineer & Research Lead

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages