Skip to content

AnkitG7/financial-document-rag

Repository files navigation

HDFC Hybrid RAG System

A highly resilient, production-ready Retrieval-Augmented Generation (RAG) system built in Python. This system extracts knowledge from local HDFC documents page-by-page, builds a hybrid local search index, and streams responses to user queries using the OpenRouter API with conversational memory and strict domain-adherence parameters.


🚀 Key Production Features

  1. Hybrid Retrieval (Dense + Sparse): Combines semantic vector lookup via FAISS and lexical keyword matching via BM25 using LangChain's EnsembleRetriever (weighted 50/50). This guarantees maximum accuracy across structured financial tables, annual reports, and narrative policies.
  2. Local Embedding Computation: Runs BAAI/bge-small-en-v1.5 locally using Hugging Face embeddings. It automatically leverages CUDA GPUs when available or runs on a highly optimized local CPU fallback, consuming zero cloud API tokens for document ingestion.
  3. Smart Hash-Based Cache Caching: Aggregates PDF file attributes (name, size, and last modified date) into a SHA-256 signature hash stored in faiss_index/cache_metadata.json.
    • Sub-Second Boot: Subsequent runs load FAISS and BM25 databases instantly, skipping heavy text extraction and embeddings computation.
    • Auto-Detect Changes: If a PDF is added, modified, or deleted, the system automatically detects the signature mismatch and re-indexes the project.
  4. PyMuPDF Ingestion Engine: Parses pages fast, handling complex multi-column reports and large annual publications. Features page-level exception isolation: corrupt pages are gracefully skipped without halting the indexing process.
  5. Resilient OpenRouter Client: Direct HTTP stream parsing with built-in exponential backoff retries (up to 5 attempts) to absorb transient connection drops, rate limits (429), or model timeouts.
  6. Windows UTF-8 Stream Safety: Automatically configures system standard stdout/stderr streams to UTF-8 with character replacements, preventing common Windows CP1252 charmap encoding crashes when rendering rich Markdown borders, tables, or Unicode punctuation.
  7. Conversational Memory buffer: Implements a sliding conversation window (retains last 5 exchanges). Conversational references (e.g., pronouns) are resolved via a fast, deterministic Query Reformulation Step prior to running search.
  8. Supporting Excerpts & Confidence Scores: Each query outputs distinct sources with page numbers, exact textual supporting excerpts, and a computed confidence score mapped from FAISS cosine similarity values.

📂 Project Structure

C:\Users\admin\Desktop\Projects\RAG_ORBK\
├── .env                       # Active API key and LLM Model configuration
├── .env.example               # Reference template for configuration setup
├── requirements.txt           # Pinned dependencies for reproducible builds
├── app.py                     # Main CLI REPL shell & stream parser
├── config.py                  # Environment config parsing and validation
├── README.md                  # System documentation
└── src/
    ├── __init__.py
    ├── document_processor.py  # Fast PyMuPDF (fitz) page extraction
    ├── chunker.py             # Page splitter with ID metadata preservation
    ├── embeddings.py          # BGE-small-en-v1.5 local embedding provider
    ├── vector_store.py        # Local indices manager with change-detection
    ├── openrouter_client.py   # HTTP client with retries and SSE stream parser
    ├── rag_engine.py          # RAG Coordinator, Hybrid Search, Memory, and scoring
    └── utils.py               # ASCII Art logs and console theme formatting

🛠️ Prerequisites & Setup

1. Requirements

Ensure you have Python 3.11+ installed on your system.

2. Environment Configuration

Verify that the .env file in the project root is configured with your OpenRouter API keys. (A template configuration is available in .env.example).

OPENROUTER_API_KEY=sk-or-v1-...
OPENROUTER_MODEL=google/gemini-2.5-flash:free

3. Installation

Open a terminal in the project directory C:\Users\admin\Desktop\Projects\RAG_ORBK and run:

# Create a virtual environment
python -m venv venv

# Activate the virtual environment
.\venv\Scripts\activate

# Install all pinned dependencies
pip install -r requirements.txt

🏃 Run Instructions

Start the interactive terminal shell using the active virtual environment:

python app.py

Command Line Flags:

  • Force Rebuild: Forces the system to re-extract text and compile a fresh index, even if no files have changed:
    python app.py --rebuild
  • Retrieval Debug Mode: Enables detailed logging. Prior to answering, the console will print standalone reformulated query terms, similarity distance levels, and matching excerpts:
    python app.py --debug

💬 CLI REPL Commands

Once inside the interactive chat loop, the following system commands are available:

  • /clear - Wipes the conversational memory buffer clean (starts a fresh dialogue thread).
  • /debug - Toggle the Retrieval Debugging Mode on or off in real-time.
  • /help - Prints a reference sheet of all CLI options.
  • exit or quit - Safely exits the application.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages