<a href="https://colab.research.google.com/github/LashawnFofung/RAG-Pipelines/blob/main/src/Optimized_RAG_Pipeline_with_Simple_Chatbot_for_Document_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **üìë Build And Optimize A RAG Pipeline For Document Retrieval**

**Data:** *LenderFeesWorksheetNew.pdf*
<br><br>


This Colab notebook sets up a Retrieval-Augmented Generation (RAG) system designed to answer questions about the uploaded LenderFeesWorksheetNew.pdf.
<br><br>

It works by first using PyMuPDF to accurately extract (parse) the text from the PDF. This text is then broken down into smaller, meaningful chunks (‚úÇÔ∏è) and converted into numerical vectors (üî¢) using the efficient `all-MiniLM-L6-v2` HuggingFace embedding model. These vectors are stored in a Vector Index. When you ask a question, the system searches the index to retrieve (üîç) the most relevant text chunks (Vector Retrieval) and feeds both the question and those chunks to the Gemini 2.5 Flash LLM. The LLM then reads the retrieved context and generates a precise answer, such as calculating the total monthly payment or identifying specific fees.

<br><br>


**Table of Contents**
* [1. RAG Pipeline](#scrollTo=Bfr8FNjnSvW8&line=1&uniqifier=1)
  * [1.1 Installation](#scrollTo=CVLftbBrTtJD&line=1&uniqifier=1)
  * [1.2 Setup Environment and Imports](#scrollTo=A-7Tha4LT_Wp&line=1&uniqifier=1)
  * [1.3 API Key Setup](#scrollTo=p43QyjfpUZ-k&line=1&uniqifier=1)
  * [1.4 Document Upload](#scrollTo=zF41iRd_UmH7&line=1&uniqifier=1)
  * [1.5 Custom PyMuPDF Loader Function (PDF Parsing)](#scrollTo=bh0AOQB7nvmp&line=1&uniqifier=1)
  * [1.6 Configure RAG Pipeline (LLM, Embedding, Chunking)](#scrollTo=YqYMeoCroEGh&line=1&uniqifier=1)
  * [1.7 Indexing and Index Creation](#scrollTo=7F7-so82oU_z&line=1&uniqifier=1)
* [2. Reasons for Methods: Embedding, Chunking Strategy, & Retrieval Method](#scrollTo=2lo9ffUJSzJE&line=1&uniqifier=1)
* [3. Simple Chatbot](#scrollTo=KZt_12xPS5hx)
  * [3.1 Create Chat Engine](#scrollTo=B_9DrGkNxFQr&line=1&uniqifier=1)
  * [3.2 Interactive Chat Loop](#scrollTo=zxAruhrJyeHa&line=1&uniqifier=1)
  * [3.3 Run the Chatbot](#scrollTo=wIfn7bGazZ7k&line=1&uniqifier=1)
  

# **1. RAG Pipeline**

## 1.1 Installation

In [1]:
# Install LlamaIndex core, Gemini LLM connector, PyMuPDF, and HuggingFace Embedding integration
!pip install -q llama-index llama-index-llms-gemini pymupdf llama-index-embeddings-huggingface nest_asyncio sentence-transformers

## 1.2 Setup Environment and Imports

In [2]:
import os
import fitz # PyMuPDF
from google.colab import files, userdata
from llama_index.core import VectorStoreIndex, Document, Settings
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from typing import List
import nest_asyncio

# Fix event loop conflicts in Colab
nest_asyncio.apply()


## 1.3 API Key Setup

In [3]:
# Gemini API Key setup in Colab secret
try:
    API_KEY = userdata.get('GEMINI_API_KEY')
    if not API_KEY:
        raise ValueError("GEMINI_API_KEY not found in Colab Secrets. Please set it.")
    # Set the official environment variable name required by the Google GenAI SDK
    os.environ["GOOGLE_API_KEY"] = API_KEY
    print("‚úÖ API Key successfully loaded and set as GOOGLE_API_KEY.")
except Exception as e:
    print(f"‚ö†Ô∏è Warning: Could not load API Key from Colab Secrets: {e}")
    print("Please ensure your API Key is set as a Colab secret named 'GEMINI_API_KEY' or set the environment variable manually.")
    # Fallback/Manual setting (Uncomment and replace if Colab Secrets is not used)
    # os.environ["GOOGLE_API_KEY"] = "YOUR_MANUAL_API_KEY"

‚úÖ API Key successfully loaded and set as GOOGLE_API_KEY.


## 1.4 Document Upload

In [4]:
print("\n--- Uploading Document: 'LenderFeesWorksheetNew.pdf' ---")
uploaded = files.upload()
pdf_path = None
if uploaded:
    pdf_path = list(uploaded.keys())[0]
    print(f"Successfully uploaded: {pdf_path}")
else:
    print("No file was uploaded. Exiting.")
    exit()



--- Uploading Document: 'LenderFeesWorksheetNew.pdf' ---


Saving LenderFeesWorksheetNew.pdf to LenderFeesWorksheetNew (1).pdf
Successfully uploaded: LenderFeesWorksheetNew (1).pdf


## 1.5 Custom PyMuPDF Loader Function (PDF Parsing)

In [5]:
def load_pdf_with_pymupdf(pdf_path: str) -> List[Document]:
    """Load a PDF and convert it to LlamaIndex Document format using PyMuPDF."""
    doc = fitz.open(pdf_path)
    documents = []
    for i, page in enumerate(doc):
        text = page.get_text()
        if not text.strip(): continue # Skip empty pages
        documents.append(
            Document(text=text, metadata={"file_name": os.path.basename(pdf_path), "page_number": i + 1})
        )
    doc.close()
    print(f"Processed {pdf_path}: Extracted {len(documents)} pages with content.")
    return documents

## 1.6 Configure RAG Pipeline (LLM, Embeddings, Chunking)

In [6]:
print("\n--- Configuring LlamaIndex Settings ---")
# LLM: Gemini 2.5 Flash
llm = Gemini(model="models/gemini-2.5-flash")
Settings.llm = llm

# Embedding Model: HuggingFace all-MiniLM-L6-v2 (Efficient and Local)
# Hugging Face API Key setup in Colab Secret
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
Settings.embed_model = embed_model

# Chunking Strategy: SentenceSplitter with optimal settings
Settings.text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=20)
Settings.chunk_size = 512
Settings.chunk_overlap = 20


--- Configuring LlamaIndex Settings ---


  llm = Gemini(model="models/gemini-2.5-flash")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## 1.7 Indexing and Index Creation

In [7]:
documents = load_pdf_with_pymupdf(pdf_path)
index = VectorStoreIndex.from_documents(documents)
print("‚úÖ Indexing complete. RAG VectorStoreIndex created.")

Processed LenderFeesWorksheetNew (1).pdf: Extracted 1 pages with content.
‚úÖ Indexing complete. RAG VectorStoreIndex created.


# **2. Reasons for Methods: Embedding, Chunking Strategy, & Retrieval Method**

## **üî¢ Embedding Model**

**Model:** `sentence-transformers/all-MiniLM-L6-v2`
<br><br>

**Justification:** This model is an extremely efficient, small, and fast open-source embedding model. It provides a good balance of accuracy and speed, making it highly suitable for rapid RAG development in a Colab environment. Running a local HuggingFace model also helps to reduce API costs and latency compared to calling a remote embedding service.


---



## **‚úÇÔ∏è Chunking Strategy**

**Strategy:** `SentenceSplitter` with `chunk_size = 512` and `chunk_overlap = 20`.
<br><br>

**Justification:** The `SentenceSplitter` breaks text primarily at logical sentence boundaries, which is ideal for a semi-structured document like a financial worksheet.

  * **Chunk Size (512 tokens):** Provides enough context for the LLM to perform calculations or detailed analysis from the retrieved text.
  
  * **Overlap (20 tokens):** A small overlap ensures that the context is maintained across the split points, preventing critical information from being cut off.


---



## **üîç Retrieval Method**

**Method:** **Vector Retrieval** (Semantic Search)
<br><br>

**Justification:** Vector retrieval is used because it finds relevant document segments based on the semantic mening of the query. This is essential for documents where specific financial terms might be used (e.g., "lender's title insurance") that a user might query using a different phrase (e.g., "fee to protect the lender's interest").


---



# **3.Create Query Engine and Execute Prompts**

## 3.1 Create Chat Engine

In [8]:
# Use the ChatEngine to maintain conversation history while retrieving context
# from your index for each turn.
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context", # A common mode for RAG chat
    similarity_top_k=3
)

## 3.2 Interactive Chat Loop

In [12]:
def interactive_rag_chatbot():
    """
    An interactive chatbot that grounds its answers using the RAG pipeline.
    """
    print("\n--- üí¨ Interactive RAG Chatbot üí¨ ---")
    print(f"Document: {pdf_path}")
    print("Ask me questions about the document (e.g., 'What is the total estimated monthly payment?').")
    print("Type 'exit' to end the conversation.")
    print("-" * 60)

    while True:
        try:
            # Use input() within a try/except to handle unexpected Colab issues
            user_input = input("You: ")
        except EOFError:
            # This often catches issues when input is expected but not provided interactively
            print("\nExiting due to non-interactive environment.")
            break

        if user_input.lower() in ["exit", "quit", "bye"]:
            print("\nChatbot: Goodbye! Feel free to upload a new document next time.")
            break

        try:
            # Use the chat_engine which handles history and retrieval simultaneously
            response = chat_engine.chat(user_input)

            # Print the response
            print(f"\nChatbot: {response.response}")

        except Exception as e:
            print(f"\nError processing query: {e}")
            print("Please try again.")

## 3.3 Run the chatbot

In [13]:
interactive_rag_chatbot()


--- üí¨ Interactive RAG Chatbot üí¨ ---
Document: LenderFeesWorksheetNew (1).pdf
Ask me questions about the document (e.g., 'What is the total estimated monthly payment?').
Type 'exit' to end the conversation.
------------------------------------------------------------
You: What is the total estimated monthly payment

Chatbot: The total estimated monthly payment is **$2,308.95**.

This amount is broken down into several components, as detailed in the Fees Worksheet:
*   **Principal & Interest:** $1,869.37
*   **Hazard Insurance:** $39.58
*   **Real Estate Taxes:** $400.00
*   **Mortgage Insurance:** $0.00 (This item is listed but shows $0.00 for this specific loan)
*   **Homeowner Assn. Dues:** $0.00 (This item is listed but shows $0.00 for this specific loan)
*   **Other Financing (P & I):** $0.00 (This item is listed but shows $0.00 for this specific loan)
*   **Other:** $0.00 (This item is listed but shows $0.00 for this specific loan)

The sum of these components ($1,869.37 + $