<a href="https://colab.research.google.com/github/LashawnFofung/RAG-Pipelines/blob/main/src/Task_Route_Queries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Route Queries**

üîé **Metadata-Driven RAG: Fast, Explainable PDF Retrieval Without Embeddings**

<br>

**Data:** *PayStatement-Nov_1_2024.pdf*

<br>

This Colab notebook demonstrates a cost-effective and high-speed retrieval system for multi-page documents like financial records, resumes, or contracts. Instead of relying on computationally expensive vector embeddings, we use metadata filtering to route a user's query to the exact, relevant PDF page.

<br>

The core idea is to use an LLM (Gemini) for high-level classification to narrow down the search space, and then perform simple, explainable keyword matching on the smaller set of documents.

<br>

**üåü What You Will Learn**

- **PDF Loading:** How to load a multi-page PDF, treating each page as a separate document using llama-index.

- **Metadata Creation:** How to attach useful, structured metadata (file_id, doc_type, page_number) to each document page.

- **LLM-Powered Classification:** Using the Gemini API to classify the user's intent (e.g., "pay_stub," "contract") from a query like "How much money did I make?".

- **Rapid Metadata Filtering:** Implementing a fast retrieval step to fetch only the documents matching the predicted doc_type.

- **Keyword Fallback:** An optional, fast keyword search method that only runs on the small subset of retrieved documents, eliminating the need for complex, resource-intensive vector database lookups.

<br>

**‚öôÔ∏è How It Works (The 4 Steps)**

**1. Extract & Store:** Load the PDF and store the page text along with essential metadata (e.g., file_id, page_number).

**2. Classify Documents:** The initial set of documents are classified by an LLM to assign a doc_type (e.g., "pay_stub") to each page.

**3.Classify Query:** The user's natural language query (e.g., "What is my monthly salary?") is classified by an LLM to determine the most relevant doc_type needed to answer it.

**4. Retrieve:** Only documents where the document's doc_type matches the query's predicted doc_type are retrieved, drastically reducing the search space and speeding up the final answer generation.

<br>

This approach is perfect for building scalable RAG systems where document types are easily categorized (e.g., personal records, legal forms, invoices) and retrieval speed is critical.


Notebook Structure
- [Step 1: Installation and File Upload](#scrollTo=GEn270-MsCsV)
- [Step2: Store File-Level Metadta](#scrollTo=EcQi7IxosMIW)
- [Step3: Classify the User Query](#scrollTo=O-IrycW0sgj4)
- [Step 4: Classify Each Page & Assign A Doc_Type](#scrollTo=lQ5BkxyetjhD)
- [Step 5: Retrieve Files Matching That Doc type](#scrollTo=OTuEi0KSuZ6I)
- [Step 6: Fall Back to Embedding Search](#scrollTo=WMNJpnowurLA)


# **Step 1: Installation and File Upload**

In [None]:
#Download the necessary packages
!pip install -q llama-index jedi
!pip install -q llama-index-readers-file


[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m1.6/1.6 MB[0m [31m51.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from llama_index.readers.file import PDFReader
import time
# Load the PDF
loader = PDFReader()
documents = loader.load_data("/content/PayStatement-Nov_1__2024.pdf")  # Returns one Document per page

# Preview to confirm it worked
print(f"Loaded {len(documents)} pages")
print(documents[0].text[:300])

Loaded 1 pages
PAY DATE:
Nov 1, 2024
NET PAY $ 1,201.21
YEAR TO DATE $ 25,712.38
Description Current YTD Rate Current YTD
Hours/Units Hours/Units Earnings Earnings
Referral Fee 250.00
Overtime $ 1.90 1.90 51.87 51.87
Arrears 682.46
1X Payment 58.00
Canada Holiday 64.00 1,137.20
Home Hourly 79.99 1,608.45 18.200 1,


# **Step 2: Store File-Level Metadata**

When users upload PDFs, don‚Äôt just save the raw file, attach useful metadata that helps identify what the file contains.

In [None]:
import uuid

file_id = str(uuid.uuid4())  # Assigns Unique ID
user_id = "xyz"
year = "2024"
filename = "PayStatement-Nov_1__2024.pdf"

pdf_metadata_store = []

for i, doc in enumerate(documents):
    metadata = {
        "file_id": file_id,
        "user_id": user_id,
        "doc_type": "unknown",  # We'll classify it later
        "year": year,
        "filename": filename,
        "page_number": i + 1,
        "text": doc.text
    }
    pdf_metadata_store.append(metadata)

print("Stored Metadata:")
print(pdf_metadata_store[0])

Stored Metadata:
{'file_id': '8400258d-d62d-4de3-9939-69f423964a61', 'user_id': 'xyz', 'doc_type': 'unknown', 'year': '2024', 'filename': 'PayStatement-Nov_1__2024.pdf', 'page_number': 1, 'text': 'PAY DATE:\nNov 1, 2024\nNET PAY $ 1,201.21\nYEAR TO DATE $ 25,712.38\nDescription Current YTD Rate Current YTD\nHours/Units Hours/Units Earnings Earnings\nReferral Fee 250.00\nOvertime $ 1.90 1.90 51.87 51.87\nArrears 682.46\n1X Payment 58.00\nCanada Holiday 64.00 1,137.20\nHome Hourly 79.99 1,608.45 18.200 1,455.82 29,112.35\nStat Hrs Wrk Py 28.68 769.16\nSick Pay 16.00 282.00\nTotal: 1,507.69 32,343.04\nDescription Current YTD\n \nFederal Tax 198.94 4,165.83\nCPP 82.51 1,757.13\nEI 25.03 536.90\nInsurance Prem 170.80\nTotal: 306.48 6,630.66\nDescription Current YTD\nLife Insurance 8.30 91.30\nDep Life BC & Q 1.37 15.07\nAD&D 4.00 44.00\nDental 72.44 796.84\nExt Health 142.90 1,571.90\nPay Period: Oct 13, 2024 to Oct 26, 2024\nPeriod Number: 22\nPayroll Number: M03715\nEmployee Number: 99535

# **Step 3: Classify the User Query**

Before you search through documents, figure out what kind of file the question is about. This helps you focus only on the most relevant files.

In [None]:
# Import required modules
import os
from google.colab import userdata # Colab utility for accessing secrets

# Gemini API Key setup in Colab secret
def gemini_model(prompt):
    import google.generativeai as genai
    from google.colab import userdata  # Import the userdata module

    # Retrieve the API key from Colab Secrets
    # NOTE: You must have a secret named 'GEMINI_API_KEY' set in the Colab sidebar
    try:
        api_key = userdata.get('GEMINI_API_KEY')
    except Exception as e:
        print("Error accessing Colab Secret. Make sure a secret named 'GEMINI_API_KEY' is set.")
        print(f"Details: {e}")
        return None # Return None or raise an error if key retrieval fails

    # Configure the Gemini client with the secure API key
    genai.configure(api_key=api_key)

    model = genai.GenerativeModel("models/gemini-2.0-flash")
    response = model.generate_content(prompt)

    return response.text

In [None]:
def classify_query_llm(query, metadata_store):
    doc_list = "\n".join(
        [f"{i+1}. {doc['filename']} ‚Äî doc_type: {doc['doc_type']}" for i, doc in enumerate(metadata_store)]
    )

    prompt = f"""
  You are an intelligent assistant that routes user queries to the most relevant document.

  Available documents:
  {doc_list}

  User query: "{query}"

  Which document(s) are most likely to contain the answer?
  Respond with one of the following types:
  ["pay_stub", "loan_form", "resume", "contract", "w2", "unknown"]

  """

    print(prompt)
    response = gemini_model(prompt)
    return response.strip()

In [None]:
query = "What is my monthly salary?"
predicted_doc_type = classify_query_llm(query, pdf_metadata_store)
print(predicted_doc_type)


  You are an intelligent assistant that routes user queries to the most relevant document.

  Available documents:
  1. PayStatement-Nov_1__2024.pdf ‚Äî doc_type: unknown

  User query: "What is my monthly salary?"

  Which document(s) are most likely to contain the answer?
  Respond with one of the following types:
  ["pay_stub", "loan_form", "resume", "contract", "w2", "unknown"]

  
pay_stub


# **Step 4: Classify Each Page & Assign A Doc_Type**
Before files can be retrieved, each PDF page needs to be classified
based on its content and assigned a doc_type using a simple rule-based function.

In [None]:
# Define a function to classify each document using the LLM

def classify_doc_type_llm(text, max_chars=1000):
    # Truncate text to avoid token overflow
    truncated_text = text[:max_chars]

    prompt = f"""
You are classifying a document into one of the following types:
["pay_stub", "loan_form", "resume", "contract", "w2", "unknown"]

Document content:
\"\"\"
{truncated_text}
\"\"\"

What is the best doc_type label for this document? Respond with only one of the labels above.
"""
    try:
        response = gemini_model(prompt)
        return response.strip().lower()
    except Exception as e:
        print("LLM failed:", e)
        return "unknown"

In [None]:
for i, doc in enumerate(pdf_metadata_store):
    print(f"Classifying doc {i+1}...")
    doc["doc_type"] = classify_doc_type_llm(doc["text"])
    time.sleep(1)  # Optional: avoid rate limiting

Classifying doc 1...


# **Step 5: Retrieve Files Matching That Doc Type**

Once you've classified the query, use metadata to fetch only the relevant files

In [None]:
# Function to retrieve documents by doc_type
def retrieve_files_by_doc_type(doc_type, metadata_store):
    return [doc for doc in metadata_store if doc["doc_type"] == doc_type]

# Call the function using the predicted type
matched_files = retrieve_files_by_doc_type(predicted_doc_type, pdf_metadata_store)

# Display matched documents
print("Matched Documents:")
for doc in matched_files:
    print(doc)

Matched Documents:
{'file_id': '8400258d-d62d-4de3-9939-69f423964a61', 'user_id': 'xyz', 'doc_type': 'pay_stub', 'year': '2024', 'filename': 'PayStatement-Nov_1__2024.pdf', 'page_number': 1, 'text': 'PAY DATE:\nNov 1, 2024\nNET PAY $ 1,201.21\nYEAR TO DATE $ 25,712.38\nDescription Current YTD Rate Current YTD\nHours/Units Hours/Units Earnings Earnings\nReferral Fee 250.00\nOvertime $ 1.90 1.90 51.87 51.87\nArrears 682.46\n1X Payment 58.00\nCanada Holiday 64.00 1,137.20\nHome Hourly 79.99 1,608.45 18.200 1,455.82 29,112.35\nStat Hrs Wrk Py 28.68 769.16\nSick Pay 16.00 282.00\nTotal: 1,507.69 32,343.04\nDescription Current YTD\n \nFederal Tax 198.94 4,165.83\nCPP 82.51 1,757.13\nEI 25.03 536.90\nInsurance Prem 170.80\nTotal: 306.48 6,630.66\nDescription Current YTD\nLife Insurance 8.30 91.30\nDep Life BC & Q 1.37 15.07\nAD&D 4.00 44.00\nDental 72.44 796.84\nExt Health 142.90 1,571.90\nPay Period: Oct 13, 2024 to Oct 26, 2024\nPeriod Number: 22\nPayroll Number: M03715\nEmployee Number: 99

# **Step 6: Fall Back to Embedding Search**

If your metadata filter still leaves too many results ‚Äî or the user‚Äôs query is vague ‚Äî you can run a deeper semantic search, but only within the smaller set of matching files.

In [None]:
def semantic_search(query, documents):
    query = query.lower()
    keywords = ["salary", "money", "pay", "net pay", "earnings", "basic pay", "income", "total"]

    for doc in documents:
        text = doc["text"].lower()
        if any(keyword in text for keyword in keywords):
            return doc
    return None

user_query = "How much money did I make?"
print("Best matching result (fallback):")
print(semantic_search(user_query, matched_files))

Best matching result (fallback):
{'file_id': '8400258d-d62d-4de3-9939-69f423964a61', 'user_id': 'xyz', 'doc_type': 'pay_stub', 'year': '2024', 'filename': 'PayStatement-Nov_1__2024.pdf', 'page_number': 1, 'text': 'PAY DATE:\nNov 1, 2024\nNET PAY $ 1,201.21\nYEAR TO DATE $ 25,712.38\nDescription Current YTD Rate Current YTD\nHours/Units Hours/Units Earnings Earnings\nReferral Fee 250.00\nOvertime $ 1.90 1.90 51.87 51.87\nArrears 682.46\n1X Payment 58.00\nCanada Holiday 64.00 1,137.20\nHome Hourly 79.99 1,608.45 18.200 1,455.82 29,112.35\nStat Hrs Wrk Py 28.68 769.16\nSick Pay 16.00 282.00\nTotal: 1,507.69 32,343.04\nDescription Current YTD\n \nFederal Tax 198.94 4,165.83\nCPP 82.51 1,757.13\nEI 25.03 536.90\nInsurance Prem 170.80\nTotal: 306.48 6,630.66\nDescription Current YTD\nLife Insurance 8.30 91.30\nDep Life BC & Q 1.37 15.07\nAD&D 4.00 44.00\nDental 72.44 796.84\nExt Health 142.90 1,571.90\nPay Period: Oct 13, 2024 to Oct 26, 2024\nPeriod Number: 22\nPayroll Number: M03715\nEmplo