<a href="https://colab.research.google.com/github/LashawnFofung/RAG-Pipelines/blob/main/src/Task_Route_Queries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Route Queries**

üîé **Metadata-Driven RAG: Fast, Explainable PDF Retrieval Without Embeddings**

<br>

**Data:**
- PayStatement-Nov_1_2024.pdf
- COE-Sample.pdf
- LenderFeesWorksheetNew.pdf
- payslip-1752803610.pdf
- payslip-1752804713.pdf
- SampleContract-Shuttle.pdf

<br>

This Colab notebook demonstrates a cost-effective and high-speed retrieval system for multi-page documents like financial records, resumes, or contracts. Instead of relying on computationally expensive vector embeddings, we use metadata filtering to route a user's query to the exact, relevant PDF page.

<br>

The core idea is to use an LLM (Gemini) for high-level classification to narrow down the search space, and then perform simple, explainable keyword matching on the smaller set of documents.

<br>

**üåü What You Will Learn**

- **PDF Loading:** How to load a multi-page PDF, treating each page as a separate document using llama-index.

- **Metadata Creation:** How to attach useful, structured metadata (file_id, doc_type, page_number) to each document page.

- **LLM-Powered Classification:** Using the Gemini API to classify the user's intent (e.g., "pay_stub," "contract") from a query like "How much money did I make?".

- **Rapid Metadata Filtering:** Implementing a fast retrieval step to fetch only the documents matching the predicted doc_type.

- **Keyword Fallback:** An optional, fast keyword search method that only runs on the small subset of retrieved documents, eliminating the need for complex, resource-intensive vector database lookups.

<br>

**‚öôÔ∏è How It Works (The 4 Steps)**

**1. Extract & Store:** Load the PDF and store the page text along with essential metadata (e.g., file_id, page_number).

**2. Classify Documents:** The initial set of documents are classified by an LLM to assign a doc_type (e.g., "pay_stub") to each page.

**3.Classify Query:** The user's natural language query (e.g., "What is my monthly salary?") is classified by an LLM to determine the most relevant doc_type needed to answer it.

**4. Retrieve:** Only documents where the document's doc_type matches the query's predicted doc_type are retrieved, drastically reducing the search space and speeding up the final answer generation.

<br>

This approach is perfect for building scalable RAG systems where document types are easily categorized (e.g., personal records, legal forms, invoices) and retrieval speed is critical.


Notebook Structure
- [Step 1: Installation and File Upload](#scrollTo=GEn270-MsCsV)
- [Step2: Store File-Level Metadta](#scrollTo=EcQi7IxosMIW)
- [Step3: Classify the User Query](#scrollTo=O-IrycW0sgj4)
- [Step 4: Classify Each Page & Assign A Doc_Type](#scrollTo=lQ5BkxyetjhD)
- [Step 5: Retrieve Files Matching That Doc type](#scrollTo=OTuEi0KSuZ6I)
- [Step 6: Fall Back to Embedding Search](#scrollTo=WMNJpnowurLA)


# **Step 1: Installation**

In [1]:
#Download the necessary packages
!pip install -q llama-index jedi
!pip install -q llama-index-readers-file


# **Step 2: Load Multiple PDF Files**

In [2]:
from llama_index.readers.file import PDFReader
import time

# List of all PDF file paths
pdf_files = [
    "/content/COE-Sample.pdf",
    "/content/LenderFeesWorksheetNew.pdf",
    "/content/PayStatement-Nov_1__2024.pdf",
    "/content/SampleContract-Shuttle.pdf",
    "/content/functionalsample.pdf",
    "/content/payslip-1752803610.pdf",
    "/content/payslip-1752804713.pdf"
]

# Initialize the loader and an empty list to store ALL documents
loader = PDFReader()
all_documents = []

# Loop through the list of files and append the loaded documents from each file
for file_path in pdf_files:
    # load_data returns a list of Document objects (one per page)
    new_documents = loader.load_data(file_path)
    all_documents.extend(new_documents) # Add the new pages to the main list

# Preview to confirm it worked
print(f"Loaded {len(all_documents)} pages from all files")
print("First page preview:")
print(all_documents[0].text[:300])

Loaded 20 pages from all files
First page preview:
SAMPLE CONTRACT OF EMPLOYMENT 
 
This agreement, made on the ‚Ä¶‚Ä¶  day of the  ‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶.month of the year‚Ä¶‚Ä¶ ‚Ä¶‚Ä¶‚Ä¶‚Ä¶    
Between: 
‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶ (hereinafter referred to as "the Employer")  
and 
‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶  (hereinafter referred to as "the Employee")  
WHEREAS the Employee and the Employe


# **Step 3: Store File-Level Metadata**

When users upload PDFs, don‚Äôt just save the raw file, attach useful metadata that helps identify what the file contains.

In [3]:
import uuid
import os

pdf_metadata_store = []
user_id = "xyz"
year = "2024" # Static year for this user's files

# --- Iterate through each file path to load and assign file-level metadata ---
for file_path in pdf_files:
    # 1. Generate unique file-level attributes
    current_file_id = str(uuid.uuid4()) # Unique ID for THIS file
    current_filename = os.path.basename(file_path) # Get just the filename

    # 2. Load documents for the current file (using the same loader approach)
    loader = PDFReader()
    file_documents = loader.load_data(file_path)

    print(f"Processing: {current_filename} ({len(file_documents)} pages)")

    # 3. Iterate through pages and assign page-level metadata
    for i, doc in enumerate(file_documents):
        metadata = {
            "file_id": current_file_id,    # Unique per file
            "user_id": user_id,
            "doc_type": "unknown",      # To be classified next
            "year": year,
            "filename": current_filename,
            "page_number": i + 1,         # Page number within the file
            "text": doc.text
        }
        pdf_metadata_store.append(metadata)

print("\nStored Metadata for all files:")
print(f"Total pages processed: {len(pdf_metadata_store)}")
print("\nSample Page Metadata:")
print(pdf_metadata_store[0])

print("\nStored Metadata for all files:")
print(f"Total pages processed: {len(pdf_metadata_store)}")


# Print a sample of metadata for verification
print("\nSample Page Metadata:")


# Find the first document from the 'PayStatement' file for a recognizable sample
sample_doc = next(doc for doc in pdf_metadata_store if doc['filename'] == 'PayStatement-Nov_1__2024.pdf')
print(sample_doc)

Processing: COE-Sample.pdf (5 pages)
Processing: LenderFeesWorksheetNew.pdf (1 pages)
Processing: PayStatement-Nov_1__2024.pdf (1 pages)
Processing: SampleContract-Shuttle.pdf (10 pages)
Processing: functionalsample.pdf (1 pages)
Processing: payslip-1752803610.pdf (1 pages)
Processing: payslip-1752804713.pdf (1 pages)

Stored Metadata for all files:
Total pages processed: 20

Sample Page Metadata:
{'file_id': '098e0252-c8e1-4fe9-a99f-1c0147b9596a', 'user_id': 'xyz', 'doc_type': 'unknown', 'year': '2024', 'filename': 'COE-Sample.pdf', 'page_number': 1, 'text': 'SAMPLE CONTRACT OF EMPLOYMENT \n \nThis agreement, made on the ‚Ä¶‚Ä¶  day of the  ‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶.month of the year‚Ä¶‚Ä¶ ‚Ä¶‚Ä¶‚Ä¶‚Ä¶    \nBetween: \n‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶ (hereinafter referred to as "the Employer")  \nand \n‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶‚Ä¶  (hereinafter referred to as "the Employee")  \nWHEREAS the Employee and the Employer wish to enter 

# **Step 4: LLM Functions (Secure API Key)**


In [4]:
# Import required modules
import os
from google.colab import userdata # Colab utility for accessing secrets

# Gemini API Key setup in Colab secret
def gemini_model(prompt):
    import google.generativeai as genai
    from google.colab import userdata  # Import the userdata module

    # Retrieve the API key from Colab Secrets
    # NOTE: You must have a secret named 'GEMINI_API_KEY' set in the Colab sidebar
    try:
        api_key = userdata.get('GEMINI_API_KEY')
    except Exception as e:
        print("Error accessing Colab Secret. Make sure a secret named 'GEMINI_API_KEY' is set.")
        print(f"Details: {e}")
        return None # Return None or raise an error if key retrieval fails

    # Configure the Gemini client with the secure API key
    genai.configure(api_key=api_key)

    model = genai.GenerativeModel("models/gemini-2.0-flash")
    response = model.generate_content(prompt)

    return response.text

In [5]:
# Define the function to classify each document's type
def classify_doc_type_llm(text, max_chars=1000):
    # Truncate text to avoid token overflow
    truncated_text = text[:max_chars]

    prompt = f"""
You are classifying a document into one of the following types:
["pay_stub", "loan_form", "resume", "contract", "w2", "unknown"]

Document content:
\"\"\"
{truncated_text}
\"\"\"

What is the best doc_type label for this document? Respond with only one of the labels above.
"""
    try:
        response = gemini_model(prompt)
        return response.strip().lower().replace('[', '').replace(']', '').replace('"', '').replace("'", "")
    except Exception as e:
        print("LLM failed:", e)
        return "unknown"

# **Step 5: Classify Documents**


In [6]:
# This step updates the 'doc_type' in pdf_metadata_store from 'unknown'
# to a meaningful label (e.g., 'pay_stub').

for i, doc in enumerate(pdf_metadata_store):
    print(f"Classifying doc {i+1}...")
    doc["doc_type"] = classify_doc_type_llm(doc["text"])
    # Optional print for verification
    print(f"Doc {i+1} classified as: {doc['doc_type']}")
    time.sleep(1) # Optional: avoid rate limiting

print("Document classifying complete")

Classifying doc 1...
Doc 1 classified as: contract
Classifying doc 2...
Doc 2 classified as: contract
Classifying doc 3...
Doc 3 classified as: contract
Classifying doc 4...
Doc 4 classified as: contract
Classifying doc 5...
Doc 5 classified as: contract
Classifying doc 6...
Doc 6 classified as: loan_form
Classifying doc 7...
Doc 7 classified as: pay_stub
Classifying doc 8...
Doc 8 classified as: contract
Classifying doc 9...
Doc 9 classified as: contract
Classifying doc 10...
Doc 10 classified as: contract
Classifying doc 11...
Doc 11 classified as: contract
Classifying doc 12...
Doc 12 classified as: contract
Classifying doc 13...
Doc 13 classified as: contract
Classifying doc 14...
Doc 14 classified as: contract
Classifying doc 15...
Doc 15 classified as: contract
Classifying doc 16...
Doc 16 classified as: contract
Classifying doc 17...
Doc 17 classified as: contract
Classifying doc 18...
Doc 18 classified as: resume
Classifying doc 19...
Doc 19 classified as: pay_stub
Classifying 

# **Step 6. Classify User Query**

In [7]:
# Define the function to classify the user's intent based on available documents
def classify_query_llm(query, metadata_store):
    # The doc_list now shows the CORRECT doc_type labels from the previous step (5)
    doc_list = "\n".join(
        [f"{i+1}. {doc['filename']} ‚Äî doc_type: {doc['doc_type']}" for i, doc in enumerate(metadata_store)]
    )

    prompt = f"""
  You are an intelligent assistant that routes user queries to the most relevant document.

  Available documents (labeled by type):
  {doc_list}

  User query: "{query}"

  Which document(s) are most likely to contain the answer?
  Respond with one of the following types:
  ["pay_stub", "loan_form", "resume", "contract", "w2", "unknown"]

  """

    print("--- Prompt sent to LLM for Query Classification ---")
    print(prompt)
    response = gemini_model(prompt)
    return response.strip().lower().replace('[', '').replace(']', '').replace('"', '').replace("'", "")

query = "How much money did I make?"
predicted_doc_type = classify_query_llm(query, pdf_metadata_store)


print("--- Predicted Document Type ---")
print(predicted_doc_type) # Should output 'pay_stub'

--- Prompt sent to LLM for Query Classification ---

  You are an intelligent assistant that routes user queries to the most relevant document.

  Available documents (labeled by type):
  1. COE-Sample.pdf ‚Äî doc_type: contract
2. COE-Sample.pdf ‚Äî doc_type: contract
3. COE-Sample.pdf ‚Äî doc_type: contract
4. COE-Sample.pdf ‚Äî doc_type: contract
5. COE-Sample.pdf ‚Äî doc_type: contract
6. LenderFeesWorksheetNew.pdf ‚Äî doc_type: loan_form
7. PayStatement-Nov_1__2024.pdf ‚Äî doc_type: pay_stub
8. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
9. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
10. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
11. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
12. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
13. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
14. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
15. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
16. SampleContract-Shuttle.pdf ‚Äî doc_type: contract
17. SampleCont

# **Step 7: Retrieve Files Matching That Doc Type**


In [8]:
# Function to retrieve documents by doc_type
def retrieve_files_by_doc_type(doc_type, metadata_store):
    # This filter is fast and only keeps relevant pages
    return [doc for doc in metadata_store if doc["doc_type"] == doc_type]

# Call the function using the predicted type ('pay_stub')
matched_files = retrieve_files_by_doc_type(predicted_doc_type, pdf_metadata_store)

# Display matched documents
print(f"Matched Documents (Filtered by doc_type: {predicted_doc_type}):")
for doc in matched_files:
    print(f"- File: {doc['filename']} | Page: {doc['page_number']} | Doc Type: {doc['doc_type']}")

print(f"\nTotal Matched Pages: {len(matched_files)}")

Matched Documents (Filtered by doc_type: pay_stub):
- File: PayStatement-Nov_1__2024.pdf | Page: 1 | Doc Type: pay_stub
- File: payslip-1752803610.pdf | Page: 1 | Doc Type: pay_stub
- File: payslip-1752804713.pdf | Page: 1 | Doc Type: pay_stub

Total Matched Pages: 3


# **Step 6: Fall Back to Embedding Search**

Keyword Search (Semantic Search)

In [9]:
def semantic_search(query, documents):
    query = query.lower()
    # Broad financial/pay keywords
    keywords = ["salary", "money", "pay", "net pay", "earnings", "basic pay", "income", "total"]

    # Only search within the small, matched list of documents
    for doc in documents:
        text = doc["text"].lower()
        if any(keyword in text for keyword in keywords):
            return doc
    return None

user_query = "How much money did I make?"
print(f"Running fallback keyword search on the {len(matched_files)} matched page(s):")
best_result = semantic_search(user_query, matched_files)

print("\nBest matching result (fallback):")
if best_result:
    print(f"File: {best_result['filename']} | Page: {best_result['page_number']}")
    print(best_result['text'][:200] + "...")
else:
    print(None)

Running fallback keyword search on the 3 matched page(s):

Best matching result (fallback):
File: PayStatement-Nov_1__2024.pdf | Page: 1
PAY DATE:
Nov 1, 2024
NET PAY $ 1,201.21
YEAR TO DATE $ 25,712.38
Description Current YTD Rate Current YTD
Hours/Units Hours/Units Earnings Earnings
Referral Fee 250.00
Overtime $ 1.90 1.90 51.87 51.8...
