<a href="https://colab.research.google.com/github/LashawnFofung/RAG-Pipelines/blob/main/src/Task_End_to_End_RAG_Pipeline_with_Page_Level.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **End-to-End RAG Pipeline with Page-Level Classification**

**Data:** *Blob File Sample.pdf*

<br>

This notebook demonstrates how to handle a single PDF containing multiple different document types (e.g., a Resume followed by a Payslip). We use an LLM to identify document boundaries and then route user queries only to the relevant sections.

<br>

**Table of Contents**
- [Setup: Install Dependencies](#scrollTo=aC0LO8G6xsHq&line=3&uniqifier=1)
- [Step 1: Load a Blob-Style PDF and Extract Pages](#scrollTo=JFp9RnDlxvsn&line=1&uniqifier=1)
- [Step 2: Use LLM to Classify Document Type and Boundaries](#scrollTo=gyvg734oy1I9)
- [Step 3: Group Pages into Logical Documents](#scrollTo=uxoWt96qzSqF)
- [Step 4: Chunk Each Logical Document with Metadata](#scrollTo=uJAtrxbrzfWx)
- [Step 5: Embed and Store in Vector DB with Metadata](#scrollTo=t0tDlA4Fzr4P)
- [Step 6: Predict Query Routing and Retrieve with Metadata Filter](#scrollTo=96J8M9z4z2FV)

# **Setup: Install Dependencies**

Install the LlamaIndex framework, PDF parsers, and embedding model utilities.

In [1]:
#Download the necessary packages
!pip install -q llama-index
!pip install -q llama-index-readers-file
!pip install -q llama-index transformers sentence-transformers
!pip install -q llama-index-embeddings-huggingface
!pip install PyPDF2



# **üì• Step 1: Load a Blob-Style PDF and Extract Pages**

In this step, we read the PDF and store each page as a separate dictionary entry to prepare for classification.

In [2]:
from PyPDF2 import PdfReader

reader = PdfReader("/content/Blob File Sample.pdf") # Initialize the PDF reader with the file path
pages = [page.extract_text() for page in reader.pages] # Extract text from every page in the PDF

# Create a structured list of dictionaries containing page number and text
doc_pages = [{"page_num": i, "text": p} for i, p in enumerate(pages)]

doc_pages # Display the extracted data

[{'page_num': 0,
  'text': 'Functional Resume Sample \n \nJohn W. Smith   \n2002 Front Range Way Fort Collins, CO 80525  \njwsmith@colostate.edu  \n \nCareer Summary \n \nFour years experience in early childhood development with a di verse background in the care of \nspecial needs children and adults.  \n  \nAdult Care Experience  \n \n‚Ä¢ Determined work placement for 150 special needs adult clients.  \n‚Ä¢ Maintained client databases and records.  \n‚Ä¢ Coordinated client contact with local health care professionals on a monthly basis.     \n‚Ä¢ Managed 25 volunteer workers.     \n \nChildcare Experience  \n \n‚Ä¢ Coordinated service assignments for 20 part -time counselors and 100 client families. \n‚Ä¢ Oversaw daily activity and outing planning for 100 clients.  \n‚Ä¢ Assisted families of special needs clients with researching financial assistance and \nhealthcare. \n‚Ä¢ Assisted teachers with managing daily classroom activities.    \n‚Ä¢ Oversaw daily and special st udent activiti

# **üß† Step 2: Use LLM to Classify Document Type and Boundaries**

Using Google's Gemini to handle the logic of classifying documents and answering questions.

<br>

Iterate through pages and ask the LLM: "Is this page a continuation of the previous one, or a new document type?"


In [3]:
# Import required modules
import os
from google.colab import userdata
import google.generativeai as genai
from llama_index.embeddings.huggingface import HuggingFaceEmbedding


#  1. Load and Set Gemini API Key
try:
    API_KEY = userdata.get('GEMINI_API_KEY')
    if not API_KEY:
        raise ValueError("GEMINI_API_KEY not found in Colab Secrets. Please set it.")

    # Configure the Gemini library globally
    genai.configure(api_key=API_KEY)
    print("‚úÖ Gemini API Key successfully loaded and configured.")

# 2. Hugging Face API Token
   # Load your custom secret name
    HF_TOKEN = userdata.get('HFACE_API_KEY')

    if HF_TOKEN:
        # Assign the token to the environment variable Hugging Face checks for
        os.environ["HF_TOKEN"] = HF_TOKEN
        print("‚úÖ Hugging Face API Token loaded and set to os.environ['HF_TOKEN'].")
    else:
        print("‚ö†Ô∏è Warning: HFACE_API_KEY not found. Hugging Face models requiring auth may fail.")

except (ImportError, ValueError) as e:
    print(f"‚ö†Ô∏è Warning: Configuration failed: {e}. Please ensure Colab secrets are set correctly.")
    # Exit or raise an error if critical setup fails

#  3. Define the Embedding Model globally
# This is used in the hybrid_semantic_search function
print("\nLoading Embedding Model (BAAI/bge-small-en-v1.5)...")
# Note: The model name is automatically derived by llama-index if not specified
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
print("‚úÖ Embedding Model Loaded.")



def gemini_model(prompt):
    """
    Function to call the Gemini model using the globally configured client.
    """
    try:
        # The configuration was done globally in the setup block (Step 1)
        model = genai.GenerativeModel("models/gemini-2.0-flash")
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        print(f"Gemini API call failed: {e}")
        return None




‚úÖ Gemini API Key successfully loaded and configured.
‚úÖ Hugging Face API Token loaded and set to os.environ['HF_TOKEN'].

Loading Embedding Model (BAAI/bge-small-en-v1.5)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


‚úÖ Embedding Model Loaded.


In [4]:
def is_same_document(prev_text, curr_text, doc_type=None):
    prompt = f"""
    You are checking whether two pages belong to the same document.
    Previous page type: {doc_type or 'unknown'}

    Previous Page:
    {prev_text}

    Current Page:
    {curr_text}

    Answer ONLY 'Yes' or 'No'. Do NOT explain.
    """
    response = gemini_model(prompt)  # Swap in LLM call
    return response.strip().lower().startswith("yes")

In [5]:
def classify_document_type(text):
    prompt = f"""
    This is the start of a new document. Based on the content, classify it.

    Page Content:
    {text}

    Choose from: Resume, Contract, Lender Fee Sheet, ID, PaySlip, Other.
    Just respond with the type.
    """
    response = gemini_model(prompt).strip().lower().replace(".", "")
    result = response.title() # Capitalize the first letter of each word
    return result

In [6]:
metadata_store = []
current_doc_type = None
doc_counter = 0
page_counter = 0
for i, page in enumerate(doc_pages):
    same = False
    if i == 0:
        current_doc_type = classify_document_type(page["text"])
    else:
        prev_text = doc_pages[i - 1]["text"]
        same = is_same_document(prev_text, page["text"], current_doc_type)
        if not same:
            doc_counter += 1
            current_doc_type = classify_document_type(page["text"])
            page_counter = 0
        else:
            page_counter +=1

    metadata_store.append({
        "page": i,
        "text": page["text"],
        "doc_type": current_doc_type,
        "page_in_doc": page_counter
    })


metadata_store

[{'page': 0,
  'text': 'Functional Resume Sample \n \nJohn W. Smith   \n2002 Front Range Way Fort Collins, CO 80525  \njwsmith@colostate.edu  \n \nCareer Summary \n \nFour years experience in early childhood development with a di verse background in the care of \nspecial needs children and adults.  \n  \nAdult Care Experience  \n \n‚Ä¢ Determined work placement for 150 special needs adult clients.  \n‚Ä¢ Maintained client databases and records.  \n‚Ä¢ Coordinated client contact with local health care professionals on a monthly basis.     \n‚Ä¢ Managed 25 volunteer workers.     \n \nChildcare Experience  \n \n‚Ä¢ Coordinated service assignments for 20 part -time counselors and 100 client families. \n‚Ä¢ Oversaw daily activity and outing planning for 100 clients.  \n‚Ä¢ Assisted families of special needs clients with researching financial assistance and \nhealthcare. \n‚Ä¢ Assisted teachers with managing daily classroom activities.    \n‚Ä¢ Oversaw daily and special st udent activities. 

# **üìö Step 3: Group Pages into Logical Documents**

In [7]:
logical_docs = []
current_doc = {"text": "", "doc_type": None, "page_start": 0}

for page in metadata_store:
    if page["page_in_doc"] == 0 and current_doc["text"]:
        current_doc["page_end"] = page["page"] - 1
        logical_docs.append(current_doc)
        current_doc = {"text": "", "doc_type": None, "page_start": page["page"]}

    current_doc["text"] += "\n\n" + page["text"]
    current_doc["doc_type"] = page["doc_type"]

# Append last document
current_doc["page_end"] = metadata_store[-1]["page"]
logical_docs.append(current_doc)

# **‚úÇÔ∏è Step 4: Chunk Each Logical Document with Metadata**

Group the raw pages into "Logical Documents" and split them into chunks for the Vector Database.

In [8]:
from llama_index.core.schema import Document

def chunk_text(text, chunk_size=500):
    words = text.split()
    return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

chunked_documents = []

for idx, doc in enumerate(logical_docs):
    chunks = chunk_text(doc["text"])
    for chunk_idx, chunk in enumerate(chunks):
        chunked_documents.append(
            Document(
                text=chunk,
                metadata={
                    "doc_type": doc["doc_type"],
                    "chunk_index": chunk_idx,
                    "page_start": doc["page_start"],
                    "page_end": doc["page_end"],
                    "source_file": "Sample Blob File" #replace with your document name
                }
            )
        )


# **üß† Step 5: Embed and Store in Vector DB with Metadata**

Store the chunks in a vector index and create a query engine that routes questions to the correct document type.

In [9]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex

embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
index = VectorStoreIndex.from_documents(chunked_documents, embed_model=embed_model)

# **üéØ Step 6: Predict Query Routing and Retrieve with Metadata Filter**

In [10]:
def predict_doc_type_for_query(query):

    prompt = f"""
  You are an intelligent assistant that routes user queries to the most relevant document.

  User query: "{query}"

  Which document type is most likely to contain the answer?
  Choose ONLY ONE from: Resume, Contract, Lender Fee Sheet, ID, PaySlip, Other.
  """

    response = gemini_model(prompt)
    return response.strip()

In [11]:
from llama_index.core.vector_stores import MetadataFilters, FilterOperator, MetadataFilter

# Get doc_type filter
user_query = "What is the total lender fee?"
# Normalizing the predicted doc type to Title Case to match the stored metadata
predicted_doc_type = predict_doc_type_for_query(user_query).title()
print(predicted_doc_type)
# Retrieve
retriever = index.as_retriever(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="doc_type", value=predicted_doc_type, operator=FilterOperator.EQ)
        ]
    )
)
results = retriever.retrieve(user_query)

for res in results:
    print(f"[{res.metadata['doc_type']} - p{res.metadata['page_start']}‚Äì{res.metadata['page_end']}]")
    print(res.text[:200])
    print("-" * 50)

Lender Fee Sheet
[Lender Fee Sheet - p1‚Äì1]
Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan. Fee Details and Summary Applicants: Application No: Date Prepared: Loan Program:Prepared By: 
--------------------------------------------------


In [12]:
# Extract the answer from the retrieved text
if results:
    # We'll take the text from the first and most relevant result
    retrieved_text = results[0].text

    # Create a new prompt to ask the model for the specific answer
    prompt = f"""
    Based on the following text, please answer the user's question.

    Text:
    {retrieved_text}

    User Question: {user_query}

    Provide only the specific value or amount as the answer.
    """

    # Get the answer from the model
    answer = gemini_model(prompt)

    print(f"The answer to your question '{user_query}' is: {answer}")
else:
    print("Could not find an answer as no relevant documents were retrieved.")

The answer to your question 'What is the total lender fee?' is: $1,070.00

