<a href="https://colab.research.google.com/github/LashawnFofung/RAG-Pipelines/blob/main/src/Task_Designing_a_Page_Level_Detection_Strategy_Using_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ü§ñ Designing A Page Level-Detection Strategy Using RAG**


This notebook implements a foundational document processing pipeline:
<br><br>

**Extraction ‚Üí Classification ‚Üí Separation ‚Üí Structured Extraction.**
<br><br>

The existing code successfully extracts text, classifies documents, and performs document separation using a Gemini large language model (LLM) for RAG (Retrieval-Augmented Generation) tasks.
<br><br>

**Notebook Structure**
- [Step 1: Extract Page-Level Content from PDF](#scrollTo=lPuFFd1y8piI)
- [Step 2: Configure the Gemini Model](#scrollTo=Y7__F3sr8t0R)
- [Step 3: Write the "Same Document?" Function with RAG](#scrollTo=EtJ7ob2U80OR)
- [Step 4: Write the Document Type Classifier](#scrollTo=7Ylcu8Gy87Wu)
- [Step 5: Loop Through Pages and Generate Page-Level Metadata](#scrollTo=4Of5KI4V9B_F)
- [Step 6: Visualize Results](#scrollTo=9_jBK1HX9JVK)




# **Step 1: Extract Page-Level Content from PDF**

This step uses the Colab-specific file upload utility and the PyPDF2 library to extract raw text, page-by-page.



In [1]:
# Install PyPDF2 library for reading and extracting text from PDFs
!pip install PyPDF2









In [2]:
from PyPDF2 import PdfReader

reader = PdfReader("/Blob File Sample.pdf")
pages = [page.extract_text() for page in reader.pages]
doc_pages = [{"page_num": i, "text": p} for i, p in enumerate(pages)]
doc_pages

[{'page_num': 0,
  'text': 'Functional Resume Sample \n \nJohn W. Smith   \n2002 Front Range Way Fort Collins, CO 80525  \njwsmith@colostate.edu  \n \nCareer Summary \n \nFour years experience in early childhood development with a di verse background in the care of \nspecial needs children and adults.  \n  \nAdult Care Experience  \n \n‚Ä¢ Determined work placement for 150 special needs adult clients.  \n‚Ä¢ Maintained client databases and records.  \n‚Ä¢ Coordinated client contact with local health care professionals on a monthly basis.     \n‚Ä¢ Managed 25 volunteer workers.     \n \nChildcare Experience  \n \n‚Ä¢ Coordinated service assignments for 20 part -time counselors and 100 client families. \n‚Ä¢ Oversaw daily activity and outing planning for 100 clients.  \n‚Ä¢ Assisted families of special needs clients with researching financial assistance and \nhealthcare. \n‚Ä¢ Assisted teachers with managing daily classroom activities.    \n‚Ä¢ Oversaw daily and special st udent activiti

# **Step 2: Configure the Gemini Model**

This step securely loads API keys from Colab Secrets.

In [9]:
# Import required modules
import os
from google.colab import userdata # Colab utility for accessing secrets

# --- SECURE API KEY SETUP ---

# 1. Gemini API Key setup in Colab secret
try:
    # Attempt to retrieve the Gemini API key from Colab Secrets
    API_KEY = userdata.get('GEMINI_API_KEY')
    if not API_KEY:
        raise ValueError("GEMINI_API_KEY not found in Colab Secrets. Please set it.")

    # Set the official environment variable name required by the Google GenAI SDK
    os.environ["GOOGLE_API_KEY"] = API_KEY
    print("‚úÖ API Key successfully loaded and set as GOOGLE_API_KEY.")


except Exception as e:
    print(f"‚ö†Ô∏è Warning: Could not load API Key from Colab Secrets: {e}")
    print("Please ensure your API Key is set as a Colab secret named 'GEMINI_API_KEY'.")



‚úÖ API Key successfully loaded and set as GOOGLE_API_KEY.


# **Step 3: Write the "Same Document?" Function with RAG**

This function determines if a page boundary signifies a new document using the LLM for contextual reasoning.

In [4]:
def gemini_model(prompt):
    import google.generativeai as genai

    model = genai.GenerativeModel("gemini-3-pro-preview")
    response = model.generate_content(prompt)

    return response.text


In [5]:
def is_same_document(prev_text, curr_text, doc_type=None):
    prompt = f"""
    You are checking whether two pages belong to the same document.
    Previous page type: {doc_type or 'unknown'}

    Previous Page:
    {prev_text}

    Current Page:
    {curr_text}

    Answer ONLY 'Yes' or 'No'. Do NOT explain.
    """
    response = gemini_model(prompt)  # Swap in LLM call
    return response.strip().lower().startswith("yes")


prev_text = doc_pages[2]["text"]
curr_text = doc_pages[0]["text"]
doc_type = "Resume"  # Optional, can be "unknown" or None

is_same_document(prev_text, curr_text, doc_type)


False

# **Step 4: Write the Document Type Classifier**

This function is called when a new document is detected to categorize its type.

In [6]:
def classify_document_type(text):
    prompt = f"""
    This is the start of a new document. Based on the content, classify it.

    Page Content:
    {text}

    Choose from: Resume, Contract, Lender Fee Sheet, ID, PaySlip, Other.
    Just respond with the type.
    """
    response = gemini_model(prompt).strip().lower().replace(".", "")
    result = response.title() # Capitalize the first letter of each word
    return result

classify_document_type(doc_pages[3]["text"])

'Payslip'

# **Step 5: Loop Through Pages and Generate Page-Level Metadata**

Goes through all pages:
- Check if a page is part of the same document
- Assign a document ID
- Classify the document type


In [7]:
results = []
current_doc_type = None
doc_counter = 0

for i, page in enumerate(doc_pages):
    if i == 0:
        current_doc_type = classify_document_type(page["text"])
    else:
        prev_text = doc_pages[i - 1]["text"]
        same = is_same_document(prev_text, page["text"], current_doc_type)
        if not same:
            doc_counter += 1
            current_doc_type = classify_document_type(page["text"])

    results.append({
        "page": i,
        "doc_id": doc_counter,
        "doc_type": current_doc_type
    })


for r in results:
    print(r)

{'page': 0, 'doc_id': 0, 'doc_type': 'Resume'}
{'page': 1, 'doc_id': 1, 'doc_type': 'Lender Fee Sheet'}
{'page': 2, 'doc_id': 2, 'doc_type': 'Payslip'}
{'page': 3, 'doc_id': 3, 'doc_type': 'Payslip'}


# **Step 6: Visualize Results**


In [8]:
import pandas as pd

df = pd.DataFrame(results)
df.head()

Unnamed: 0,page,doc_id,doc_type
0,0,0,Resume
1,1,1,Lender Fee Sheet
2,2,2,Payslip
3,3,3,Payslip
