<a href="https://colab.research.google.com/github/RDGopal/IB9AU-2026/blob/main/RAG_4_Multimodal_Chunking_Gemini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates a **Page-Wise Multimodal Retrieval Augmented Generation (RAG)** system using Gemini-Flash and LlamaIndex. The core idea addresses the limitations of even very large context windows when dealing with massive documents or collections of documents, by switching from text-chunk retrieval to **visual page retrieval**.

### The Challenge: Beyond Long Context
Traditional RAG systems retrieve text chunks. While Large Language Models (LLMs) now boast impressive context windows (like Gemini's 2-million tokens), real-world enterprise scenarios often involve processing thousands of pages (e.g., all S&P 500 reports). In such cases, even large context windows can be insufficient, requiring a more intelligent retrieval strategy.

### The Solution: Page-Wise Visual Retrieval
Instead of sending just text, this approach extracts and sends *entire relevant pages* (including their visual layout, charts, and images) to a Vision Model. This allows the model to 'see' the information exactly as a human would, leading to more accurate and context-rich responses, especially for questions requiring interpretation of visual data.

### Architecture Overview:
1.  **Index**: We create a standard text index (using LlamaIndex) but crucially store the **page number as metadata** for each text chunk.
2.  **Search**: A standard text search query identifies the most relevant text snippets within the documents.
3.  **Router**: Based on the metadata, the system identifies the exact page(s) where the relevant information is located.
4.  **Extract & Solve**: The identified page is then extracted as a mini-PDF (preserving its original visual integrity) and sent exclusively to a multimodal LLM (Gemini Flash) along with the user's query.

This notebook walks through the setup, data indexing, and query process for this advanced RAG architecture.

First, we need to install all the necessary Python libraries for this project. This includes `google-genai` for interacting with the Gemini API, `llama-index-core` and related LlamaIndex packages for indexing and retrieval, and `pypdf` for PDF manipulation.

In [None]:
# Install the Google GenAI SDK for the multimodal part
!pip install -qU google-genai

# Install LlamaIndex & Dependencies
!pip install -qU llama-index-core llama-index-readers-file llama-index-llms-gemini llama-index-embeddings-gemini pypdf

# Fix jedi version conflict flagged by pip's dependency resolver
!pip install -qU jedi>=0.16

print("\u2705 Installation Complete.")

Next, we import the required modules from these libraries. This sets up our environment for creating the LlamaIndex, configuring the LLM and embedding models, and handling PDF files.

In [None]:
# @title 1. Imports & Custom Embedding Class
import os
import time
from google.colab import userdata
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.core.embeddings import BaseEmbedding
from llama_index.llms.gemini import Gemini
from google import genai
from typing import List
from pydantic import PrivateAttr

class GenAIEmbedding(BaseEmbedding):
    _client: genai.Client = PrivateAttr()
    _model_name: str = PrivateAttr()
    _requests_per_minute: int = PrivateAttr()
    _request_count: int = PrivateAttr()
    _window_start: float = PrivateAttr()

    def __init__(self, model_name: str = "gemini-embedding-001", requests_per_minute: int = 10, **kwargs):
        super().__init__(**kwargs)
        self._client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
        self._model_name = model_name
        self._requests_per_minute = requests_per_minute
        self._request_count = 0
        self._window_start = time.time()

    def _rate_limit(self):
        """Simple rate limiter: pause if we've hit the per-minute request cap."""
        self._request_count += 1
        elapsed = time.time() - self._window_start
        if self._request_count >= self._requests_per_minute:
            sleep_time = max(0, 10 - elapsed)
            if sleep_time > 0:
                print(f"   \u23f3 Rate limit: sleeping {sleep_time:.1f}s...")
                time.sleep(sleep_time)
            # Reset window
            self._request_count = 0
            self._window_start = time.time()

    def _get_embedding(self, text: str) -> List[float]:
        self._rate_limit()
        result = self._client.models.embed_content(
            model=self._model_name,
            contents=text
        )
        return result.embeddings[0].values

    def _get_text_embedding(self, text: str) -> List[float]:
        return self._get_embedding(text)

    def _get_query_embedding(self, query: str) -> List[float]:
        return self._get_embedding(query)

    async def _aget_query_embedding(self, query: str) -> List[float]:
        return self._get_embedding(query)

    async def _aget_text_embedding(self, text: str) -> List[float]:
        return self._get_embedding(text)

print("\u2705 Imports and GenAIEmbedding class defined.")

To interact with Google's GenAI and LlamaIndex with Gemini models, you need to provide your Google API Key. This key is securely loaded from Colab's user data secrets.

In [None]:
# @title 2. Set API Key
from google.colab import userdata
import os

# Set API key in environment for LlamaIndex to pick up automatically
os.environ["GOOGLE_API_KEY"] = userdata.get('GEMINI_API_KEY')
print("\u2705 API Key loaded.")

Here, we configure LlamaIndex to use Google's Gemini models for both language generation (`Gemini`) and embeddings (`GenAIEmbedding`). We set these globally so all subsequent LlamaIndex operations use these powerful models.

In [None]:
# @title 3. Configure LlamaIndex with Gemini Flash + Rate-Limited Embeddings
Settings.llm = Gemini(model_name="models/gemini-2.0-flash")

# requests_per_minute=10 is safe for the free tier (limit is ~15 RPM).
# If you're on a paid plan, you can raise this to 50 or higher.
Settings.embed_model = GenAIEmbedding(model_name="gemini-embedding-001", requests_per_minute=10)
print("\u2705 LlamaIndex configured.")

Now, we load the earnings report PDF. The `SimpleDirectoryReader` automatically splits the PDF into individual pages, treating each page as a separate document. This is crucial for our page-wise retrieval strategy.

In [None]:
# @title 4. Load & Index Data
print("\U0001f4da Loading Goldman Sachs Report (This automatically separates pages)...")
# SimpleDirectoryReader reads the file and creates a 'Document' for every page
documents = SimpleDirectoryReader(input_files=["GS-2024-q4-earnings.pdf"]).load_data()

print(f"   Loaded {len(documents)} pages.")
print(f"   Sample Metadata: {documents[0].metadata}")

In [None]:
# @title 5. Build Vector Index
print("\U0001f9e0 Building Vector Index...")
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=1)  # Retrieve the single best page

print("\u2705 Index Ready!")

This is the core of our **Visual RAG Orchestrator**. The `query_visual_rag` function performs the following steps:
1.  **Retrieve**: Uses LlamaIndex to find the most relevant text snippet.
2.  **Locate**: Extracts the page number metadata from the retrieved snippet.
3.  **Extract**: Uses `pypdf` to create a mini-PDF containing only the identified page.
4.  **Vision**: Uploads this mini-PDF to Gemini Flash, enabling the model to 'see' the entire page visually.
5.  **Generate**: Prompts Gemini Flash to answer the user's question based on the visual and textual content of that single page.

In [None]:
# @title 6. The Visual RAG Orchestrator
# FIX: Use genai.types.Part(text=...) instead of the deprecated Part.from_text()
# The google-genai SDK changed the API: from_text() no longer accepts positional args.

from pypdf import PdfReader, PdfWriter
import time

# Initialise the google-genai client
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

def query_visual_rag(query_text):
    print(f"\n\u2385 Querying LlamaIndex for: '{query_text}'...")

    # 1. RETRIEVE: Ask LlamaIndex to find the best text match
    nodes = retriever.retrieve(query_text)

    if not nodes:
        return "\u274c No relevant information found in the index."

    # 2. LOCATE: Extract metadata from the best node
    best_node = nodes[0]
    page_label = best_node.metadata.get('page_label')

    if page_label:
        page_index = int(page_label) - 1
    else:
        print("\u26a0\ufe0f Page metadata missing, defaulting to Page 1")
        page_index = 0
        page_label = "1"

    print(f"Found answer on Page {page_label} (Score: {best_node.score:.4f})")
    print(f"   Context Snippet: {best_node.text[:100]}...")

    # 3. EXTRACT: Slice that specific page as a mini-PDF
    reader = PdfReader("GS-2024-q4-earnings.pdf")
    writer = PdfWriter()
    writer.add_page(reader.pages[page_index])

    temp_filename = "temp_visual_context.pdf"
    with open(temp_filename, "wb") as f:
        writer.write(f)

    # 4. VISION: Upload the page PDF with the google-genai SDK
    print("\U0001f680 Sending Page Visuals to Gemini Flash...")
    with open(temp_filename, "rb") as f:
        upload_file = client.files.upload(
            file=f,
            config=genai.types.UploadFileConfig(mime_type="application/pdf")
        )

    # Wait for processing
    while upload_file.state.name == "PROCESSING":
        time.sleep(1)
        upload_file = client.files.get(name=upload_file.name)

    # 5. GENERATE: Ask the model to look at the page
    prompt_text = (
        f"You are an expert financial analyst. You are looking at Page {page_label} "
        "of the GS Earnings Report. "
        "Answer the user's question based on the visual charts, tables, and text on this page. "
        "If the answer is in a chart, explicitly describe the visual trend."
    )

    # FIX: Use genai.types.Part(text=...) constructor instead of the broken
    # Part.from_text() which changed behaviour in newer google-genai SDK versions.
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=[
            genai.types.Part(text=prompt_text),
            genai.types.Part(
                file_data=genai.types.FileData(
                    file_uri=upload_file.uri,
                    mime_type="application/pdf"
                )
            ),
            genai.types.Part(text=query_text)
        ]
    )
    return response.text

print("\u2705 query_visual_rag() function defined and ready.")

Let's test our Visual RAG system with a query that specifically benefits from visual information. This query asks about AUS by region, which requires interpreting a pie chart on the page.

In [None]:
# @title 7. Run Tests
from IPython.display import display, Markdown

# Test 1: AUS by Region
q1 = "Describe the AUS by region."
display(Markdown(f"### Q1: {q1}"))
display(Markdown(query_visual_rag(q1)))

This second test query challenges the model to extract structured information from a complex table. A traditional text-only RAG might struggle, but with visual context, Gemini Flash can accurately interpret the table layout and extract the requested data.

In [None]:
# Test 2: Summarise financial results
q2 = "Summarize the financial results"
display(Markdown(f"### Q2: {q2}"))
display(Markdown(query_visual_rag(q2)))

Finally, we run an 'Audit Mode' query designed for verification. This prompt specifically asks for revenue figures and requires citing the exact page numbers, including whether the information was derived from a chart. This helps confirm the system's accuracy and ability to pinpoint information within the visual context.

In [None]:
# @title 8. Audit Mode (Verification)

audit_prompt = """
Extract the exact revenue figures for FY24/25.
For every figure you provide, cite the PAGE NUMBER where it appears.
If the number comes from a chart, state \"Derived from Chart on Page X\".
"""
display(Markdown(f"### Q: {audit_prompt}"))
display(Markdown(query_visual_rag(audit_prompt)))