<a href="https://colab.research.google.com/github/RDGopal/IB9AU-2026/blob/main/RAG_5_Multimodal_Chunking_OpenSourced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates a **Page-Wise Multimodal Retrieval Augmented Generation (RAG)** system using open-source HuggingFace models and LlamaIndex — no API keys or rate limits required.

### The Challenge: Beyond Long Context
Traditional RAG systems retrieve text chunks. Even with large context windows, processing thousands of pages (e.g., entire document collections) requires a smarter retrieval strategy.

### The Solution: Page-Wise Visual Retrieval
Instead of sending just text, this approach extracts and sends *entire relevant pages* as images to a Vision-Language Model (VLM). This allows the model to 'see' charts, tables, and layout exactly as a human analyst would.

### Architecture:
1. **Index** — LlamaIndex embeds each page's text with a local sentence-transformer model.
2. **Search** — Semantic search identifies the best matching page for a query.
3. **Render** — `pdf2image` converts that page to a PIL image.
4. **VLM** — Qwen2.5-VL-3B-Instruct reads the image and answers the question.

> ⚠️ **IMPORTANT**: This notebook requires a **T4 GPU** runtime.  
> Go to **Runtime → Change runtime type → T4 GPU** before running any cells.  
> Running on CPU will cause generation to take 5–10 minutes per query.

## ⚠️ Step 1: Upgrade Pillow and Restart Runtime

Colab ships with an outdated version of Pillow that causes an `ImportError: cannot import name '_Ink'` when loading the HuggingFace embedding model. **You must upgrade Pillow first and then restart the runtime before running any other cells.**

**Instructions:**
1. Run the cell below.
2. Go to **Runtime → Restart session**.
3. Then continue running cells from Step 2 onwards — do **not** re-run this cell.

In [None]:
# Step 1: Upgrade Pillow FIRST — restart runtime after this cell
!pip install -qU Pillow
print("\u2705 Pillow upgraded. Now go to Runtime \u2192 Restart session, then continue from Step 2.")

## Step 2: Verify GPU and Install Dependencies

After restarting the runtime, run this cell. It first **verifies a GPU is available** — if not, it will warn you before wasting time loading a 3B model onto CPU.

Then it installs:
- `transformers` + `accelerate` — for loading Qwen2.5-VL
- `llama-index-core`, `llama-index-readers-file`, `llama-index-embeddings-huggingface` — for indexing
- `pdf2image` + `poppler-utils` — for rendering PDF pages as images
- `pypdf` — for reading PDF metadata

In [None]:
# FIX 1: Check GPU is available BEFORE loading the model.
# On CPU, Qwen2.5-VL-3B generates ~1-2 tokens/sec = 5-10 min per query.
# On T4 GPU, it generates ~30-50 tokens/sec = ~15 sec per query.
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"\u2705 GPU detected: {gpu_name} ({vram_gb:.1f} GB VRAM)")
else:
    print("\u274c NO GPU DETECTED!")
    print("   Go to Runtime \u2192 Change runtime type \u2192 T4 GPU")
    print("   Running on CPU will make each query take 5-10 minutes.")
    raise RuntimeError("GPU required. Please change runtime type to T4 GPU and re-run.")

!pip install -qU transformers accelerate
!pip install -qU llama-index-core llama-index-readers-file llama-index-embeddings-huggingface pypdf
!pip install -qU pdf2image
!apt-get install -q poppler-utils

print("\u2705 Installation Complete.")

## Step 3: Load the Embedding Model

`all-MiniLM-L6-v2` is a compact sentence-transformer that converts text into 384-dimensional embeddings for semantic similarity search. It runs on CPU and requires no API key.

In [None]:
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("\u2705 Embedding model loaded.")

## Step 4: Load the Vision-Language Model (VLM)

`Qwen2.5-VL-3B-Instruct` is a 3B open-source vision-language model. At `float16` it fits on a T4 GPU (~7.5 GB VRAM) and generates responses in ~15 seconds per query.

The key fix here is explicitly setting `device_map='cuda'` to ensure the model always loads onto the GPU.

In [None]:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

VLM_MODEL_ID = "Qwen/Qwen2.5-VL-3B-Instruct"

print(f"\u23f3 Loading {VLM_MODEL_ID}...")
print(f"   Device: {'CUDA - ' + torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU (WARNING: will be very slow)'}")

# FIX 2: Explicitly use 'cuda' instead of 'auto' to guarantee GPU placement.
# 'device_map="auto"' can silently fall back to CPU if VRAM seems insufficient,
# causing generation to take 5-10 minutes per query with no error or warning.
vlm_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    VLM_MODEL_ID,
    torch_dtype=torch.float16,
    device_map="cuda"   # <-- FIXED: was 'auto', now explicitly 'cuda'
)
vlm_processor = AutoProcessor.from_pretrained(VLM_MODEL_ID)

# Confirm where the model actually landed
model_device = next(vlm_model.parameters()).device
print(f"\u2705 VLM loaded on: {model_device}")
if str(model_device) == 'cpu':
    print("\u26a0\ufe0f  WARNING: Model is on CPU. Each query will take 5-10 minutes!")

## Step 5: Configure LlamaIndex Settings

`Settings.llm = None` because LlamaIndex's built-in LLM is only used for query synthesis — our visual QA is handled directly by the Qwen VLM.

In [None]:
Settings.embed_model = embed_model
Settings.llm = None
print("\u2705 LlamaIndex settings configured.")

## Step 6: Load the PDF

`SimpleDirectoryReader` reads the PDF and creates one `Document` per page, each carrying page number metadata (e.g. `{'page_label': '5'}`). This page label is how we know which page to render for the VLM.

In [None]:
PDF_FILE = "GS-2024-q4-earnings.pdf"

print(f"\U0001f4da Loading {PDF_FILE}...")
documents = SimpleDirectoryReader(input_files=[PDF_FILE]).load_data()

print(f"   Loaded {len(documents)} pages.")
print(f"   Sample Metadata: {documents[0].metadata}")

## Step 7: Build the Vector Index

LlamaIndex embeds each page's text using `all-MiniLM-L6-v2` and stores the vectors in-memory. The retriever is configured to return the single best-matching page per query.

In [None]:
print("\U0001f9e0 Building Vector Index...")
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=1)
print("\u2705 Index Ready!")

## Step 8: The Visual RAG Orchestrator

The `query_visual_rag` function ties everything together:
1. **Retrieve** — LlamaIndex finds the best matching page by text similarity.
2. **Locate** — The page number is extracted from metadata.
3. **Render** — `pdf2image` renders that PDF page to a PIL image at 150 DPI.
4. **Prompt** — Image + question are formatted into a Qwen2.5-VL chat prompt.
5. **Generate** — The VLM generates an answer, with a token limit to prevent runaway generation.

**Key fixes applied:**
- `device_map='cuda'` ensures GPU is always used (see Step 4).
- `max_new_tokens=256` keeps generation fast — increase if you need longer answers.
- A device check prints a warning if somehow CPU is being used.
- Fixed the prompt: was incorrectly labelled 'JLR Annual Report', now 'GS Earnings Report'.

In [None]:
from pypdf import PdfReader, PdfWriter
from pdf2image import convert_from_path
from PIL import Image
import torch

def query_visual_rag(query_text, max_new_tokens=256):
    print(f"\n\U0001f50e Querying LlamaIndex for: '{query_text}'...")

    # FIX 3: Warn if model is on CPU before spending time on generation.
    model_device = next(vlm_model.parameters()).device
    if str(model_device) == 'cpu':
        print("\u26a0\ufe0f  WARNING: VLM is running on CPU. This will be very slow (5-10 min).")
        print("   Change runtime to T4 GPU and restart for fast inference.")

    # 1. RETRIEVE: semantic search over page text
    nodes = retriever.retrieve(query_text)
    if not nodes:
        return "\u274c No relevant information found in the index."

    # 2. LOCATE: get page number from metadata
    best_node = nodes[0]
    page_label = best_node.metadata.get('page_label')
    page_index = int(page_label) - 1 if page_label else 0

    print(f"\U0001f4cd Found answer on Page {page_label} (Score: {best_node.score:.4f})")
    print(f"   Context Snippet: {best_node.text[:100]}...")

    # 3. RENDER: convert that PDF page to an image
    print("\U0001f5bc\ufe0f  Rendering page as image...")
    pages = convert_from_path(
        PDF_FILE,
        first_page=page_index + 1,
        last_page=page_index + 1,
        dpi=150
    )
    page_image = pages[0]

    # 4. VISION: build the prompt and run Qwen2.5-VL
    print(f"\U0001f680 Sending page image to Qwen2.5-VL (max {max_new_tokens} tokens)...")
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": page_image},
                {"type": "text", "text": (
                    f"You are an expert financial analyst reviewing Page {page_label} "
                    # FIX 4: Corrected report name — was 'JLR Annual Report'
                    "of the Goldman Sachs Q4 2024 Earnings Report. "
                    "Answer the following question based on the text, tables, and "
                    "charts visible on this page. "
                    "If the answer involves a chart, describe the visual trend.\n\n"
                    f"Question: {query_text}"
                )}
            ]
        }
    ]

    text_prompt = vlm_processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = vlm_processor(
        text=[text_prompt],
        images=[page_image],
        return_tensors="pt"
    ).to(vlm_model.device)

    with torch.no_grad():
        output_ids = vlm_model.generate(
            **inputs,
            # FIX 5: Reduced max_new_tokens from 512 to 256.
            # 512 tokens on T4 GPU takes ~15-20 sec; on CPU it takes ~5-10 MIN.
            # 256 tokens covers all typical financial QA answers in ~8-10 sec on GPU.
            # Caller can pass max_new_tokens=512 if longer answers are needed.
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=None,   # FIX 6: Remove temperature when do_sample=False
            top_p=None,         # to suppress the 'not valid' warning
        )

    # Decode only the newly generated tokens
    generated = output_ids[:, inputs['input_ids'].shape[1]:]
    answer = vlm_processor.batch_decode(generated, skip_special_tokens=True)[0]
    return answer

print("\u2705 query_visual_rag() function defined and ready.")

## Step 9: Test 1

This query requires the VLM to read a table and extract a value — something text-only RAG cannot do reliably.

In [None]:
from IPython.display import display, Markdown

q1 = "What are the net revenues from Equity underwriting?"
display(Markdown(f"### Q1: {q1}"))
display(Markdown(query_visual_rag(q1)))

## Step 10: Test 2

This query asks the model to extract structured financial data from a complex table. Tables are notoriously difficult for text-extraction-based RAG — column alignment is often lost. By passing the rendered page image, the VLM can interpret the table layout visually.

In [None]:
q2 = "Give me the details for Investment banking fees"
display(Markdown(f"### Q2: {q2}"))
display(Markdown(query_visual_rag(q2)))

# Required Task 14
Build a Page-Wise Visual RAG (Retrieval-Augmented Generation) system to analyse AstraZeneca's FY and Q4 2025 earnings report. Rather than simply reading the PDF as text, your system will identify the most relevant page for a given query, render it as an image, and use a Vision-Language Model (VLM) to extract and interpret the information visually — just as a financial analyst would.

This mirrors a real-world analyst workflow: locate the right section of a report, then read it carefully to extract structured insights.

## Setup
Use the notebook structure from the lab session. Your system should use:

**Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (local, CPU)

**VLM**: Qwen/Qwen2.5-VL-3B-Instruct (local, T4 GPU)

**PDF**: AstraZeneca-Q4-2025-earnings.pdf

**Runtime**: Google Colab with T4 GPU


## Tasks
### Task 1 — Revenue Table Extraction
Use your Visual RAG system to answer the following query:

"What were AstraZeneca's total Product Sales and Alliance Revenue for FY 2025, and how did each change compared to FY 2024?"

### Task 2 — Regional Revenue Breakdown
Issue the following query:

"Which geographic region had the highest Total Revenue growth in FY 2025, and what was the growth rate at constant exchange rates?"

### Task 3 — R&D Pipeline Interpretation
Issue the following query:

"Which medicines received regulatory approvals in the US between November 2025 and February 2026, and for what indications?"

### Task 4 — Audit Mode Query
Design your own audit-style prompt — one that demands precise numerical figures with an explicit instruction to cite the page or table the number came from. Run it through your system and report:

Your chosen query
The page retrieved
The VLM's response
A verification of at least two specific figures against the source document

Your prompt should target a different section of the report from Tasks 1–3. Good candidates include the cash flow statement (Table 12), the Reported-to-Core reconciliation (Table 10), or the currency sensitivity table (Table 16).