# Multi-Modal RAG with Image Captioning

In this notebook, I implement a Multi-Modal RAG system that extracts both text and images from documents, generates captions for images, and uses both content types to respond to queries. This approach enhances traditional RAG by incorporating visual information into the knowledge base.

Traditional RAG systems only work with text, but many documents contain crucial information in images, charts, and tables. By captioning these visual elements and incorporating them into our retrieval system, we can:

- Access information locked in figures and diagrams
- Understand tables and charts that complement the text
- Create a more comprehensive knowledge base
- Answer questions that rely on visual data

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import os
import io
import numpy as np
import json
import fitz
from PIL import Image
import google.generativeai as genai
import base64
import re
import tempfile
import shutil

# --- Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set. Please set it.")
genai.configure(api_key=GOOGLE_API_KEY)

## Document Processing Functions

In [2]:
def extract_content_from_pdf(pdf_path, output_dir=None):
    """
    Extract both text and images from a PDF file.
    
    Args:
        pdf_path (str): Path to the PDF file
        output_dir (str, optional): Directory to save extracted images
        
    Returns:
        Tuple[List[Dict], List[Dict]]: Text data and image data
    """
    # Create a temporary directory for images if not provided
    temp_dir = None
    if output_dir is None:
        temp_dir = tempfile.mkdtemp()
        output_dir = temp_dir
    else:
        os.makedirs(output_dir, exist_ok=True)
        
    text_data = []  # List to store extracted text data
    image_paths = []  # List to store paths of extracted images
    
    print(f"Extracting content from {pdf_path}...")
    
    try:
        with fitz.open(pdf_path) as pdf_file:
            # Loop through every page in the PDF
            for page_number in range(len(pdf_file)):
                page = pdf_file[page_number]
                
                # Extract text from the page
                text = page.get_text().strip()
                if text:
                    text_data.append({
                        "content": text,
                        "metadata": {
                            "source": pdf_path,
                            "page": page_number + 1,
                            "type": "text"
                        }
                    })
                
                # Extract images from the page
                image_list = page.get_images(full=True)
                for img_index, img in enumerate(image_list):
                    xref = img[0]  # XREF of the image
                    base_image = pdf_file.extract_image(xref)
                    
                    if base_image:
                        image_bytes = base_image["image"]
                        image_ext = base_image["ext"]
                        
                        # Save the image to the output directory
                        img_filename = f"page_{page_number+1}_img_{img_index+1}.{image_ext}"
                        img_path = os.path.join(output_dir, img_filename)
                        
                        with open(img_path, "wb") as img_file:
                            img_file.write(image_bytes)
                        
                        image_paths.append({
                            "path": img_path,
                            "metadata": {
                                "source": pdf_path,
                                "page": page_number + 1,
                                "image_index": img_index + 1,
                                "type": "image"
                            }
                        })
        
        print(f"Extracted {len(text_data)} text segments and {len(image_paths)} images")
        return text_data, image_paths
    
    except Exception as e:
        print(f"Error extracting content: {e}")
        if temp_dir and os.path.exists(temp_dir):
            shutil.rmtree(temp_dir)
        raise

## Chunking Text Content

In [3]:
def chunk_text(text_data, chunk_size=1000, overlap=200):
    """
    Split text data into overlapping chunks.
    
    Args:
        text_data (List[Dict]): Text data extracted from PDF
        chunk_size (int): Size of each chunk in characters
        overlap (int): Overlap between chunks in characters
        
    Returns:
        List[Dict]: Chunked text data
    """
    chunked_data = []  # Initialize an empty list to store chunked data
    
    for item in text_data:
        text = item["content"]  # Extract the text content
        metadata = item["metadata"]  # Extract the metadata
        
        # Skip if text is too short
        if len(text) < chunk_size / 2:
            chunked_data.append({
                "content": text,
                "metadata": metadata
            })
            continue
        
        # Create chunks with overlap
        chunks = []
        for i in range(0, len(text), chunk_size - overlap):
            chunk = text[i:i + chunk_size]  # Extract a chunk of the specified size
            if chunk:  # Ensure we don't add empty chunks
                chunks.append(chunk)
        
        # Add each chunk with updated metadata
        for i, chunk in enumerate(chunks):
            chunk_metadata = metadata.copy()  # Copy the original metadata
            chunk_metadata["chunk_index"] = i  # Add chunk index to metadata
            chunk_metadata["chunk_count"] = len(chunks)  # Add total chunk count to metadata
            
            chunked_data.append({
                "content": chunk,  # The chunk text
                "metadata": chunk_metadata  # The updated metadata
            })
    
    print(f"Created {len(chunked_data)} text chunks")  # Print the number of created chunks
    return chunked_data  # Return the list of chunked data

## Image Captioning with OpenAI Vision

In [4]:
def encode_image(image_path):
    """
    Encode an image file as base64.
    
    Args:
        image_path (str): Path to the image file
        
    Returns:
        str: Base64 encoded image
    """
    # Open the image file in binary read mode
    with open(image_path, "rb") as image_file:
        # Read the image file and encode it to base64
        encoded_image = base64.b64encode(image_file.read())
        # Decode the base64 bytes to a string and return
        return encoded_image.decode('utf-8')

In [6]:
import os
import google.generativeai as genai
from PIL import Image
from typing import Optional

# --- 1. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. The main image caption generation function (revised for Gemini) ---
def generate_image_caption(image_path: str, model: str = "gemini-1.5-flash") -> str:
    """
    Generate a caption for an image using Gemini's multi-modal capabilities.
    
    Args:
        image_path (str): Path to the image file
        model (str): The Gemini multi-modal model to use
        
    Returns:
        str: Generated caption
    """
    if not os.path.exists(image_path):
        return "Error: Image file not found"
        
    try:
        # Define the system prompt to guide the AI's behavior
        system_prompt = (
            "You are an assistant specialized in describing images from academic papers. "
            "Provide detailed captions for the image that capture key information. "
            "If the image contains charts, tables, or diagrams, describe their content and purpose clearly. "
            "Your caption should be optimized for future retrieval when people ask questions about this content."
        )
        
        # Open the image file using PIL
        img = Image.open(image_path)
        
        # Create the Gemini model instance with the system prompt
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # The prompt is a list of parts: a text string and the image object
        prompt_parts = [
            "Describe this image in detail, focusing on its academic content:",
            img
        ]
        
        # Generate the caption using the specified model
        response = gemini_model.generate_content(prompt_parts, stream=False)
        
        # Return the generated caption
        return response.text
        
    except Exception as e:
        return f"Error generating caption: {str(e)}"

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a local image file. Replace with a valid path for a real run.
    image_file = "/Users/kekunkoya/Desktop/770 Google /reward_plot.png"
    
    print("Generating image caption with Gemini...")
    caption = generate_image_caption(image_file)
    
    print("\nGenerated Caption:")
    print(caption)

Generating image caption with Gemini...

Generated Caption:
Here's a caption describing the image's academic content:

**Figure: Reward History During Reinforcement Learning (RL) Training**

This line graph illustrates the reward history observed during the training of a reinforcement learning agent across five episodes.  The x-axis represents the episode number (0 through 4), and the y-axis represents the reward obtained by the agent in each episode, ranging from 0 to approximately 0.9. The graph shows a fluctuating reward pattern.  The reward increases sharply in the first episode then peaks in the second episode before dropping to a near-zero value. Subsequently, there is a rapid increase in the reward again by episode 3, followed by a more gradual improvement in the final episode. This suggests an RL agent learning process that's neither consistently smooth nor consistently poor, instead displaying periods of rapid learning interspersed with dips.  The overall trend shows an improv

In [7]:
def process_images(image_paths):
    """
    Process all images and generate captions.
    
    Args:
        image_paths (List[Dict]): Paths to extracted images
        
    Returns:
        List[Dict]: Image data with captions
    """
    image_data = []  # Initialize an empty list to store image data with captions
    
    print(f"Generating captions for {len(image_paths)} images...")  # Print the number of images to process
    for i, img_item in enumerate(image_paths):
        print(f"Processing image {i+1}/{len(image_paths)}...")  # Print the current image being processed
        img_path = img_item["path"]  # Get the image path
        metadata = img_item["metadata"]  # Get the image metadata
        
        # Generate caption for the image
        caption = generate_image_caption(img_path)
        
        # Add the image data with caption to the list
        image_data.append({
            "content": caption,  # The generated caption
            "metadata": metadata,  # The image metadata
            "image_path": img_path  # The path to the image
        })
    
    return image_data  # Return the list of image data with captions

## Simple Vector Store Implementation

In [8]:
class MultiModalVectorStore:
    """
    A simple vector store implementation for multi-modal content.
    """
    def __init__(self):
        # Initialize lists to store vectors, contents, and metadata
        self.vectors = []
        self.contents = []
        self.metadata = []
    
    def add_item(self, content, embedding, metadata=None):
        """
        Add an item to the vector store.
        
        Args:
            content (str): The content (text or image caption)
            embedding (List[float]): The embedding vector
            metadata (Dict, optional): Additional metadata
        """
        # Append the embedding vector, content, and metadata to their respective lists
        self.vectors.append(np.array(embedding))
        self.contents.append(content)
        self.metadata.append(metadata or {})
    
    def add_items(self, items, embeddings):
        """
        Add multiple items to the vector store.
        
        Args:
            items (List[Dict]): List of content items
            embeddings (List[List[float]]): List of embedding vectors
        """
        # Loop through items and embeddings and add each to the vector store
        for item, embedding in zip(items, embeddings):
            self.add_item(
                content=item["content"],
                embedding=embedding,
                metadata=item.get("metadata", {})
            )
    
    def similarity_search(self, query_embedding, k=5):
        """
        Find the most similar items to a query embedding.
        
        Args:
            query_embedding (List[float]): Query embedding vector
            k (int): Number of results to return
            
        Returns:
            List[Dict]: Top k most similar items
        """
        # Return an empty list if there are no vectors in the store
        if not self.vectors:
            return []
        
        # Convert query embedding to numpy array
        query_vector = np.array(query_embedding)
        
        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k results
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "content": self.contents[idx],
                "metadata": self.metadata[idx],
                "similarity": float(score)  # Convert to float for JSON serialization
            })
        
        return results

## Creating Embeddings

In [9]:
def create_embeddings(texts, model="text-embedding-3-small"):
    """
    Create embeddings for the given texts.
    
    Args:
        texts (List[str]): Input texts
        model (str): Embedding model name
        
    Returns:
        List[List[float]]: Embedding vectors
    """
    # Handle empty input
    if not texts:
        return []
        
    # Process in batches if needed (OpenAI API limits)
    batch_size = 100
    all_embeddings = []
    
    # Iterate over the input texts in batches
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]  # Get the current batch of texts
        
        # Create embeddings for the current batch
        response = client.embeddings.create(
            model=model,
            input=batch
        )
        
        # Extract embeddings from the response
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)  # Add the batch embeddings to the list
    
    return all_embeddings  # Return all embeddings

In [9]:
import os
import google.generativeai as genai
from typing import List, Any
from tqdm import tqdm

# --- 1. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. The main embedding function (revised for Gemini) ---
def create_embeddings(texts: List[str], model: str = "models/embedding-001") -> Any:
    """
    Creates embeddings for the given texts using the Gemini API.

    Args:
        texts (List[str]): Input texts
        model (str): Embedding model name

    Returns:
        List[List[float]]: Embedding vectors
    """
    if not texts:
        return []

    # The Gemini API can handle a list of texts directly for batching
    # The batch size is often handled more efficiently by the API itself.
    try:
        response = genai.embed_content(
            model=model,
            content=texts
        )
        # The embedding list is directly under the 'embedding' key
        return response['embedding']
    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a list of text chunks
    text_chunks = [
        "Homelessness is a complex social problem.",
        "A lack of affordable housing is a key contributing factor.",
        "Social factors like family breakdown can also lead to homelessness."
    ]

    print("Creating embeddings with Gemini...")
    # Create embeddings for the text chunks
    embeddings = create_embeddings(text_chunks)

    if embeddings:
        print("\nEmbeddings created successfully.")
        print(f"Number of embeddings: {len(embeddings)}")
        print(f"Embedding dimensions: {len(embeddings[0])}")
    else:
        print("\nFailed to create embeddings.")

Creating embeddings with Gemini...

Embeddings created successfully.
Number of embeddings: 3
Embedding dimensions: 768


## Complete Processing Pipeline

In [10]:
def process_document(pdf_path, chunk_size=1000, chunk_overlap=200):
    """
    Process a document for multi-modal RAG.
    
    Args:
        pdf_path (str): Path to the PDF file
        chunk_size (int): Size of each chunk in characters
        chunk_overlap (int): Overlap between chunks in characters
        
    Returns:
        Tuple[MultiModalVectorStore, Dict]: Vector store and document info
    """
    # Create a directory for extracted images
    image_dir = "extracted_images"
    os.makedirs(image_dir, exist_ok=True)
    
    # Extract text and images from the PDF
    text_data, image_paths = extract_content_from_pdf(pdf_path, image_dir)
    
    # Chunk the extracted text
    chunked_text = chunk_text(text_data, chunk_size, chunk_overlap)
    
    # Process the extracted images to generate captions
    image_data = process_images(image_paths)
    
    # Combine all content items (text chunks and image captions)
    all_items = chunked_text + image_data
    
    # Extract content for embedding
    contents = [item["content"] for item in all_items]
    
    # Create embeddings for all content
    print("Creating embeddings for all content...")
    embeddings = create_embeddings(contents)
    
    # Build the vector store and add items with their embeddings
    vector_store = MultiModalVectorStore()
    vector_store.add_items(all_items, embeddings)
    
    # Prepare document info with counts of text chunks and image captions
    doc_info = {
        "text_count": len(chunked_text),
        "image_count": len(image_data),
        "total_items": len(all_items),
    }
    
    # Print summary of added items
    print(f"Added {len(all_items)} items to vector store ({len(chunked_text)} text chunks, {len(image_data)} image captions)")
    
    # Return the vector store and document info
    return vector_store, doc_info

## Query Processing and Response Generation

In [11]:
def query_multimodal_rag(query, vector_store, k=5):
    """
    Query the multi-modal RAG system.
    
    Args:
        query (str): User query
        vector_store (MultiModalVectorStore): Vector store with document content
        k (int): Number of results to retrieve
        
    Returns:
        Dict: Query results and generated response
    """
    print(f"\n=== Processing query: {query} ===\n")
    
    # Generate embedding for the query
    query_embedding = create_embeddings(query)
    
    # Retrieve relevant content from the vector store
    results = vector_store.similarity_search(query_embedding, k=k)
    
    # Separate text and image results
    text_results = [r for r in results if r["metadata"].get("type") == "text"]
    image_results = [r for r in results if r["metadata"].get("type") == "image"]
    
    print(f"Retrieved {len(results)} relevant items ({len(text_results)} text, {len(image_results)} image captions)")
    
    # Generate a response using the retrieved content
    response = generate_response(query, results)
    
    return {
        "query": query,
        "results": results,
        "response": response,
        "text_results_count": len(text_results),
        "image_results_count": len(image_results)
    }

In [12]:
import os
import google.generativeai as genai
from typing import List, Dict

# --- 1. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the response generator for Gemini ---
def generate_response(query: str, results: List[Dict], model: str = "gemini-1.5-flash") -> str:
    """
    Generate a response based on the query and retrieved results using Gemini.

    Args:
        query (str): User query
        results (List[Dict]): Retrieved content
        model (str): LLM model to use
        
    Returns:
        str: Generated response
    """
    # Format the context from the retrieved results
    context = ""
    for i, result in enumerate(results):
        # Determine the type of content (text or image caption)
        content_type = "Text" if result["metadata"].get("type") == "text" else "Image caption"
        # Get the page number from the metadata
        page_num = result["metadata"].get("page", "unknown")
        
        # Append the content type and page number to the context
        context += f"[{content_type} from page {page_num}]\n"
        # Append the actual content to the context
        context += result["content"]
        context += "\n\n"
        
    # System message to guide the AI assistant
    system_message = """You are an AI assistant specializing in answering questions about documents
that contain both text and images. You have been given relevant text passages and image captions
from the document. Use this information to provide a comprehensive, accurate response to the query.
If information comes from an image or chart, mention this in your answer.
If the retrieved information doesn't fully answer the query, acknowledge the limitations."""

    # User message containing the query and the formatted context
    user_message = f"""Query: {query}

Retrieved content:
{context}

Please answer the query based on the retrieved content."""
    
    try:
        # Pass the system message to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_message)
        
        # Generate the response using the specified model
        response = gemini_model.generate_content(user_message, generation_config={"temperature": 0.1})
        
        # Return the generated response
        return response.text
    except Exception as e:
        print(f"An error occurred during response generation: {e}")
        return "I could not generate a response due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a query and results from a previous step
    query = "What is the capital of France?"
    results = [
        {"content": "Paris is the capital of France.", "metadata": {"type": "text", "page": 1}},
        {"content": "This is a map of France with Paris marked on it.", "metadata": {"type": "image_caption", "page": 2}}
    ]
    
    print("Generating AI response with Gemini...")
    ai_response = generate_response(query, results)
    
    print("\nAI Response:")
    print(ai_response)

Generating AI response with Gemini...

AI Response:
Based on the provided text, Paris is the capital of France.  The image caption further supports this by mentioning that a map of France shows Paris marked on it.



## Evaluation Against Text-Only RAG

In [13]:
def build_text_only_store(pdf_path, chunk_size=1000, chunk_overlap=200):
    """
    Build a text-only vector store for comparison.
    
    Args:
        pdf_path (str): Path to the PDF file
        chunk_size (int): Size of each chunk in characters
        chunk_overlap (int): Overlap between chunks in characters
        
    Returns:
        MultiModalVectorStore: Text-only vector store
    """
    # Extract text from PDF (reuse function but ignore images)
    text_data, _ = extract_content_from_pdf(pdf_path, None)
    
    # Chunk text
    chunked_text = chunk_text(text_data, chunk_size, chunk_overlap)
    
    # Extract content for embedding
    contents = [item["content"] for item in chunked_text]
    
    # Create embeddings
    print("Creating embeddings for text-only content...")
    embeddings = create_embeddings(contents)
    
    # Build vector store
    vector_store = MultiModalVectorStore()
    vector_store.add_items(chunked_text, embeddings)
    
    print(f"Added {len(chunked_text)} text items to text-only vector store")
    return vector_store

In [14]:
def evaluate_multimodal_vs_textonly(pdf_path, test_queries, reference_answers=None):
    """
    Compare multi-modal RAG with text-only RAG.
    
    Args:
        pdf_path (str): Path to the PDF file
        test_queries (List[str]): Test queries
        reference_answers (List[str], optional): Reference answers
        
    Returns:
        Dict: Evaluation results
    """
    print("=== EVALUATING MULTI-MODAL RAG VS TEXT-ONLY RAG ===\n")
    
    # Process document for multi-modal RAG
    print("\nProcessing document for multi-modal RAG...")
    mm_vector_store, mm_doc_info = process_document(pdf_path)
    
    # Build text-only store
    print("\nProcessing document for text-only RAG...")
    text_vector_store = build_text_only_store(pdf_path)
    
    # Run evaluation for each query
    results = []
    
    for i, query in enumerate(test_queries):
        print(f"\n\n=== Evaluating Query {i+1}: {query} ===")
        
        # Get reference answer if available
        reference = None
        if reference_answers and i < len(reference_answers):
            reference = reference_answers[i]
        
        # Run multi-modal RAG
        print("\nRunning multi-modal RAG...")
        mm_result = query_multimodal_rag(query, mm_vector_store)
        
        # Run text-only RAG
        print("\nRunning text-only RAG...")
        text_result = query_multimodal_rag(query, text_vector_store)
        
        # Compare responses
        comparison = compare_responses(query, mm_result["response"], text_result["response"], reference)
        
        # Add to results
        results.append({
            "query": query,
            "multimodal_response": mm_result["response"],
            "textonly_response": text_result["response"],
            "multimodal_results": {
                "text_count": mm_result["text_results_count"],
                "image_count": mm_result["image_results_count"]
            },
            "reference_answer": reference,
            "comparison": comparison
        })
    
    # Generate overall analysis
    overall_analysis = generate_overall_analysis(results)
    
    return {
        "results": results,
        "overall_analysis": overall_analysis,
        "multimodal_doc_info": mm_doc_info
    }

In [15]:
import os
import google.generativeai as genai
from typing import Optional

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. The main comparison function (revised for Gemini) ---
def compare_responses(query: str, mm_response: str, text_response: str, reference: Optional[str] = None) -> str:
    """
    Compare multi-modal and text-only responses using Gemini.

    Args:
        query (str): User query
        mm_response (str): Multi-modal response
        text_response (str): Text-only response
        reference (str, optional): Reference answer

    Returns:
        str: Comparison analysis
    """
    system_prompt = """You are an expert evaluator comparing two RAG systems:
1. Multi-modal RAG: Retrieves from both text and image captions
2. Text-only RAG: Retrieves only from text

Evaluate which response better answers the query based on:
- Accuracy and correctness
- Completeness of information
- Relevance to the query
- Unique information from visual elements (for multi-modal)"""

    user_prompt = f"""Query: {query}

Multi-modal RAG Response:
{mm_response}

Text-only RAG Response:
{text_response}
"""

    if reference:
        user_prompt += f"""
Reference Answer:
{reference}
"""

    user_prompt += """
Compare these responses and explain which one better answers the query and why.
Note any specific information that came from images in the multi-modal response.
"""

    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel("gemini-1.5-flash", system_instruction=system_prompt)
        
        # Generate the comparison using the specified model
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0.0})
        
        # Return the generated response content
        return response.text
    except Exception as e:
        print(f"An error occurred during comparison: {e}")
        return "Comparison failed due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate responses and a query
    query = "What is the capital of France?"
    mm_response = "Paris is the capital of France, which is also a major European city. A map on page 2 shows its location."
    text_response = "Paris is the capital of France, which is a major European city."

    print("Comparing responses with Gemini...")
    analysis = compare_responses(query, mm_response, text_response)
    
    print("\nComparison Analysis:")
    print(analysis)

Comparing responses with Gemini...

Comparison Analysis:
Both responses correctly identify Paris as the capital of France.  However, the **Text-only RAG response is slightly better** for this specific query. Here's why:

* **Accuracy and Correctness:** Both are accurate.

* **Completeness of Information:** Both provide the same core information.  The additional sentence about Paris being a major European city is relevant but not crucial to answering the question about the capital.

* **Relevance to the Query:** Both are highly relevant.

* **Unique Information from Visual Elements (Multi-modal):** The multi-modal response mentions a map showing Paris' location. While this is helpful context, it's not *necessary* to answer the question "What is the capital of France?".  The inclusion of the map is more of a tangential addition than a crucial piece of information directly answering the query.  In fact, the reference to a specific page ("page 2") implies a reliance on a particular documen

In [16]:
import os
import google.generativeai as genai
from typing import List, Dict, Any

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. The main analysis function (revised for Gemini) ---
def generate_overall_analysis(results: List[Dict], model: str = "gemini-1.5-flash") -> str:
    """
    Generate an overall analysis of multi-modal vs text-only RAG using Gemini.
    
    Args:
        results (List[Dict]): Evaluation results for each query
        model (str): The model to be used for the analysis.
        
    Returns:
        str: Overall analysis
    """
    # System prompt for the evaluator
    system_prompt = """You are an expert evaluator of RAG systems. Provide an overall analysis comparing 
multi-modal RAG (text + images) versus text-only RAG based on multiple test queries.

Focus on:
1. Types of queries where multi-modal RAG outperforms text-only
2. Specific advantages of incorporating image information
3. Any disadvantages or limitations of the multi-modal approach
4. Overall recommendation on when to use each approach"""

    # Create summary of evaluations
    evaluations_summary = ""
    for i, result in enumerate(results):
        evaluations_summary += f"Query {i+1}: {result['query']}\n"
        evaluations_summary += f"Multi-modal retrieved {result['multimodal_results']['text_count']} text chunks and {result['multimodal_results']['image_count']} image captions\n"
        evaluations_summary += f"Comparison summary: {result['comparison'][:200]}...\n\n"

    # User prompt with evaluations summary
    user_prompt = f"""Based on the following evaluations of multi-modal vs text-only RAG across {len(results)} queries, 
provide an overall analysis comparing these two approaches:

{evaluations_summary}

Please provide a comprehensive analysis of the relative strengths and weaknesses of multi-modal RAG 
compared to text-only RAG, with specific attention to how image information contributed (or didn't contribute) to response quality."""

    try:
        # Create a Gemini model instance with the system prompt
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate overall analysis
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0.0})
        
        return response.text
    except Exception as e:
        print(f"An error occurred during analysis generation: {e}")
        return "Analysis failed due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a results list from a previous evaluation pipeline
    simulated_results = [
        {
            "query": "What are the key differences in homelessness data collection methods?",
            "multimodal_results": {"text_count": 2, "image_count": 1},
            "comparison": "Multi-modal RAG was superior. It retrieved text on data collection methods and an image caption of a chart that visually represented the different methodologies, leading to a more complete answer.",
        },
        {
            "query": "Describe the main causes of homelessness.",
            "multimodal_results": {"text_count": 3, "image_count": 0},
            "comparison": "Both RAG systems performed similarly. Since no relevant images were retrieved, the multi-modal approach offered no additional value. Both systems provided a good text-based answer.",
        }
    ]

    print("Generating overall analysis with Gemini...")
    overall_analysis = generate_overall_analysis(simulated_results)
    
    print("\n=== OVERALL ANALYSIS ===")
    print(overall_analysis)



Generating overall analysis with Gemini...

=== OVERALL ANALYSIS ===
## Multi-modal vs. Text-only RAG: A Comparative Analysis

Based on the provided evaluations of two queries, a clear picture emerges regarding the strengths and weaknesses of multi-modal RAG compared to its text-only counterpart.  The analysis reveals that multi-modal RAG's effectiveness is highly dependent on the query's nature and the availability of relevant and informative visual data.

**1. Types of Queries Where Multi-modal RAG Outperforms Text-only:**

Multi-modal RAG significantly outperforms text-only RAG when the query involves:

* **Data visualization:**  As demonstrated in Query 1, queries requiring the interpretation of data presented visually (charts, graphs, diagrams) benefit immensely from multi-modal retrieval.  Text alone might describe the data, but the visual representation provides immediate understanding and context, leading to a richer and more complete answer.  This is particularly true for comp

## Evaluation on Multi-Modal RAG vs Text-Only RAG

In [17]:
# Path to your PDF document
pdf_path = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/attention_is_all_you_need.pdf"

# Define test queries targeting both text and visual content
test_queries = [
    "What is the BLEU score of the Transformer (base model)?",
]

# Optional reference answers for evaluation
reference_answers = [
    "The Transformer (base model) achieves a BLEU score of 27.3 on the WMT 2014 English-to-German translation task and 38.1 on the WMT 2014 English-to-French translation task.",
]

# Run evaluation
evaluation_results = evaluate_multimodal_vs_textonly(
    pdf_path=pdf_path,
    test_queries=test_queries,
    reference_answers=reference_answers
)

# Print overall analysis
print("\n=== OVERALL ANALYSIS ===\n")
print(evaluation_results["overall_analysis"])

=== EVALUATING MULTI-MODAL RAG VS TEXT-ONLY RAG ===


Processing document for multi-modal RAG...
Extracting content from /Users/kekunkoya/Desktop/ISEM 770 Class Project/attention_is_all_you_need.pdf...
Extracted 15 text segments and 3 images
Created 56 text chunks
Generating captions for 3 images...
Processing image 1/3...
Processing image 2/3...
Processing image 3/3...
Creating embeddings for all content...
Added 59 items to vector store (56 text chunks, 3 image captions)

Processing document for text-only RAG...
Extracting content from /Users/kekunkoya/Desktop/ISEM 770 Class Project/attention_is_all_you_need.pdf...
Extracted 15 text segments and 3 images
Created 56 text chunks
Creating embeddings for text-only content...
Added 56 text items to text-only vector store


=== Evaluating Query 1: What is the BLEU score of the Transformer (base model)? ===

Running multi-modal RAG...

=== Processing query: What is the BLEU score of the Transformer (base model)? ===

Retrieved 5 relevant it