In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Building a DIY Healthcare Multimodal Question Answering System with Vertex AI (A Beginner's Guide - Multimodal RAG)

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/qa-ops/building_DIY_multimodal_qa_healthcare_system_with_mrag.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fqa-ops%2Fbuilding_DIY_multimodal_qa_healthcare_system_with_mrag.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/qa-ops/building_DIY_multimodal_qa_healthcare_system_with_mrag.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/qa-ops/building_DIY_multimodal_qa_healthcare_system_with_mrag.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>    
</table>

| | |
|-|-|
|Author(s) | [Lavi Nigam](https://github.com/lavinigam-gcp), [Ayo Adedeji]()  |

## Overview

Welcome to this hands-on workshop where we explore the cutting-edge field of multimodal Retrieval-Augmented Generation (RAG). In this session, we'll focus on utilizing both textual and visual data to perform complex question answering (Q&A) over CVS and Medicare documents. This approach not only enriches the interaction with the data but also enhances the decision-making process by leveraging a broader spectrum of information.

## Introduction to Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation has revolutionized the way Large Language Models (LLMs) interact with information by enabling them to access external data. This not only enriches their knowledge base but also grounds their responses, significantly reducing the likelihood of generating inaccurate information, commonly referred to as "hallucinations."

<img src="https://storage.googleapis.com/cvs-era-of-gemini-workshop/images/multimodal-rag.png" alt="Multimodal RAG Diagram" height="500">

## Understanding Gemini: A Multimodal AI by DeepMind

Gemini is a suite of generative AI models developed by Google DeepMind tailored for multimodal applications. With Gemini, you can process and analyze data that includes text, images, audio, and even videos. The API provides seamless access to various versions of the Gemini models, each designed for specific capabilities:

    Gemini 1.0 Pro Vision: Supports multimodal prompts. You can include text, images, and video in your prompt requests and get text or code responses.
    Gemini 1.0 Pro: Designed to handle natural language tasks, multiturn text and code chat, and code generation.
    Gemini 1.5 Pro: Created to be multimodal (text, images, audio, PDFs, code, videos) and to scale across a wide range of tasks with up to 1M input tokens.

## Advantages of Multimodal RAG over Text-Based RAG

Multimodal RAG extends the capabilities of text-only RAG by integrating visual processing, which offers several advantages:

    Enhanced Knowledge Access: By analyzing both text and visual content, the model taps into a richer knowledge base, providing more comprehensive insights.
    Improved Reasoning Capabilities: Visual cues enable the model to make more informed inferences, improving accuracy and relevance in responses.

## Objectives and Learning Outcomes

In this workshop, you will learn to build a robust document search engine using multimodal RAG. Here are the key steps we will cover:

    Document Processing: Extract and store metadata from documents containing text and images.
    Embedding Generation: Produce text and multimodal embeddings to facilitate efficient search capabilities.
    Metadata Searching:
        Use text queries to locate similar text or images within the document metadata.
        Use image queries to find related images.
        Employ text queries to search for contextual answers using both text and image data.

## Hands-on Examples

Throughout this workshop, you'll engage in practical exercises that illustrate how to:

    Construct a Multimedia Metadata Repository: This repository will serve as the backbone of your document search engine, enabling sophisticated search, comparison, and reasoning across various types of data.
    Use Gemini API for Multimodal Queries: Learn how to send mixed media prompts to Gemini models and interpret the outputs effectively.


## Conclusion

By the end of this workshop, you'll have a foundational understanding of how to implement and utilize multimodal RAG for enhanced data interaction and retrieval. This knowledge will empower you to handle more complex data sets and contribute to advancements in the field of generative AI.

## Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.


## Getting Started


### Install Vertex AI SDK for Python and other dependencies


In [None]:
! pip3 install --upgrade --user google-cloud-aiplatform pymupdf

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython
import time

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>



### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).


In [None]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Define Google Cloud project information


In [None]:
# Define project information

import sys

PROJECT_ID = ""  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# if not running on colab, try to get the PROJECT_ID automatically
if "google.colab" not in sys.modules:
    import subprocess

    PROJECT_ID = subprocess.check_output(
        ["gcloud", "config", "get-value", "project"], text=True
    ).strip()

print(f"Your project ID is: {PROJECT_ID}")

In [None]:
import sys

# Initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries


In [None]:
from rich import print as rich_print
from rich.markdown import Markdown as rich_Markdown
from IPython.display import Markdown, display
from vertexai.generative_models import (
    Content,
    GenerationConfig,
    GenerationResponse,
    GenerativeModel,
    HarmCategory,
    HarmBlockThreshold,
    Image,
    Part,
)
from vertexai.language_models import TextEmbeddingModel
from vertexai.vision_models import MultiModalEmbeddingModel

### Load the Gemini 1.0 Pro, Gemini 1.5 Pro and Gemini 1.0 Pro Vision model


Learn more about each models and their differences: [here](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/send-multimodal-prompts)

Learn about the quotas: [here](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas)

In [None]:
# Instantiate text model with appropriate name and version
text_model = GenerativeModel("gemini-1.0-pro")  # works with text, code

# Multimodal models: Choose based on your performance/cost needs
multimodal_model_15 = GenerativeModel(
    "gemini-1.5-pro-preview-0409"
)  # works with text, code, images, video(with or without audio) and audio(mp3) with 1M input context

multimodal_model_10 = GenerativeModel(
    "gemini-1.0-pro-vision-001"
)  # works with text, code, video(without audio) and images with 16k input context

# Load text embedding model from pre-trained source
text_embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")

# Load multimodal embedding model from pre-trained source
multimodal_embedding_model = MultiModalEmbeddingModel.from_pretrained(
    "multimodalembedding"
)  # works with image, image with caption(~32 words), video, video with caption(~32 words)

### Download custom Python modules and utilities

The cell below will download some helper functions needed for this notebook, to improve readability. You can also view the code (`multimodal_qa_with_rag_utils`) [directly](https://storage.googleapis.com/github-repo/rag/intro_multimodal_rag/utils/multimodal_qa_with_rag_utils.py).

In [None]:
import requests
import os

url = "https://storage.googleapis.com/github-repo/rag/intro_multimodal_rag/utils/multimodal_qa_with_rag_utils.py"
folder_name = "utils"
filename = "multimodal_qa_with_rag_utils.py"

# Create the folder if it doesn't exist
os.makedirs(folder_name, exist_ok=True)

# Construct the full file path within the folder
file_path = os.path.join(folder_name, filename)

response = requests.get(url)
response.raise_for_status()

with open(file_path, "wb") as f:
    f.write(response.content)

print(f"Downloaded {filename} to {folder_name} successfully.")

#### Get documents and images from GCS

In [None]:
# download documents and images used in this notebook
!gsutil -m rsync -r gs://cvs-era-of-gemini-workshop/materials/multimodal-rag .
print("Download completed")

## Building metadata of documents containing text and images

### The data

The source data that you will use in this notebook consists of a collection of CVS benefit plan and Medicare benefit documents. These documents provide detailed information about over-the-counter costs, services covered, and related healthcare benefits.

### Import helper functions to build metadata

Before building the multimodal RAG system, it's important to have metadata of all the text and images in the document. For references and citations purposes, the metadata should contain essential elements, including page number, file name, image counter, and so on. Hence, as a next step, you will generate embeddings from the metadata, which will is required to perform similarity search when quering the data.

In [None]:
from utils.multimodal_qa_with_rag_utils import (
    get_document_metadata,
    set_global_variable,
)

set_global_variable("text_embedding_model", text_embedding_model)
set_global_variable("multimodal_embedding_model", multimodal_embedding_model)

### Extract and store metadata of text and images from a document

You just imported a function called `get_document_metadata()`. This function extracts text and image metadata from a document, and returns two dataframes, namely *text_metadata* and *image_metadata*, as outputs. If you want to find out more about how `get_document_metadata()` function is implemented using Gemini and the embedding models, you can take look at the [source code](https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/retrieval-augmented-generation/utils/intro_multimodal_rag_utils.py) directly.

The reason for extraction and storing both text metadata and image metadata is that just by using either of the two alone is not sufficient to come out with a relevent answer. For example, the relevant answers could be in visual form within a document, but text-based RAG won't be able to take into consideration of the visual images. You will also be exploring this example later in this notebook.


At the next step, you will use the function to extract and store metadata of text and images froma document. Please note that the following cell may take a few minutes to complete:

Note:

The current implementation works best:

1) if your documents are a combination of text and images.
2) if the tables in your documents are available as images.
3) if the images in the document don't require too much context.

Additionally,

1) If you want to run this on text-only documents, use normal RAG
2) If your documents contain particular domain knowledge, pass that information in the prompt below.

In [None]:
# Specify the PDF folder with multiple PDF

# pdf_folder_path = "/content/data/" # if running in Google Colab/Colab Enterprise
pdf_folder_path = "data/"  # if running in Vertex AI Workbench.

# Specify the image description prompt. Change it
# image_description_prompt = """Explain what is going on in the image.
# If it's a table, extract all elements of the table.
# If it's a graph, explain the findings in the graph.
# Do not include any numbers that are not mentioned in the image.
# """

image_description_prompt = """You are a technical image analysis expert. You will be provided with various types of images extracted from documents like research papers, technical blogs, and more.
Your task is to generate concise, accurate descriptions of the images without adding any information you are not confident about.
Focus on capturing the key details, trends, or relationships depicted in the image.

Important Guidelines:
* Prioritize accuracy:  If you are uncertain about any detail, state "Unknown" or "Not visible" instead of guessing.
* Avoid hallucinations: Do not add information that is not directly supported by the image.
* Be specific: Use precise language to describe shapes, colors, textures, and any interactions depicted.
* Consider context: If the image is a screenshot or contains text, incorporate that information into your description.
"""

# Extract text and image metadata from the PDF document
text_metadata_df, image_metadata_df = get_document_metadata(
    multimodal_model_10,  # we are passing gemini 1.0 pro vision model
    pdf_folder_path,
    image_save_dir="images",
    image_description_prompt=image_description_prompt,
    embedding_size=1408,
    # add_sleep_after_page = True, # Uncomment this if you are running into API quota issues
    # sleep_time_after_page = 5,
    add_sleep_after_document=True,  # Uncomment this if you are running into API quota issues
    sleep_time_after_document=5,
    # generation_config = # see next cell
    # safety_settings =  # see next cell
)

print("\n\n --- Completed processing. ---")

In [None]:
# # Parameters for Gemini API call.
# # reference for parameters: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini

# generation_config=  GenerationConfig(temperature=0.2, max_output_tokens=2048)

# # Set the safety settings if Gemini is blocking your content or you are facing "ValueError("Content has no parts")" error or "Exception occured" in your data.
# # ref for settings and thresholds: https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/configure-safety-attributes

# safety_settings = {
#                   HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
#                   HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
#                   HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
#                   HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
#                   }

# # You can also pass parameters and safety_setting to "get_gemini_response" function

#### Inspect the processed text metadata


The following cell will produce a metadata table which describes the different parts of text metadata, including:

- **text**: the original text from the page
- **text_embedding_page**: the embedding of the original text from the page
- **chunk_text**: the original text divided into smaller chunks
- **chunk_number**: the index of each text chunk
- **text_embedding_chunk**: the embedding of each text chunk

In [None]:
text_metadata_df.head()

#### Inspect the processed image metadata

The following cell will produce a metadata table which describes the different parts of image metadata, including:
* **img_desc**: Gemini-generated textual description of the image.
* **mm_embedding_from_text_desc_and_img**: Combined embedding of image and its description, capturing both visual and textual information.
* **mm_embedding_from_img_only**: Image embedding without description, for comparison with description-based analysis.
* **text_embedding_from_image_description**: Separate text embedding of the generated description, enabling textual analysis and comparison.

In [None]:
image_metadata_df.head()

### Import the helper functions to implement RAG

You will be importing the following functions which will be used in the remainder of this notebook to implement RAG:

* **get_similar_text_from_query():** Given a text query, finds text from the document which are relevant, using cosine similarity algorithm. It uses text embeddings from the metadata to compute and the results can be filtered by top score, page/chunk number, or embedding size.
* **print_text_to_text_citation():** Prints the source (citation) and details of the retrieved text from the `get_similar_text_from_query()` function.
* **get_similar_image_from_query():** Given an image path or an image, finds images from the document which are relevant. It uses image embeddings from the metadata.
* **print_text_to_image_citation():** Prints the source (citation) and the details of retrieved images from the `get_similar_image_from_query()` fuction.
* **get_gemini_response():** Interacts with a Gemini model to answer questions based on a combination of text and image inputs.
* **display_images():**  Displays a series of images provided as paths or PIL Image objects.

In [None]:
from utils.multimodal_qa_with_rag_utils import (
    get_similar_text_from_query,
    print_text_to_text_citation,
    get_similar_image_from_query,
    print_text_to_image_citation,
    get_gemini_response,
    display_images,
    get_answer_from_qa_system,
)

Before implementing a multimodal RAG, let's take a step back and explore what you can achieve with just text or image embeddings alone. It will help to set the foundation for implementing a multimodal RAG, which you will be doing in the later part of the notebook. You can also use these essential elements together to build applications for multimodal use cases for extracting meaningful information from the document.

## Text Search

Let's start the search with a simple question and see if the simple text search using text embeddings can answer it. The expected answer is to show the value of basic and diluted net income per share of Google for different share types.


In [None]:
# query = "When can I receive my over-the-counter quarterly benefit?" # Answer present only in text

query = "What is the price of Fexofenadine tablets?"  # Answer present only in images

### Search similar text with text query

In [None]:
# Matching user text query with "chunk_embedding" to find relevant chunks.
matching_results_text = get_similar_text_from_query(
    query,
    text_metadata_df,
    column_name="text_embedding_chunk",
    top_n=3,
    chunk_text=True,
)

# Print the matched text citations
print_text_to_text_citation(
    matching_results_text, print_top=True, chunk_text=True
)  # print_top=False to see all text matches

In this exercise, you'll notice that the high-scoring match initially seems to contain the information we need, but a detailed review reveals it lacks specific pricing details for Fexofenadine tablets. This omission occurs because the pricing information is presented in an image format within the document, not as searchable text. Consequently, without the ability to process and interpret image data, critical details like this could be overlooked.

To address this challenge, let’s input the relevant sections of the document into the Gemini 1.0 Pro model and see if it can integrate information from both the text and image data across the document. This approach exemplifies a basic multimodal RAG implementation, where the model considers multiple data types to provide a more complete answer.

### Get answer with text-RAG

In [None]:
# All relevant text chunk found across documents based on user query
context = "\n".join(
    [value["chunk_text"] for key, value in matching_results_text.items()]
)

prompt = f"""Answer the question with the given context.
Question: {query}
Context: {context}
Answer:
"""

In [None]:
safety_settings = {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
}

In [None]:
%%time
# Generate response with Gemini 1.5 Pro
print("\n **** Result: ***** \n")

Markdown(
    get_gemini_response(
        multimodal_model_15,
        model_input=prompt,
        stream=True,
        safety_settings=safety_settings,
        generation_config=GenerationConfig(temperature=1, max_output_tokens=8192),
    )
)

In [None]:
%%time
# Generate response with Gemini 1.0 Pro
Markdown(
    get_gemini_response(
        text_model,
        model_input=prompt,
        stream=True,
        safety_settings=safety_settings,
        generation_config=GenerationConfig(temperature=0.4),
    )
)

You can expect a response like the one below:

*"I'm sorry, but the price of Fexofenadine tablets is not listed in the provided context."*

This outcome aligns with our previous discussions. None of the text sections contain the pricing information you're looking for, primarily because it is presented in image form within the document, rather than as text. To tackle this issue, let's explore how we can effectively utilize the capabilities of Gemini 1.0 Pro Vision along with Multimodal Embeddings to extract and interpret the data embedded in images.

### Search similar images with text query

Since a plain text search didn't yield the desired results, and the information may be visually represented in a table or another image format, you will leverage the multimodal capabilities of the Gemini 1.0 Pro Vision model for this task. The goal is to find an image that correlates with your text query about the pricing information. Additionally, you can print the citations to validate the accuracy of the retrieved images..

In [None]:
query = "What is the price of Fexofenadine tablets?"

In [None]:
matching_results_image = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query=query,
    column_name="text_embedding_from_image_description",  # Use image description text embedding
    image_emb=False,  # Use text embedding instead of image embedding
    top_n=5,
    embedding_size=1408,
)

# Markdown(print_text_to_image_citation(matching_results_image, print_top=True))
print("\n **** Result: ***** \n")

# Display the top matching image
display_images(
    [
        matching_results_image[0]["img_path"],
    ],
    resize_ratio=0.8,
)

"Bingo! It found exactly what you were looking for. You wanted the details on the pricing for Fexofenadine tablets, and guess what? This image fits the bill perfectly thanks to its descriptive metadata used by Gemini.

You can also submit the image along with its description to the Gemini 1.0 Pro Vision model and receive the detailed pricing information as a JSON response:

In [None]:
%%time
print("\n **** Result: ***** \n")

instruction = f"""Answer the question and explain results with the given Image:
Question: {query}
Image:
"""

# Prepare the model input
model_input = [
    instruction,
    # passing all matched images to Gemini
    "Image:",
    matching_results_image[0]["image_object"],
    "Description:",
    matching_results_image[0]["image_description"],
]

# Generate Gemini response with streaming output
Markdown(
    get_gemini_response(
        multimodal_model_15,  # we are passing Gemini 1.5 Pro
        model_input=model_input,
        stream=True,
        safety_settings=safety_settings,
        generation_config=GenerationConfig(temperature=1),
    )
)

In [None]:
## you can check the citations to probe further.
## check the "image description:" which is a description extracted through gemini which helped search our query.
Markdown(print_text_to_image_citation(matching_results_image, print_top=True))

## Image Search

### Search similar image with image input [using multimodal image embeddings]

Imagine using an image as your search query instead of text. For instance, you have a table detailing the cost of revenue for two years and you want to find other images that resemble it, whether they are in the same document or across multiple documents.

Think of it as navigating with a visual map rather than a written address. It's a unique way to request, "Show me more like this." Instead of typing out "cost of revenue 2020-2021 table," you simply present a picture of that table and say, "Find me more like this."

For demonstration purposes in this exercise, we will focus on finding similar images within a single document—specifically, images that depict over-the-counter labels or something similar. However, this method can be expanded to locate relevant images across multiple documents, showcasing the scalability of this visual search approach.

In [None]:
# You can find a similar image as per the images you have in the metadata.
# In this case, you have an over-the-counter label and you would like to find similar labels in the documents.
image_query_path = "images/Catalog Wellcare.pdf_image_3_3_22.jpeg"

# Print a message indicating the input image
print("***Input image from user:***")

# Display the input image
Image.load_from_file(image_query_path)

You expect to find tables (as images) that are similar in terms of "Other/Total cost of revenues."

In [None]:
# Search for Similar Images Based on Input Image and Image Embedding

matching_results_image = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query=query,  # Use query text for additional filtering (optional)
    column_name="mm_embedding_from_img_only",  # Use image embedding for similarity calculation
    image_emb=True,
    image_query_path=image_query_path,  # Use input image for similarity calculation
    top_n=3,  # Retrieve top 3 matching images
    embedding_size=1408,  # Use embedding size of 1408
)

print("\n **** Result: ***** \n")

# Display the Top Matching Image
display(
    matching_results_image[0]["image_object"]
)  # Display the top matching image object (Pillow Image)

The search successfully identified a label that closely resembles the one provided, which lists pricing and details for various over-the-counter allergy relief medications. More importantly, both labels feature the "OTCH Eligible" tag followed by a SKU number, which is crucial for matching products to your health plan's catalog.

You can also view the citation to understand how the match was determined and to verify the accuracy of the information retrieved from the image data. This process demonstrates the capability of multimodal search to not only recognize similar visual patterns but also to contextualize the information within your specific healthcare framework.

In [None]:
# Display citation details for the top matching image
print_text_to_image_citation(
    matching_results_image, print_top=True
)  # Print citation details for the top matching image

The ability to identify similar text and images based on user input, powered by Gemini and embeddings, forms a crucial foundation for development of multimodal RAG systems, which you explore in the next section.

### Comparative reasoning

Next, let's apply what you have done so far in doing comparative reasoning.

For this example:

* **Step 1:** You will search all the images for a specific query

* **Step 2:** Send those images to Gemini 1.5 Pro to ask multiple questions, where it has to compare among those images and provide you with answers.

In [None]:
matching_results_image_query_1 = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query="Show me images in the documents of a Doctor or Pharmacist speaking directly to a patient",
    column_name="text_embedding_from_image_description",  # Use image description text embedding # mm_embedding_from_img_only text_embedding_from_image_description
    image_emb=False,  # Use text embedding instead of image embedding
    top_n=3,
    embedding_size=1408,
)

In [None]:
# Check Matched Images
# You can access the other two matched images using:

print("---------------Matched Images------------------\n")
display_images(
    [
        matching_results_image_query_1[0]["img_path"],
        matching_results_image_query_1[1]["img_path"],
        matching_results_image_query_1[2]["img_path"],
    ],
    resize_ratio=0.2,
)

In [None]:
prompt = f"""Task: Answer the following questions in detail, providing clear reasoning and evidence from the images in bullet points.
Question:
 - What are the differences in settings? Are all clinical settings?
 - What the differences in average ages across all?
"""

In [None]:
%%time
# Generate response with Gemini 1.5 Pro
print("\n **** Result: ***** \n")
Markdown(
    get_gemini_response(
        multimodal_model_15,
        model_input=[
            prompt,
            "Images:",
            matching_results_image_query_1[0]["image_object"],
            matching_results_image_query_1[1]["image_object"],
            matching_results_image_query_1[2]["image_object"],
        ],
        stream=True,
        safety_settings=safety_settings,
        generation_config=GenerationConfig(temperature=1, max_output_tokens=8192),
    )
)

## Multimodal retrieval augmented generation (RAG)

Let's bring everything together to implement multimodal RAG. You will use all the elements that you've explored in previous sections to implement the multimodal RAG. These are the steps:

* **Step 1:** The user gives a query in text format where the expected information is available in the document and is embedded in images and text.
* **Step 2:** Find all text chunks from the pages in the documents using a method similar to the one you explored in `Text Search`.
* **Step 3:** Find all similar images from the pages based on the user query matched with `image_description` using a method identical to the one you explored in `Image Search`.
* **Step 4:** Combine all similar text and images found in steps 2 and 3 as `context_text` and `context_images`.
* **Step 5:** With the help of Gemini, we can pass the user query with text and image context found in steps 2 & 3. You can also add a specific instruction the model should remember while answering the user query.
* **Step 6:** Gemini produces the answer, and you can print the citations to check all relevant text and images used to address the query.

### Step 1: User query

In [None]:
# this time we are not passing any images, but just a simple text query.

query = """\
 - What is the price of Loratadine 10 mg?
 - What is the standard Part B premium amount in 2024?
 - What is the Medicare number for John L Smith?
 - What conditions are CVS Specialty services available for?
 - What is the price of Vick's Nyquil Liquicap - 16 CT?
 """

### Step 2: Get all relevant text chunks

In [None]:
# Retrieve relevant chunks of text based on the query
matching_results_chunks_data = get_similar_text_from_query(
    query,
    text_metadata_df,
    column_name="text_embedding_chunk",
    top_n=30,
    chunk_text=True,
)

### Step 3: Get all relevant images

In [None]:
# Get all relevant images based on user query
matching_results_image_fromdescription_data = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query=query,
    column_name="text_embedding_from_image_description",
    image_emb=False,
    top_n=30,
    embedding_size=1408,
)

### Step 4: Create context_text and context_images

In [None]:
instruction = """Task: Answer the following questions in detail one by one, providing clear reasoning and evidence from the images and text in bullet points.
Instructions:

1. **Analyze:** Carefully examine the provided images and text context.
2. **Synthesize:** Integrate information from both the visual and textual elements.
3. **Reason:**  Deduce logical connections and inferences to address the question.
4. **formatting:** Please format the response as plain text, removing any unintended formatting or mathematical symbols. Ensure all characters are displayed as they appear in the original text.
5. **Respond:** Provide a bulleted, concise, accurate answer in the following format:

   * **Question:** [Question]
   * **Answer:** [Direct response to the question]
   * **Explanation:** [Bullet-point reasoning steps if applicable]
   * **Source** [Image and Text citation]

5. **Ambiguity:** If the context is insufficient to answer, respond "Not enough context to answer."

"""

# combine all the selected relevant text chunks
context_text = ["Text Context: "]
for key, value in matching_results_chunks_data.items():
    context_text.extend(
        [
            "Text Source: ",
            f"""file_name: "{value["file_name"]}" Page: "{value["page_num"]}""",
            "Text",
            value["chunk_text"],
        ]
    )

# combine all the selected relevant images
gemini_content = [
    instruction,
    "Questions: ",
    query,
    "Image Context: ",
]
for key, value in matching_results_image_fromdescription_data.items():
    gemini_content.extend(
        [
            "Image Path: ",
            value["img_path"],
            "Image Description: ",
            value["image_description"],
            "Image:",
            value["image_object"],
        ]
    )
gemini_content.extend(context_text)

### Step 5: Pass context to Gemini

In [None]:
# Generate final response using Gemini 1.5 Pro
rich_Markdown(
    get_gemini_response(
        multimodal_model_15,
        model_input=gemini_content,
        stream=True,
        safety_settings=safety_settings,
        generation_config=GenerationConfig(temperature=1, max_output_tokens=8192),
    )
)

### Step 6: Print citations and references

In [None]:
print("---------------Matched Images------------------\n")
display_images(
    [
        matching_results_image_fromdescription_data[0]["img_path"],
        matching_results_image_fromdescription_data[1]["img_path"],
        matching_results_image_fromdescription_data[2]["img_path"],
        matching_results_image_fromdescription_data[3]["img_path"],
    ],
    resize_ratio=0.5,
)

In [None]:
# Image citations. You can check how Gemini generated metadata helped in grounding the answer.

print_text_to_image_citation(
    matching_results_image_fromdescription_data, print_top=True
)

In [None]:
# Text citations

print_text_to_text_citation(
    matching_results_chunks_data,
    print_top=True,
    chunk_text=True,
)

## Conclusions

Congratulations on making it through this multimodal RAG notebook!

While multimodal RAG can be quite powerful, note that it can face some limitations:

* **Data dependency:** Needs high-quality paired text and visuals.
* **Computationally demanding:** Processing multimodal data is resource-intensive.
* **Domain specific:** Models trained on general data may not shine in specialized fields like medicine.
* **Black box:** Understanding how these models work can be tricky, hindering trust and adoption.


Despite these challenges, multimodal RAG represents a significant step towards search and retrieval systems that can handle diverse, multimodal data.