<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Multimodal RAG with Amazon Bedrock, Amazon Nova and LangChain
</h1>

This notebook demonstrates how to implement a multi-modal Retrieval-Augmented Generation (RAG) system using **Amazon Bedrock with Amazon Nova and LangChain**. Many documents contain a mixture of content types, including text and images. Traditional RAG applications often lose valuable information captured in images. With the emergence of Multimodal Large Language Models (MLLMs), we can now leverage both text and image data in our RAG systems.

In this notebook, we'll explore one approach to multi-modal RAG (`Option 1`):

1. Use multimodal embeddings (such as Amazon Titan) to embed both images and text
2. Retrieve relevant information using similarity search
3. Pass raw images and text chunks to a multimodal LLM for answer synthesis using Amazon Nova

We'll use the following tools and technologies:

- [LangChain](https://python.langchain.com/v0.2/docs/introduction/) to build a multimodal RAG system
- [faiss](https://github.com/facebookresearch/faiss) for similarity search
- [Amazon Nova](https://docs.aws.amazon.com/nova/latest/userguide/what-is-nova.html ) for answer synthesis
- [Amazon Titan Multimodal Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html) for image embeddings
- [Amazon Bedrock](https://aws.amazon.com/bedrock/) for accessing powerful AI models, like the ones above
- [pymupdf](https://pymupdf.readthedocs.io/en/latest/) to parse images, text, and tables from documents (PDFs)
- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) for interacting with Amazon Bedrock

This approach allows us to create a more comprehensive RAG system that can understand and utilize both textual and visual information from our documents.

## Prerequisites

Before running this notebook, ensure you have the following packages and dependencies installed:

- Python 3.10 or later
- langchain
- boto3
- faiss
- pymupdf
- tabula
- tesseract
- requests

Let's get started with building our multi-modal RAG system using Amazon Bedrock!

![Multimodal RAG with Amazon Bedrock](imgs/multimodal-rag1.png)

<h2 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4, #1e90ff); 
            color: white; 
            padding: 15px; 
            border-radius: 10px; 
            text-align: center; 
            font-family: 'Comic Sans MS', cursive, sans-serif; 
            text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
   Importing the libs
</h2>

In [1]:
# !pip install --upgrade jpype1 tabula-py PyMuPDF
# !pip install --upgrade boto3 requests numpy tqdm botocore langchain ipython
# !pip install --upgrade faiss-cpu

In [21]:
import boto3
import tabula
import faiss
import json
import base64
import pymupdf
import requests
import os
import logging
import numpy as np
import warnings
from tqdm import tqdm
from botocore.exceptions import ClientError
from langchain_text_splitters import RecursiveCharacterTextSplitter
from IPython import display
import os
import base64
import json
import logging
import requests
import numpy as np
import warnings
from tqdm import tqdm
from IPython import display
from PIL import Image
import io

logger = logging.getLogger(__name__)
logger.setLevel(logging.ERROR)

warnings.filterwarnings("ignore")

<h2 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4, #1e90ff); 
            color: white; 
            padding: 15px; 
            border-radius: 10px; 
            text-align: center; 
            font-family: 'Comic Sans MS', cursive, sans-serif; 
            text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
   Data Loading
</h2>

In [22]:
import pymupdf  # ensure you have PyMuPDF installed
import tabula
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [23]:
# Downloading the dataset - URL of the "Attention Is All You Need" paper (Replace it with the URL of the PDF file/dataset you want to download)
# Download and display a sample PDF
url = "https://arxiv.org/pdf/1706.03762.pdf"
filename = "attention_paper.pdf"
filepath = r"E:\rag_edubot\data\attention_paper.pdf"
os.makedirs("data", exist_ok=True)
response = requests.get(url)
if response.status_code == 200:
    with open(filepath, 'wb') as file:
        file.write(response.content)
    print(f"File downloaded successfully: {filepath}")
else:
    print(f"Failed to download the file. Status code: {response.status_code}")

File downloaded successfully: E:\rag_edubot\data\attention_paper.pdf


In [24]:
# Open and display the PDF
doc = pymupdf.open(filepath)
num_pages = len(doc)
display.IFrame(filepath, width=1000, height=600)

In [12]:
# Display the PDF file
display.IFrame(filepath, width=1000, height=600)

<h2 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4, #1e90ff); 
            color: white; 
            padding: 15px; 
            border-radius: 10px; 
            text-align: center; 
            font-family: 'Comic Sans MS', cursive, sans-serif; 
            text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
   Data Extraction
</h2>

In [26]:
# Create directories for outputs
def create_directories(base_dir):
    os.makedirs(f"{base_dir}/text", exist_ok=True)
    os.makedirs(f"{base_dir}/images", exist_ok=True)
    os.makedirs(f"{base_dir}/tables", exist_ok=True)
    os.makedirs(f"{base_dir}/page_images", exist_ok=True)
base_dir = "data"
create_directories(base_dir)

# Process tables, text, and images (your existing functions)
def process_tables(doc, page_num, base_dir, items):
    try:
        tables = tabula.read_pdf(filepath, pages=page_num + 1, multiple_tables=True)
        for i, table in enumerate(tables):
            table_file_name = f"{base_dir}/tables/{os.path.basename(filepath)}_table_{page_num}_{i}.csv"
            table.to_csv(table_file_name, index=False)
            # Save table text as a string (for embedding)
            table_text = table.to_csv(index=False)
            items.append({"page": page_num, "type": "table", "text": table_text, "path": table_file_name})
    except Exception as e:
        print(f"Error extracting tables from page {page_num}: {e}")


# Process text chunks
def process_text_chunks(text, text_splitter, page_num, base_dir, items):
    chunks = text_splitter.split_text(text)
    for i, chunk in enumerate(chunks):
        text_file_name = f"{base_dir}/text/{os.path.basename(filepath)}_text_{page_num}_{i}.txt"
        with open(text_file_name, 'w', encoding='utf-8') as f:
            f.write(chunk)
        items.append({"page": page_num, "type": "text", "text": chunk, "path": text_file_name})

# Process images
def process_images(page, page_num, base_dir, items):
    images = page.get_images()
    for idx, image in enumerate(images):
        xref = image[0]
        pix = pymupdf.Pixmap(doc, xref)
        image_name = f"{base_dir}/images/{os.path.basename(filepath)}_image_{page_num}_{idx}_{xref}.png"
        pix.save(image_name)
        with open(image_name, 'rb') as f:
            encoded_image = base64.b64encode(f.read()).decode('utf8')
        items.append({"page": page_num, "type": "image", "path": image_name, "image": encoded_image})
# Process page images
def process_page_images(page, page_num, base_dir, items):
    pix = page.get_pixmap()
    page_path = os.path.join(base_dir, f"page_images/page_{page_num:03d}.png")
    pix.save(page_path)
    with open(page_path, 'rb') as f:
        page_image = base64.b64encode(f.read()).decode('utf8')
    items.append({"page": page_num, "type": "page", "path": page_path, "image": page_image})

In [14]:
doc = pymupdf.open(filepath)
num_pages = len(doc)
base_dir = "data"
print(f"Number of pages in the PDF: {num_pages}")
# Creating the directories
create_directories(base_dir)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=200, length_function=len)
items = []

# Process each page of the PDF
for page_num in tqdm(range(num_pages), desc="Processing PDF pages"):
    page = doc[page_num]
    text = page.get_text()
    process_tables(doc, page_num, base_dir, items)
    process_text_chunks(text, text_splitter, page_num, base_dir, items)
    process_images(page, page_num, base_dir, items)
    process_page_images(page, page_num, base_dir, items)

Number of pages in the PDF: 15


Processing PDF pages:   0%|          | 0/15 [00:00<?, ?it/s]

Error extracting tables from page 0: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 1: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 2: module 'tabula' has no attribute 'read_pdf'


Processing PDF pages:  33%|███▎      | 5/15 [00:00<00:00, 16.18it/s]

Error extracting tables from page 3: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 4: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 5: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 6: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 7: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 8: module 'tabula' has no attribute 'read_pdf'


Processing PDF pages:  80%|████████  | 12/15 [00:00<00:00, 25.10it/s]

Error extracting tables from page 9: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 10: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 11: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 12: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 13: module 'tabula' has no attribute 'read_pdf'


Processing PDF pages: 100%|██████████| 15/15 [00:00<00:00, 19.92it/s]

Error extracting tables from page 14: module 'tabula' has no attribute 'read_pdf'





In [27]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=200, length_function=len)
items = []

In [28]:
# Process each page of the PDF
for page_num in tqdm(range(num_pages), desc="Processing PDF pages"):
    page = doc[page_num]
    text = page.get_text()
    process_tables(doc, page_num, base_dir, items)
    process_text_chunks(text, text_splitter, page_num, base_dir, items)
    process_images(page, page_num, base_dir, items)
    process_page_images(page, page_num, base_dir, items)

Processing PDF pages:  13%|█▎        | 2/15 [00:00<00:00, 18.75it/s]

Error extracting tables from page 0: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 1: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 2: module 'tabula' has no attribute 'read_pdf'


Processing PDF pages:  53%|█████▎    | 8/15 [00:00<00:00, 20.05it/s]

Error extracting tables from page 3: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 4: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 5: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 6: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 7: module 'tabula' has no attribute 'read_pdf'


Processing PDF pages:  80%|████████  | 12/15 [00:00<00:00, 24.17it/s]

Error extracting tables from page 8: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 9: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 10: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 11: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 12: module 'tabula' has no attribute 'read_pdf'
Error extracting tables from page 13: module 'tabula' has no attribute 'read_pdf'


Processing PDF pages: 100%|██████████| 15/15 [00:00<00:00, 18.44it/s]

Error extracting tables from page 14: module 'tabula' has no attribute 'read_pdf'





In [29]:
# 1. Text Embeddings using Sentence Transformers
from sentence_transformers import SentenceTransformer
text_model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_text_embedding(text):
    embedding = text_model.encode(text)
    return embedding.tolist()

In [30]:
# 2. Image Embeddings using CLIP from Hugging Face
from transformers import CLIPProcessor, CLIPModel
import torch

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

In [31]:
def generate_image_embedding(image_base64):
    # Decode base64 image
    image_data = base64.b64decode(image_base64)
    image = Image.open(io.BytesIO(image_data)).convert("RGB")
    # Process the image for CLIP
    inputs = clip_processor(images=image, return_tensors="pt")
    with torch.no_grad():
        image_features = clip_model.get_image_features(**inputs)
    # Normalize the embedding vector
    image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
    return image_features.squeeze(0).cpu().numpy().tolist()

In [32]:
# Set embedding vector dimension (for consistency, e.g., 384 for text and 512 for CLIP)
# Note: The dimensions might differ between models. You might consider mapping to a unified dimension if needed.
text_embedding_dimension = 384  # all-MiniLM-L6-v2 outputs 384-dim vectors
image_embedding_dimension = 512  # CLIP-vit-base-patch32 outputs 512-dim vectors

In [33]:
with tqdm(total=len(items), desc="Generating embeddings") as pbar:
    for item in items:
        if item['type'] in ['text', 'table']:
            # Use text embedding model
            item['embedding'] = generate_text_embedding(item['text'])
        elif item['type'] in ['image', 'page']:
            # Use image embedding model
            item['embedding'] = generate_image_embedding(item['image'])
        else:
            item['embedding'] = None  # in case of unknown type
        pbar.update(1)

Generating embeddings: 100%|██████████| 101/101 [00:03<00:00, 27.89it/s]


In [34]:
# Optionally, create a unified FAISS index for one modality or maintain separate indexes.
# For example, here we'll create a FAISS index for text embeddings.
import faiss

# Collect only text/table embeddings (or choose modality as needed)
text_embeddings = [np.array(item['embedding'], dtype=np.float32) 
                   for item in items if item['type'] in ['text', 'table'] and item['embedding'] is not None]

if len(text_embeddings) > 0:
    # Create a FAISS index (using L2 distance)
    index = faiss.IndexFlatL2(text_embedding_dimension)
    index.reset()
    index.add(np.vstack(text_embeddings))
    print(f"FAISS index created with {index.ntotal} embeddings.")
else:
    print("No text embeddings available to index.")


FAISS index created with 83 embeddings.


In [35]:
# Now you can perform retrieval using the appropriate modality.
# For example, to search using a text query:
query = "Which optimizer was used when training the models?"
query_embedding = np.array(generate_text_embedding(query), dtype=np.float32).reshape(1, -1)
k = 5

In [36]:
distances, indices = index.search(query_embedding, k=k)
print("Nearest neighbors indices:", indices.flatten())

Nearest neighbors indices: [43 44 47 45 25]


In [38]:
import requests
import os

url = "https://arxiv.org/pdf/1706.03762.pdf"
filepath = r"E:\rag_edubot\data\attention_paper.pdf"
os.makedirs(os.path.dirname(filepath), exist_ok=True)

def download_file(url, filepath, max_retries=3):
    retries = 0
    while retries < max_retries:
        try:
            with requests.get(url, stream=True, timeout=(10, 60)) as response:
                response.raise_for_status()  # Raise an HTTPError if the HTTP request returned an unsuccessful status code.
                with open(filepath, 'wb') as f:
                    for chunk in response.iter_content(chunk_size=8192):
                        if chunk:  # filter out keep-alive new chunks
                            f.write(chunk)
            print(f"File downloaded successfully: {filepath}")
            return
        except requests.exceptions.ChunkedEncodingError as e:
            print(f"ChunkedEncodingError encountered: {e}. Retrying ({retries + 1}/{max_retries})...")
        except requests.exceptions.ReadTimeout as e:
            print(f"ReadTimeout encountered: {e}. Retrying ({retries + 1}/{max_retries})...")
        except requests.exceptions.RequestException as e:
            print(f"Request error: {e}. Retrying ({retries + 1}/{max_retries})...")
        retries += 1
    print("Failed to download the file after several retries.")

download_file(url, filepath)


File downloaded successfully: E:\rag_edubot\data\attention_paper.pdf


In [None]:
from huggingface_hub import InferenceApi

# Initialize the inference client with your API token
inference = InferenceApi(repo_id="EleutherAI/gpt-j-6B", token="YOUR_API_TOKEN_HERE")

prompt = """
You are a helpful assistant for question answering.
The following context is retrieved from a set of documents:
The optimizer used was AdamW.
Results show that AdamW outperforms SGD in our experiments.
[IMAGE CONTENT]
Based on the above context, answer the question: Which optimizer was used when training the models?
Answer:
"""

# Create a payload with the prompt and parameters
payload = {
    "inputs": prompt,
    "parameters": {
        "max_new_tokens": 300,
        "do_sample": True,
        "top_p": 0.9,
        "top_k": 20
    }
}

# Invoke the model using the payload
response = inference(payload)
print("Generated Response:\n", response)




Generated Response:
 {'error': '401 Unauthorized'}


In [None]:
# -----------------------------------
# Open-Source RAG Response Generation
# -----------------------------------
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Initialize a text-generation model from Hugging Face.
# Here, we use GPT-J-6B as an example; you can substitute another model if desired.
# Make sure you have the required model files and a GPU (or set device=-1 for CPU, which is slow).
def generate_response_hf(prompt, matched_items):
    # Prepare context by concatenating retrieved text
    context_parts = []
    for item in matched_items:
        if item['type'] in ['text', 'table']:
            context_parts.append(item.get("text", ""))
        elif item['type'] in ['image', 'page']:
            context_parts.append("[IMAGE CONTENT]")
    context = "\n".join(context_parts)
    
    full_prompt = f"""You are a helpful assistant for question answering.
The following context is retrieved from a set of documents:
{context}

Based on the above context, answer the question: {prompt}
Answer:"""
    
    # Call Hugging Face Inference API
    API_URL = "https://api-inference.huggingface.co/models/EleutherAI/gpt-j-6B"
    headers = {"Authorization": "Bearer YOUR_API_TOKEN_HERE"}
    response = requests.post(API_URL, headers=headers, json={
        "inputs": full_prompt,
        "parameters": {"max_new_tokens": 300, "do_sample": True, "top_p": 0.9, "top_k": 20}
    })
    result = response.json()
    # Depending on the API response structure, extract the generated text
    generated_text = result[0].get('generated_text', '') if isinstance(result, list) else result.get('generated_text', '')
    return generated_text


In [45]:
query = "Which optimizer was used when training the models?"
matched_items = [
    {"type": "text", "text": "The optimizer used was AdamW."},
    {"type": "table", "text": "Results show that AdamW outperforms SGD in our experiments."},
    {"type": "image", "image": "[Base64ImageData]"}
]

In [46]:
response_text = generate_response_hf(query, matched_items)
print("Generated Response:\n", response_text)

Generated Response:
 


<h2 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4, #1e90ff); 
            color: white; 
            padding: 15px; 
            border-radius: 10px; 
            text-align: center; 
            font-family: 'Comic Sans MS', cursive, sans-serif; 
            text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Creating Vector Database/Index
</h2>

In [11]:
# All the embeddings
all_embeddings = np.array([item['embedding'] for item in items])

# Create FAISS Index
index = faiss.IndexFlatL2(embedding_vector_dimension)

# Clear any pre-existing index
index.reset()

# Add embeddings to the index
index.add(np.array(all_embeddings, dtype=np.float32))

In [12]:
from langchain_aws import ChatBedrock

# Generating RAG response with Amazon Nova
def invoke_nova_multimodal(prompt, matched_items):
    """
    Invoke the Amazon Nova model.
    """


    # Define your system prompt(s).
    system_msg = [
                        { "text": """You are a helpful assistant for question answering. 
                                    The text context is relevant information retrieved. 
                                    The provided image(s) are relevant information retrieved."""}
                 ]

    # Define one or more messages using the "user" and "assistant" roles.
    message_content = []

    for item in matched_items:
        if item['type'] == 'text' or item['type'] == 'table':
            message_content.append({"text": item['text']})
        else:
            message_content.append({"image": {
                                                "format": "png",
                                                "source": {"bytes": item['image']},
                                            }
                                    })


    # Configure the inference parameters.
    inf_params = {"max_new_tokens": 300, 
                "top_p": 0.9, 
                "top_k": 20}

    # Define the final message list
    message_list = [
        {"role": "user", "content": message_content}
    ]
    
    # Adding the prompt to the message list
    message_list.append({"role": "user", "content": [{"text": prompt}]})

    native_request = {
        "messages": message_list,
        "system": system_msg,
        "inferenceConfig": inf_params,
    }

    # Initialize the Amazon Bedrock runtime client
    model_id = "amazon.nova-pro-v1:0"
    client = ChatBedrock(model_id=model_id)

    # Invoke the model and extract the response body.
    response = client.invoke(json.dumps(native_request))
    model_response = response.content
    
    return model_response

<h2 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4, #1e90ff); 
            color: white; 
            padding: 15px; 
            border-radius: 10px; 
            text-align: center; 
            font-family: 'Comic Sans MS', cursive, sans-serif; 
            text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Test the RAG Pipeline
</h2>

In [13]:
# User Query
query = "Which optimizer was used when training the models?"

# Generate embeddings for the query
query_embedding = generate_multimodal_embeddings(prompt=query,output_embedding_length=embedding_vector_dimension)

# Search for the nearest neighbors in the vector database
distances, result = index.search(np.array(query_embedding, dtype=np.float32).reshape(1,-1), k=5)

In [None]:
# Check the result (matched chunks)
result.flatten()

In [None]:
# Retrieve the matched items
matched_items = [{k: v for k, v in items[index].items() if k != 'embedding'} for index in result.flatten()]

# Generate RAG response with Amazon Nova
response = invoke_nova_multimodal(query, matched_items)

display.Markdown(response)

<h2 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4, #1e90ff); 
            color: white; 
            padding: 15px; 
            border-radius: 10px; 
            text-align: center; 
            font-family: 'Comic Sans MS', cursive, sans-serif; 
            text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Your Turn: Test the RAG Pipeline
</h2>

In [17]:
# List of queries (Replace with any query of your choice)
other_queries = ["How long were the base and big models trained?",
                 "Which optimizer was used when training the models?",
                 "What is the position-wise feed-forward neural network mentioned in the paper?",
                 "What is the BLEU score of the model in English to German translation (EN-DE)?",
                 "How is the scaled-dot-product attention is calculated?",
                 ]


In [None]:
query = other_queries[0] # Replace with any query from the list above

# Generate embeddings for the query
query_embedding = generate_multimodal_embeddings(prompt=query,output_embedding_length=embedding_vector_dimension)

# Search for the nearest neighbors in the vector database
distances, result = index.search(np.array(query_embedding, dtype=np.float32).reshape(1,-1), k=5)

# Retrieve the matched items
matched_items = [{k: v for k, v in items[index].items() if k != 'embedding'} for index in result.flatten()]

# Generate RAG response with Amazon Nova
response = invoke_nova_multimodal(query, matched_items)

# Display the response
display.Markdown(response)

<h2 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4, #1e90ff); 
            color: white; 
            padding: 15px; 
            border-radius: 10px; 
            text-align: center; 
            font-family: 'Comic Sans MS', cursive, sans-serif; 
            text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Thank you!
</h2>