# **RAG_Multimodal**

# Pipeline Overview: Multimodal RAG for PDF Documents

This pipeline implements a **Retrieval-Augmented Generation (RAG)** system that combines **textual content** and **image-derived context** extracted from PDF documents to support multimodal question answering.

---

## 1. Document Extraction and Image Description Generation

- Text and images were extracted from the PDF files (`Multimodal.pdf`, `multimodal_sample.pdf`) using **PyMuPDF**.
- For each extracted image, a **generic textual description** was generated (for example:  
  *“This image from [filename.pdf] shows a figure from the document.”*).
- These descriptions act as **placeholders** for image understanding and represent where a **Vision-Language Model (VLM)** would be integrated in a full multimodal pipeline.

---

## 2. Unified Document Representation

- Text chunks extracted from the PDFs were converted into **LangChain `Document` objects**.
- Image descriptions were also wrapped as `Document` objects, with metadata identifying:
  - Source PDF
  - Page number
  - Image index
- Both text and image documents were merged into a **single unified collection** (`all_documents`), enabling joint retrieval.

---

## 3. Vector Store and Retriever Setup

- A **FAISS vector store** was created from the unified document list.
- **HuggingFaceEmbeddings** (`sentence-transformers/all-MiniLM-L6-v2`) were used to generate embeddings.
- A **retriever** was initialized from the FAISS index to fetch the most relevant documents for a given query, regardless of whether the source was text or an image description.

---

## 4. Multimodal RAG Prompt Design

- A **ChatPromptTemplate** was designed to explicitly inform the **NVIDIA DeepSeek LLM** that:
  - Context may include both raw text and image descriptions.
  - The model should **synthesize information across modalities**.
  - Sources must be cited clearly, distinguishing:
    - Text sources (PDF name and page number)
    - Image sources (PDF name and image index)

---

## 5. Multimodal RAG Chain Construction

- A LangChain **Runnable pipeline** (`multimodal_rag_chain`) was assembled with the following flow:




- This chain ensures that retrieved multimodal context is structured and grounded before being passed to the LLM.

---

## 6. Demonstrated Capabilities and Observations

- For multimodal queries, the system:
- Correctly retrieved and summarized **textual information**, with accurate citations (PDF and page numbers).
- Included **image-related context** when available.
- The LLM appropriately **declined to infer detailed visual information** from generic image descriptions.
- It explicitly stated the limitation of the provided image context.
- This behavior confirms effective **hallucination prevention**.
- The results highlight that **rich image captions** (ideally generated by a VLM) are critical for deeper multimodal reasoning.

---

## Summary

This pipeline demonstrates a **foundational multimodal RAG architecture**:
- Text and image contexts are unified at the retrieval layer.
- The system is robust against hallucination when image information is insufficient.
- It clearly shows the path for future enhancement by integrating **VLM-based image captioning** for truly effective multimodal understanding.


In [1]:
pip install -q --upgrade langchain langchain-core langchain-community


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/108.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m102.4/108.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.8/108.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.4/157.4 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency res

In [2]:
pip install -U --q pypdf unstructured tiktoken


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/981.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m430.1/981.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m9.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.0/329.0 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

In [3]:
!pip install -q uvicorn langserve


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
import langchain
print(langchain.__version__)


1.2.7


In [5]:
pip install -q fastapi


In [6]:
pip install --q langchain-groq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/137.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m71.7/137.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
pip install -q pymupdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [8]:
pip install -q streamlit

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [9]:
pip install --q neo4j

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/325.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/325.3 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m276.5/325.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.3/325.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## RAG dependencies

In [10]:
pip install -q pypdf arxiv wikipedia faiss-cpu sentence-transformers langchain-nvidia-ai-endpoints

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.8/49.8 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.5/81.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


In [11]:
pip install -U --q python-docx beautifulsoup4


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/253.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/253.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m245.8/253.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/107.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.7/107.7 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [12]:
pip install -U --q msoffcrypto-tool unstructured[all]


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
# Google Colab-compatible environment setup with sanity checks

import os
from google.colab import userdata
from google.colab.userdata import SecretNotFoundError # Import SecretNotFoundError

# Fetch secrets from Colab userdata
LANGCHAIN_API_KEY = userdata.get("LANGCHAIN_API_KEY")
try:
    LANGCHAIN_PROJECT = userdata.get("LANGCHAIN_PROJECT")
except SecretNotFoundError:
    print("Warning: LANGCHAIN_PROJECT secret not found in Colab userdata.")
    print("Please add 'LANGCHAIN_PROJECT' to your Colab secrets if you intend to use Langsmith project tracking.")
    LANGCHAIN_PROJECT = None # Set to None if not found

# Set environment variables
if LANGCHAIN_API_KEY:
    os.environ["LANGCHAIN_API_KEY"] = LANGCHAIN_API_KEY

os.environ["LANGCHAIN_TRACING_V2"] = "true"

if LANGCHAIN_PROJECT:
    os.environ["LANGCHAIN_PROJECT"] = LANGCHAIN_PROJECT

# -------- Sanity Checks --------
def sanity_check():
    checks = {
        "LANGCHAIN_API_KEY": os.environ.get("LANGCHAIN_API_KEY"),
        "LANGCHAIN_TRACING_V2": os.environ.get("LANGCHAIN_TRACING_V2"),
        "LANGCHAIN_PROJECT": os.environ.get("LANGCHAIN_PROJECT"), # Check if it's set in env
    }

    print("\n--- Sanity Checks ---")
    for key, value in checks.items():
        if value:
            print(f"[OK] {key} is set")
        else:
            print(f"[MISSING] {key} is NOT set")

sanity_check()

Please add 'LANGCHAIN_PROJECT' to your Colab secrets if you intend to use Langsmith project tracking.

--- Sanity Checks ---
[OK] LANGCHAIN_API_KEY is set
[OK] LANGCHAIN_TRACING_V2 is set
[MISSING] LANGCHAIN_PROJECT is NOT set


# **All models available in GROQ**

In [14]:
import requests
import os
import json
from google.colab import userdata

# Ensure GROQ_API_KEY is fetched directly from Colab secrets or environment
api_key = userdata.get("GROQ_API_KEY")

# If the API key is still not found, raise an error or inform the user
if not api_key:
    raise ValueError("GROQ_API_KEY not found in Colab secrets. Please ensure it is added.")

url = "https://api.groq.com/openai/v1/models"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

response = requests.get(url, headers=headers)
response.raise_for_status() # This will raise an HTTPError for bad responses (4xx or 5xx)

print(json.dumps(response.json(), indent=2))


{
  "object": "list",
  "data": [
    {
      "id": "groq/compound",
      "object": "model",
      "created": 1756949530,
      "owned_by": "Groq",
      "active": true,
      "context_window": 131072,
      "public_apps": null,
      "max_completion_tokens": 8192
    },
    {
      "id": "moonshotai/kimi-k2-instruct",
      "object": "model",
      "created": 1752435491,
      "owned_by": "Moonshot AI",
      "active": true,
      "context_window": 131072,
      "public_apps": null,
      "max_completion_tokens": 16384
    },
    {
      "id": "llama-3.1-8b-instant",
      "object": "model",
      "created": 1693721698,
      "owned_by": "Meta",
      "active": true,
      "context_window": 131072,
      "public_apps": null,
      "max_completion_tokens": 131072
    },
    {
      "id": "canopylabs/orpheus-arabic-saudi",
      "object": "model",
      "created": 1765926439,
      "owned_by": "Canopy Labs",
      "active": true,
      "context_window": 4000,
      "public_apps": null,

# Model Selection Guide (Purpose-Based)

This guide maps each available model to its best use case so you can quickly choose the right one.

---

## General Natural Language Generation / Chat

Suitable for chatbots, summaries, reasoning, coding help, and general text generation.

| Model | Notes | Best For |
|-----|-----|-----|
| **llama-3.3-70b-versatile** | Large, high-quality | Deep reasoning, complex tasks, long contexts |
| **llama-3.1-8b-instant** | Small, very fast | General chat, Q&A, lightweight apps |
| **openai/gpt-oss-20b** | Open-source GPT-style | Strong general text generation |
| **openai/gpt-oss-120b** | Very large OSS model | Highest-quality OSS reasoning & generation |

---

## Lightweight / Fast / Cost-Efficient

Optimized for speed and lower resource usage.

| Model | Notes | Best For |
|-----|-----|-----|
| **groq/compound-mini** | Lightweight | Fast throughput, low cost |
| **groq/compound** | Balanced | Speed + quality |
| **allam-2-7b** | 7B model | Very lightweight text generation |
| **moonshotai/kimi-k2-instruct** | Instruction-tuned | Fast assistant-style tasks |

---

## Long-Context Processing

Designed for very large documents and multi-file inputs.

| Model | Context Size | Best For |
|-----|-----|-----|
| **moonshotai/kimi-k2-instruct-0905** | 262k tokens | Books, long documents, multi-doc reasoning |
| **llama-3.1 / 3.3 variants** | 131k tokens | Long-context chat and analysis |

---

## Speech-to-Text (Not Text Generation)

| Model | Best For |
|-----|-----|
| **whisper-large-v3** | High-quality transcription |
| **whisper-large-v3-turbo** | Faster speech-to-text |

---

## Safety / Guard Models (Not for Generation)

Used only for moderation, safety checks, or filtering.

| Model | Purpose |
|-----|-----|
| **meta-llama/llama-guard-4-12b** | Safety classification |
| **meta-llama/llama-prompt-guard-2-22m / 86m** | Prompt risk detection |

---

## Language / Region-Specific

| Model | Best For |
|-----|-----|
| **canopylabs/orpheus-v1-english** | English-focused NLP |
| **canopylabs/orpheus-arabic-saudi** | Arabic (Saudi dialect) |
| **allam-2-7b** | Arabic-centric lightweight tasks |

---

## Quick Recommendations

- **Best overall (small + free):** `llama-3.1-8b-instant`
- **Best quality:** `llama-3.3-70b-versatile`
- **Fastest / cheapest:** `groq/compound-mini`
- **Very long documents:** `moonshotai/kimi-k2-instruct-0905`
- **Speech recognition:** `whisper-large-v3`

---


In [15]:
from langchain_groq import ChatGroq
from google.colab import userdata
import os

# Set Groq API key (must exist in Colab secrets)
os.environ["GROQ_API_KEY"] = userdata.get("GROQ_API_KEY")

# Initialize Groq LLM
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0
)

print(llm)


profile={'max_input_tokens': 131072, 'max_output_tokens': 8192, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 'reasoning_output': False, 'tool_calling': True} client=<groq.resources.chat.completions.Completions object at 0x7b9dc32f93a0> async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7b9dc27f1070> model_name='llama-3.1-8b-instant' temperature=1e-08 model_kwargs={} groq_api_key=SecretStr('**********')


## **Sanity check: verify the Groq LLM is working**

In [16]:
from langchain_core.messages import HumanMessage

response = llm.invoke([HumanMessage(content="Reply with the single word: OK")])

print("LLM response:", response.content)


LLM response: OK


In [17]:
## Input and get response form LLM

result=llm.invoke("What is generative AI?")

In [18]:
import os
from google.colab import userdata

# Read from Colab Secrets first, then env vars
NEO4J_URI = userdata.get("NEO4J_URI") or os.environ.get("NEO4J_URI")
NEO4J_USERNAME = userdata.get("NEO4J_USERNAME") or os.environ.get("NEO4J_USERNAME")
NEO4J_PASSWORD = userdata.get("NEO4J_PASSWORD") or os.environ.get("NEO4J_PASSWORD")

if not all([NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD]):
    raise RuntimeError("❌ Neo4j credentials not found in Colab Secrets or environment variables")

# Export for LangChain / Neo4j drivers
os.environ["NEO4J_URI"] = NEO4J_URI.strip()
os.environ["NEO4J_USERNAME"] = NEO4J_USERNAME.strip()
os.environ["NEO4J_PASSWORD"] = NEO4J_PASSWORD.strip()

print(" Neo4j credentials loaded successfully.")


 Neo4j credentials loaded successfully.


In [None]:
from langchain_community.graphs import Neo4jGraph

graph = Neo4jGraph(
    url=os.environ["NEO4J_URI"],
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"]
)

graph.refresh_schema()
print(" Connected to Neo4j and schema loaded.")


# **NVIDIA API and DEEPSEEK**

In [23]:
import os
from google.colab import userdata
from langchain_core.messages import HumanMessage
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# ---- Retrieve NVIDIA API key ----
api_key = userdata.get("NVIDIA_API_KEY") or os.environ.get("NVIDIA_API_KEY")

if not api_key:
    raise ValueError(
        "NVIDIA_API_KEY not found. Set it in Colab Secrets or as an environment variable."
    )

# ---- Initialize ChatNVIDIA LLM ----
deepseek_llm_test = ChatNVIDIA(
    model="deepseek-ai/deepseek-v3.2",
    temperature=0,
    max_completion_tokens=100,
    api_key=api_key
)

print("ChatNVIDIA (DeepSeek v3.2) initialized successfully.")



ChatNVIDIA (DeepSeek v3.2) initialized successfully.


In [24]:
import fitz  # PyMuPDF
from langchain_core.documents import Document
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import numpy as np
from langchain.chat_models import init_chat_model
from langchain_core.prompts import PromptTemplate #  upadted import path
from langchain_core.messages import HumanMessage # updated import path
from sklearn.metrics.pairwise import cosine_similarity
import os
import base64
import io
from langchain_text_splitters import RecursiveCharacterTextSplitter # updated import path
from langchain_community.vectorstores import FAISS



# **RAG_Multimodal**

## DATA LOADING and CHUNKIMG

In [25]:
from pathlib import Path
import os
import fitz  # PyMuPDF

def extract_text_images_and_chunks(
    pdf_paths,
    image_dir="/content/pdf_images",
    chunk_size=500,
    overlap=50
):
    """
    Extracts text + images from PDFs and creates text chunks with metadata.
    """
    os.makedirs(image_dir, exist_ok=True)
    all_items = []
    text_chunks = []

    for pdf_path in pdf_paths:
        doc = fitz.open(pdf_path)

        for page_index in range(len(doc)):
            page = doc[page_index]
            page_number = page_index + 1

            # ---------- TEXT ----------
            text = page.get_text().strip()
            if text:
                all_items.append({
                    "type": "text",
                    "content": text,
                    "metadata": {
                        "source": Path(pdf_path).name,
                        "page_number": page_number
                    }
                })

                # Chunk text immediately
                start = 0
                while start < len(text):
                    end = start + chunk_size
                    text_chunks.append({
                        "type": "text",
                        "content": text[start:end],
                        "metadata": {
                            "source": Path(pdf_path).name,
                            "page_number": page_number,
                            "chunk_start": start,
                            "chunk_end": end
                        }
                    })
                    start = end - overlap

            # ---------- IMAGES ----------
            for img_index, img in enumerate(page.get_images(full=True)):
                xref = img[0]
                base_image = doc.extract_image(xref)
                image_bytes = base_image["image"]
                image_ext = base_image["ext"]

                image_name = f"{Path(pdf_path).stem}_p{page_number}_{img_index}.{image_ext}"
                image_path = os.path.join(image_dir, image_name)

                with open(image_path, "wb") as f:
                    f.write(image_bytes)

                all_items.append({
                    "type": "image",
                    "content": image_path,
                    "metadata": {
                        "source": Path(pdf_path).name,
                        "page_number": page_number,
                        "image_index": img_index
                    }
                })

    return {
        "items": all_items,      # text + image records
        "text_chunks": text_chunks  # chunked text with metadata
    }


# ------------------ USAGE ------------------

pdf_files = [
    "/content/Multimodal.pdf",
    "/content/multimodal_sample.pdf"
]

data = extract_text_images_and_chunks(pdf_files)

print(data["text_chunks"][0])
print(f"Total chunks: {len(data['text_chunks'])}")
print(f"Total items (text + images): {len(data['items'])}")


{'type': 'text', 'content': 'Annual Revenue Overview\nThis document summarizes the revenue trends across Q1, Q2, and Q3. As illustrated in the chart\nbelow, revenue grew steadily with the highest growth recorded in Q3.\nQ1 showed a moderate increase in revenue as new product lines were introduced. Q2 outperformed\nQ1 due to marketing campaigns. Q3 had exponential growth due to global expansion.', 'metadata': {'source': 'multimodal_sample.pdf', 'page_number': 1, 'chunk_start': 0, 'chunk_end': 500}}
Total chunks: 1
Total items (text + images): 3


In [26]:
# -------- EMBEDDINGS + FAISS INDEX  --------

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

def build_faiss_index(text_chunks, model_name="sentence-transformers/all-MiniLM-L6-v2"):
    """
    Creates embeddings using Hugging Face and stores them in a FAISS index.
    """

    # Load HF embedding model
    model = SentenceTransformer(model_name)

    # Extract texts
    texts = [chunk["content"] for chunk in text_chunks]

    # Generate embeddings
    embeddings = model.encode(
        texts,
        batch_size=32,
        show_progress_bar=True,
        normalize_embeddings=True
    )

    embeddings = np.array(embeddings).astype("float32")

    # Create FAISS index (cosine similarity via inner product)
    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    return index, embeddings


# -------- USAGE WITH YOUR EXISTING OUTPUT --------

# data["text_chunks"] comes from the previous extraction step
faiss_index, embeddings = build_faiss_index(data["text_chunks"])

print("FAISS index size:", faiss_index.ntotal)


# -------- OPTIONAL: QUERY SEARCH --------

def search(query, model, index, text_chunks, top_k=3):
    query_embedding = model.encode(
        [query],
        normalize_embeddings=True
    ).astype("float32")

    scores, indices = index.search(query_embedding, top_k)

    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            "score": float(score),
            "content": text_chunks[idx]["content"],
            "metadata": text_chunks[idx]["metadata"]
        })
    return results


# Example search
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
results = search(
    query="revenue growth in Q3",
    model=model,
    index=faiss_index,
    text_chunks=data["text_chunks"]
)

print(results[0])


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

FAISS index size: 1
{'score': 0.8086128830909729, 'content': 'Annual Revenue Overview\nThis document summarizes the revenue trends across Q1, Q2, and Q3. As illustrated in the chart\nbelow, revenue grew steadily with the highest growth recorded in Q3.\nQ1 showed a moderate increase in revenue as new product lines were introduced. Q2 outperformed\nQ1 due to marketing campaigns. Q3 had exponential growth due to global expansion.', 'metadata': {'source': 'multimodal_sample.pdf', 'page_number': 1, 'chunk_start': 0, 'chunk_end': 500}}


## Generate Textual Descriptions for Images



In [27]:
from langchain_core.documents import Document

image_description_documents = []

for item in data['items']:
    if item['type'] == 'image':
        source = item['metadata']['source']
        image_index = item['metadata']['image_index']
        # Create a simple textual description for the image
        description = f"This image from {source} (image index {image_index}) shows a figure from the document."

        # Create a LangChain Document object
        doc = Document(
            page_content=description,
            metadata=item['metadata']
        )
        image_description_documents.append(doc)

print(f"Generated {len(image_description_documents)} image description documents.")
if image_description_documents:
    print("First image description document:")
    print(image_description_documents[0])

Generated 2 image description documents.
First image description document:
page_content='This image from Multimodal.pdf (image index 0) shows a figure from the document.' metadata={'source': 'Multimodal.pdf', 'page_number': 1, 'image_index': 0}


## Combine All Documents (Text Chunks + Image Descriptions)



In [28]:
from langchain_core.documents import Document

# 1. Convert text_chunks into LangChain Document objects
text_document_chunks = []
for chunk in data['text_chunks']:
    doc = Document(
        page_content=chunk['content'],
        metadata=chunk['metadata']
    )
    text_document_chunks.append(doc)

# 2. Concatenate text documents with image description documents
all_documents = text_document_chunks + image_description_documents

# 3. Print the total number and first few documents to verify
print(f"Total combined documents: {len(all_documents)}")
print("\nFirst 3 combined documents:")
for i, doc in enumerate(all_documents[:3]):
    print(f"Document {i+1}:")
    print(f"  Page Content: {doc.page_content[:100]}...") # Truncate for display
    print(f"  Metadata: {doc.metadata}")
    print("\n")

Total combined documents: 3

First 3 combined documents:
Document 1:
  Page Content: Annual Revenue Overview
This document summarizes the revenue trends across Q1, Q2, and Q3. As illust...
  Metadata: {'source': 'multimodal_sample.pdf', 'page_number': 1, 'chunk_start': 0, 'chunk_end': 500}


Document 2:
  Page Content: This image from Multimodal.pdf (image index 0) shows a figure from the document....
  Metadata: {'source': 'Multimodal.pdf', 'page_number': 1, 'image_index': 0}


Document 3:
  Page Content: This image from multimodal_sample.pdf (image index 0) shows a figure from the document....
  Metadata: {'source': 'multimodal_sample.pdf', 'page_number': 1, 'image_index': 0}




## Rebuild FAISS Index and Retriever


In [29]:
pip install -U --q langchain-huggingface

In [30]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Initialize HuggingFaceEmbeddings
hf_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 2. Create a new FAISS vector store from all_documents
# This automatically handles extracting page_content and generating embeddings
vectorstore = FAISS.from_documents(documents=all_documents, embedding=hf_embeddings)

# 3. Create a retriever from the FAISS vector store
retriever = vectorstore.as_retriever()

print("New FAISS vector store and retriever created successfully.")
print(f"Retriever type: {type(retriever)}")
print(f"Number of documents in vectorstore: {vectorstore.index.ntotal}")

New FAISS vector store and retriever created successfully.
Retriever type: <class 'langchain_core.vectorstores.base.VectorStoreRetriever'>
Number of documents in vectorstore: 3


## Define Multimodal RAG Prompt


In [31]:
from langchain_core.prompts import ChatPromptTemplate

# 2. Define the multimodal RAG prompt string
multimodal_rag_prompt_template = ChatPromptTemplate.from_messages([
    ("system",
     "You are an expert assistant for question-answering tasks. You will be provided with context from various sources,\n" \
     "including raw text content and descriptions of images. Your goal is to synthesize information from all relevant\n" \
     "context sources to answer the user's question accurately. It is crucial that you cite your sources for every piece\n" \
     "of information you provide. The sources will be indicated in the metadata of each context item (e.g., 'source': 'filename.pdf', 'page_number': X, 'image_index': Y).\n" \
     "If you use information derived from an image description, cite it as an image from that source. If you use textual content, cite the PDF and page number.\n" \
     "If the question cannot be answered from the provided context, state that clearly."
    ),
    ("human", "Context: {context}\n\nQuestion: {question}")
])

print("Multimodal RAG prompt template created successfully.")
print(multimodal_rag_prompt_template.messages)


Multimodal RAG prompt template created successfully.
[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], input_types={}, partial_variables={}, template="You are an expert assistant for question-answering tasks. You will be provided with context from various sources,\nincluding raw text content and descriptions of images. Your goal is to synthesize information from all relevant\ncontext sources to answer the user's question accurately. It is crucial that you cite your sources for every piece\nof information you provide. The sources will be indicated in the metadata of each context item (e.g., 'source': 'filename.pdf', 'page_number': X, 'image_index': Y).\nIf you use information derived from an image description, cite it as an image from that source. If you use textual content, cite the PDF and page number.\nIf the question cannot be answered from the provided context, state that clearly."), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_

## Construct Multimodal RAG Chain


In [32]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document

# 1. Define the format_docs function
def format_docs(docs: list[Document]) -> str:
    formatted_strings = []
    for i, doc in enumerate(docs):
        # Extract metadata for citation
        source_info = []
        if 'source' in doc.metadata:
            source_info.append(f"Source: {doc.metadata['source']}")
        if 'page_number' in doc.metadata:
            source_info.append(f"Page: {doc.metadata['page_number']}")
        if 'image_index' in doc.metadata:
            source_info.append(f"Image Index: {doc.metadata['image_index']}")

        source_str = ", ".join(source_info) if source_info else "Unknown Source"

        formatted_strings.append(
            f"<doc id={i+1}>\n"
            f"{doc.page_content}\n"
            f"</doc id={i+1}>\n({source_str})"
        )

    return "\n\n".join(formatted_strings)

# 2. Construct the multimodal RAG chain
# Ensure deepseek_llm_test (ChatNVIDIA) and multimodal_rag_prompt_template are defined
# and retriever is available from previous steps
multimodal_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | multimodal_rag_prompt_template
    | deepseek_llm_test
    | StrOutputParser()
)

print("Multimodal RAG chain constructed successfully.")

Multimodal RAG chain constructed successfully.


## Ask 3 Multimodal Queries


In [36]:
print("Retrieving documents for the last query...")
last_query = "What is the primary content described in the image from Multimodal.pdf and what information is provided about Q3 revenue growth?"

# Retrieve documents using the existing retriever
retrieved_docs = retriever.invoke(last_query)

# Format the retrieved documents using the existing format_docs function
formatted_context = format_docs(retrieved_docs)

print("\n--- Retrieved and Formatted Documents (Context for LLM) ---")
print(formatted_context)

Retrieving documents for the last query...

--- Retrieved and Formatted Documents (Context for LLM) ---
<doc id=1>
Annual Revenue Overview
This document summarizes the revenue trends across Q1, Q2, and Q3. As illustrated in the chart
below, revenue grew steadily with the highest growth recorded in Q3.
Q1 showed a moderate increase in revenue as new product lines were introduced. Q2 outperformed
Q1 due to marketing campaigns. Q3 had exponential growth due to global expansion.
</doc id=1>
(Source: multimodal_sample.pdf, Page: 1)

<doc id=2>
This image from multimodal_sample.pdf (image index 0) shows a figure from the document.
</doc id=2>
(Source: multimodal_sample.pdf, Page: 1, Image Index: 0)

<doc id=3>
This image from Multimodal.pdf (image index 0) shows a figure from the document.
</doc id=3>
(Source: Multimodal.pdf, Page: 1, Image Index: 0)


As you can see, the retriever successfully identified the relevant text chunk detailing Q3 revenue growth and also included descriptions for images from both PDFs, even though the image descriptions themselves are quite general. This is why the LLM was able to provide information about Q3 revenue growth from the text, but only acknowledge the presence of figures from the images without specific visual details.

## Summary:

### Q&A
1.  **Effectiveness of integration and evaluation of coherence, accuracy, and completeness:** The multimodal RAG chain partially integrated information from both textual and image-derived contexts. The LLM's responses were coherent and accurate regarding the textual content, providing summaries of revenue trends and product lines based on text. However, its completeness was limited concerning image content, as the LLM often stated that the provided image descriptions were generic (e.g., "a figure from the document") and thus could not extract specific visual details.
2.  **Acknowledgment of different types of sources:** The LLM appropriately acknowledged sources. It cited textual information with document names and page numbers (e.g., "multimodal_sample.pdf") and attributed general insights to "textual content." For image references, it cited them as images from specific sources (e.g., "image from Multimodal.pdf"), reflecting the generic nature of the provided image descriptions.

### Data Analysis Key Findings
*   Two generic textual descriptions for images were generated and incorporated into the document set. For instance, an image from "Multimodal.pdf (image index 0)" was described as "This image from Multimodal.pdf (image index 0) shows a figure from the document."
*   A total of 3 documents (1 text chunk and 2 image descriptions) were combined into a unified list and successfully indexed in a new FAISS vector store.
*   A `ChatPromptTemplate` was successfully defined, explicitly instructing the NVIDIA DeepSeek LLM to synthesize information from both text and image descriptions and to cite sources distinguishing between text (PDF, page number) and image (image from source).
*   A multimodal RAG chain was successfully constructed, incorporating the retriever, a `format_docs` helper function, the defined prompt, and the NVIDIA DeepSeek LLM.
*   When executing multimodal queries, the LLM effectively extracted and summarized information from the textual content, citing sources like "multimodal_sample.pdf."
*   However, the LLM's ability to provide detailed answers from image contexts was limited because the generated image descriptions were generic and lacked specific content details, leading the LLM to explicitly state this limitation in its responses.
