# Multi-modal RAG with LangChain

## SetUp

Install the dependencies you need to run the notebook.

In [5]:
# for linux
!sudo apt-get install poppler-utils tesseract-ocr libmagic-dev

# for mac
# !brew install poppler tesseract libmagic

[sudo] password for ori: 
^C
sudo: a password is required


In [87]:
%pip install -Uq "unstructured[all-docs]" pillow lxml pillow
%pip install -Uq chromadb tiktoken
%pip install -Uq langchain langchain-community langchain-openai langchain-groq
%pip install -Uq python_dotenv


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
from dotenv import load_dotenv
load_dotenv()

## Extract the data

Extract the elements of the PDF that we will be able to use in the retrieval process. These elements can be: Text, Images, Tables, etc.

### Partition PDF tables, text, and images

In [89]:
from unstructured.partition.pdf import partition_pdf

output_path = "./document/"
file_path = output_path + '10_destinasi_klungkung.pdf'

# Reference: https://docs.unstructured.io/open-source/core-functionality/chunking
chunks = partition_pdf(
    filename=file_path,
    infer_table_structure=True,            # extract tables
    strategy="hi_res",                     # mandatory to infer tables

    extract_image_block_types=["Image"],   # Add 'Table' to list to extract image of tables
    # image_output_dir_path=output_path,   # if None, images and tables will saved in base64

    extract_image_block_to_payload=True,   # if true, will extract base64 for API usage

    chunking_strategy="by_title",          # or 'basic'
    max_characters=10000,                  # defaults to 500
    combine_text_under_n_chars=2000,       # defaults to 0
    new_after_n_chars=6000,

    # extract_images_in_pdf=True,          # deprecated
)



In [100]:
len(chunks)

2

In [101]:
chunks[1].metadata.orig_elements

[<unstructured.documents.elements.Title at 0x70c6da936d50>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da935880>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da935a00>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da936060>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da937290>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da937410>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da937590>,
 <unstructured.documents.elements.Title at 0x70c6da937830>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da9379b0>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da937b30>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da937d40>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da937ce0>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da9350a0>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da934770>,
 <unstructured.documents.elements.Title at 0x70c6da9367e0>,
 <un

In [102]:
# We get 2 types of elements from the partition_pdf function
set([str(type(el)) for el in chunks])

{"<class 'unstructured.documents.elements.CompositeElement'>"}

In [103]:
# Each CompositeElement containes a bunch of related elements.
# This makes it easy to use these elements together in a RAG pipeline.

chunks[0].metadata.orig_elements

[<unstructured.documents.elements.Title at 0x70c6daf16510>,
 <unstructured.documents.elements.Title at 0x70c77e33b290>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da936270>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da935e50>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da935550>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da9353d0>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da9348c0>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da935040>,
 <unstructured.documents.elements.Title at 0x70c6da935bb0>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da934620>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da934a10>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da934950>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da934380>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da9352e0>,
 <unstructured.documents.elements.NarrativeText at 0x70c6da9370b0>,
 <un

In [104]:
# This is what an extracted image looks like.
# It contains the base64 representation only because we set the param extract_image_block_to_payload=True

elements = chunks[0].metadata.orig_elements
chunk_images = [el for el in elements]
chunk_images[0].to_dict()

{'type': 'Title',
 'element_id': 'cfc0fcc3-3106-47fe-9944-259a1c39ff2e',
 'text': 'Destinasi Wisata Klungkung (5W+1H)',
 'metadata': {'detection_class_prob': 0.8271118998527527,
  'is_extracted': 'true',
  'coordinates': {'points': ((np.float64(249.23828125),
     np.float64(274.8055555555557)),
    (np.float64(249.23828125), np.float64(313.8055555555556)),
    (np.float64(857.9528888888889), np.float64(313.8055555555556)),
    (np.float64(857.9528888888889), np.float64(274.8055555555557))),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'last_modified': '2025-12-08T07:24:00',
  'filetype': 'PPM',
  'languages': ['eng'],
  'page_number': 1}}

### Separate extracted elements into tables, text, and images

In [105]:
# separate tables from texts
tables = []
texts = []

for chunk in chunks:
    if "Table" in str(type(chunk)):
        tables.append(chunk)

    if "CompositeElement" in str(type((chunk))):
        texts.append(chunk)

In [106]:
# Get the images from the CompositeElement objects
def get_images_base64(chunks):
    images_b64 = []
    for chunk in chunks:
        if "CompositeElement" in str(type(chunk)):
            chunk_els = chunk.metadata.orig_elements
            for el in chunk_els:
                if "Image" in str(type(el)):
                    images_b64.append(el.metadata.image_base64)
    return images_b64

images = get_images_base64(chunks)

#### Check what the images look like

In [19]:
import base64
from IPython.display import Image, display

def display_base64_image(base64_code):
    # Decode the base64 string to binary
    image_data = base64.b64decode(base64_code)
    # Display the image
    display(Image(data=image_data))

display_base64_image(images[0])

IndexError: list index out of range

## Summarize the data

Create a summary of each element extracted from the PDF. This summary will be vectorized and used in the retrieval process.

### Text and Table summaries

We don't need a multimodal model to generate the summaries of the tables and the text. I will use open source models available on Groq.

In [84]:
%pip install langchain-google-genai


Collecting langchain-google-genai
  Downloading langchain_google_genai-4.1.2-py3-none-any.whl.metadata (2.7 kB)
Collecting google-genai<2.0.0,>=1.56.0 (from langchain-google-genai)
  Downloading google_genai-1.56.0-py3-none-any.whl.metadata (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.3/53.3 kB[0m [31m276.8 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Collecting google-auth<3.0.0,>=2.45.0 (from google-auth[requests]<3.0.0,>=2.45.0->google-genai<2.0.0,>=1.56.0->langchain-google-genai)
  Downloading google_auth-2.45.0-py2.py3-none-any.whl.metadata (6.8 kB)
Downloading langchain_google_genai-4.1.2-py3-none-any.whl (65 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.6/65.6 kB[0m [31m333.2 kB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m
[?25hDownloading google_genai-1.56.0-py3-none-any.whl (426 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m426.6/426.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[

In [None]:
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


In [119]:
import os
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_google_genai import ChatGoogleGenerativeAI

prompt_text = """
You are an assistant tasked with summarizing tables and text.
Give a concise summary of the table or text.

Respond only with the summary, no additional comment.
Do not start your message by saying "Here is a summary" or anything like that.
Just give the summary as it is. and respond it with the same language of the text.

Table or text chunk: {element}
"""

prompt = ChatPromptTemplate.from_template(prompt_text)

# Konfigurasi model Gemini
model = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",   
    temperature=0.5,
    api_key=os.getenv("GEMINI_API_KEY"),
)

# Summary chain
summarize_chain = (
    {"element": lambda x: x}
    | prompt
    | model
    | StrOutputParser()
)


In [120]:
# Summarize text
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 3})

# Summarize tables
tables_html = [table.metadata.text_as_html for table in tables]
table_summaries = summarize_chain.batch(tables_html, {"max_concurrency": 3})

In [117]:
texts[0].text

'Destinasi Wisata Klungkung (5W+1H)\n\n1. Pantai Kelingking\n\nWhat: Pantai dengan tebing ikonik berbentuk T-Rex yang menjadi simbol pariwisata Nusa Penida. Pemandangannya dramatis dengan kontras antara tebing curam dan laut biru toska.\n\nWhere: Desa Bunga Mekar, Nusa Penida, Kabupaten Klungkung.\n\nWhen: Paling ideal dikunjungi pada pagi hingga siang hari ketika cuaca cerah dan gelombang relatif tenang.\n\nWhy: Keindahan panoramanya menjadikan pantai ini salah satu spot foto terbaik di Asia, sekaligus tujuan wajib wisatawan.\n\nWho: Wisatawan lokal, mancanegara, fotografer landscape, dan pencinta alam.\n\nHow: Dari Pelabuhan Banjar Nyuh atau Toya Pakeh, perjalanan darat 25–35 menit. Trek menurun menuju pantai cukup curam dan membutuhkan stamina.\n\n2. Pantai Crystal Bay\n\nWhat: Pantai teluk dengan air sangat jernih dan pasir lembut, ideal untuk snorkeling dan diving.\n\nWhere: Desa Sakti, Nusa Penida.\n\nWhen: Waktu terbaik sore hari untuk menikmati sunset.\n\nWhy: Crystal Bay terke

In [122]:
text_summaries[0]

'Destinasi wisata Klungkung meliputi Pantai Kelingking di Nusa Penida dengan tebing ikonik T-Rex dan pemandangan dramatis, ideal untuk fotografi. Pantai Crystal Bay di Desa Sakti populer untuk snorkeling dan diving berkat airnya yang jernih dan habitat mola-mola, cocok dinikmati sore hari. Pantai Atuh dan Pantai Diamond di Desa Pejukutan menawarkan pemandangan tebing menjulang, batu karang unik, dan tangga tebing yang eksotis, paling baik dikunjungi pagi hari. Terakhir, Pulau Nusa Lembongan, dengan pantai seperti Jungut Batu dan Mushroom Bay, menjadi pusat aktivitas dengan suasana tenang, air jernih, serta fasilitas lengkap, dapat dinikmati sepanjang tahun.'

### Image summaries

We will use gpt-4o-mini to produce the image summaries.

In [23]:
%pip install -Uq langchain_openai

Note: you may need to restart the kernel to use updated packages.


In [25]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# 1. Konfigurasi Model (Sesuaikan base_url jika pakai Gateway)
model = ChatOpenAI(
    model="gpt-4o-mini", # Model ini mendukung input gambar (vision)
    temperature=0.5,
    max_tokens=1024,
    api_key=os.getenv('AI_GATEWAY_API_KEY'),
    base_url='https://ai-gateway.vercel.sh/v1'
)

# 2. Definisi Prompt Template
# Kita menggunakan placeholder {image} di dalam struktur URL
prompt = ChatPromptTemplate.from_messages([
    (
        "user",
        [
            {
                "type": "text", 
                "text": "Describe the image in detail. For context, the image is part of a research paper explaining the transformers architecture. Be specific about graphs, such as bar plots."
            },
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64,{image}"},
            },
        ],
    )
])

# 3. Membuat Chain
chain = prompt | model | StrOutputParser()


batch_input = [{"image": img_str} for img_str in images]

# Eksekusi batch
image_summaries = chain.batch(batch_input)

In [26]:
image_summaries

[]

In [None]:
print(image_summaries[1])

## Load data and summaries to vectorstore

### Create the vectorstore

In [123]:
import uuid
import os

from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.stores import InMemoryStore
from langchain_classic.retrievers.multi_vector import MultiVectorRetriever
from langchain_google_genai import GoogleGenerativeAIEmbeddings

# Embeddings Gemini
embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",
    google_api_key=os.getenv("GEMINI_API_KEY")
)

# Vectorstore
vectorstore = Chroma(
    collection_name="multi_modal_rag",
    embedding_function=embeddings
)

# Parent document store
store = InMemoryStore()
id_key = "doc_id"

# Retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)


In [126]:
# Cara 1: Menggunakan embed_documents (Input HARUS List)
try:
    # Perhatikan kurung siku ["kekeke"]
    test_embed = retriever.vectorstore.embeddings.embed_documents(["asepe","nskskks"])
    print("✅ Embeddings API Check (Batch): Sukses")
    print(f"Dimensi vektor: {len(test_embed[0])}")
except Exception as e:
    print(f"❌ Embeddings API Check (Batch): GAGAL.\nError: {e}")

# Cara 2: Menggunakan embed_query (Input String biasa)
try:
    # Tanpa kurung siku
    test_query = retriever.vectorstore.embeddings.embed_query("kekeke")
    print("✅ Embeddings API Check (Query): Sukses")
except Exception as e:
    print(f"❌ Embeddings API Check (Query): GAGAL.\nError: {e}")

✅ Embeddings API Check (Batch): Sukses
Dimensi vektor: 768
✅ Embeddings API Check (Query): Sukses


In [75]:
retriever.vectorstore.add_documents

<bound method VectorStore.add_documents of <langchain_chroma.vectorstores.Chroma object at 0x70c89fcd8710>>

### Load the summaries and link the to the original data

In [129]:
from typing import List, Tuple

def build_valid_docs(
    summaries: List[str],
    parents: List,
    id_key: str
) -> Tuple[List[Document], List[Tuple[str, object]]]:
    """
    - Buang summary kosong / None
    - Jaga sinkron parent-child
    """
    docs = []
    parent_pairs = []

    for summary, parent in zip(summaries, parents):
        if isinstance(summary, str) and summary.strip():
            doc_id = str(uuid.uuid4())
            docs.append(
                Document(
                    page_content=summary.strip(),
                    metadata={id_key: doc_id}
                )
            )
            parent_pairs.append((doc_id, parent))

    return docs, parent_pairs


In [None]:
import uuid 
from langchain_core.documents import Document
# texts: List[str]
# text_summaries: List[str]

text_docs, text_pairs = build_valid_docs(
    summaries=text_summaries,
    parents=texts,
    id_key=id_key
)

if text_docs:
    retriever.vectorstore.add_documents(text_docs)
    retriever.docstore.mset(text_pairs)



table_docs, table_pairs = build_valid_docs(
    summaries=table_summaries,
    parents=tables,
    id_key=id_key
)

if table_docs:
    retriever.vectorstore.add_documents(table_docs)
    retriever.docstore.mset(table_pairs)


image_docs, image_pairs = build_valid_docs(
    summaries=image_summaries,
    parents=images,
    id_key=id_key
)

if image_docs:
    retriever.vectorstore.add_documents(image_docs)
    retriever.docstore.mset(image_pairs)

### Check retrieval

In [135]:
# Retrieve
docs = retriever.invoke(
    "Apa itu pantai Kelingking?"
)

In [139]:
print(docs)

[<unstructured.documents.elements.CompositeElement object at 0x70c6daf14890>, <unstructured.documents.elements.CompositeElement object at 0x70c6daf14890>, <unstructured.documents.elements.CompositeElement object at 0x70c704812150>, <unstructured.documents.elements.CompositeElement object at 0x70c704812150>]


In [138]:
for doc in docs:
    print(str(doc) + "\n\n" + "-" * 80)

Destinasi Wisata Klungkung (5W+1H)

1. Pantai Kelingking

What: Pantai dengan tebing ikonik berbentuk T-Rex yang menjadi simbol pariwisata Nusa Penida. Pemandangannya dramatis dengan kontras antara tebing curam dan laut biru toska.

Where: Desa Bunga Mekar, Nusa Penida, Kabupaten Klungkung.

When: Paling ideal dikunjungi pada pagi hingga siang hari ketika cuaca cerah dan gelombang relatif tenang.

Why: Keindahan panoramanya menjadikan pantai ini salah satu spot foto terbaik di Asia, sekaligus tujuan wajib wisatawan.

Who: Wisatawan lokal, mancanegara, fotografer landscape, dan pencinta alam.

How: Dari Pelabuhan Banjar Nyuh atau Toya Pakeh, perjalanan darat 25–35 menit. Trek menurun menuju pantai cukup curam dan membutuhkan stamina.

2. Pantai Crystal Bay

What: Pantai teluk dengan air sangat jernih dan pasir lembut, ideal untuk snorkeling dan diving.

Where: Desa Sakti, Nusa Penida.

When: Waktu terbaik sore hari untuk menikmati sunset.

Why: Crystal Bay terkenal sebagai habitat ikan 

## RAG pipeline

In [142]:
from base64 import b64decode
from typing import Dict, List

from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage


def parse_docs(docs: List[str]) -> Dict[str, List[str]]:
    images = []
    texts = []

    for doc in docs:
        if not isinstance(doc, str):
            continue
        try:
            b64decode(doc, validate=True)
            images.append(doc)
        except Exception:
            texts.append(doc)

    return {"images": images, "texts": texts}




def build_prompt(kwargs):
    context = kwargs["context"]
    question = kwargs["question"]

    messages = []

    # Gabungkan teks context
    if context["texts"]:
        context_text = "\n".join(context["texts"])
        messages.append(
            HumanMessage(
                content=f"""
Answer the question using ONLY the following context.

Context:
{context_text}

Question:
{question}
"""
            )
        )
    else:
        messages.append(
            HumanMessage(
                content=f"Answer the question. Question: {question}"
            )
        )

    # Tambahkan image (Gemini style)
    for img_b64 in context["images"]:
        messages.append(
            HumanMessage(
                content=[
                    {
                        "type": "image",
                        "data": img_b64,
                        "mime_type": "image/jpeg",
                    }
                ]
            )
        )

    return messages

model = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",   
    temperature=0.2,
)

chain = (
    {
        "context": retriever | RunnableLambda(parse_docs),
        "question": RunnablePassthrough(),
    }
    | RunnableLambda(build_prompt)
    | model
    | StrOutputParser()
)

chain_with_response = {
    "context": retriever | RunnableLambda(parse_docs),
    "question": RunnablePassthrough(),
} | RunnablePassthrough().assign(
    response=(
        RunnableLambda(build_prompt)
        | model
        | StrOutputParser()
    )
)


In [145]:
response = chain.invoke(
    "Saya ingin ke pantai kelingking, gimana caranya?"
)

print(response)

Pantai Kelingking adalah salah satu ikon paling terkenal di Nusa Penida, Bali. Untuk sampai ke sana, ada beberapa langkah yang harus Anda ikuti:

**Langkah 1: Dari Bali Daratan ke Nusa Penida**

1.  **Pergi ke Sanur, Bali:** Sebagian besar perahu cepat (fast boat) ke Nusa Penida berangkat dari Pelabuhan Sanur di Denpasar, Bali. Anda bisa menggunakan taksi, taksi online (Gojek/Grab), atau menyewa mobil/motor untuk sampai ke sana.
2.  **Pesan Tiket Fast Boat:**
    *   **Online:** Ini cara terbaik untuk memastikan Anda mendapatkan tempat, terutama di musim ramai. Banyak operator fast boat memiliki situs web sendiri atau Anda bisa memesan melalui agen perjalanan online (misalnya Traveloka, Klook, GetYourGuide, dll.).
    *   **Langsung di Pelabuhan:** Anda juga bisa membeli tiket langsung di konter-konter operator fast boat di Pelabuhan Sanur, tetapi ada risiko kehabisan tiket atau harga yang lebih tinggi.
    *   **Operator Populer:** Maruti Group, El Rey Junior, Angel Billabong, Idola E

In [144]:
response = chain_with_response.invoke(
    "Ada apa saja destinasi wisata klungkung?"
)

print("Response:", response['response'])

print("\n\nContext:")
for text in response['context']['texts']:
    print(text.text)
    print("Page number: ", text.metadata.page_number)
    print("\n" + "-"*50 + "\n")
for image in response['context']['images']:
    display_base64_image(image)

Response: Kabupaten Klungkung, Bali, terkenal dengan keindahan alamnya yang memukau, terutama gugusan pulau-pulau kecilnya (Nusa Islands) yang menjadi daya tarik utama. Selain itu, ada juga destinasi menarik di daratan Klungkung.

Berikut adalah beberapa destinasi wisata populer di Klungkung:

**I. Gugusan Nusa Islands (Nusa Penida, Nusa Lembongan, Nusa Ceningan)**
Ini adalah daya tarik terbesar Klungkung dan seringkali menjadi alasan utama wisatawan datang. Akses ke pulau-pulau ini biasanya menggunakan perahu cepat (fast boat) dari Sanur, Kusamba, atau Padang Bai.

1.  **Nusa Penida**
    Pulau terbesar dan paling populer dengan pemandangan tebing-tebing dramatis dan pantai-pantai eksotis.
    *   **Kelingking Beach:** Tebing ikonik berbentuk T-Rex, sangat populer untuk berfoto.
    *   **Broken Beach (Pasih Uug):** Tebing melingkar dengan lubang besar di tengahnya, membentuk kolam alami.
    *   **Angel's Billabong:** Kolam alami dengan air jernih di antara batuan karang, menyerupai 

## References

- [LangChain Inspiration](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_and_multi_modal_RAG.ipynb?ref=blog.langchain.dev)
- [Multivector Storage](https://python.langchain.com/docs/how_to/multi_vector/)