# Multimodal Pipeline for RAG

## Phase 0: Setting up the Project

In [1]:
from unstructured.partition.pdf import partition_pdf
import pytest
from services.categorizer import categorize

  from .autonotebook import tqdm as notebook_tqdm


## Phase 1: Indexing

Indexing starts with the cleaning and extraction of raw data in diverse formats like PDF, HTML, Word, and Markdown, which is then converted into a uniform plain text format.

### Extraction

In [28]:
file_path = "./assets/MCP9808.pdf"

pdf_elements = partition_pdf(filename=file_path,
                             strategy='hi_res',
                             infer_table_structure=True,
                             hi_res_model_name='yolox',
                             extract_image_block_types=['Image'],
                             extract_image_block_to_payload=True,   #If True, will extract base64 for API usage
                             chunking_strategy='by_title',          # splitting strategy for the document (related elements are now grouped together)
                             max_characters=10000,                  # defaults to 500
                             combine_text_under_n_chars=2000,       # defaults to 0
                             new_after_n_chars=6000)

The PDF <_io.BufferedReader name='./assets/MCP9808.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
The PDF <_io.BufferedReader name='./assets/MCP9808.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case


With the recent updates to the unstructured library (especially >=0.11.x), when using chunking_strategy="by_title", the output elements are wrapped as CompositeElement, grouping together content under headings — which can include Table, Text, Image, etc.

🔍 When to Use Raw Access (No Chunking)

✅ Use this when:
	•	Your primary goal is to extract specific elements, like tables, without worrying about their surrounding context.
	•	You want to classify, transform, or analyze tables or text independently.
	•	You’re building a pipeline where you process each element individually (e.g., sending them to LLMs, storing in a vector DB, etc.).

✅ Pros:
	•	Simple and straightforward.
	•	Full visibility into all content types.
	•	Easier debugging and testing.

❌ Cons:
	•	No semantic grouping — loses the logical structure (e.g., which section the table belongs to).

🧩 When to Use Chunking (e.g. by_title)

✅ Use this when:
	•	You want to preserve the document’s logical structure — e.g., sections, headings, context.
	•	You’re building a retrieval system, summarizer, or LLM pipeline that benefits from cohesive, meaningful chunks.
	•	You want to preserve the relationship between paragraphs and tables/images under a specific section.

✅ Pros:
	•	More semantically meaningful.
	•	Better input for language models.
	•	Maintains context between related elements.

❌ Cons:
	•	More complex to work with — requires digging into CompositeElement.elements.
	•	Slightly harder to extract just tables.

In [85]:
#pdf_elements is a list full of chunks. the different elements are stored in the metadata attribute of each chunk so next we have to extract the elements from the metadata with a loop

#pdf_elements[0].metadata.orig_elements
chunks = pdf_elements

In [4]:
# Subelements contain the actual elements
#chunks = pdf_elements[0].metadata.orig_elements
#set([str(type(el)) for el in chunks])

{"<class 'unstructured.documents.elements.Image'>",
 "<class 'unstructured.documents.elements.ListItem'>",
 "<class 'unstructured.documents.elements.NarrativeText'>",
 "<class 'unstructured.documents.elements.Title'>"}

### Separate extracted elements into tables, text and images

Separate tables from texts

In [87]:
# separate tables from texts
tables = []
texts = []

for chunk in chunks:
    if "CompositeElement" in str(type(chunk)):
        for el in chunk.metadata.orig_elements:
            if "Table" in str(type(el)):
                tables.append(el)
                #remove table from chunk
                chunk.metadata.orig_elements.remove(el)

In [89]:
# For Testing only!
# Search for Table in chunk
for chunk in chunks:
    if "Table" in str(type(chunk)):
        print("Found Table in chunk")
        print(chunk.metadata.orig_elements)
        #print(chunk.metadata

Get the images from the CompositeElement objects

In [90]:
def get_images_base64(chunks):
    images_b64 = []
    for chunk in chunks:
        if "CompositeElement" in str(type(chunk)):
            chunk_els = chunk.metadata.orig_elements
            for el in chunk_els:
                if "Image" in str(type(el)):
                    images_b64.append(el.metadata.image_base64)
                    chunk.metadata.orig_elements.remove(el)
    return images_b64

images = get_images_base64(chunks)

In [91]:
#Check if images are still in the chunks
for chunk in chunks:
    if "Image" in str(type(chunk)):
        print("Found Image in chunk")
        print(chunk.metadata.orig_elements)
        #print(chunk.metadata.orig_elements

In [92]:
# The rest of the chunks are text
texts = []
for chunk in chunks:
    texts.append(chunk)

In [94]:
print(texts[0])

MicROcHIP

MCP9808

±0.5°C Maximum Accuracy Digital Temperature Sensor

Features

General Description

• Accuracy:

- ±0.25 (typical) from –40°C to +125°C

- ±0.5°C (maximum) from –20°C to 100°C

Microchip Technology Inc.’s MCP9808 digital temperature sensor converts temperatures between –20°C and +100°C to a digital word with ±0.25°C/±0.5°C (typi- cal/maximum) accuracy.

- ±1°C (maximum) from –40°C to +125°C

- ±0.0625°C or ±1 LSb (typical) repeatability

• User-Selectable Measurement Resolution:

- +0.5°C, +0.25°C, +0.125°C, +0.0625°C

• User-Programmable Temperature Limits:

- Temperature Window Limit

- Critical Temperature Limit

• User-Programmable Temperature Alert Output

• Operating Voltage Range: 2.7V-5.5V

• Operating Current: 200 μA (typical)

• Shutdown Current: 0.1 μA (typical)

The MCP9808 comes with user-programmable registers that provide flexibility for temperature sensing applications. The registers allow user-selectable settings such as Shutdown or Low-Power modes a

## Phase 2: Summarization

In [118]:
%pip install --upgrade --quiet  langchain-openai
%pip install dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [147]:
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os
from dotenv import load_dotenv
from langchain_core.rate_limiters import InMemoryRateLimiter

rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.5,
    check_every_n_seconds=1,
    max_bucket_size=500000,
)
load_dotenv()
azure_api_key = os.environ.get("AZURE_OPENAI_API_KEY")
azure_endopint = os.environ.get("AZURE_OPENAI_ENDPOINT")

### Summarization of Tables

In [149]:
# Prompt
prompt_text ="""
You are a helpful assistant tasked with summarizing tables precisely.
Give a concise summary of the table.

Respond only with the summary, no additional comment.
Do not start your message by saying "Here is a summary" or anything like that.
Just give the summary as it is.

Table: {element}

"""
prompt = ChatPromptTemplate.from_template(prompt_text)
#rate_limiter=rate_limiter,

# Summary chain
model = AzureChatOpenAI(
    azure_deployment="gpt-4o",
    azure_endpoint=azure_endopint,
    openai_api_key=azure_api_key,
    api_version="2024-12-01-preview",
    temperature=1.0,
    model="gpt-4o"
)
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [150]:
tables_html = [table.metadata.text_as_html for table in tables]
table_summaries = summarize_chain.invoke(tables_html)

NotFoundError: Error code: 404 - {'error': {'code': 'DeploymentNotFound', 'message': 'The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.'}}

### Summarization of Images