# Multimodal Pipeline for RAG

## Phase 0: Setting up the Project

In [89]:
!brew install poppler tesseract libmagic
%pip install "unstructured[pdf]" pillow

[34m==>[0m [1mAuto-updating Homebrew...[0m
Adjust how often this is run with HOMEBREW_AUTO_UPDATE_SECS or disable with
HOMEBREW_NO_AUTO_UPDATE. Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/portable-ruby/portable-ruby/blobs/sha256:7645e2d653a335798030f6502e7834dfdbeeec5629429a1a34da5dbb2c57d63e[0m
######################################################################### 100.0%                                50.0%
[34m==>[0m [1mPouring portable-ruby-3.3.8.arm64_big_sur.bottle.tar.gz[0m
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 2 taps (homebrew/core and homebrew/cask).
[34m==>[0m [1mNew Formulae[0m
anubis              fedify              intermodal          mob
apache-flink@1      ftxui               is-fast             otterdog
chdig               gama                ktexttemplate       pyp
dblab               gdown               libpg_query         tfcmt
dish                geesefs          

## Phase 1: Indexing

Indexing starts with the cleaning and extraction of raw data in diverse formats like PDF, HTML, Word, and Markdown, which is then converted into a uniform plain text format.

### Extraction

In [75]:
import os
os.environ["UNSTRUCTURED_HI_RES_MODEL_NAME"] = "detectron2_onnx"

In [None]:
from unstructured.partition.pdf import partition_pdf
file_path = "./assets/MTS2916A.pdf"

pdf_elements = partition_pdf(
    filename=file_path,
    infer_table_structure=True,
    strategy="hi_res",                      # 'hi_res' or 'ocr_only'
    #hi_res_model_name='detectron2_onnx',   # 'yolox' has problems identifying tables! other options are 'detectron2_onnx' 'pytesseract',
    #extract_image_block_types=["Image"],
    #extract_image_block_to_payload=True,   # If True, will extract base64 for API usage
    #chunking_strategy='by_title',          # splitting strategy for the document (related elements are now grouped together) other options are 'basic'or 'by_title,
    #max_characters=10000,                  # defaults to 500
    #combine_text_under_n_chars=2000,       # defaults to 0
    #new_after_n_chars=6000
    )

AttributeError: 'list' object has no attribute 'element_coords'

With the recent updates to the unstructured library (especially >=0.11.x), when using chunking_strategy="by_title", the output elements are wrapped as CompositeElement, grouping together content under headings — which can include Table, Text, Image, etc.

🔍 When to Use Raw Access (No Chunking)

✅ Use this when:
	•	Your primary goal is to extract specific elements, like tables, without worrying about their surrounding context.
	•	You want to classify, transform, or analyze tables or text independently.
	•	You’re building a pipeline where you process each element individually (e.g., sending them to LLMs, storing in a vector DB, etc.).

✅ Pros:
	•	Simple and straightforward.
	•	Full visibility into all content types.
	•	Easier debugging and testing.

❌ Cons:
	•	No semantic grouping — loses the logical structure (e.g., which section the table belongs to).

🧩 When to Use Chunking (e.g. by_title)

✅ Use this when:
	•	You want to preserve the document’s logical structure — e.g., sections, headings, context.
	•	You’re building a retrieval system, summarizer, or LLM pipeline that benefits from cohesive, meaningful chunks.
	•	You want to preserve the relationship between paragraphs and tables/images under a specific section.

✅ Pros:
	•	More semantically meaningful.
	•	Better input for language models.
	•	Maintains context between related elements.

❌ Cons:
	•	More complex to work with — requires digging into CompositeElement.elements.
	•	Slightly harder to extract just tables.

“Basic” chunking strategy: This method allows you to combine sequential elements to maximally fill each chunk while respecting the maximum chunk size limit. If a single isolated element exceeds the hard-max, it will be divided into two or more chunks.

“By title” chunking strategy: This strategy leverages the document element types identified during partitioning to understand the document structure, and preserves section boundaries. This means that a single chunk will never contain text that occurred in two different sections, ensuring that topics remain self-contained for enhanced retrieval precision. 

In [65]:
#pdf_elements is a list full of chunks. the different elements are stored in the metadata attribute of each chunk so next we have to extract the elements from the metadata with a loop

#pdf_elements[0].metadata.orig_elements
chunks = pdf_elements
chunks

[<unstructured.documents.elements.CompositeElement at 0x3c0a6b700>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a6bb60>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a69be0>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a69d30>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a6b7e0>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a6ba10>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a6b380>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a69da0>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a6ac10>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a6b310>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a6ab30>,
 <unstructured.documents.elements.CompositeElement at 0x3c0a6a740>]

In [72]:
#Subelements contain the actual elements

# test_chunks = pdf_elements[0].metadata.orig_elements
# set([str(type(el)) for el in chunks])

chunks[1].metadata.orig_elements

[<unstructured.documents.elements.Title at 0x3bf509c50>,
 <unstructured.documents.elements.Text at 0x3bf8a5cc0>,
 <unstructured.documents.elements.Title at 0x3bf8a5ef0>,
 <unstructured.documents.elements.Title at 0x3bf8a5fd0>,
 <unstructured.documents.elements.Title at 0x3c0d02120>,
 <unstructured.documents.elements.Title at 0x3bf8a5f60>,
 <unstructured.documents.elements.NarrativeText at 0x3c084a970>,
 <unstructured.documents.elements.Title at 0x3c0849da0>,
 <unstructured.documents.elements.Title at 0x3c0d008a0>,
 <unstructured.documents.elements.Title at 0x3c0849cc0>,
 <unstructured.documents.elements.Title at 0x3c084a120>,
 <unstructured.documents.elements.NarrativeText at 0x3c084a430>,
 <unstructured.documents.elements.Title at 0x3c0d00a60>,
 <unstructured.documents.elements.Title at 0x3c0d00b40>,
 <unstructured.documents.elements.Title at 0x3c0d00ad0>,
 <unstructured.documents.elements.Title at 0x3c0d00bb0>,
 <unstructured.documents.elements.Title at 0x3c0d00c90>,
 <unstructured.d

### Separate extracted elements into tables, text and images

Separate tables from texts

In [61]:
# separate tables from texts
tables = []
texts = []

for chunk in chunks:
    if "CompositeElement" in str(type(chunk)):
        for el in chunk.metadata.orig_elements:
            if "Table" in str(type(el)):
                tables.append(el)
                #remove table from chunk
                chunk.metadata.orig_elements.remove(el)

IndexError: list index out of range

In [63]:
# For Testing only!
# Search for Table in chunk
for chunk in chunks:
    if "Table" in str(type(chunk)):
        print("Found Table in chunk")
        print(chunk.metadata.orig_elements)
        #print(chunk.metadata

Get the images from the CompositeElement objects

In [38]:
def get_images_base64(chunks):
    images_b64 = []
    for chunk in chunks:
        if "CompositeElement" in str(type(chunk)):
            chunk_els = chunk.metadata.orig_elements
            for el in chunk_els:
                if "Image" in str(type(el)):
                    images_b64.append(el.metadata.image_base64)
                    chunk.metadata.orig_elements.remove(el)
    return images_b64

images = get_images_base64(chunks)

In [9]:
#Check if images are still in the chunks
for chunk in chunks:
    if "Image" in str(type(chunk)):
        print("Found Image in chunk")
        print(chunk.metadata.orig_elements)
        #print(chunk.metadata.orig_elements

In [39]:
# The rest of the chunks are text
texts = []
for chunk in chunks:
    texts.append(chunk)

In [11]:
#print(texts[0])

## Phase 2: Summarization

In [40]:
%pip install --upgrade --quiet  langchain-openai
%pip install dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [41]:
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os
from dotenv import load_dotenv
from langchain_core.rate_limiters import InMemoryRateLimiter

rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.5,
    check_every_n_seconds=1,
    max_bucket_size=500000,
)
load_dotenv()
azure_api_key = os.environ.get("AZURE_OPENAI_API_KEY")
azure_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")

### Summarization of Tables

In [43]:
# Prompt
prompt_text ="""
You are a helpful assistant tasked with summarizing tables precisely.
Give a concise summary of the table.

Respond only with the summary, no additional comment.
Do not start your message by saying "Here is a summary" or anything like that.
Just give the summary as it is.

Table: {element}

"""
prompt = ChatPromptTemplate.from_template(prompt_text)
#rate_limiter=rate_limiter,

# Summary chain
model = AzureChatOpenAI(
    azure_deployment="gpt-4o",
    azure_endpoint=azure_endpoint,
    openai_api_key=azure_api_key,
    api_version="2024-12-01-preview",
    temperature=1.0,
    model="gpt-4o"
)
#summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
summarize_chain = prompt | model | StrOutputParser()

In [46]:
tables_html = [table.metadata.text_as_html for table in tables]
tables_html[2]

'<table><thead><tr><th>Parameters</th><th>Sym</th><th>Min</th><th>Typ</th><th>Max</th><th>Units</th><th>Conditions</th></tr></thead><tbody><tr><td colspan="7">DC Characteristics</td></tr><tr><td>Logic Supply Voltage</td><td>Viocic</td><td>4.5</td><td>5.0</td><td>5.5</td><td>Vv</td><td></td></tr><tr><td>Load Supply Voltage</td><td>ViLoaD</td><td>10</td><td>30</td><td>40</td><td>Vv</td><td></td></tr><tr><td>Logic Supply Current</td><td>lvLocic</td><td>_</td><td>0.8</td><td>1.0</td><td>mA</td><td></td></tr><tr><td>Veer Voltage Range</td><td>VREF</td><td>1.5</td><td>5.0</td><td>7.0</td><td>Vv</td><td></td></tr><tr><td rowspan="2">Driver Supply Current</td><td>IVLOAD ON</td><td>_</td><td>0.55</td><td>1.0</td><td>mA</td><td>| Both Bridges ON, No Load</td></tr><tr><td>IVLOAD_OFF</td><td>_</td><td>0.55</td><td>1.0</td><td>mA _</td><td>| Both Bridges OFF</td></tr><tr><td>Control Logic Input Current (Vin = OV)</td><td>lin</td><td>—_</td><td>—_</td><td>-70</td><td>HA</td><td>101, 111 ,102, 112, P

In [None]:

#table_summaries = summarize_chain.invoke(tables_html)
table_summaries = summarize_chain.batch(tables_html, {"max_concurrency": 3})

In [33]:
print(table_summaries[2])

The table lists key performance parameters of a temperature sensor:

1. **Temperature Sensor Accuracy**:
   - For -20°C to 100°C: Accuracy range is -0.5°C to +0.5°C at 3.3V (Vpp).
   - For 40°C to 125°C: Accuracy range is -1.0°C to +1.0°C at 3.3V (Vpp).
   - Drift: Typical value of +0.05°C at 3.3V (Vpp).
   - Repeatability: Typical value of +0.0625°C after 48 hours at +55°C and 3.3V (Vpp).

2. **Temperature Conversion Time (typical)**:
   - 0.5°C/bit: 30 ms (33 samples/sec).
   - 0.25°C/bit: 65 ms (15 samples/sec).
   - 0.125°C/bit: 130 ms (7 samples/sec).
   - 0.0625°C/bit: 250 ms (4 samples/sec).

3. **Power Supply**:
   - Operating Voltage: 2.7V to 5.5V.
   - Current Consumption: 200 µA (typical), 400 µA (max).
   - Shutdown Current: 0.1 pA (typical), 2 pA (max).
   - Reset Voltage: 2.2V (typical threshold for Vpp drop).
   - Supply Rejection: -0.1°C/V from 2.7V to 5.5V at 25°C.

4. **Alert Output**:
   - High-Level Leakage Current: Up to 1 µA.
   - Low-Level Voltage: Maximum of 0.4

In [19]:
from lxml import etree

def validate_html_tables(table_list):
    """
    Given a list of HTML snippets (strings) each containing a <table> element,
    returns a list of dicts with validation results.
    """
    results = []
    for idx, html in enumerate(table_list, start=1):
        parser = etree.HTMLParser()  # collects parse errors
        try:
            # Try to parse the snippet
            etree.fromstring(html, parser)
            errors = parser.error_log
            valid = len(errors) == 0
        except etree.XMLSyntaxError as e:
            # Fatal syntax error
            valid = False
            errors = [e]
        
        # Record result
        results.append({
            'table_index': idx,
            'valid': valid,
            'errors': [str(err) for err in errors]
        })
    return results

In [20]:
for res in validate_html_tables(tables_html):
        print(f"Table #{res['table_index']}:",
              "Valid" if res['valid'] else "INVALID")
        if not res['valid']:
            print("  Errors:")
            for err in res['errors']:
                print("   -", err)

Table #1: Valid
Table #2: Valid
Table #3: Valid
Table #4: Valid
Table #5: Valid
Table #6: Valid
Table #7: Valid
Table #8: Valid
Table #9: Valid
Table #10: Valid
Table #11: Valid
Table #12: Valid
Table #13: Valid
Table #14: Valid
Table #15: Valid
Table #16: Valid
Table #17: Valid
Table #18: Valid
Table #19: Valid
Table #20: Valid
Table #21: Valid
Table #22: Valid
Table #23: Valid
Table #24: Valid
Table #25: Valid
Table #26: Valid
Table #27: Valid


In [25]:
tables_html[26]

'<table><thead><tr><th rowspan="2">Device</th><th colspan="2">TapeandReel</th><th>Temperature</th><th>Package</th><th>a)</th><th>MCP9808-E/MC:</th><th>Extended Temperature 8LD DFN package.</th></tr><tr><th colspan="2">and/or Alternate Pinout</th><th>Range</th><th></th><th>b)</th><th>MCP9808-E/MS:</th><th>Extended Temperature 8LD MSOP package.</th></tr></thead><tbody><tr><td>Device:</td><td></td><td>MCP9808: MCP9808T:</td><td>Digital Digital</td><td>Temperature Sensor Temperature Sensor (Tape and Reel)</td><td>c)</td><td>MCP9808T-E/MC:</td><td>Tape and Reel, Extended Temperature 8LD DFN</td></tr><tr><td>Temperature</td><td>Range: E</td><td></td><td>-40°C to +125°C</td><td></td><td>d)</td><td>MCP9808T-E/MS:</td><td>package. Tape and Reel,</td></tr><tr><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Extended Temperature 8LD MSOP package.</td></tr><tr><td rowspan="2">Package:</td><td></td><td>MC</td><td>Plastic Dual Flat</td><td>No-Lead (DFN) 2x3, 8-lead</td><td></td><td

### Summarization of Images

In [None]:
prompt_template = """You are a helpful assistant tasked with describing a image in detail. For context, the image is part of a design specification explaining the design of a digital temperature sensor.
Respond only with the description, no additional comment. Do not start your message by saying "Here is a description" or anything like that. Just give the description as it is."""
messages = [
    (
        "user",
        [
            {"type": "text", "text": prompt_template},
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64,{image}"},
            },
        ],
    )
]

prompt = ChatPromptTemplate.from_messages(messages)

chain = prompt | model | StrOutputParser()

image_summaries = chain.batch(images)

## Phase 3: Vectorization

### Create Vectorstore

In [None]:
import uuid
from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain.schema.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="DesignSpecsRAG", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

### Load Data

In [None]:
# Add texts
retriever.vectorstore.add_texts(texts)

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=summary, metadata={id_key: table_ids[i]}) for i, summary in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

# Add image summaries
img_ids = [str(uuid.uuid4()) for _ in images]
summary_img = [
    Document(page_content=summary, metadata={id_key: img_ids[i]}) for i, summary in enumerate(image_summaries)
]
retriever.vectorstore.add_documents(summary_img)
retriever.docstore.mset(list(zip(img_ids, images)))

## Phase 4: Retrieval

In [None]:
# Retrieve
docs = retriever.invoke(
    "What is MTS2916A?"
)

for doc in docs:
    print(str(doc) + "\n\n" + "-" * 80)