# Realtime multimodal Usecase | Extract Image,Table,Text from Document |

### Unstructured Library: Taming the Chaos of Unstructured Data

The **unstructured** library is a powerful open-source Python toolkit designed to tackle the pervasive challenge of handling unstructured data. In essence, it provides a streamlined and efficient way to extract and preprocess text from a wide array of file formats that lack a predefined data model, such as PDFs, HTML pages, Word documents, and even images. This process transforms messy, difficult-to-use data into a clean, structured format, making it readily available for a variety of downstream tasks, particularly in the realm of natural language processing (NLP) and machine learning.

The core strength of the `unstructured` library lies in its ability to not only extract raw text but also to understand and preserve the inherent structure of a document. For instance, it can differentiate between titles, paragraphs, lists, and other document elements. This "structural awareness" is crucial for providing context to large language models (LLMs) and other analytical tools.

### Key Use Cases and Functionality

The `unstructured` library is instrumental in a range of applications that rely on processing and understanding textual data. Some of its primary uses include:

  * **Retrieval-Augmented Generation (RAG):** A common application is in RAG systems, where the library is used to ingest a corpus of documents (like a company's internal knowledge base). The extracted and cleaned text is then used to provide relevant context to an LLM, enabling it to answer questions based on that specific information.

  * **Fine-tuning Machine Learning Models:** To adapt a machine learning model to a specific domain, it needs to be trained on relevant data. The `unstructured` library can efficiently preprocess large volumes of domain-specific documents, preparing the text for the fine-tuning process.

  * **Traditional ETL (Extract, Transform, Load) Pipelines:** In any data-driven workflow, the initial step often involves gathering data from various sources. `unstructured` serves as a critical component in ETL pipelines by handling the "extract" and "transform" stages for unstructured data, making it suitable for loading into databases or data warehouses.

To achieve this, the library offers several key functionalities:

  * **Partitioning:** This is the core function that breaks down a raw document into a sequence of structured elements.
  * **Cleaning:** It provides utilities to remove unwanted artifacts from the extracted text, such as HTML tags or repetitive headers and footers.
  * **Chunking:** For long documents, the library can intelligently split the text into smaller, more manageable chunks, which is often necessary for feeding into models with limited context windows.

Here is a simple Python code snippet demonstrating how to use the `unstructured` library to extract text from a PDF file:

```python
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("example.pdf")

for element in elements:
    print(element)
```

This code would output a list of structured elements found within the "example.pdf" file, each element representing a distinct part of the document's content.

In conclusion, the `unstructured` library is an essential tool for developers and data scientists working with real-world data. By simplifying the complex process of ingesting and preprocessing unstructured information, it accelerates the development of powerful AI and data analytics applications.

In [1]:
! pip install "unstructured[all-docs]" pillow pydantic lxml matplotlib



In [2]:
!sudo apt-get update

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:5 https://cli.github.com/packages stable InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


In [3]:
!sudo apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.9).
0 upgraded, 0 newly installed, 0 to remove and 62 not upgraded.


In [4]:
!sudo apt-get install libleptonica-dev tesseract-ocr libtesseract-dev python3-pil tesseract-ocr-eng tesseract-ocr-script-latn

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libleptonica-dev is already the newest version (1.82.0-3build1).
libtesseract-dev is already the newest version (4.1.1-2.1build1).
tesseract-ocr is already the newest version (4.1.1-2.1build1).
tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1.1).
tesseract-ocr-script-latn is already the newest version (1:4.00~git30-7274cfa-1.1).
python3-pil is already the newest version (9.0.1-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 62 not upgraded.


In [5]:
!pip install unstructured-pytesseract
!pip install tesseract-ocr



In [6]:
from unstructured.partition.pdf import partition_pdf

In [None]:
raw_pdf_elements=partition_pdf(
    filename="/content/Attention is all you need.pdf",
    strategy="hi_res",
    extract_images_in_pdf=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=False,
    extract_image_block_output_dir="extracted_data"
  )

In [8]:
raw_pdf_elements

[<unstructured.documents.elements.Text at 0x7e6031fd7230>,
 <unstructured.documents.elements.Text at 0x7e6031f9dbe0>,
 <unstructured.documents.elements.Text at 0x7e6031fd7470>,
 <unstructured.documents.elements.Text at 0x7e6031fd6090>,
 <unstructured.documents.elements.Text at 0x7e6031fd5820>,
 <unstructured.documents.elements.Header at 0x7e6031fd52b0>,
 <unstructured.documents.elements.Text at 0x7e6032104320>,
 <unstructured.documents.elements.Text at 0x7e6032107fb0>,
 <unstructured.documents.elements.Text at 0x7e6032107500>,
 <unstructured.documents.elements.Text at 0x7e6032107890>,
 <unstructured.documents.elements.Text at 0x7e6031f9c050>,
 <unstructured.documents.elements.Text at 0x7e6032104380>,
 <unstructured.documents.elements.Text at 0x7e6032104e90>,
 <unstructured.documents.elements.Text at 0x7e6032104e30>,
 <unstructured.documents.elements.Text at 0x7e6032105f10>,
 <unstructured.documents.elements.NarrativeText at 0x7e6031fd7e60>,
 <unstructured.documents.elements.Title at 0x

In [69]:
Header=[]
Footer=[]
Title=[]
NarrativeText=[]
Text=[]
ListItem=[]


for element in raw_pdf_elements:
  if "unstructured.documents.elements.Header" in str(type(element)):
            Header.append(str(element))
  elif "unstructured.documents.elements.Footer" in str(type(element)):
            Footer.append(str(element))
  elif "unstructured.documents.elements.Title" in str(type(element)):
            Title.append(str(element))
  elif "unstructured.documents.elements.NarrativeText" in str(type(element)):
            NarrativeText.append(str(element))
  elif "unstructured.documents.elements.Text" in str(type(element)):
            Text.append(str(element))
  elif "unstructured.documents.elements.ListItem" in str(type(element)):
            ListItem.append(str(element))



In [10]:
NarrativeText

['Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.',
 'Noam Shazeer∗ Google Brain noam@google.com',
 'Niki Parmar∗ Google Research nikip@google.com',
 'Google Research usz@google.com',
 'Google Research llion@google.com',
 'Aidan N. Gomez∗ † University of Toronto aidan@cs.toronto.edu',
 'Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com',
 'illia.polosukhin@gmail.com',
 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being mo

In [11]:
ListItem

['• In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].',
 '• The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.',
 '• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking o

In [12]:
img=[]
for element in raw_pdf_elements:
  if "unstructured.documents.elements.Image" in str(type(element)):
            img.append(str(element))

In [13]:
img[0]

'Output Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention yo Add & Norm Add & Norm Feed Forward Nx | Cag Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Lt Lt Positional Positional Encoding EQ © OY Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)'

In [14]:
tab=[]
for element in raw_pdf_elements:
  if "unstructured.documents.elements.Table" in str(type(element)):
            tab.append(str(element))

In [15]:
tab[1]

'Model BLEU EN-DE EN-FR Training Cost (FLOPs) EN-DE EN-FR ByteNet [18] 23.75 Deep-Att + PosUnk [39] 39.2 1.0 · 1020 GNMT + RL [38] 24.6 39.92 2.3 · 1019 1.4 · 1020 ConvS2S [9] 25.16 40.46 9.6 · 1018 1.5 · 1020 MoE [32] 26.03 40.56 2.0 · 1019 1.2 · 1020 Deep-Att + PosUnk Ensemble [39] 40.4 8.0 · 1020 GNMT + RL Ensemble [38] 26.30 41.16 1.8 · 1020 1.1 · 1021 ConvS2S Ensemble [9] 26.36 41.29 7.7 · 1019 1.2 · 1021 Transformer (base model) 27.3 38.1 3.3 · 1018 Transformer (big) 28.4 41.8 2.3 · 1019'

In [16]:
# raw_pdf_elements2=partition_pdf(
#     filename="/content/RAG.pdf",
#     strategy="hi_res",
#     extract_images_in_pdf=True,
#     extract_image_block_types=["Image", "Table"],
#     extract_image_block_to_payload=False,
#     extract_image_block_output_dir="extracted_data2"
#   )

In [17]:
# raw_pdf_elements2

In [18]:
# img=[]
# for element in raw_pdf_elements2:
#   if "unstructured.documents.elements.Image" in str(type(element)):
#             # img.append(str(element))

In [19]:
# img

In [20]:
# tab=[]
# for element in raw_pdf_elements2:
#   if "unstructured.documents.elements.Table" in str(type(element)):
#             tab.append(str(element))

In [21]:
# tab[0]

In [22]:
# NarrativeText=[]
# for element in raw_pdf_elements2:
#   if "unstructured.documents.elements.NarrativeText" in str(type(element)):
#             NarrativeText.append(str(element))

In [23]:
# ListItem=[]
# for element in raw_pdf_elements2:
#   if "unstructured.documents.elements.ListItem" in str(type(element)):
#             ListItem.append(str(element))

In [24]:
# NarrativeText


In [25]:
# ListItem

In [26]:
!pip install langchain_core



In [27]:
!pip install langchain_openai langchain_google_genai



In [28]:
len(tab)

4

In [29]:
tab[0]

'Layer Type Complexity per Layer Sequential Maximum Path Length Operations Self-Attention O(n2 · d) O(1) O(1) Recurrent O(n · d2) O(n) O(n) Convolutional O(k · n · d2) O(1) O(logk(n)) Self-Attention (restricted) O(r · n · d) O(1) O(n/r)'

In [30]:
len(img)

7

In [31]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI

In [32]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables for retrieval. \
    These summaries will be embedded and used to retrieve the raw table elements. \
    Give a concise summary of the table that is well optimized for retrieval. Table {element} """

In [33]:
prompt = ChatPromptTemplate.from_template(prompt_text)

In [40]:
import os
from google.colab import userdata

OPENAI_API_TOKEN=userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN

GEMINI_API_TOKEN = userdata.get('GEMINI_API_KEY')
os.environ["GEMINI_API_KEY"] = GEMINI_API_TOKEN

In [55]:
# Text summary chain
model = ChatGoogleGenerativeAI(model="gemini-2.5-flash")
model = ChatOpenAI(model="gpt-4o-mini")

In [56]:
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [57]:
table_summaries = []

In [58]:
table_summaries=summarize_chain.batch(tab,{"max_concurrency": 5})

In [59]:
tab[0]

'Layer Type Complexity per Layer Sequential Maximum Path Length Operations Self-Attention O(n2 · d) O(1) O(1) Recurrent O(n · d2) O(n) O(n) Convolutional O(k · n · d2) O(1) O(logk(n)) Self-Attention (restricted) O(r · n · d) O(1) O(n/r)'

In [60]:
table_summaries[0]

'**Table Summary: Layer Types and Complexity**\n\nThis table compares different layer types in terms of their computational complexity and operational characteristics:\n\n1. **Self-Attention**\n   - Complexity: O(n² · d)\n   - Sequential Max Path Length: O(1)\n   - Operations: O(1)\n\n2. **Recurrent**\n   - Complexity: O(n · d²)\n   - Sequential Max Path Length: O(n)\n   - Operations: O(n)\n\n3. **Convolutional**\n   - Complexity: O(k · n · d²)\n   - Sequential Max Path Length: O(1)\n   - Operations: O(logk(n))\n\n4. **Self-Attention (Restricted)**\n   - Complexity: O(r · n · d)\n   - Sequential Max Path Length: O(1)\n   - Operations: O(n/r)\n\nKey parameters include layer type (Self-Attention, Recurrent, Convolutional), complexity in terms of input size (n), depth (d), and additional parameters (k, r).'

In [61]:
img[0]

'Output Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention yo Add & Norm Add & Norm Feed Forward Nx | Cag Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Lt Lt Positional Positional Encoding EQ © OY Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)'

In [62]:
import base64
import os
from langchain_core.messages import HumanMessage

In [63]:
def encode_image(image_path):
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

In [64]:
def image_summarize(img_base64, prompt):
    """Make image summary"""


    chat = ChatOpenAI(model="gpt-5", max_tokens=1024)

    msg = chat.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},

                     {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ]
            )
        ]
    )
    return msg.content

In [65]:
def generate_img_summaries(path):
    """
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    """

    # Store base64 encoded images
    img_base64_list = []

    # Store image summaries
    image_summaries = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval."""


    base64_image = encode_image(path)
    img_base64_list.append(base64_image)
    image_summaries.append(image_summarize(base64_image, prompt))

    return img_base64_list, image_summaries

In [66]:
fpath="/content/extracted_data/figure-3-1.jpg"

In [67]:
img_base64_list,image_summaries=generate_img_summaries(fpath)

In [68]:
image_summaries[0]

'Transformer architecture diagram (Attention Is All You Need): encoder–decoder stack with N× repeated layers. Encoder: input embedding + positional encoding → multi-head self-attention → add & norm → feed-forward → add & norm (residuals). Decoder: output embedding + positional encoding → masked multi-head self-attention → add & norm → cross-attention to encoder → add & norm → feed-forward → add & norm. Final linear + softmax to output probabilities. Keywords: residual connections, positional encodings, multi-head attention, masked attention, feed-forward, encoder-decoder.'