### System Dependencies
To get started with Unstructured.io, we need a few system-wide dependencies:

### Poppler (poppler-utils)
Handles PDF processing. It's a library that can extract text, images, and metadata from PDFs. Unstructured uses it to parse PDF documents and convert them into processable text.

### Tesseract (tesseract-ocr)
Optical Character Recognition (OCR) engine. When you have scanned documents, images with text, or PDFs that are essentially pictures, Tesseract reads the text from these images and converts it to machine-readable text.

### libmagic
File type detection library. It identifies what type of file you're dealing with (PDF, Word doc, image, etc.) by analyzing the file's content, not just the extension. This helps Unstructured choose the right processing method for each document.

In [None]:
%pip install -Uq "unstructured[all-docs]" 
%pip install -Uq langchain_chroma 
%pip install -Uq langchain langchain-community
%pip install -Uq python_dotenv

In [None]:
%pip install -Uq langchain-google-genai langchain_huggingface

In [2]:
import json
from typing import List

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_core.messages import HumanMessage
from langchain_core.documents import Document
from dotenv import load_dotenv

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
load_dotenv()

True

In [4]:
def partition_documents(file_path: str):
    """
    Extract elements from a PDF document.

    Args:
        doc_path (str): The file path to the PDF document.
    """
    print(f"Partitioning document: {file_path}")

    elements = partition_pdf(
        filename=file_path, # Path to your PDF file
        strategy="hi_res",  # Use the most accurate (but slower) processing method of extraction
        infer_table_structure=True, # Keep tables as structured HTML, not jumbled tex
        extract_image_block_types=['Image'], # Found image in PDFs
        extract_image_block_to_payload=True, # Store image like a Base64 data, that we can use
    )

    print(f"✅ Extracted {len(elements)} elements")

    return elements

# Testing
file_path = "./docs/Attention_is_all_you_need.pdf"
elements = partition_documents(file_path)

Partitioning document: ./docs/Attention_is_all_you_need.pdf


The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


✅ Extracted 220 elements


In [52]:
elements[0]

<unstructured.documents.elements.Text at 0x2837ad496a0>

In [53]:
elements[49]

<unstructured.documents.elements.Image at 0x2837b2e0390>

In [54]:
set([str(type(el)) for el in elements])

{"<class 'unstructured.documents.elements.FigureCaption'>",
 "<class 'unstructured.documents.elements.Footer'>",
 "<class 'unstructured.documents.elements.Formula'>",
 "<class 'unstructured.documents.elements.Header'>",
 "<class 'unstructured.documents.elements.Image'>",
 "<class 'unstructured.documents.elements.ListItem'>",
 "<class 'unstructured.documents.elements.NarrativeText'>",
 "<class 'unstructured.documents.elements.Table'>",
 "<class 'unstructured.documents.elements.Text'>",
 "<class 'unstructured.documents.elements.Title'>"}

In [55]:
elements[30].to_dict()

{'type': 'Title',
 'element_id': 'cb2dd736a07d5643061be24e9e9364c3',
 'text': 'Abstract',
 'metadata': {'detection_class_prob': 0.80176842212677,
  'is_extracted': 'true',
  'coordinates': {'points': ((np.float64(788.2166666666665),
     np.float64(1070.3456577777777)),
    (np.float64(788.2166666666665), np.float64(1103.5545466666665)),
    (np.float64(913.0672607421875), np.float64(1103.5545466666665)),
    (np.float64(913.0672607421875), np.float64(1070.3456577777777))),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'last_modified': '2025-12-05T17:35:32',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'file_directory': './docs',
  'filename': 'Attention_is_all_you_need.pdf',
  'parent_id': '8fb6c0a7d81753e6ad6378121866d944'}}

In [5]:
# extracting all images atomic data
images = [el for el in elements if el.category == "Image"]

print(f"There are {len(images)} image elements")

images[0].to_dict()

#for el in elements:
    #if el.category == "Image":
        #print(el.to_dict())

# https://codebeautify.org/base64-to-image-converter

There are 7 image elements


{'type': 'Image',
 'element_id': '8ececc354de606443bee7854621a450e',
 'text': 'Output Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention a, Add & Norm Add & Norm Feed Forward Nx | -+CAgc8 Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Se a, ee a, Positional Positional Encoding @ © @ Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)',
 'metadata': {'coordinates': {'points': ((np.float64(545.9972222222221),
     np.float64(200.00555555555542)),
    (np.float64(545.9972222222221), np.float64(1095.6055555555556)),
    (np.float64(1153.997222222222), np.float64(1095.6055555555556)),
    (np.float64(1153.997222222222), np.float64(200.00555555555542))),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'last_modified': '2025-12-05T17:35:32',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 3,
  'image_base64': '/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8

In [57]:
# extracting all table atomic data
table = [el for el in elements if el.category == "Table"]

print(f"There are {len(table)} table elements")

table[0].to_dict()

# https://jsfiddle.net/

There are 4 table elements


{'type': 'Table',
 'element_id': '86cad8924ae868aa0c78ffcb29747897',
 'text': 'Layer Type Complexity per Layer Sequential Maximum Path Length Operations Self-Attention O(n2 · d) O(1) O(1) Recurrent O(n · d2) O(n) O(n) Convolutional O(k · n · d2) O(1) O(logk(n)) Self-Attention (restricted) O(r · n · d) O(1) O(n/r)',
 'metadata': {'detection_class_prob': 0.928255021572113,
  'is_extracted': 'true',
  'coordinates': {'points': ((np.float64(320.3291931152344),
     np.float64(312.45477294921875)),
    (np.float64(320.3291931152344), np.float64(519.1640014648438)),
    (np.float64(1363.98291015625), np.float64(519.1640014648438)),
    (np.float64(1363.98291015625), np.float64(312.45477294921875))),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'last_modified': '2025-12-05T17:35:32',
  'text_as_html': '<table><thead><tr><th>Layer Type</th><th>Complexity per Layer</th><th>Sequential Operations</th><th>Maximum Path Length</th></tr></thead><tbody><tr><td>Self-

In [6]:
def create_chunks_by_title(elements):
    """
    Creating intellegent chunks by using Title-based strategy
    """
    chunks = chunk_by_title(
        elements=elements,
        max_characters=3000,    # Hard limit - never exceed 3000 characters per chunk
        new_after_n_chars=2400, # Try to start a new chunk after 2400 characters
        combine_text_under_n_chars=500  # Merge tiny chunks under 500 chars with neighbors
    )

    print(f"Number of chunks is: {len(chunks)}")

    return chunks

chunks = create_chunks_by_title(elements)

Number of chunks is: 25


In [7]:
chunks[0]

<unstructured.documents.elements.CompositeElement at 0x2794c12d550>

In [8]:
chunks[2].to_dict()

{'type': 'CompositeElement',
 'element_id': '2763cd90-e766-4691-992c-22eb86955df6',
 'text': '1 Introduction\n\nRecurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].\n\nRecurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across ex

In [9]:
chunks[4].to_dict()

{'type': 'CompositeElement',
 'element_id': '78f13176-d678-4424-9e49-1e2ef6916684',
 'text': '3 Model Architecture\n\nMost competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1,...,xn) to a sequence of continuous representations z = (z1,...,zn). Given z, the decoder then generates an output sequence (y1,...,ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.\n\n2\n\nOutput Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention a, Add & Norm Add & Norm Feed Forward Nx | -+CAgc8 Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Se a, ee a, Positional Positional Encoding @ © @ Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)\n\nFigure 1: The Transformer - model architecture.\n\nThe Transformer follows thi

In [10]:
# Viewing orignal elements inside the chunks:
chunks[4].metadata.orig_elements

[<unstructured.documents.elements.Title at 0x2794caac130>,
 <unstructured.documents.elements.NarrativeText at 0x2794caac2f0>,
 <unstructured.documents.elements.Footer at 0x2794e99af90>,
 <unstructured.documents.elements.Image at 0x2794c12d940>,
 <unstructured.documents.elements.FigureCaption at 0x2795cdacd70>,
 <unstructured.documents.elements.NarrativeText at 0x2795cd772a0>]

In [24]:
def seperate_content_types(chunk):
    """
    Analyses what types of content are in chunks

    Args:
        chunk (Chunk): A chunk of document content.

    Returns:
        dict: A dictionary containing the text, tables, images, and types of content.
    """

    content_data = {
        'text': chunk.text,
        'tables': [],
        'images': [],
        'types': ['text']
    }

    # Check for tables and images in an orignal elements:
    if hasattr(chunk, 'metadata') and hasattr(chunk.metadata, 'orig_elements'):     # <=
        for element in chunk.metadata.orig_elements:
            element_type = type(element).__name__

            # Handle tables
            if element_type == "Table":
                content_data['types'].append('table')
                table_html = getattr(element.metadata, 'text_as_html', element.text)    # <=
                content_data['tables'].append(table_html)

            # Handle images
            elif element_type == "Image":
                if hasattr(element, 'metadata') and hasattr(element.metadata, 'image_base64'):
                    content_data['types'].append('image')
                    content_data['images'].append(element.metadata.image_base64)


    content_data['types'] = list(set(content_data['types']))

    return content_data


def create_ai_enhanced_summary(text: str, tables: List[str], images: List[str]):
    """
    Create AI enhanced summary for mixed context
    
    Args:
        text (str): The textual content to analyze.
        tables (List[str]): List of HTML strings representing tables.
        images (List[str]): List of base64-encoded image strings.

    RETURNS:
        str: AI-generated enhanced summary.
    """

    try:
        llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

        prompt_text = f"""You are creating a searchable description for document content retrieval.

        CONTENT TO ANALYZE:
        TEXT CONTENT:
        {text}

        """

        # Add table if present
        if tables:
            prompt_text += "TABLES:\n"
            for i, table in enumerate(tables):
                print(f"Table {i+1}: \n{table}\n")

                prompt_text += """
                YOUR TASK:
                Generate a comprehensive, searchable description that covers:

                1. Key facts, numbers, and data points from text and tables
                2. Main topics and concepts discussed  
                3. Questions this content could answer
                4. Visual content analysis (charts, diagrams, patterns in images)
                5. Alternative search terms users might use

                Make it detailed and searchable - prioritize findability over brevity.

                SEARCHABLE DESCRIPTION:"""

        # Build message content staring from text:
        message_content = [{'type': 'text', 'text': prompt_text}]

        # Add images to the message
        for image_base64 in images:
            message_content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
            })
        
        # Send to AI and get response
        message = HumanMessage(content=message_content)
        response = llm.invoke([message])
        
        return response.content


    except Exception as e:
        print(f"     ❌ AI summary failed: {e}")
        # Fallback to simple summary
        summary = f"{text[:300]}..."
        if tables:
            summary += f" [Contains {len(tables)} table(s)]"
        if images:
            summary += f" [Contains {len(images)} image(s)]"
        return summary


def summarise_chunks(chunks):
    """
    Process all chunks with AI Summaries
    
    Args:
        chunks (List[Chunk]): List of document chunks to process.

    returns:
        List[Document]: List of LangChain Documents with enhanced summaries.
    """
    print("🧠 Processing chunks with AI Summaries...")
    
    langchain_documents = []
    total_chunks = len(chunks)
    
    for i, chunk in enumerate(chunks):
        current_chunk = i + 1
        print(f"   Processing chunk {current_chunk}/{total_chunks}")
        
        # Analyze chunk content
        content_data = seperate_content_types(chunk)
        
        # Debug prints
        print(f"     Types found: {content_data['types']}")
        print(f"     Tables: {len(content_data['tables'])}, Images: {len(content_data['images'])}")
        
        # Create AI-enhanced summary if chunk has tables/images
        if content_data['tables'] or content_data['images']:
            print(f"     → Creating AI summary for mixed content...")
            try:
                enhanced_content = create_ai_enhanced_summary(
                    content_data['text'],
                    content_data['tables'], 
                    content_data['images']
                )
                print(f"     → AI summary created successfully")
                print(f"     → Enhanced content preview: {enhanced_content[:200]}...")
            except Exception as e:
                print(f"     ❌ AI summary failed: {e}")
                enhanced_content = content_data['text']
        else:
            print(f"     → Using raw text (no tables/images)")
            enhanced_content = content_data['text']
        
        # Create LangChain Document with rich metadata
        doc = Document(
            page_content=enhanced_content,
            metadata={
                "original_content": json.dumps({
                    "raw_text": content_data['text'],
                    "tables_html": content_data['tables'],
                    "images_base64": content_data['images']
                })
            }
        )
        
        langchain_documents.append(doc)
    
    print(f"✅ Processed {len(langchain_documents)} chunks")
    return langchain_documents

process_chunks = summarise_chunks(chunks)

🧠 Processing chunks with AI Summaries...
   Processing chunk 1/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 2/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 3/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 4/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 5/25
     Types found: ['image', 'text']
     Tables: 0, Images: 1
     → Creating AI summary for mixed content...
     → AI summary created successfully
     → Enhanced content preview: **Searchable Description:**

Diagram illustrating the **Transformer model architecture**, a core **encoder-decoder neural network** designed for **sequence transduction** tasks, commonly used in **Nat...
   Processing chunk 6/25
     Types found: ['text']
     Tables: 0, Imag

In [25]:
process_chunks[4]

Document(metadata={'original_content': '{"raw_text": "3 Model Architecture\\n\\nMost competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1,...,xn) to a sequence of continuous representations z = (z1,...,zn). Given z, the decoder then generates an output sequence (y1,...,ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.\\n\\n2\\n\\nOutput Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention a, Add & Norm Add & Norm Feed Forward Nx | -+CAgc8 Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Se a, ee a, Positional Positional Encoding @ \\u00a9 @ Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)\\n\\nFigure 1: The Transformer - model architecture.\\n\\nThe Transformer follows this overall architecture u

In [26]:
def export_chunks_to_json(chunks, filename="chunks_export.json"):
    """Export processed chunks to clean JSON format"""
    export_data = []
    
    for i, doc in enumerate(chunks):
        chunk_data = {
            "chunk_id": i + 1,
            "enhanced_content": doc.page_content,
            "metadata": {
                "original_content": json.loads(doc.metadata.get("original_content", "{}"))
            }
        }
        export_data.append(chunk_data)
    
    # Save to file
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(export_data, f, indent=2, ensure_ascii=False)
    
    print(f"✅ Exported {len(export_data)} chunks to {filename}")
    return export_data

# Export your chunks
json_data = export_chunks_to_json(process_chunks)

✅ Exported 25 chunks to chunks_export.json


In [28]:
def create_vector_store(documents, persist_directory = "db2/chroma_db"):
    """
    Create a Chroma vector store from the provided documents.

    Args:
        documents (List[Document]): List of LangChain Documents to store.
        persist_directory (str): Directory to persist the Chroma database.
    """

    embeddings_modal = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    vector_store = Chroma.from_documents(
        documents=documents,
        embedding=embeddings_modal,
        persist_directory=persist_directory,
        collection_metadata={"hnsw:space": "cosine"},
    )

    print("Finished creating vector")

    print(f"✅ Vector store persisted at: {persist_directory}")
    return vector_store

db = create_vector_store(documents=process_chunks)

Finished creating vector
✅ Vector store persisted at: db2/chroma_db


In [29]:
query = "How many layers does the base Transformer model use in both encoder and decoder?"
retriever = db.as_retriever(search_kwarge={'k':3})
relevent_chunks = retriever.invoke(query)

# Export to JSON
export_chunks_to_json(relevent_chunks, "rag_results.json")

✅ Exported 4 chunks to rag_results.json


[{'chunk_id': 1,
  'enhanced_content': "**Searchable Description:**\n\nDiagram illustrating the **Transformer model architecture**, a core **encoder-decoder neural network** designed for **sequence transduction** tasks, commonly used in **Natural Language Processing (NLP)**.\n\nThe architecture details:\n*   **Input Stage:** **Inputs** are transformed by **Input Embedding** and combined with **Positional Encoding**.\n*   **Encoder Stack (Nx layers):** Each encoder layer contains a **Multi-Head Attention** mechanism followed by an **Add & Norm** layer, and then a **Feed Forward Network** also followed by an **Add & Norm** layer. This stack maps input sequences to continuous representations.\n*   **Output Stage:** **Outputs (shifted right)** are processed by **Output Embedding** and combined with **Positional Encoding**.\n*   **Decoder Stack (Nx layers):** Each decoder layer includes a **Masked Multi-Head Attention** (for auto-regressive generation, consuming previously generated symbols

In [30]:
def run_complete_ingestion_pipeline(pdf_path: str):
    """Run compleate RAG ingestion pipeline"""

    elements = partition_documents(file_path=pdf_path)

    chunks = create_chunks_by_title(elements)

    summarised_chunks = summarise_chunks(chunks)

    db = create_vector_store(summarised_chunks, persist_directory = "db/chroma_db")

    print("Pipeline completed successfully!")
    return db

In [31]:
db = run_complete_ingestion_pipeline(pdf_path="./docs/Attention_is_all_you_need.pdf")

Partitioning document: ./docs/Attention_is_all_you_need.pdf
✅ Extracted 220 elements
Number of chunks is: 25
🧠 Processing chunks with AI Summaries...
   Processing chunk 1/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 2/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 3/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 4/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 5/25
     Types found: ['image', 'text']
     Tables: 0, Images: 1
     → Creating AI summary for mixed content...
     ❌ AI summary failed: Error calling model 'gemini-2.5-flash' (RESOURCE_EXHAUSTED): 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more inform

In [36]:
# Query the vector store
query = "According to Table 1, what are the main advantages of self-attention layers compared to recurrent and convolutional layers in terms of computational complexity and parallelization?"

retriever = db.as_retriever(search_kwargs={"k": 3})
chunks = retriever.invoke(query)

def generate_final_answer(chunks, query):
    """Generate final answer using multimodal content"""
    
    try:
        # Initialize LLM (needs vision model for images)
        llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite", temperature=0)
        
        # Build the text prompt
        prompt_text = f"""Based on the following documents, please answer this question: {query}

CONTENT TO ANALYZE:
"""
        
        for i, chunk in enumerate(chunks):
            prompt_text += f"--- Document {i+1} ---\n"
            
            if "original_content" in chunk.metadata:
                original_data = json.loads(chunk.metadata["original_content"])
                
                # Add raw text
                raw_text = original_data.get("raw_text", "")
                if raw_text:
                    prompt_text += f"TEXT:\n{raw_text}\n\n"
                
                # Add tables as HTML
                tables_html = original_data.get("tables_html", [])
                if tables_html:
                    prompt_text += "TABLES:\n"
                    for j, table in enumerate(tables_html):
                        prompt_text += f"Table {j+1}:\n{table}\n\n"
            
            prompt_text += "\n"
        
        prompt_text += """
Please provide a clear, comprehensive answer using the text, tables, and images above. If the documents don't contain sufficient information to answer the question, say "I don't have enough information to answer that question based on the provided documents."

ANSWER:"""

        # Build message content starting with text
        message_content = [{"type": "text", "text": prompt_text}]
        
        # Add all images from all chunks
        for chunk in chunks:
            if "original_content" in chunk.metadata:
                original_data = json.loads(chunk.metadata["original_content"])
                images_base64 = original_data.get("images_base64", [])
                
                for image_base64 in images_base64:
                    message_content.append({
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
                    })
        
        # Send to AI and get response
        message = HumanMessage(content=message_content)
        response = llm.invoke([message])
        
        return response.content
        
    except Exception as e:
        print(f"❌ Answer generation failed: {e}")
        return "Sorry, I encountered an error while generating the answer."

# Usage
final_answer = generate_final_answer(chunks, query)
print(final_answer)

According to Table 1 (as referenced in the text), the main advantages of self-attention layers compared to recurrent and convolutional layers in terms of computational complexity and parallelization are:

*   **Parallelization:** Self-attention layers connect all positions with a **constant number of sequentially executed operations**. In contrast, recurrent layers require **O(n) sequential operations**. This means self-attention layers are significantly more parallelizable.

*   **Computational Complexity:** Self-attention layers are **faster than recurrent layers when the sequence length (n) is smaller than the representation dimensionality (d)**. This condition is frequently met in modern machine translation models. While the document mentions convolutional layers are generally more expensive than recurrent layers, it notes that separable convolutions can significantly reduce complexity, and even with a large kernel (k=n), the complexity of a separable convolution is comparable to a