System Dependencies

To get started with Unstructured.io, we need a few system-wide dependencies:

Poppler (poppler-utils)

Handles PDF processing. It's a library that can extract text, images, and metadata from PDFs. Unstructured uses it to parse PDF documents and convert them into processable text.

Tesseract (tesseract-ocr)

Optical Character Recognition (OCR) engine. When you have scanned documents, images with text, or PDFs that are essentially pictures, Tesseract reads the text from these images and converts it to machine-readable text.

libmagic

File type detection library. It identifies what type of file you're dealing with (PDF, Word doc, image, etc.) by analyzing the file's content, not just the extension. This helps Unstructured choose the right processing method for each document.

In [1]:
!brew install poppler tesseract libmagic

[34m==>[0m [1mAuto-updating Homebrew...[0m
Adjust how often this is run with `$HOMEBREW_AUTO_UPDATE_SECS` or disable with
`$HOMEBREW_NO_AUTO_UPDATE=1`. Hide these hints with `$HOMEBREW_NO_ENV_HINTS=1` (see `man brew`).
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/portable-ruby/blobs/sha256:80c194381e990a4967a1ae44b8242b688e6a17ab590865a38671137677411469[0m
######################################################################### 100.0%
[34m==>[0m [1mPouring portable-ruby-3.4.8.catalina.bottle.tar.gz[0m
[32m==>[0m [1mFetching downloads for: [32mpoppler[39m, [32mtesseract[39m and [32mlibmagic[39m[0m
[?25l[K[34m⠋[0m Bottle Manifest poppler (26.01.0)
[K[34m⠋[0m Bottle Manifest tesseract (5.5.2)
[K[34m⠋[0m Bottle Manifest libmagic (5.46)[2F[K[34m⠋[0m Bottle Manifest poppler (26.01.0)
[K[34m⠋[0m Bottle Manifest tesseract (5.5.2)
[K[34m⠋[0m Bottle Manifest libmagic (5.46)[2F[K[34m⠙[0m Bottle Manifest poppler (26.01.0)
[K[34m⠙[0m B

In [1]:
%pip install -Uq "unstructured[all-docs]" 
%pip install -Uq langchain_chroma 
%pip install -Uq langchain langchain-community langchain-openai 
%pip install -Uq python_dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import json
from typing import List

# Unstructured for document parsing
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

# LangChain components
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.messages import HumanMessage
from dotenv import load_dotenv

load_dotenv()

True

In [5]:
def partition_document(file_path:str):
    print(f"partitioning document:{file_path}")
    elements = partition_pdf(
        filename=file_path,  # Path to PDF file
        strategy="hi_res",  # Use most accurate high-resolution strategy
        infer_table_structure=True,  # keep table as structured HTML , not text
        extract_image_block_types=["Image"],   # grab image found in pdf
        extract_image_block_to_payload=True   # store image as base64 data you can acctually use
    )

    print(f" Extracted {len(elements)} elements")
    return elements


#test partitioning
file_path = "./docs/attention-is-all-you-need.pdf"
elements = partition_document(file_path)

partitioning document:./docs/attention-is-all-you-need.pdf
 Extracted 186 elements


In [6]:

set([str(type(el))for el in elements])

{"<class 'unstructured.documents.elements.FigureCaption'>",
 "<class 'unstructured.documents.elements.Footer'>",
 "<class 'unstructured.documents.elements.Formula'>",
 "<class 'unstructured.documents.elements.Header'>",
 "<class 'unstructured.documents.elements.Image'>",
 "<class 'unstructured.documents.elements.ListItem'>",
 "<class 'unstructured.documents.elements.NarrativeText'>",
 "<class 'unstructured.documents.elements.Table'>",
 "<class 'unstructured.documents.elements.Text'>",
 "<class 'unstructured.documents.elements.Title'>"}

In [None]:
images = [element for element in elements if element.category == "Image"]
print(f"Found {len(images)} images")
images[0].to_dict()
# Use https://codebeautify.org/base64-to-image-converter to view the base64 text

Found 6 images


{'type': 'Image',
 'element_id': 'dc2d6abeac67f068e466418609467a2a',
 'text': 'Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention a, Add & Norm Add & Norm Feed Forward Nx | Cag Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Se a, Lt Positional Positional Encoding @ © OY Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)',
 'metadata': {'detection_class_prob': 0.8299219608306885,
  'coordinates': {'points': ((np.float64(531.2772216796875),
     np.float64(230.8849334716797)),
    (np.float64(531.2772216796875), np.float64(1091.02734375)),
    (np.float64(1149.029541015625), np.float64(1091.02734375)),
    (np.float64(1149.029541015625), np.float64(230.8849334716797))),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'last_modified': '2026-01-14T15:37:16',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 3,
  'image_base64': '/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw

In [None]:
tables = [element for element in elements if element.category == "Table"]
print(f"Found {len(tables)} tables")
tables[0].to_dict()
# Use https://jsfiddle.net/ to view the table html 

Found 4 tables


{'type': 'Table',
 'element_id': 'b58d62ba961c75e028eb51dbab3a5c49',
 'text': 'Layer Type Complexity per Layer Sequential Maximum Path Length Operations Self-Attention O(n? - d) O(1) O(1) Recurrent O(n - d?) O(n) O(n) Convolutional O(k-n-d?) O(1) O(logx(n)) Self-Attention (restricted) O(r-n-d) ol) O(n/r)',
 'metadata': {'detection_class_prob': 0.9282549619674683,
  'coordinates': {'points': ((np.float64(320.3291931152344),
     np.float64(312.45477294921875)),
    (np.float64(320.3291931152344), np.float64(519.1640014648438)),
    (np.float64(1363.98291015625), np.float64(519.1640014648438)),
    (np.float64(1363.98291015625), np.float64(312.45477294921875))),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'last_modified': '2026-01-14T15:37:16',
  'text_as_html': '<table><thead><tr><th>Layer Type</th><th>Complexity per Layer</th><th>Sequential Operations</th><th>Maximum Path Length</th></tr></thead><tbody><tr><td>Self-Attention</td><td>O(n? - d)</td><t

In [16]:
def create_chunk_by_title(elements):

    print("creating chunks")

    chunks = chunk_by_title(
        elements,      # the parsed pdf elements
        max_characters=3000,     # hard linit , never exceed 3000 char
        new_after_n_chars=2400,  # try to start new chunk after 2400 char
        combine_text_under_n_chars=500  # merge tiny chunks under 500 char with nieghbors
    )

    print(f" created {len(chunks)} chunks")
    return chunks

# Testing
chunks = create_chunk_by_title(elements)


creating chunks
 created 25 chunks


In [17]:
def separate_content_types(chunk):
    """Analyze what types of content are in a chunk"""
    content_data = {
        'text': chunk.text,
        'tables': [],
        'images': [],
        'types': ['text']
    }
    
    # Check for tables and images in original elements
    if hasattr(chunk, 'metadata') and hasattr(chunk.metadata, 'orig_elements'):
        for element in chunk.metadata.orig_elements:
            element_type = type(element).__name__
            
            # Handle tables
            if element_type == 'Table':
                content_data['types'].append('table')
                table_html = getattr(element.metadata, 'text_as_html', element.text)
                content_data['tables'].append(table_html)
            
            # Handle images
            elif element_type == 'Image':
                if hasattr(element, 'metadata') and hasattr(element.metadata, 'image_base64'):
                    content_data['types'].append('image')
                    content_data['images'].append(element.metadata.image_base64)
    
    content_data['types'] = list(set(content_data['types']))
    return content_data

def create_ai_enhanced_summary(text: str, tables: List[str], images: List[str]) -> str:
    """Create AI-enhanced summary for mixed content"""
    
    try:
        # Initialize LLM (needs vision model for images)
        llm = ChatOpenAI(model="gpt-4o", temperature=0)
        
        # Build the text prompt
        prompt_text = f"""You are creating a searchable description for document content retrieval.

        CONTENT TO ANALYZE:
        TEXT CONTENT:
        {text}

        """
        
        # Add tables if present
        if tables:
            prompt_text += "TABLES:\n"
            for i, table in enumerate(tables):
                prompt_text += f"Table {i+1}:\n{table}\n\n"
        
                prompt_text += """
                YOUR TASK:
                Generate a comprehensive, searchable description that covers:

                1. Key facts, numbers, and data points from text and tables
                2. Main topics and concepts discussed  
                3. Questions this content could answer
                4. Visual content analysis (charts, diagrams, patterns in images)
                5. Alternative search terms users might use

                Make it detailed and searchable - prioritize findability over brevity.

                SEARCHABLE DESCRIPTION:"""

        # Build message content starting with text
        message_content = [{"type": "text", "text": prompt_text}]
        
        # Add images to the message
        for image_base64 in images:
            message_content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
            })
        
        # Send to AI and get response
        message = HumanMessage(content=message_content)
        response = llm.invoke([message])
        
        return response.content
        
    except Exception as e:
        print(f"      AI summary failed: {e}")
        # Fallback to simple summary
        summary = f"{text[:300]}..."
        if tables:
            summary += f" [Contains {len(tables)} table(s)]"
        if images:
            summary += f" [Contains {len(images)} image(s)]"
        return summary

def summarise_chunks(chunks):
    """Process all chunks with AI Summaries"""
    print(" Processing chunks with AI Summaries...")
    
    langchain_documents = []
    total_chunks = len(chunks)
    
    for i, chunk in enumerate(chunks):
        current_chunk = i + 1
        print(f"   Processing chunk {current_chunk}/{total_chunks}")
        
        # Analyze chunk content
        content_data = separate_content_types(chunk)
        
        # Debug prints
        print(f"     Types found: {content_data['types']}")
        print(f"     Tables: {len(content_data['tables'])}, Images: {len(content_data['images'])}")
        
        # Create AI-enhanced summary if chunk has tables/images
        if content_data['tables'] or content_data['images']:
            print(f"     → Creating AI summary for mixed content...")
            try:
                enhanced_content = create_ai_enhanced_summary(
                    content_data['text'],
                    content_data['tables'], 
                    content_data['images']
                )
                print(f"     → AI summary created successfully")
                print(f"     → Enhanced content preview: {enhanced_content[:200]}...")
            except Exception as e:
                print(f"      AI summary failed: {e}")
                enhanced_content = content_data['text']
        else:
            print(f"     → Using raw text (no tables/images)")
            enhanced_content = content_data['text']
        
        # Create LangChain Document with rich metadata
        doc = Document(
            page_content=enhanced_content,
            metadata={
                "original_content": json.dumps({
                    "raw_text": content_data['text'],
                    "tables_html": content_data['tables'],
                    "images_base64": content_data['images']
                })
            }
        )
        
        langchain_documents.append(doc)
    
    print(f"Processed {len(langchain_documents)} chunks")
    return langchain_documents


# Process chunks with AI
processed_chunks = summarise_chunks(chunks)

 Processing chunks with AI Summaries...
   Processing chunk 1/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 2/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 3/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 4/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 5/25
     Types found: ['text', 'image']
     Tables: 0, Images: 1
     → Creating AI summary for mixed content...
     → AI summary created successfully
     → Enhanced content preview: **Searchable Description:**

The document discusses the architecture of neural sequence transduction models, focusing on the encoder-decoder structure. It highlights the Transformer model, which uses ...
   Processing chunk 6/25
     Types found: ['text']
     Tables: 0, Image

In [34]:
def create_vector_store(documents,persist_directory="db/chroma_db"):
    print(f" creating embeddings and storing in chroma db")

    embedding_model=OpenAIEmbeddings(model='text-embedding-3-small')

    print("creating vector store")
    vectorstore=Chroma.from_documents(
        documents=documents,
        embedding=embedding_model,
        persist_directory=persist_directory,
        collection_configuration={"hnsw:space":"cosine"}
    )

    print("vector store created")

    return vectorstore
db = create_vector_store(processed_chunks)

 creating embeddings and storing in chroma db
creating vector store
vector store created


In [25]:
def export_chunks_to_json(chunks, filename="chunks_export.json"):
    """Export processed chunks to clean JSON format"""
    export_data = []
    
    for i, doc in enumerate(chunks):
        chunk_data = {
            "chunk_id": i + 1,
            "enhanced_content": doc.page_content,
            "metadata": {
                "original_content": json.loads(doc.metadata.get("original_content", "{}"))
            }
        }
        export_data.append(chunk_data)
    
    # Save to file
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(export_data, f, indent=2, ensure_ascii=False)
    
    print(f" Exported {len(export_data)} chunks to {filename}")

    return export_data

# Export your chunks
json_data = export_chunks_to_json(processed_chunks)


 Exported 25 chunks to chunks_export.json


In [27]:
query = "According to Table 1 , what are the main advantages of self-attention layers compared to recurrent and convolutional layers in terms of computational complexity and parellalization "
retriever = db.as_retriever(search_kwargs={"k":5})
chunks = retriever.invoke(query)
export_chunks_to_json(chunks,"rag_results.json")

 Exported 5 chunks to rag_results.json


[{'chunk_id': 1,
  'enhanced_content': '4 Why Self-Attention\n\nIn this section we compare various aspects of self-attention layers to the recurrent and convolu- tional layers commonly used for mapping one variable-length sequence of symbol representations (x1,...,%n) to another sequence of equal length (21,...,2n), with x;, 2; € R%, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.\n\nOne is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.\n\nThe third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The short

In [30]:

def run_complete_ingestion_pipeline(pdf_path: str):
    """Run the complete RAG ingestion pipeline"""
    print("Starting RAG Ingestion Pipeline")
    print("=" * 50)
    
    # Step 1: Partition
    elements = partition_document(pdf_path)
    
    # Step 2: Chunk
    chunks = create_chunk_by_title(elements)
    
    # Step 3: AI Summarisation
    summarised_chunks = summarise_chunks(chunks)
    
    # Step 4: Vector Store
    db = create_vector_store(summarised_chunks, persist_directory="dbv2/chroma_db")
    
    print(" Pipeline completed successfully!")
    return db

In [35]:
db = run_complete_ingestion_pipeline("./docs/attention-is-all-you-need.pdf")

Starting RAG Ingestion Pipeline
partitioning document:./docs/attention-is-all-you-need.pdf
 Extracted 186 elements
creating chunks
 created 25 chunks
 Processing chunks with AI Summaries...
   Processing chunk 1/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 2/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 3/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 4/25
     Types found: ['text']
     Tables: 0, Images: 0
     → Using raw text (no tables/images)
   Processing chunk 5/25
     Types found: ['text', 'image']
     Tables: 0, Images: 1
     → Creating AI summary for mixed content...
     → AI summary created successfully
     → Enhanced content preview: **Searchable Description:**

This document section describes the architecture of neural sequence transduction models, focusing 

In [36]:
# Query the vector store
query = "How many attention heads does the Transformer use, and what is the dimension of each head? "

retriever = db.as_retriever(search_kwargs={"k": 3})
chunks = retriever.invoke(query)

def generate_final_answer(chunks, query):
    """Generate final answer using multimodal content"""
    
    try:
        # Initialize LLM (needs vision model for images)
        llm = ChatOpenAI(model="gpt-4o", temperature=0)
        
        # Build the text prompt
        prompt_text = f"""Based on the following documents, please answer this question: {query}

CONTENT TO ANALYZE:
"""
        
        for i, chunk in enumerate(chunks):
            prompt_text += f"--- Document {i+1} ---\n"
            
            if "original_content" in chunk.metadata:
                original_data = json.loads(chunk.metadata["original_content"])
                
                # Add raw text
                raw_text = original_data.get("raw_text", "")
                if raw_text:
                    prompt_text += f"TEXT:\n{raw_text}\n\n"
                
                # Add tables as HTML
                tables_html = original_data.get("tables_html", [])
                if tables_html:

                    prompt_text += "TABLES:\n"
                    for j, table in enumerate(tables_html):
                        prompt_text += f"Table {j+1}:\n{table}\n\n"
            
            prompt_text += "\n"
        
        prompt_text += """
Please provide a clear, comprehensive answer using the text, tables, and images above. If the documents don't contain sufficient information to answer the question, say "I don't have enough information to answer that question based on the provided documents."

ANSWER:"""

        # Build message content starting with text
        message_content = [{"type": "text", "text": prompt_text}]
        
        # Add all images from all chunks
        for chunk in chunks:
            if "original_content" in chunk.metadata:
                original_data = json.loads(chunk.metadata["original_content"])
                images_base64 = original_data.get("images_base64", [])
                
                for image_base64 in images_base64:
                    message_content.append({
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
                    })
        
        # Send to AI and get response
        message = HumanMessage(content=message_content)
        response = llm.invoke([message])
        
        return response.content
        
    except Exception as e:
        print(f" Answer generation failed: {e}")
        return "Sorry, I encountered an error while generating the answer."

# Usage
final_answer = generate_final_answer(chunks, query)
print(final_answer)


The Transformer uses 8 attention heads, and the dimension of each head is 64.
