# Basic Multimodal RAG Pipeline Tutorial

This notebook implements a simple, beginner-friendly multimodal RAG pipeline that:
- Parses PDFs into atomic elements (text, titles, tables, images)
- Chunks content by title
- Handles hybrid chunks (text + tables/images)
- Builds a vector store for retrieval
- Demonstrates basic RAG retrieval

---


## Task 1: Environment Setup & Installation

First, install the required packages. Run this in your terminal:

```bash
pip install unstructured[pdf]
pip install langchain-core langchain-chroma langchain-openai
pip install python-dotenv
```

**System Dependencies (install separately):**
- `poppler` (for PDF processing)
- `tesseract` (for OCR)
- `libmagic` (for file type detection)

On Ubuntu/Debian:
```bash
sudo apt-get install poppler-utils tesseract-ocr libmagic1
```

On macOS:
```bash
brew install poppler tesseract libmagic
```


Here‚Äôs a unified **System Dependencies** section you can drop straight into your README, with install instructions for **Windows**, **Ubuntu/WSL**, and **macOS**.

---

## üß© System Dependencies (All OS)

This project needs three native tools:

* **poppler** ‚Äì PDF processing (e.g., `pdftotext`, `pdfimages`)
* **tesseract** ‚Äì OCR for scanned PDFs
* **libmagic** ‚Äì file type detection (`python-magic` uses this)

Below are OS-specific install steps.

---

### ü™ü Windows

#### 1. Poppler

1. Download latest Poppler build (ZIP) from:
   [https://github.com/oschwartz10612/poppler-windows/releases](https://github.com/oschwartz10612/poppler-windows/releases)
2. Extract to a permanent folder, e.g.:
   `C:\Tools\poppler`
3. Add this to your **System PATH**:
   `C:\Tools\poppler\Library\bin`

Verify in **PowerShell**:

```powershell
pdftotext -v
```

---

#### 2. Tesseract OCR

1. Download Windows installer (UB Mannheim build recommended):
   [https://github.com/UB-Mannheim/tesseract/wiki](https://github.com/UB-Mannheim/tesseract/wiki)
2. Install (default path):
   `C:\Program Files\Tesseract-OCR`
3. Add to **PATH**:
   `C:\Program Files\Tesseract-OCR`

Verify:

```powershell
tesseract --version
```

---

#### 3. libmagic

Windows uses a bundled Python version:

```bash
pip install python-magic-bin
```

Quick test:

```bash
python -c "import magic; print(magic.from_buffer(b'hello'))"
```

---

### üêß Ubuntu / Debian / WSL2 (Ubuntu)

From your **WSL/Ubuntu terminal**:

```bash
sudo apt-get update
sudo apt-get install -y \
  poppler-utils \
  tesseract-ocr \
  libmagic1
```

> Optional language packs for Tesseract (e.g., English):
>
> ```bash
> sudo apt-get install -y tesseract-ocr-eng
> ```

Verify:

```bash
pdftotext -v
tesseract --version
python - << 'EOF'
import magic
print(magic.from_buffer(b'hello'))
EOF
```

*(If `magic` is missing, run: `pip install python-magic` in your env.)*

---

### üçè macOS (with Homebrew)

Make sure you have **Homebrew** installed first: [https://brew.sh](https://brew.sh)

Then:

```bash
brew update
brew install poppler tesseract libmagic
```

Verify:

```bash
pdftotext -v
tesseract --version
python - << 'EOF'
import magic
print(magic.from_buffer(b'hello'))
EOF
```

---

### üîé Final Sanity Check (All Platforms)

In your activated Python/conda env:

```bash
pip install "unstructured[pdf]" langchain-core langchain-chroma langchain-openai python-dotenv
python - << 'EOF'
import unstructured, langchain_core, dotenv
print("Python deps OK")
EOF
```

If all verification commands succeed, your **PDF/OCR/file-type stack** is ready on Windows, WSL, and macOS.


In [1]:
# Import all required libraries
import os
from pathlib import Path
from typing import List, Dict, Any
from dotenv import load_dotenv

# Unstructured library for PDF parsing
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

# LangChain components
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Load environment variables (for API keys)
load_dotenv()

print("‚úÖ All imports successful!")


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ All imports successful!


## Task 2: Partition PDF into Atomic Elements

This function extracts all elements from a PDF: text, titles, tables, and images.


In [2]:
def partition_document(file_path: str) -> List[Any]:
    """
    Partition a PDF into atomic elements (text, titles, tables, images).
    
    Args:
        file_path: Path to the PDF file
        
    Returns:
        List of unstructured elements
    """
    print(f"üìÑ Partitioning PDF: {file_path}")
    
    # Partition PDF with high-resolution strategy
    # This extracts text, tables, and images
    elements = partition_pdf(
        filename=file_path,
        strategy="hi_res",  # High resolution for better accuracy
        infer_table_structure=True,  # Extract tables as structured data
        extract_image_block_types=["Image"],  # Extract images
        extract_image_block_to_payload=True,  # Include image data
    )
    
    print(f"‚úÖ Found {len(elements)} elements in the PDF")
    return elements


In [3]:
def inspect_element_types(elements: List[Any]) -> None:
    """
    Helper function to inspect what types of elements were found.
    """
    # Get unique element types
    element_types = set()
    for elem in elements:
        element_types.add(elem.__class__.__name__)
    
    print(f"\nüìä Unique element types found: {sorted(element_types)}")
    
    # Find examples of each type
    examples = {}
    for elem in elements:
        elem_type = elem.__class__.__name__
        if elem_type not in examples:
            examples[elem_type] = elem
    
    # Print examples
    print("\nüìù Example elements:")
    for elem_type, elem in examples.items():
        print(f"\n--- {elem_type} ---")
        elem_dict = elem.to_dict()
        # Show a preview (first 200 chars)
        if 'text' in elem_dict:
            text_preview = elem_dict['text'][:200]
            print(f"Text preview: {text_preview}...")
        print(f"Full dict keys: {list(elem_dict.keys())}")


## Task 3: Chunk Elements by Title

This function groups elements into chunks based on titles, following the pattern:
"title + related paragraphs + any tables/images"


In [13]:
def chunk_elements_by_title(elements: List[Any]) -> List[Any]:
    """
    Chunk elements by title, grouping related content together.
    
    Args:
        elements: List of unstructured elements
        
    Returns:
        List of chunks
    """
    print(f"\nüî™ Chunking {len(elements)} elements by title...")
    
    # Chunk by title with reasonable size limits
    chunks = chunk_by_title(
        elements,
        max_characters=3000,  # Maximum characters per chunk
        new_after_n_chars=2400,  # Start new chunk after this many chars
        combine_text_under_n_chars=500,  # Combine small chunks under this size
    )
    
    print(f"‚úÖ Created {len(chunks)} chunks")
    return chunks


In [5]:
def inspect_chunk(chunk: Any, chunk_idx: int = 0) -> None:
    """
    Helper function to inspect a single chunk.
    """
    print(f"\nüì¶ Inspecting chunk {chunk_idx}:")
    print(f"Text preview (first 300 chars): {chunk.text[:300]}...")
    
    # Check what types of elements are in this chunk
    if hasattr(chunk, 'metadata') and hasattr(chunk.metadata, 'orig_elements'):
        element_types = [elem.__class__.__name__ for elem in chunk.metadata.orig_elements]
        print(f"Element types in chunk: {element_types}")
    else:
        print("Note: orig_elements metadata not available")


## Task 4: Separate Content Types

This function separates text, tables, and images from a chunk.


In [6]:
def separate_content_types(chunk: Any) -> Dict[str, Any]:
    """
    Separate a chunk into text, tables, and images.
    
    Args:
        chunk: A chunk from chunk_by_title
        
    Returns:
        Dictionary with 'text', 'tables', 'images', and 'types'
    """
    result = {
        "text": chunk.text,
        "tables": [],
        "images": [],
        "types": ["text"]  # Always has text
    }
    
    # Check if chunk has original elements metadata
    if hasattr(chunk, 'metadata') and hasattr(chunk.metadata, 'orig_elements'):
        for elem in chunk.metadata.orig_elements:
            elem_type = elem.__class__.__name__
            
            # Extract tables
            if elem_type == "Table":
                if hasattr(elem, 'metadata') and hasattr(elem.metadata, 'text_as_html'):
                    table_html = elem.metadata.text_as_html
                    result["tables"].append(table_html)
                    if "table" not in result["types"]:
                        result["types"].append("table")
            
            # Extract images
            elif elem_type == "Image":
                if hasattr(elem, 'metadata') and hasattr(elem.metadata, 'image_base64'):
                    image_b64 = elem.metadata.image_base64
                    result["images"].append(image_b64)
                    if "image" not in result["types"]:
                        result["types"].append("image")
    
    return result


## Task 5: Simple Summarization for Hybrid Chunks

This is a basic rule-based summarizer (no LLM required).


In [7]:
def summarize_chunk_basically(text: str, tables: list, images: list) -> str:
    """
    Create a simple text summary for hybrid chunks (text + tables/images).
    
    This is a basic rule-based summarizer - no LLM call needed.
    
    Args:
        text: The text content of the chunk
        tables: List of HTML table strings
        images: List of base64 image strings
        
    Returns:
        Summary string suitable for embedding
    """
    # Take first 500 characters of text as summary
    summary = text[:500]
    
    # Add notes about tables and images
    notes = []
    if tables:
        notes.append(f"{len(tables)} table(s)")
    if images:
        notes.append(f"{len(images)} image(s)")
    
    if notes:
        summary += f" [contains {', '.join(notes)}]"
    
    return summary


## Task 6: Build LangChain Documents

Convert chunks into LangChain Document objects with proper metadata.


In [8]:
def build_documents_from_chunks(chunks: List[Any], file_path: str) -> List[Document]:
    """
    Convert chunks into LangChain Document objects.
    
    Args:
        chunks: List of chunks from chunk_by_title
        file_path: Path to the original PDF file
        
    Returns:
        List of LangChain Document objects
    """
    print(f"\nüìö Building LangChain Documents from {len(chunks)} chunks...")
    
    documents = []
    
    for idx, chunk in enumerate(chunks):
        # Separate content types
        content_data = separate_content_types(chunk)
        
        # Determine page_content
        if content_data["tables"] or content_data["images"]:
            # Hybrid chunk: use summary
            page_content = summarize_chunk_basically(
                content_data["text"],
                content_data["tables"],
                content_data["images"]
            )
        else:
            # Pure text chunk: use raw text
            page_content = content_data["text"]
        
        # Build metadata
        metadata = {
            "source": file_path,
            "chunk_index": idx,
            "types": content_data["types"],
            "raw_text": content_data["text"],
            "raw_tables_html": content_data["tables"],
            "raw_images_b64": content_data["images"],
        }
        
        # Try to get page number if available
        if hasattr(chunk, 'metadata') and hasattr(chunk.metadata, 'page_number'):
            metadata["page_number"] = chunk.metadata.page_number
        elif hasattr(chunk, 'metadata') and hasattr(chunk.metadata, 'page'):
            metadata["page_number"] = chunk.metadata.page
        
        # Create LangChain Document
        doc = Document(
            page_content=page_content,
            metadata=metadata
        )
        
        documents.append(doc)
    
    print(f"‚úÖ Created {len(documents)} LangChain Documents")
    
    # Print some stats
    pure_text = sum(1 for doc in documents if doc.metadata["types"] == ["text"])
    hybrid = len(documents) - pure_text
    print(f"   - Pure text chunks: {pure_text}")
    print(f"   - Hybrid chunks (with tables/images): {hybrid}")
    
    return documents


## Task 7: Basic Vector Store + Retrieval Demo

Build a vector store and demonstrate retrieval.


In [9]:
# def build_vector_store(documents: List[Document], persist_directory: str = "./chroma_db") -> Chroma:
#     """
#     Build a Chroma vector store from documents.
    
#     Args:
#         documents: List of LangChain Documents
#         persist_directory: Directory to persist the vector store
        
#     Returns:
#         Chroma vector store
#     """
#     print(f"\nüî® Building vector store from {len(documents)} documents...")
    
#     # Initialize embeddings
#     # Option 1: OpenAI embeddings (requires OPENAI_API_KEY in .env)
#     # embeddings = OpenAIEmbeddings()
    
#     # Option 2: Use a small local model (uncomment if you prefer)
#     # from langchain_community.embeddings import HuggingFaceEmbeddings
#     # embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    
#     # For this demo, we'll use OpenAI (make sure you have OPENAI_API_KEY set)
#     embeddings = OpenAIEmbeddings()
    
#     # Create vector store
#     vectorstore = Chroma.from_documents(
#         documents=documents,
#         embedding=embeddings,
#         persist_directory=persist_directory
#     )
    
#     print(f"‚úÖ Vector store created and persisted to {persist_directory}")
#     return vectorstore


In [10]:
# def demo_retrieval(vectorstore: Chroma, question: str, k: int = 3) -> None:
#     """
#     Demonstrate retrieval from the vector store.
    
#     Args:
#         vectorstore: The Chroma vector store
#         question: The query/question to search for
#         k: Number of documents to retrieve
#     """
#     print(f"\nüîç Query: {question}")
#     print(f"Retrieving top {k} documents...\n")
    
#     # Perform similarity search
#     results = vectorstore.similarity_search(question, k=k)
    
#     # Print results
#     for idx, doc in enumerate(results, 1):
#         print(f"--- Result {idx} ---")
#         print(f"Page number: {doc.metadata.get('page_number', 'N/A')}")
#         print(f"Content types: {doc.metadata.get('types', [])}")
#         print(f"Page content (first 300 chars): {doc.page_content[:300]}...")
#         print()


## Task 8: Orchestrator & Main Pipeline

This is the main function that ties everything together.


In [11]:
def main(pdf_path: str = None):
    """
    Main orchestrator function that runs the entire pipeline.
    
    Args:
        pdf_path: Path to the PDF file (default: looks for sample PDF)
    """
    # Set default PDF path if not provided
    if pdf_path is None:
        # Try to find a PDF in the data/raw directory
        default_paths = [
            "../data/raw/Text Chunking.pdf",
            "./data/raw/Text Chunking.pdf",
            "../rag-pipeline/data/raw/Text Chunking.pdf",
        ]
        
        pdf_path = None
        for path in default_paths:
            if os.path.exists(path):
                pdf_path = path
                break
        
        if pdf_path is None:
            print("‚ùå No PDF found. Please provide a pdf_path argument.")
            print("Example: main(pdf_path='./docs/attention-is-all-you-need.pdf')")
            return
    
    print("=" * 60)
    print("üöÄ Starting Multimodal RAG Pipeline")
    print("=" * 60)
    
    # Step 1: Partition PDF into elements
    print("\n[Step 1] Partitioning PDF...")
    elements = partition_document(pdf_path)
    
    # Inspect element types (optional, for debugging)
    inspect_element_types(elements)
    
    # Step 2: Chunk elements by title
    print("\n[Step 2] Chunking by title...")
    chunks = chunk_elements_by_title(elements)
    
    # Inspect first chunk (optional, for debugging)
    if chunks:
        inspect_chunk(chunks[0], chunk_idx=0)
    
    # Step 3: Build LangChain Documents
    print("\n[Step 3] Building LangChain Documents...")
    documents = build_documents_from_chunks(chunks, pdf_path)
    
    # Step 4: Build vector store
    print("\n[Step 4] Building vector store...")
    # vectorstore = build_vector_store(documents)
    
    # Step 5: Demo retrieval
    print("\n[Step 5] Running retrieval demo...")
    
    # Example questions (customize these for your PDF)
    questions = [
        "What is the main topic of this document?",
        "What are the key concepts discussed?",
    ]
    
    # for question in questions:
    #     demo_retrieval(vectorstore, question, k=3)
    
    print("\n" + "=" * 60)
    print("‚úÖ Pipeline completed successfully!")
    print("=" * 60)
    
    # return vectorstore, documents
    return  documents


## Run the Pipeline

Execute the main function to run the entire pipeline. Make sure you have:
1. A PDF file (update the path below)
2. `OPENAI_API_KEY` set in your `.env` file (or environment variables)
3. All required packages installed


In [None]:
# Run the pipeline!
# Update the path to your PDF file
# pdf_path = "../data/raw/Text Chunking.pdf"  # Change this to your PDF path
pdf_path = rf"C:\Users\SuryaDeva\Documents\Certifications_202k\Coding\RAG_Mini\rag-pipeline\data\raw\Text Chunking.pdf"
# Uncomment to run:
# vectorstore, documents = main(pdf_path=pdf_path)


üöÄ Starting Multimodal RAG Pipeline

[Step 1] Partitioning PDF...
üìÑ Partitioning PDF: C:\Users\SuryaDeva\Documents\Certifications_202k\Coding\RAG_Mini\rag-pipeline\data\raw\Text Chunking.pdf


## Optional: Test Individual Components

You can also test individual functions separately:


In [15]:
# Example: Test partitioning only
elements = partition_document(pdf_path)



üìÑ Partitioning PDF: C:\Users\SuryaDeva\Documents\Certifications_202k\Coding\RAG_Mini\rag-pipeline\data\raw\Text Chunking.pdf
‚úÖ Found 196 elements in the PDF


In [16]:
inspect_element_types(elements)


üìä Unique element types found: ['FigureCaption', 'Header', 'Image', 'ListItem', 'NarrativeText', 'Table', 'Text', 'Title']

üìù Example elements:

--- Text ---
Text preview: 5...
Full dict keys: ['type', 'element_id', 'text', 'metadata']

--- Header ---
Text preview: r a M 1 3 ] L C . s c [ 1 v 4 7 2 0 0 . 4 0 5 2...
Full dict keys: ['type', 'element_id', 'text', 'metadata']

--- NarrativeText ---
Text preview: Text Chunking for Document Classification for Urban Systems Management using Large Language Models...
Full dict keys: ['type', 'element_id', 'text', 'metadata']

--- ListItem ---
Text preview: * Corresponding author; email: steve.conrad@colostate.edu...
Full dict keys: ['type', 'element_id', 'text', 'metadata']

--- Title ---
Text preview: Abstract...
Full dict keys: ['type', 'element_id', 'text', 'metadata']

--- Table ---
Text preview: ments, taking as input StudySet, Codebook Algorithm 1 Whole Paper analysis of documents, 1: for Text ‚àà StudySet do taking as input StudyS

In [17]:
# Step 2: Chunk elements by title
print("\n[Step 2] Chunking by title...")
chunks = chunk_elements_by_title(elements)
    
   


[Step 2] Chunking by title...

üî™ Chunking 196 elements by title...
‚úÖ Created 24 chunks


In [18]:
 # Inspect first chunk (optional, for debugging)
if chunks:
    inspect_chunk(chunks[0], chunk_idx=0)
    


üì¶ Inspecting chunk 0:
Text preview (first 300 chars): 5

2025

2

0

2

r a M 1 3 ] L C . s c [ 1 v 4 7 2 0 0 . 4 0 5 2

:

v

i

X

r

a

Text Chunking for Document Classification for Urban Systems Management using Large Language Models

Joshua Rodriguez1‚Ä†, Om Sanan2‚Ä†, Guillermo Vizarreta-Luna1, Steven A. Conrad1*

1 Department of Systems Engineering,...
Element types in chunk: ['Text', 'Text', 'Text', 'Text', 'Text', 'Header', 'Text', 'Text', 'Text', 'Text', 'Text', 'Text', 'NarrativeText', 'NarrativeText', 'NarrativeText', 'ListItem', 'Text', 'Title', 'NarrativeText', 'Text']


In [19]:
# Step 3: Build LangChain Documents
print("\n[Step 3] Building LangChain Documents...")
documents = build_documents_from_chunks(chunks, pdf_path)


[Step 3] Building LangChain Documents...

üìö Building LangChain Documents from 24 chunks...
‚úÖ Created 24 LangChain Documents
   - Pure text chunks: 17
   - Hybrid chunks (with tables/images): 7


---

## Summary

This notebook implements a complete basic multimodal RAG pipeline:

1. ‚úÖ **Environment setup** - Installation instructions
2. ‚úÖ **PDF partitioning** - Extract atomic elements (text, tables, images)
3. ‚úÖ **Title-based chunking** - Group related content
4. ‚úÖ **Content type separation** - Identify text, tables, images
5. ‚úÖ **Basic summarization** - Create summaries for hybrid chunks
6. ‚úÖ **LangChain Documents** - Convert to standard format
7. ‚úÖ **Vector store** - Build Chroma index with embeddings
8. ‚úÖ **Retrieval demo** - Query and retrieve relevant chunks
9. ‚úÖ **Orchestrator** - Main pipeline function

**Next Steps:**
- Add an LLM to generate final answers from retrieved chunks
- Experiment with different chunking strategies
- Try different embedding models
- Add more sophisticated summarization for hybrid chunks
