# Document Parsing with Docling for RAG Systems

## A Comprehensive Guide to Document Conversion and Processing

This notebook demonstrates the powerful document parsing capabilities of **Docling** (v2.55.1), a Python library developed by IBM for converting various document formats into structured representations suitable for AI/ML workflows, particularly Retrieval-Augmented Generation (RAG) systems.

### What You'll Learn

1. **Basic Document Conversion** - Convert PDFs and other formats to Markdown, JSON, HTML
2. **Multiple File Formats** - PDF, DOCX, XLSX, PPTX, HTML, Markdown, Images, Audio
3. **Pipeline Configuration** - OCR engines, table extraction, layout analysis, VLM
4. **LangChain Integration** - DoclingLoader and RAG pipeline with Chroma
5. **Advanced Topics** - Enrichment, error handling

### Prerequisites

- Python 3.12 (recommended for full compatibility)
- OpenAI API key (for RAG examples)
- Sufficient disk space for model downloads (~2-4GB)

---

## 1. Installation & Setup

### 1.1 Create Python 3.12 Virtual Environment

```bash
# Create virtual environment with Python 3.12
python3.12 -m venv .venv

# Activate the environment
source .venv/bin/activate  # On macOS/Linux
# .venv\Scripts\activate  # On Windows
```

### 1.2 Install Dependencies

Run the following commands in your terminal:

In [13]:
# Install Docling and its optional dependencies
# Uncomment and run these lines if you haven't installed the packages yet

!uv pip install docling==2.68.0 langchain-docling langchain-openai python-dotenv
# !uv pip install 'docling[easyocr,vlm,asr]'
# !uv pip install 'docling-core[chunking]'
# !uv pip install chromadb transformers sentence-transformers
# !uv pip install pandas openpyxl  # For table export examples

[2mAudited [1m4 packages[0m [2min 1.65s[0m[0m


In [None]:
# Verify installation
import docling 
from importlib.metadata import version

print(version("docling"))


2.68.0


### 1.3 Environment Configuration

In [None]:
from dotenv import load_dotenv, dotenv_values

# Load environment variables
load_dotenv()

config = dotenv_values(".env")

print(f"Core imports loaded successfully!")

Core imports loaded successfully!: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


In [7]:
# Load environment variables from .env file
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify OpenAI API key is set (for RAG examples later)
if os.getenv("OPENAI_API_KEY"):
    print("OpenAI API key is configured")
else:
    print("Warning: OpenAI API key not found. Some RAG examples will not work.")
    print("Create a .env file with: OPENAI_API_KEY=your-key-here")

OpenAI API key is configured


In [8]:
# Import core modules that we'll use throughout the notebook
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')  # Suppress warnings for cleaner output

# Docling imports
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat, ConversionStatus

# Set up paths
SAMPLE_DIR = Path("sample_documents")
OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)

print("Core imports loaded successfully!")

Core imports loaded successfully!


---

## 2. Basic Document Conversion

The `DocumentConverter` class is the main entry point for document conversion in Docling. It handles format detection, backend selection, and pipeline execution automatically.

### Key Concepts:
- **ConversionResult**: Contains the converted document, status, and any errors
- **DoclingDocument**: The unified internal representation of any document
- **Export Formats**: Markdown, JSON, HTML, Text, DocTags

### 2.1 Simple PDF Conversion

In [16]:
# Basic PDF conversion example
# Using the Docling paper from arXiv as an example

from docling.document_converter import DocumentConverter

# Initialize the converter with default settings
converter = DocumentConverter()

# Convert a PDF from URL
# The Docling paper: "Docling Technical Report"
pdf_url = "https://arxiv.org/pdf/2408.09869"

print(f"Converting PDF from: {pdf_url}")
print("This may take a minute for the first run as models are downloaded...")

# Perform conversion
result = converter.convert(pdf_url)


print(f" The whole result is : {type(result)}")
# Check conversion status
print(f"\nConversion Status: {result.status}")
print(f"Document Name: {result.input.file.name}")
print(f"Number of Pages: {len(result.pages) if result.pages else 'N/A'}")

Converting PDF from: https://arxiv.org/pdf/2408.09869
This may take a minute for the first run as models are downloaded...


[32m[INFO] 2026-01-19 17:39:18,762 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-19 17:39:18,767 [RapidOCR] download_file.py:60: File exists and is valid: C:\learning_notes\RAGG\simple-rag-langchain-1\Data Parsers\document-parsers-rag\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-19 17:39:18,768 [RapidOCR] main.py:53: Using C:\learning_notes\RAGG\simple-rag-langchain-1\Data Parsers\document-parsers-rag\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-19 17:39:18,840 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-19 17:39:18,842 [RapidOCR] download_file.py:60: File exists and is valid: C:\learning_notes\RAGG\simple-rag-langchain-1\Data Parsers\document-parsers-rag\.venv\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-01-19 17:39:18,843 [RapidOCR] main.py:53: Using C:\learning_notes\RAGG\simple-ra

 The whole result is : <class 'docling.datamodel.document.ConversionResult'>

Conversion Status: ConversionStatus.SUCCESS
Document Name: 2408.09869v5.pdf
Number of Pages: 9


In [None]:
# Access the converted document
doc = result.document

# Display document structure information
print(f"Document Type: {type(doc).__name__ }")
print(f"Number of Tables: {len(doc.tables) if hasattr(doc, 'tables') else 0}")
print(f"Number of Pictures: {len(doc.pictures) if hasattr(doc, 'pictures') else 0}")

Document Type: <class 'docling_core.types.doc.document.DoclingDocument'>
Number of Tables: 3
Number of Pictures: 5


In [21]:
print(f"doc: {doc}")

doc: schema_name='DoclingDocument' version='1.9.0' name='2408.09869v5' origin=DocumentOrigin(mimetype='application/pdf', binary_hash=11465328351749295394, filename='2408.09869v5.pdf', uri=None) furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, meta=None, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/texts/0'), RefItem(cref='#/pictures/0'), RefItem(cref='#/texts/1'), RefItem(cref='#/texts/2'), RefItem(cref='#/texts/3'), RefItem(cref='#/texts/4'), RefItem(cref='#/texts/5'), RefItem(cref='#/texts/6'), RefItem(cref='#/texts/7'), RefItem(cref='#/texts/8'), RefItem(cref='#/texts/9'), RefItem(cref='#/texts/10'), RefItem(cref='#/texts/11'), RefItem(cref='#/texts/12'), RefItem(cref='#/groups/0'), RefItem(cref='#/texts/19'), RefItem(cref='#/texts/20'), RefItem(cref='#/texts/21'), RefItem(cref='#/texts/22'), RefItem(cref='#/texts/23'), R

In [None]:
# # Display Tables
# print("=" * 50)
# print("TABLES")
# print("=" * 50)

if hasattr(doc, 'tables') and doc.tables:
    for i, table in enumerate(doc.tables):
        print(f"\n--- Table {i+1} ---")
        # Export table to markdown format
        #print(table) #It prints their ref for the table 
        print(table.export_to_markdown()) #prints the table in table format
else:
    print("No tables found")

# Display Pictures
print("\n" + "=" * 50)
print("PICTURES")
print("=" * 50)

if hasattr(doc, 'pictures') and doc.pictures:
    for i, picture in enumerate(doc.pictures):
        print(f"\n--- Picture {i+1} ---")
        # Get caption or text associated with the picture
        if hasattr(picture, 'caption') and picture.caption:
            print(f"Caption: {picture.caption}")
        if hasattr(picture, 'text') and picture.text:
            print(f"Text: {picture.text}")
        # Show any available metadata
        if hasattr(picture, 'prov'):
            print(f"Provenance: {picture.prov}")
else:
    print("No pictures found")

Usage of TableItem.export_to_markdown() without `doc` argument is deprecated.
Usage of TableItem.export_to_markdown() without `doc` argument is deprecated.
Usage of TableItem.export_to_markdown() without `doc` argument is deprecated.



--- Table 1 ---
| CPU                     | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   |
|-------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------|
|                         |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                |
| Apple M3 Max (16 cores) | 4 16            | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            |
| Intel(R) Xeon E5-2690   | 4 16            | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            |

--- Table 2 ---
|                                                                                                        | human                        

### 2.2 Export Formats

Docling supports multiple export formats:

| Method | Output | Use Case |
|--------|--------|----------|
| `export_to_markdown()` | Markdown text | LLM input, readable output |
| `export_to_dict()` | Python dict | Programmatic access |
| `save_as_json()` | JSON file | Persistence, API responses |
| `save_as_html()` | HTML file | Web display |
| `export_to_text()` | Plain text | Simple text extraction |

In [8]:
# Export to Markdown
markdown_content = doc.export_to_markdown()

# Display first 2000 characters
print("=" * 80)
print("MARKDOWN OUTPUT (first 2000 chars)")
print("=" * 80)
print(markdown_content[:2000])
print("\n... [truncated] ...")

MARKDOWN OUTPUT (first 2000 chars)
<!-- image -->

## Docling Technical Report

## Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar

AI4K Group, IBM Research R¨ uschlikon, Switzerland

## Abstract

This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

## 1 Introduction

Converting PDF documents back into a machine-processable format has been a major challenge

In [10]:
# Export to JSON (save to file)
json_output_path = OUTPUT_DIR / "docling_paper.json"
doc.save_as_json(json_output_path)
print(f"JSON saved to: {json_output_path}")

# Export to HTML
html_output_path = OUTPUT_DIR / "docling_paper.html"
doc.save_as_html(html_output_path)
print(f"HTML saved to: {html_output_path}")

# Export to Markdown file
md_output_path = OUTPUT_DIR / "docling_paper.md"
with open(md_output_path, "w") as f:
    f.write(markdown_content)
print(f"Markdown saved to: {md_output_path}")

JSON saved to: output/docling_paper.json
HTML saved to: output/docling_paper.html
Markdown saved to: output/docling_paper.md


In [9]:
# Export to dictionary for programmatic access
doc_dict = doc.export_to_dict()

# Explore the structure
print("Document Dictionary Keys:")
for key in doc_dict.keys():
    print(f"  - {key}")

Document Dictionary Keys:
  - schema_name
  - version
  - name
  - origin
  - furniture
  - body
  - groups
  - texts
  - pictures
  - tables
  - key_value_items
  - form_items
  - pages


### 2.3 ConversionResult Structure

The `ConversionResult` object contains valuable metadata about the conversion process.

In [10]:
# Examine the ConversionResult structure
print("ConversionResult Attributes:")
print(f"  status: {result.status}")
print(f"  input.file: {result.input.file}")
print(f"  input.format: {result.input.format}")
print(f"  input.document_hash: {result.input.document_hash[:16]}...")

# Check for errors
if result.errors:
    print(f"\nErrors ({len(result.errors)}):")
    for error in result.errors:
        print(f"  - {error.component_type}: {error.error_message}")
else:
    print("\nNo errors during conversion!")

ConversionResult Attributes:
  status: ConversionStatus.SUCCESS
  input.file: 2408.09869v5.pdf
  input.format: InputFormat.PDF
  input.document_hash: 82dd470712ce8389...

No errors during conversion!


---

## 3. Supported File Formats

Docling supports a wide variety of input formats, each handled by specialized backends:

| Format | Extensions | Backend | Pipeline |
|--------|-----------|---------|----------|
| PDF | `.pdf` | DoclingParseV4Backend | StandardPdfPipeline |
| Word | `.docx` | MsWordDocumentBackend | SimplePipeline |
| Excel | `.xlsx` | MsExcelDocumentBackend | SimplePipeline |
| PowerPoint | `.pptx` | MsPowerpointDocumentBackend | SimplePipeline |
| HTML | `.html`, `.htm` | HTMLDocumentBackend | SimplePipeline |
| Markdown | `.md` | MarkdownDocumentBackend | SimplePipeline |
| Images | `.png`, `.jpg`, `.tiff` | ImageDocumentBackend | StandardPdfPipeline |
| Audio | `.wav`, `.mp3` | AudioBackend | AsrPipeline |

### 3.1 PDF Documents

PDF is the most feature-rich format with support for:
- Layout analysis (headers, paragraphs, lists)
- Table structure extraction
- OCR for scanned pages
- Image/figure extraction
- Reading order determination

In [11]:
# PDF with detailed options
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure PDF pipeline with specific options
pdf_options = PdfPipelineOptions(
    do_ocr=False,              # Disable OCR for native PDFs (faster)
    do_table_structure=True,   # Enable table structure extraction
    generate_page_images=True, # Generate page images for HTML export
)

# Create converter with custom options
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_options)
    }
)

# Convert the PDF
result = converter.convert(pdf_url)
print(f"Conversion status: {result.status}")

2025-12-06 10:28:46,794 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-06 10:28:46,799 - INFO - Going to convert document batch...
2025-12-06 10:28:46,800 - INFO - Initializing pipeline for StandardPdfPipeline with options hash c5864bf7476cef264d361ef9410e72bc
2025-12-06 10:28:46,801 - INFO - Accelerator device: 'mps'
2025-12-06 10:28:48,521 - INFO - Accelerator device: 'mps'
2025-12-06 10:28:49,091 - INFO - Processing document 2408.09869v5.pdf
2025-12-06 10:28:55,147 - INFO - Finished converting document 2408.09869v5.pdf in 8.68 sec.


Conversion status: ConversionStatus.SUCCESS


In [12]:
# Access tables from the converted document
doc = result.document

if hasattr(doc, 'tables') and doc.tables:
    print(f"Found {len(doc.tables)} tables in the document\n")
    
    # Display first table
    for i, table in enumerate(doc.tables[:2]):  # Show first 2 tables
        print(f"Table {i+1}:")
        print("-" * 40)
        
        # Try to export to DataFrame if pandas is available
        try:
            df = table.export_to_dataframe()
            print(df.head())
        except Exception as e:
            print(f"Table markdown: {table.export_to_markdown()[:500]}")
        print()
else:
    print("No tables found in the document")



Found 3 tables in the document

Table 1:
----------------------------------------
                      CPU. Thread budget. native backend.TTS  \
0  Apple M3 Max (16 cores)           4 16        177 s 167 s   
1    Intel(R) Xeon E5-2690           4 16        375 s 244 s   

  native backend.Pages/s native backend.Mem pypdfium backend.TTS  \
0              1.27 1.34            6.20 GB           103 s 92 s   
1              0.60 0.92            6.16 GB          239 s 143 s   

  pypdfium backend.Pages/s pypdfium backend.Mem  
0                2.18 2.45              2.56 GB  
1                0.94 1.57              2.42 GB  

Table 2:
----------------------------------------
                                                      \
0  Caption Footnote Formula List-item Page-footer...   

                                               human  \
0  84-89 83-91 83-85 87-88 93-94 85-89 69-71 83-8...   

                                      MRCNN R50 R101  \
0  68.4 71.5 70.9 71.8 60.1 63.4 81.2

### 3.2 Microsoft Office Documents

Docling supports Office Open XML formats (DOCX, XLSX, PPTX) with rich formatting preservation.

In [13]:
# Convert HTML document (from our sample files)
from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Convert the sample HTML file
html_path = SAMPLE_DIR / "sample.html"

if html_path.exists():
    result = converter.convert(str(html_path))
    print(f"HTML Conversion Status: {result.status}")
    
    # Display converted content
    html_markdown = result.document.export_to_markdown()
    print("\nConverted HTML to Markdown:")
    print("=" * 60)
    print(html_markdown[:1500])
else:
    print(f"Sample HTML file not found at {html_path}")

2025-12-06 10:31:09,252 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2025-12-06 10:31:09,255 - INFO - Going to convert document batch...
2025-12-06 10:31:09,256 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-12-06 10:31:09,256 - INFO - Processing document sample.html
2025-12-06 10:31:09,265 - INFO - Finished converting document sample.html in 0.01 sec.


HTML Conversion Status: ConversionStatus.SUCCESS

Converted HTML to Markdown:
# Introduction to Document Parsing

A comprehensive guide to understanding document parsing techniques

## 1. Overview

Document parsing is the process of analyzing and extracting structured information from various document formats. This includes PDFs, Word documents, HTML pages, and more.

### 1.1 Key Benefits

- Automated data extraction
- Structured content analysis
- Integration with AI/ML pipelines
- Support for multiple formats

## 2. Core Features

Modern document parsers offer a variety of features:

| Feature          | Description                                         | Use Case                         |
|------------------|-----------------------------------------------------|----------------------------------|
| OCR Support      | Optical Character Recognition for scanned documents | Scanned PDFs, Images             |
| Table Extraction | Structured table data extraction                    | Fi

In [17]:
# Convert Markdown document
md_path = SAMPLE_DIR / "sample.md"

if md_path.exists():
    result = converter.convert(str(md_path))
    print(f"Markdown Conversion Status: {result.status}")
    
    # Markdown to Markdown (demonstrates parsing and re-export)
    output_md = result.document.export_to_markdown()
    print("\nParsed and re-exported Markdown:")
    print("=" * 60)
    print(output_md[:1500])
else:
    print(f"Sample Markdown file not found at {md_path}")

2025-12-04 11:30:55,622 - INFO - detected formats: [<InputFormat.MD: 'md'>]
2025-12-04 11:30:55,623 - INFO - Going to convert document batch...
2025-12-04 11:30:55,624 - INFO - Processing document sample.md
2025-12-04 11:30:55,705 - INFO - Finished converting document sample.md in 0.08 sec.


Markdown Conversion Status: ConversionStatus.SUCCESS

Parsed and re-exported Markdown:
# Document Parsing Best Practices

A comprehensive guide to document parsing for RAG systems.

## Table of Contents

1. [Introduction](#introduction)
2. [Supported Formats](#supported-formats)
3. [Parsing Strategies](#parsing-strategies)
4. [Integration Guide](#integration-guide)

## Introduction

Document parsing is a critical component in modern AI applications. It enables the extraction of structured information from unstructured documents, making it possible to:

- Build searchable knowledge bases
- Create training datasets for machine learning
- Enable semantic search and retrieval
- Power question-answering systems

**Note:** The quality of document parsing directly impacts the performance of downstream AI applications.

## Supported Formats

### Primary Formats

`.pdf` `.docx` `.xlsx` `.pptx` `.html` `.md`

| Format   | Extension   | Description   |
|----------|-------------|---------------|
|

In [None]:
# Example: Converting a DOCX file (if you have one)
# This demonstrates the pattern for Word documents

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, WordFormatOption

# Configure for Word documents
converter = DocumentConverter(
    allowed_formats=[InputFormat.DOCX, InputFormat.PPTX, InputFormat.XLSX],  # Only allow DOCX
)

# Excel conversion pattern
print("Excel (DOCX) Conversion:")
print("-" * 40)
result = converter.convert("sample_documents/sample.docx")
docx = result.document
docx_markdown = docx.export_to_markdown()

print("Word document conversion pattern demonstrated.")
print("To convert a Word document, use: converter.convert('your_document.docx')")
print(docx_markdown)

2025-12-06 10:33:36,265 - INFO - detected formats: [<InputFormat.DOCX: 'docx'>]
2025-12-06 10:33:36,271 - INFO - Going to convert document batch...
2025-12-06 10:33:36,272 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-12-06 10:33:36,272 - INFO - Processing document sample.docx
2025-12-06 10:33:36,299 - INFO - Finished converting document sample.docx in 0.03 sec.


Excel (XLSX) Conversion:
----------------------------------------
Word document conversion pattern demonstrated.
To convert a Word document, use: converter.convert('your_document.docx')
# Deepfake Technology: A Technical Analysis of GAN-Based Synthetic Media

Technical Report

## Introduction

Deepfake technology represents one of the most significant applications of artificial intelligence. The term "deepfake" combines "deep learning" and "fake," referring to synthetic media where a person's face, body, or voice is digitally altered (Sharma &amp; Kaur, 2022).

**Key Statistics:**

- Projected 8 million deepfakes will be shared in 2025
- Significant increase from 500,000 in 2023 (European Parliament, 2025)
- 49% of companies experienced audio/video deepfakes in 2024

## Technical Foundation: Generative Adversarial Networks

The core technology behind deepfakes is Generative Adversarial Networks (GANs), introduced by Goodfellow et al. in 2014. GANs use two competing neural networks:

- 

In [32]:
from docling.document_converter import DocumentConverter  
from docling.datamodel.base_models import InputFormat  
  
# Initialize converter with office document support  
converter = DocumentConverter(  
    allowed_formats=[InputFormat.DOCX, InputFormat.XLSX, InputFormat.PPTX]  
)  
  
# Convert any office document  
result = converter.convert("sample_documents/sample.xlsx")  
print(result.document.export_to_markdown())

2025-12-04 12:11:46,551 - INFO - detected formats: [<InputFormat.XLSX: 'xlsx'>]
2025-12-04 12:11:46,568 - INFO - Going to convert document batch...
2025-12-04 12:11:46,569 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-12-04 12:11:46,569 - INFO - Processing document sample.xlsx
2025-12-04 12:11:46,570 - INFO - Processing sheet: plan
2025-12-04 12:11:46,572 - INFO - Processing sheet: black
2025-12-04 12:11:46,582 - INFO - Finished converting document sample.xlsx in 0.03 sec.


| Dates                                                                                                                                                                                                                                                                           | Modules                                                                                                      |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| =IFERROR(__xludf.DUMMYFUNCTION("ARRAYFORMULA(   TEXT(     FILTER(       SEQUENCE(DATE(2026,4,14)-DATE(2025,11,1)+1,1,DATE(2025,11,1),1),       WEEKDAY(SEQUENCE(DATE(2026,4,14)-DATE(2025,11,1)+1,1,DATE(2025,11,1),1))=1     ),

In [None]:
from docling.document_converter import DocumentConverter  
from docling.datamodel.base_models import InputFormat  

  
# Initialize converter with office document support  
converter = DocumentConverter(  
    allowed_formats=[InputFormat.DOCX, InputFormat.XLSX, InputFormat.PPTX]  
)  
print("\nPowerPoint (PPTX) Conversion:")
print("-" * 40)
# Convert any office document 
# Each slide becomes a section in the document
result = converter.convert("sample_documents/dl.pptx")  
print(result.document.export_to_markdown())

2025-12-06 10:40:46,466 - INFO - detected formats: [<InputFormat.PPTX: 'pptx'>]
2025-12-06 10:40:46,494 - INFO - Going to convert document batch...
2025-12-06 10:40:46,498 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-12-06 10:40:46,501 - INFO - Processing document dl.pptx



PowerPoint (PPTX) Conversion:
----------------------------------------


2025-12-06 10:40:46,766 - INFO - Finished converting document dl.pptx in 0.30 sec.


# Introduction to Deep Learning

1

Date: 12 Nov, 2015

# A Motivational Task: Percepts  Concepts

<!-- image -->

- Create algorithms
- that can understand scenes and describe 	them in natural language
- that can infer semantic concepts to allow 	machines to interact with humans using these 	concepts
- Requires creating a series of abstractions
- Image (Pixel Intensities)  Objects in Image  Object
- Deep learning aims to automatically learn these 	abstractions with little supervision

Interactions  Scene Description

Courtesy: Yoshua Bengio, Learning Deep Architectures for AI

2

# Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy, Fei-Fei; CVPR 2015)

"boy is doing backflip

on wakeboard."

<!-- image -->

<!-- image -->

“two young girls are playing with lego toy.”

<!-- image -->

"man in black shirt is playing guitar."

<!-- image -->

"construction worker in orange safety vest is working on road."

3

http://cs.stanford.edu/people/karpathy/deepimages

### 3.3 Image Files with OCR

Images are processed through the same pipeline as PDFs, with OCR enabled to extract text.

In [16]:
# Image conversion with OCR
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure OCR for images
image_pipeline_options = PdfPipelineOptions(
    do_ocr=True,  # Enable OCR for text extraction from images
)

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(
            pipeline_options=image_pipeline_options
        )
    }
)

# Conversion pattern:
result = converter.convert("sample_documents/scan.pdf")
text = result.document.export_to_markdown()

print("Image OCR conversion pattern:")
print("-" * 40)
print(text)
print("Supported formats: PNG, JPEG, TIFF, BMP, WEBP")
print("Multi-page TIFF files are automatically handled.")

2025-12-06 10:59:07,457 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-06 10:59:07,461 - INFO - Going to convert document batch...
2025-12-06 10:59:07,461 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-12-06 10:59:07,462 - INFO - Accelerator device: 'mps'
2025-12-06 10:59:10,757 - INFO - Accelerator device: 'mps'
2025-12-06 10:59:12,677 - INFO - Accelerator device: 'mps'
2025-12-06 10:59:13,127 - INFO - Processing document scan.pdf
2025-12-06 10:59:33,940 - INFO - Finished converting document scan.pdf in 26.48 sec.


Image OCR conversion pattern:
----------------------------------------
Cosine

Similarity

<!-- formula-not-decoded -->

<!-- formula-not-decoded -->

<!-- image -->

```
Action Rating Romance Rating Vector Persona 5 3 5,3 Person B 10 6 10,6 Person C 1 5 1,5 Step 1 Dot Product A B 5 10 3 6 68 Step 2 Calc the Magnitudes IAI V73 MY 5.83 11131 N F6 11.6 Step 3 Apply the formula Cosine Similarity 7.6 68.03 1 ~ 2 = 2 N
```

Casel

A

B

Cosine

Person

A

B

are

Visual

1

Value to

<!-- formula-not-decoded -->

<!-- formula-not-decoded -->

A

and

Vetore nearly

Case identical

2

a

AB

Perpendicular

Cosine

Range

I

Person C

0

Cabe 3

Â

opposite

Cosine

1

<!-- formula-not-decoded -->

<!-- formula-not-decoded -->

<!-- formula-not-decoded -->

<!-- formula-not-decoded -->

## Word 2 Vue

word embedding technique

Two

Architectures

<!-- image -->

<!-- formula-not-decoded -->

<!-- image -->

Benefits

1

of

Rare

Con

Slowwe to

Skiquam words

train as

compared to

CBOW

Wor

### 3.4 Audio Files (ASR Pipeline)

Docling can transcribe audio files using Automatic Speech Recognition (ASR).

### Run in a GPU

https://colab.research.google.com/drive/1EemOQ8V5BeGz1v7W2xjD6YUC3eZdJLOU?usp=sharing

In [None]:
from docling.document_converter import DocumentConverter  
from docling.datamodel.base_models import InputFormat  
from docling.datamodel import asr_model_specs  
  
# Initialize converter with ASR support  
converter = DocumentConverter(  
    allowed_formats=[InputFormat.AUDIO],  
    format_options={  
        InputFormat.AUDIO: AudioFormatOption(  
            pipeline_cls=AsrPipeline,  
            pipeline_options=AsrPipelineOptions(  
                asr_options=asr_model_specs.WHISPER_TINY  
            )  
        )  
    }  
)  
  
# Convert audio file  
result = converter.convert("sample_documents/sample.mp3")  
print(result.document.export_to_markdown())

In [None]:
# Audio transcription example (requires 'asr' extra)
from docling.document_converter import DocumentConverter, AudioFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.asr_pipeline import AsrPipeline
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.datamodel import asr_model_specs

print("Audio Transcription (ASR) Pattern:")
print("-" * 40)


# Configure ASR pipeline
asr_options = AsrPipelineOptions(
    asr_options=asr_model_specs.WHISPER_TINY,  # or WHISPER_BASE, WHISPER_SMALL
)

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=asr_options,
        )
    }
)

result = converter.convert("sample_documents/sample.mp3")  # or .wav
transcript = result.document.export_to_markdown()
print(transcript)
print("\nSupported formats: WAV, MP3")
print("Requires: pip install 'docling[asr]'")

In [18]:
# VLM Pipeline Configuration
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel import vlm_model_specs
from docling.datamodel.pipeline_options import VlmPipelineOptions

print("\n2. GraniteDocling MLX (Apple Silicon M1/M2/M3/M4):")
print("-" * 40)
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
)

vlm_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)
# Convert with VLM
pdf_url="https://arxiv.org/pdf/2408.09869"
result = vlm_converter.convert(pdf_url)
vlm_markdown = result.document.export_to_markdown()
print(vlm_markdown[:2000])


2. GraniteDocling MLX (Apple Silicon M1/M2/M3/M4):
----------------------------------------


2025-12-06 11:33:50,247 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-06 11:33:50,253 - INFO - Going to convert document batch...
2025-12-06 11:33:50,254 - INFO - Initializing pipeline for VlmPipeline with options hash e58bc69b0ac7e3ef286a71274276cf73
2025-12-06 11:33:51,207 - INFO - Processing document 2408.09869v5.pdf
2025-12-06 11:38:16,214 - INFO - Finished converting document 2408.09869v5.pdf in 266.40 sec.


<!-- image -->

## Docling Technical Report

Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar

AI4K Group, IBM Research Rüschlikon, Switzerland

## Abstract

This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

## 1 Introduction

Converting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variabili

---

## 4. Pipeline Options & Configuration

Docling provides extensive configuration options for customizing the document processing pipeline.

### 4.1 OCR Configuration

Multiple OCR engines are available, each with different strengths:

| Engine | Best For | Installation |
|--------|----------|-------------|
| RapidOCR | General use (default) | Included |
| EasyOCR | Multi-language | `pip install 'docling[easyocr]'` |
| Tesseract | Production | System install + `pip install 'docling[tesserocr]'` |
| OcrMac | macOS native | `pip install 'docling[ocrmac]'` |

In [4]:
# OCR Configuration Examples
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
    RapidOcrOptions,
    TesseractOcrOptions,
)

# Option 1: RapidOCR (default, fast)
rapid_ocr_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=RapidOcrOptions(),
)

# Option 2: EasyOCR (multi-language support)
easy_ocr_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=EasyOcrOptions(
        lang=["en", "fr", "de"],  # English, French, German
        use_gpu=True,  # Use GPU if available
    ),
)

# Option 3: Tesseract (production-ready)
tesseract_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractOcrOptions(
        lang=["eng", "fra"],  # Tesseract language codes
    ),
)

print("OCR configurations created successfully!")
print("\nAvailable OCR options:")
print("  - RapidOcrOptions: Fast, general-purpose")
print("  - EasyOcrOptions: Multi-language, GPU support")
print("  - TesseractOcrOptions: Production, requires system Tesseract")
print("  - OcrMacOptions: macOS Vision framework (macOS only)")

OCR configurations created successfully!

Available OCR options:
  - RapidOcrOptions: Fast, general-purpose
  - EasyOcrOptions: Multi-language, GPU support
  - TesseractOcrOptions: Production, requires system Tesseract
  - OcrMacOptions: macOS Vision framework (macOS only)


In [20]:
# Using EasyOCR with custom language support
# This example shows how to set up OCR for scanned documents

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice

# Configure EasyOCR with accelerator options
pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    ocr_options=EasyOcrOptions(
        lang=["en"],
    ),
    accelerator_options=AcceleratorOptions(
        device=AcceleratorDevice.AUTO,  # AUTO, CPU, CUDA, or MPS
        num_threads=4,
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Conversion pattern:
result = converter.convert("sample_documents/scan.pdf")
text = result.document.export_to_markdown()
print("Converter configured with EasyOCR and accelerator options.")
print(f"Accelerator device: {AcceleratorDevice.AUTO}")

2025-12-06 11:45:31,653 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-06 11:45:31,657 - INFO - Going to convert document batch...
2025-12-06 11:45:31,658 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 154138fdf3f99bf9804fd62accd0504f
2025-12-06 11:45:31,658 - INFO - Accelerator device: 'mps'
2025-12-06 11:45:32,833 - INFO - Download complete.
2025-12-06 11:45:36,638 - INFO - Accelerator device: 'mps'
2025-12-06 11:45:38,205 - INFO - Accelerator device: 'mps'
2025-12-06 11:45:38,715 - INFO - Processing document scan.pdf
2025-12-06 11:45:59,345 - INFO - Finished converting document scan.pdf in 27.69 sec.


Converter configured with EasyOCR and accelerator options.
Accelerator device: AcceleratorDevice.AUTO


In [21]:
text

'Cosine\n\nSimilarity\n\n<!-- formula-not-decoded -->\n\n<!-- formula-not-decoded -->\n\n<!-- image -->\n\n```\nAction Rating Romance Rating Vector Persona 5 3 5,3 Person B 10 6 10,6 Person C 1 5 1,5 Step 1 Dot Product A B 5 10 3 6 68 Step 2 Calc the Magnitudes IAI V73 MY 5.83 11131 N F6 11.6 Step 3 Apply the formula Cosine Similarity 7.6 68.03 1 ~ = = 2 N\n```\n\nCasel\n\nA\n\nB\n\nCosine\n\nPerson\n\nA\n\nB\n\nare\n\nVisual\n\n1\n\nValue to\n\n<!-- formula-not-decoded -->\n\n<!-- formula-not-decoded -->\n\nA\n\nand\n\nVetore nearly\n\nCase identical\n\n2\n\na\n\nAB\n\nPerpendicular\n\nCosine\n\nRange\n\nI\n\nPerson C\n\n0\n\nCabe 3\n\nopposite\n\nCosine\n\n1\n\n<!-- formula-not-decoded -->\n\n<!-- formula-not-decoded -->\n\n<!-- formula-not-decoded -->\n\n<!-- formula-not-decoded -->\n\n## Word 2 Vue\n\nword embedding technique\n\nTwo\n\nArchitectures\n\n<!-- image -->\n\n<!-- formula-not-decoded -->\n\n<!-- image -->\n\nBenefits\n\n1\n\nof\n\nRare\n\nCon\n\nSlowwe to\n\nSkiquam word

### 4.2 Table Structure Options

Configure table extraction with TableFormer model settings.

In [23]:
# Table structure configuration
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    TableFormerMode,
)

# Configure table extraction
table_options = TableStructureOptions(
    do_cell_matching=True,  # Match cells with text content
    mode=TableFormerMode.ACCURATE,  # ACCURATE or FAST
)

pipeline_options = PdfPipelineOptions(
    do_table_structure=True,
    table_structure_options=table_options,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)


# Conversion pattern:
pdf_url ="https://arxiv.org/pdf/2408.09869v1"
result = converter.convert(pdf_url)
text = result.document.export_to_markdown()
print("Table extraction configured:")
print(f"  - Cell matching: {table_options.do_cell_matching}")
print(f"  - Mode: {table_options.mode}")

2025-12-06 11:56:30,572 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-06 11:56:30,580 - INFO - Going to convert document batch...
2025-12-06 11:56:30,581 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-12-06 11:56:30,582 - INFO - Accelerator device: 'mps'
2025-12-06 11:56:34,601 - INFO - Accelerator device: 'mps'
2025-12-06 11:56:36,252 - INFO - Accelerator device: 'mps'
2025-12-06 11:56:36,748 - INFO - Processing document 2408.09869v1.pdf
2025-12-06 11:56:51,934 - INFO - Finished converting document 2408.09869v1.pdf in 26.11 sec.


Table extraction configured:
  - Cell matching: True
  - Mode: TableFormerMode.ACCURATE


In [24]:
text

'<!-- image -->\n\n## Docling Technical Report\n\nVersion 1.0\n\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\n\nAI4K Group, IBM Research R¨ uschlikon, Switzerland\n\n## Abstract\n\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.\n\n## 1 Introduction\n\nConverting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and pri

### 4.4 VLM Pipeline (Vision-Language Models)

For complex documents, Vision-Language Models provide end-to-end understanding.

**Available VLM Models:**
- `GRANITEDOCLING_TRANSFORMERS` - IBM GraniteDocling with Transformers
- `GRANITEDOCLING_MLX` - GraniteDocling optimized for Apple Silicon
- `SMOLDOCLING_TRANSFORMERS` - Smaller, faster model

In [None]:
# VLM Pipeline Configuration
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel import vlm_model_specs
from docling.datamodel.pipeline_options import VlmPipelineOptions

print("VLM Pipeline Configuration:")
print("=" * 60)

# Option 1: GraniteDocling with Transformers (cross-platform)
print("\n1. GraniteDocling with Transformers (GPU/CPU):")
print("-" * 40)
print("""pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)""")

# Option 2: GraniteDocling MLX (Apple Silicon optimized)
print("\n2. GraniteDocling MLX (Apple Silicon M1/M2/M3/M4):")
print("-" * 40)
print("""pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)""")

In [9]:
#Option 2: GraniteDocling MLX (Apple Silicon optimized)

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel import vlm_model_specs
from docling.datamodel.pipeline_options import VlmPipelineOptions

print("\n2. GraniteDocling MLX (Apple Silicon M1/M2/M3/M4):")
print("-" * 40)
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)
# Convert with VLM
pdf_url="https://arxiv.org/pdf/2408.09869"
result = converter.convert(pdf_url)
vlm_markdown = result.document.export_to_markdown()
print(vlm_markdown[:2000])


2. GraniteDocling MLX (Apple Silicon M1/M2/M3/M4):
----------------------------------------


2025-12-04 13:15:23,664 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-04 13:15:23,716 - INFO - Going to convert document batch...
2025-12-04 13:15:23,717 - INFO - Initializing pipeline for VlmPipeline with options hash e58bc69b0ac7e3ef286a71274276cf73
2025-12-04 13:15:24,048 - INFO - Loading plugin 'docling_defaults'
2025-12-04 13:15:24,048 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-12-04 13:15:54,456 - INFO - Processing document 2408.09869v5.pdf
2025-12-04 13:19:32,775 - INFO - Finished converting document 2408.09869v5.pdf in 249.58 sec.


<!-- image -->

## Docling Technical Report

Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar

AI4K Group, IBM Research Rüschlikon, Switzerland

## Abstract

This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

## 1 Introduction

Converting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variabili

In [None]:
# VLM Pipeline - Live Example (requires significant GPU/memory)
# Uncomment to run if you have sufficient resources

"""from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel import vlm_model_specs
from docling.datamodel.pipeline_options import VlmPipelineOptions

# Configure VLM pipeline
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
)

vlm_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)

# Convert with VLM
result = vlm_converter.convert(pdf_url)
vlm_markdown = result.document.export_to_markdown()
print(vlm_markdown[:2000])
"""

print("VLM example is commented out to avoid resource issues.")
print("Uncomment and run if you have GPU/sufficient memory.")

---

## 6. LangChain Integration

Docling integrates seamlessly with LangChain through the `langchain-docling` package.

### 6.1 DoclingLoader

The `DoclingLoader` provides a LangChain-compatible document loader.

In [2]:
# DoclingLoader Basic Usage
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType

pdf_url = "https://arxiv.org/pdf/2408.09869"

# Create loader with DOC_CHUNKS export (recommended for RAG)
loader = DoclingLoader(
    file_path=pdf_url,
    export_type=ExportType.MARKDOWN,  # Returns chunked documents
)

print("Loading documents with DoclingLoader...")
docs = loader.load()

print(f"\nLoaded {len(docs)} document chunks")
print("\nFirst document chunk:")
print("=" * 60)
print(f"Content: {docs[0].page_content[:500]}...")
print(f"\nMetadata: {docs[0].metadata}")

  from .autonotebook import tqdm as notebook_tqdm


Loading documents with DoclingLoader...


The plugin langchain_docling will not be loaded because Docling is being executed with allow_external_plugins=false.
The plugin langchain_docling will not be loaded because Docling is being executed with allow_external_plugins=false.
[32m[INFO] 2026-01-23 16:50:23,681 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-23 16:50:23,744 [RapidOCR] download_file.py:60: File exists and is valid: C:\learning_notes\RAGG\simple-rag-langchain-1\Data Parsers\document-parsers-rag\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-23 16:50:23,747 [RapidOCR] main.py:53: Using C:\learning_notes\RAGG\simple-rag-langchain-1\Data Parsers\document-parsers-rag\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-23 16:50:24,097 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-23 16:50:24,123 [RapidOCR] download_file.py:60: File exists and is valid: C:\learning_notes\RAGG\


Loaded 1 document chunks

First document chunk:
Content: ## Docling Technical Report

## Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar

AI4K Group, IBM Research R¨ uschlikon, Switzerland

## Abstract

This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-s...

Metadata: {'source': 'https://arxiv.org/pdf/2408.09869'}


In [1]:
# DoclingLoader with MARKDOWN export
loader_md = DoclingLoader(
    file_path=pdf_url,
    export_type=ExportType.MARKDOWN,  # Returns full document as Markdown
)

docs_md = loader_md.load()

print(f"Loaded {len(docs_md)} document(s) as Markdown")
print(f"\nDocument length: {len(docs_md[0].page_content)} characters")
print("\nFirst 500 characters:")
print(docs_md[0].page_content[:500])

NameError: name 'DoclingLoader' is not defined

In [27]:
# DoclingLoader with custom converter
from langchain_docling import DoclingLoader
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Create custom converter with specific options
custom_pipeline = PdfPipelineOptions(
    do_ocr=False,
    do_table_structure=True,
)

custom_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=custom_pipeline)
    }
)

# Use custom converter with DoclingLoader
loader_custom = DoclingLoader(
    file_path=pdf_url,
    converter=custom_converter,  # Pass custom converter
    export_type=ExportType.DOC_CHUNKS,
)

docs_custom = loader_custom.load()
print(f"Loaded {len(docs_custom)} chunks with custom converter")

2025-12-06 12:12:03,791 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-06 12:12:03,797 - INFO - Going to convert document batch...
2025-12-06 12:12:03,798 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 60c8066c482b9239b869b997da3fb1da
2025-12-06 12:12:03,798 - INFO - Accelerator device: 'mps'
2025-12-06 12:12:05,287 - INFO - Accelerator device: 'mps'
2025-12-06 12:12:05,832 - INFO - Processing document 2408.09869v5.pdf
2025-12-06 12:12:12,155 - INFO - Finished converting document 2408.09869v5.pdf in 8.61 sec.


Loaded 40 chunks with custom converter


### 6.2 RAG Pipeline with LangChain

Build a complete RAG pipeline using Docling, LangChain, and Chroma.

In [28]:
# Complete RAG Pipeline Setup
import os
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_classic.chains import create_retrieval_chain
from langchain_community.vectorstores.utils import filter_complex_metadata

# Check for OpenAI API key
if not os.getenv("OPENAI_API_KEY"):
    print("Warning: OPENAI_API_KEY not set. RAG example will not work.")
    print("Set your API key: os.environ['OPENAI_API_KEY'] = 'your-key'")
else:
    print("OpenAI API key found. Proceeding with RAG setup...")

OpenAI API key found. Proceeding with RAG setup...


In [29]:
# Step 1: Load and chunk documents

pdf_url = "https://arxiv.org/pdf/2408.09869"

if os.getenv("OPENAI_API_KEY"):
    print("Step 1: Loading documents...")
    
    loader = DoclingLoader(
        file_path=pdf_url,
        export_type=ExportType.DOC_CHUNKS,
    )
    
    documents = loader.load()
    print(f"Loaded {len(documents)} document chunks")

Step 1: Loading documents...


2025-12-06 12:14:05,621 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-06 12:14:05,626 - INFO - Going to convert document batch...
2025-12-06 12:14:05,627 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-12-06 12:14:05,628 - INFO - Accelerator device: 'mps'
2025-12-06 12:14:09,487 - INFO - Accelerator device: 'mps'
2025-12-06 12:14:11,039 - INFO - Accelerator device: 'mps'
2025-12-06 12:14:11,546 - INFO - Processing document 2408.09869v5.pdf
2025-12-06 12:14:28,580 - INFO - Finished converting document 2408.09869v5.pdf in 23.24 sec.


Loaded 41 document chunks


In [30]:
# Step 2: Create embeddings and vector store
if os.getenv("OPENAI_API_KEY"):
    print("Step 2: Creating embeddings and vector store...")
    
    # Initialize embeddings
    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-small"
    )
    # Filter complex metadata from documents
    filtered_documents = filter_complex_metadata(documents)
    
    # Create Chroma vector store
    vectorstore = Chroma.from_documents(
        documents=filtered_documents,
        embedding=embeddings,
        persist_directory="./chroma_db",  # Persist to disk
        collection_name="docling_demo",
    )
    
    print(f"Vector store created with {len(documents)} documents")
    print(f"Persisted to: ./chroma_db")

Step 2: Creating embeddings and vector store...


2025-12-06 12:15:45,710 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2025-12-06 12:15:47,632 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Vector store created with 41 documents
Persisted to: ./chroma_db


In [31]:
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Step 3: Create RAG chain
if os.getenv("OPENAI_API_KEY"):
    print("Step 3: Creating RAG chain...")
    
    # Initialize LLM
    llm = ChatOpenAI(
        model="gpt-4o-mini",
        temperature=0,
    )

          # Create prompt template
    prompt = ChatPromptTemplate.from_messages([
          ("system", "Answer the question based only on the following context:\n\n{context}"),
          ("human", "{input}")
      ])
    
    # Create retriever
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5},  # Return top 5 relevant chunks
    )
    
    # Create QA chain
    question_answer_chain = create_stuff_documents_chain(llm, prompt)
    qa_chain = create_retrieval_chain(retriever, question_answer_chain)
    
    print("RAG chain created successfully!")

Step 3: Creating RAG chain...
RAG chain created successfully!


In [32]:
response = qa_chain.invoke({"input": "What is this document about?"})
response

2025-12-06 12:16:59,201 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-06 12:17:02,734 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


{'input': 'What is this document about?',
 'context': [Document(id='c3670689-c337-484c-8c27-c8528b6926f4', metadata={'source': 'https://arxiv.org/pdf/2408.09869'}, page_content='References\n- [3] C. Auer, M. Dolfi, A. Carvalho, C. B. Ramis, and P. W. Staar. Delivering document conversion as a cloud service with high throughput and responsiveness. In 2022 IEEE 15th International Conference on Cloud Computing (CLOUD) , pages 363-373. IEEE, 2022.\n- [4] J. Berkenbilt. Qpdf: A content-preserving pdf document transformer, 2024. URL https: //github.com/qpdf/qpdf .\n- [5] O. R. developers. Onnx runtime. https://onnxruntime.ai/ , 2024. Version: 1.18.1.\n- [6] IBM. Data Prep Kit: a community project to democratize and accelerate unstructured data preparation for LLM app developers, 2024. URL https://github.com/IBM/ data-prep-kit .\n- [7] A. S. Inc. PyMuPDF, 2024. URL https://github.com/pymupdf/PyMuPDF .'),
  Document(id='b0c19398-9e1a-49b7-8bcc-cbe90d0d95a9', metadata={'source': 'https://arxiv.

In [35]:
# Step 4: Query the RAG system
if os.getenv("OPENAI_API_KEY"):
    print("Step 4: Querying the RAG system...")
    print("=" * 60)
    
    # Example questions about Docling
    questions = [
        "What is Docling and what are its main features?",
        "What file formats does Docling support?",
        "How does Docling handle table extraction?",
    ]
    
    for question in questions:
        print(f"\nQ: {question}")
        print("-" * 40)
        
        response = qa_chain.invoke({"input": question})
        
        #print(f"A: {response['input']}")
        #print(f"\n(Based on {len(response['source_documents'])} source documents)")
        print("=" * 60)
        print(response['answer'])

Step 4: Querying the RAG system...

Q: What is Docling and what are its main features?
----------------------------------------


2025-12-06 12:31:15,874 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-06 12:31:24,504 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Docling is an easy-to-use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by advanced AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), allowing it to efficiently convert documents while maintaining high-quality, richly structured output. 

Main features of Docling include:

1. **High-Quality Document Conversion**: It provides detailed and structured document conversion suitable for various applications.
2. **Support for Downstream Applications**: Its output can be used for enterprise document search, passage retrieval, classification, and knowledge extraction pipelines.
3. **Table Structure Recognition**: It has powerful capabilities for recognizing table structures, which aids in automated knowledge-base construction.
4. **Integration with Generative AI**: Docling supports generative AI application patterns, such as retrieval-augmented generation (RAG), through the open-source package quackling, which

2025-12-06 12:31:24,958 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-06 12:31:26,067 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Docling supports converting PDF documents. The output can be retrieved in either JSON or Markdown format.

Q: How does Docling handle table extraction?
----------------------------------------


2025-12-06 12:31:27,469 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-06 12:31:34,385 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Docling handles table extraction using the TableFormer model, which is a vision-transformer model designed for table structure recovery. The process involves the following steps:

1. **Layout Analysis**: The Docling pipeline first performs layout analysis to detect all table objects within the document.

2. **Image Cropping**: For each detected table, an image crop of the table along with the included text cells is provided to the TableFormer model.

3. **Structure Prediction**: The TableFormer model predicts the logical row and column structure of the table, identifying which cells belong to column headers, row headers, or the table body. It can handle various complexities of tables, such as partial or no borderlines, empty cells, cell spans, and inconsistent indentation or alignment.

4. **Post-Processing**: After the structure predictions are made, they are matched back to the original PDF cells in a post-processing step to avoid the need for expensive re-transcription of text in th

---

## 7. Export & Serialization

### 7.1 Export Methods

Docling provides multiple export methods for different use cases.

In [None]:
# Comprehensive export examples
converter = DocumentConverter()
result = converter.convert(pdf_url)
doc = result.document

# 1. Export to Markdown
markdown = doc.export_to_markdown()
print(f"Markdown export: {len(markdown)} characters")

# 2. Export to Text (plain text, no formatting)
text = doc.export_to_markdown(strict_text=True)
print(f"Text export: {len(text)} characters")

# 3. Export to Dictionary
doc_dict = doc.export_to_dict()
print(f"Dict export: {len(doc_dict.keys())} top-level keys")

# 4. Save as JSON
json_path = OUTPUT_DIR / "export_demo.json"
doc.save_as_json(json_path)
print(f"JSON saved: {json_path}")

# 5. Save as HTML
html_path = OUTPUT_DIR / "export_demo.html"
doc.save_as_html(html_path)
print(f"HTML saved: {html_path}")

### 7.2 Table Export

Export tables to pandas DataFrames or CSV.

In [None]:
# Table export to DataFrame
import pandas as pd

# Access tables from the document
if hasattr(doc, 'tables') and doc.tables:
    print(f"Found {len(doc.tables)} tables\n")
    
    for i, table in enumerate(doc.tables[:3]):  # First 3 tables
        print(f"Table {i+1}:")
        print("-" * 40)
        
        try:
            # Export to DataFrame
            df = table.export_to_dataframe()
            print(df.head())
            
            # Save to CSV
            csv_path = OUTPUT_DIR / f"table_{i+1}.csv"
            df.to_csv(csv_path, index=False)
            print(f"Saved to: {csv_path}")
        except Exception as e:
            print(f"Error exporting table: {e}")
        
        print()
else:
    print("No tables found in the document")

---

## 8. Advanced Topics

### 8.1 Batch Processing

Process multiple documents efficiently with `convert_all()`.

In [None]:
# Batch processing example
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import ConversionStatus
from pathlib import Path

# Define sources (can be paths, URLs, or streams)
sources = [
    str(SAMPLE_DIR / "sample.html"),
    str(SAMPLE_DIR / "sample.md"),
]

# Filter to existing files only
existing_sources = [s for s in sources if Path(s).exists()]

if existing_sources:
    converter = DocumentConverter()
    
    # Batch convert with error handling
    results = {
        "success": [],
        "partial": [],
        "failed": [],
    }
    
    print(f"Processing {len(existing_sources)} documents...")
    
    for result in converter.convert_all(existing_sources, raises_on_error=False):
        if result.status == ConversionStatus.SUCCESS:
            results["success"].append(result)
            print(f"  SUCCESS: {result.input.file.name}")
        elif result.status == ConversionStatus.PARTIAL_SUCCESS:
            results["partial"].append(result)
            print(f"  PARTIAL: {result.input.file.name}")
        else:
            results["failed"].append(result)
            print(f"  FAILED: {result.input.file.name}")
    
    print(f"\nSummary: {len(results['success'])} success, "
          f"{len(results['partial'])} partial, "
          f"{len(results['failed'])} failed")
else:
    print("No sample files found for batch processing demo.")

### 8.2 Document Enrichment

Enable enrichment features like picture classification and description.

In [3]:
# Document enrichment configuration
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Enable enrichment features
enrichment_options = PdfPipelineOptions(
    do_table_structure=True,
    do_picture_classification=True,   # Classify pictures (chart, diagram, etc.)
    do_picture_description=False,     # Disable VLM description (resource intensive)
    generate_picture_images=True,     # Save picture images
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=enrichment_options)
    }
)

print("Enrichment features configured:")
print(f"  - Picture classification: {enrichment_options.do_picture_classification}")
print(f"  - Picture description: {enrichment_options.do_picture_description}")
print(f"  - Generate picture images: {enrichment_options.generate_picture_images}")

Enrichment features configured:
  - Picture classification: True
  - Picture description: False
  - Generate picture images: True


### 8.3 Error Handling

Handle conversion errors gracefully with status checking.

In [None]:
# Error handling patterns
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import ConversionStatus

converter = DocumentConverter()

def safe_convert(source):
    """Safely convert a document with proper error handling."""
    try:
        result = converter.convert(source, raises_on_error=False)
        
        if result.status == ConversionStatus.SUCCESS:
            print(f"Conversion successful: {result.input.file.name}")
            return result.document
        
        elif result.status == ConversionStatus.PARTIAL_SUCCESS:
            print(f"Partial success: {result.input.file.name}")
            print(f"  Errors: {len(result.errors)}")
            for error in result.errors:
                print(f"    - {error.component_type}: {error.error_message}")
            return result.document  # Still usable
        
        else:
            print(f"Conversion failed: {result.input.file.name}")
            for error in result.errors:
                print(f"  - {error.component_type}: {error.error_message}")
            return None
            
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

# Example usage
doc = safe_convert(pdf_url)
if doc:
    print(f"\nDocument ready with {len(doc.export_to_markdown())} characters")

---

## Summary

In this notebook, we covered:

1. **Installation & Setup** - Installing Docling 2.55.1 with all dependencies
2. **Basic Conversion** - Converting documents to Markdown, JSON, HTML
3. **File Formats** - PDF, Office (DOCX, XLSX, PPTX), HTML, Markdown, Images, Audio
4. **Pipeline Options** - OCR engines, table extraction, layout analysis, VLM
5. **Chunking** - HybridChunker and HierarchicalChunker for RAG
6. **LangChain Integration** - DoclingLoader and RAG pipeline
7. **Export Methods** - Multiple output formats and table export

### Key Takeaways

- **Docling** provides unified document parsing across multiple formats
- **DocumentConverter** is the main entry point for all conversions
- **Pipeline options** allow fine-tuned control over processing
- **Native chunking** is optimized for RAG applications
- **LangChain integration** enables seamless RAG pipeline creation

### Resources

- [Docling Documentation](https://docling-project.github.io/docling/)
- [Docling GitHub](https://github.com/docling-project/docling)
- [LangChain Docling Integration](https://docs.langchain.com/oss/python/integrations/document_loaders/docling)
- [Docling Examples](https://docling-project.github.io/docling/examples/)

In [None]:
# Cleanup (optional)
import shutil

# Uncomment to clean up generated files
# if OUTPUT_DIR.exists():
#     shutil.rmtree(OUTPUT_DIR)
# if Path("./chroma_db").exists():
#     shutil.rmtree("./chroma_db")
# if Path("./chroma_rag_demo").exists():
#     shutil.rmtree("./chroma_rag_demo")

print("Notebook completed successfully!")
print(f"Output files saved to: {OUTPUT_DIR.absolute()}")