# Data Processing: Document Conversion with VLM Docling

This notebook uses **vision‑language model (VLM)** powered [Docling](https://docling-project.github.io/docling/) to convert PDF documents into Markdown and the [Docling Document](https://docling-project.github.io/docling/concepts/docling_document/) format, a structured representation of the original document that can be exported as JSON.

VLM conversion leverages multimodal models to interpret complex layouts, figures, and image‑only pages. It is especially helpful when standard (non‑VLM) or OCR pipelines miss content embedded in charts, diagrams, screenshots, hand-written notes, or dense tables.

You can run VLM in two ways:

- **Remote VLM service**: route processing through a VLM [model service](https://github.com/rh-aiservices-bu/models-aas) API by providing an endpoint URL, model name, and API key.
- **Local VLM**: run a lightweight model locally.

This notebook walks you through configuration and conversion and produces both Markdown and Docling JSON outputs for each input PDF.

## 📦 Installation

Install the [Docling](https://docling-project.github.io/docling/) package into this notebook environment. Run this once per session, it may take a minute. If you restart the kernel or change runtimes, re-run this cell before continuing.

In [None]:
!pip install -qq docling

## ⚠️ Important Notes

**Exception Handling**: This notebook demonstrates the core workflow with minimal error handling for clarity. When using your own data or deploying to production:

- Add try-except blocks around file I/O operations
- Handle network errors for URL-based document loading
- Validate document formats and sizes before processing
- Implement timeouts for long-running operations
- Add proper logging for debugging and monitoring
- Handle cases where documents fail to convert or chunk

Example of adding exception handling:
```python
try:
    result = converter.convert(file_path)
    document = result.document
except Exception as e:
    print(f"❌ Failed to convert {file_path}: {str(e)}")
    continue  # Skip to next document
```

## 🔧 Configuration

### Set files to convert

Set the list of PDF files to convert. You can mix public web URLs and local file paths, each entry will be processed in order. Replace the example with your own documents as needed.

In [None]:
files = [
    "https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/001-trivial/minimal-document.pdf",
    "https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/003-pdflatex-image/pdflatex-image.pdf"
]

### Set output directory

Choose where to save results. This notebook creates the folder if it doesn’t exist and writes one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` file per source file, using the source’s base name.

In [None]:
from pathlib import Path

output_dir_name = "document-conversion-vlm/output"

output_dir = Path(output_dir_name)
output_dir.mkdir(parents=True, exist_ok=True)

### Choose a VLM backend

Select how the VLM will run:

- Set `remote = True` to use a hosted VLM endpoint (configure URL, model name, and API key).
- Set `remote = False` to use the local and lightweight smolDocling model.

If you’re unsure, start with the local option to test the results. Depending on your hardware resources and the characteristics of your documents, VLM models can take a significant amount of time to run.

In [None]:
import os

# Set to True to use a VLM model hosted on a remote server
remote = False

# If remote = True, set the remote model endpoint URL (can be overridden via VLM_SERVICE_ENDPOINT_URL)
remote_model_endpoint_url = os.getenv("VLM_SERVICE_ENDPOINT_URL", "https://path.to.your.vlm.endpoint/v1/chat/completions")

# If remote = True, set the remote model name (can be overridden via VLM_SERVICE_MODEL_NAME)
remote_model_name = os.getenv("VLM_SERVICE_MODEL_NAME", "granite-vision-3-2")

# If remote = True, set the remote model API key (can be overridden via VLM_SERVICE_API_KEY)
remote_model_api_key = os.getenv("VLM_SERVICE_API_KEY", "your.api.key.here")

### Configure the VLM conversion pipeline

Next we create the configuration options for the conversion pipelines supported by this notebook.

For additional customization and a complete reference of Docling's conversion pipeline configuration, check the [official documentation](https://docling-project.github.io/docling/) and [examples](https://docling-project.github.io/docling/examples/).

In [None]:
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    smoldocling_vlm_conversion_options,
)
from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat

pipeline_options = VlmPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.document_timeout = 600

if remote:
    pipeline_options.enable_remote_services = True
    pipeline_options.vlm_options = ApiVlmOptions(
        url=remote_model_endpoint_url,
        params=dict(
            model_id=remote_model_name,
            parameters=dict(
                max_new_tokens=400,
            ),
        ),
        prompt="Convert the full page to markdown. Do not miss any text.",
        timeout=600,
        response_format=ResponseFormat.MARKDOWN,
        headers={
            "Authorization": f"Bearer {remote_model_api_key}",
        },
    )

else:
    pipeline_options.vlm_options = smoldocling_vlm_conversion_options
    pipeline_options.accelerator_options = AcceleratorOptions(
        num_threads=4, device=AcceleratorDevice.AUTO
    )

### Configure enrichments

Depending on your documents, you may benefit from optional enrichments. These add specialized processing for specific content types and can increase processing time.

- `do_picture_description`: Generates captions for pictures with a vision model.
- `do_picture_classification`: Classifies pictures (e.g., charts, flow diagrams, logos, signatures).

All enrichments are disabled by default; enable the ones you need below. See the [enrichments docs](https://docling-project.github.io/docling/usage/enrichments/) for details.

In [None]:
# Sets picture description and classification
pipeline_options.do_picture_description = False
pipeline_options.do_picture_classification = False

# If you enable enrichments, you may benefit from increasing the image scale (e.g. to 2)
pipeline_options.images_scale = 1

## ✨ Conversion

Finally, use the pipeline options we configured to convert every document into one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` (markdown), which will be stored in the output directory configured earlier.

In [None]:
import json
from docling_core.types.doc import ImageRefMode
from docling.pipeline.vlm_pipeline import VlmPipeline

# Create the document converter
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
            pipeline_cls=VlmPipeline,
        )
    }
)

if not files:
    raise ValueError("No input files specified. Please set the 'files' list above.")

for file in files:
    # Convert the file
    print(f"Converting {file}...")

    result = converter.convert(file)
    document = result.document
    dictionary = document.export_to_dict()

    file_path = Path(file)

    # Export the document to JSON
    json_output_path = (output_dir / f"{file_path.stem}.json")
    with open(json_output_path, "w", encoding="utf-8") as f:
        json.dump(dictionary, f)
        print(f"✓ Path of JSON output is: {json_output_path.resolve()}")

    # Export the document to Markdown
    md_output_path = output_dir / f"{file_path.stem}.md"
    with open(md_output_path, "w", encoding="utf-8") as f:
        markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
        f.write(markdown)
        print(f"✓ Path of markdown output is: {md_output_path.resolve()}")

## 🍩 Additional resources

For additional example notebooks related to Data Processing, check the [Open Data Hub Data Processing](https://github.com/opendatahub-io/odh-data-processing/) repository on GitHub.

### Any Feedback?

We'd love to hear if you have any feedback on this or any other notebook in this series! Please [open an issue](https://github.com/opendatahub-io/odh-data-processing/issues) and help us improve our demos.