## Docling converter

Docling converter can work with multiple file format.
The API is pretty simple:
- Define the Converter with file format open
- Convert the source file (to docling document)
- Export the suitable format (JSON, markdown)

In [1]:
from docling.datamodel import vlm_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.document_converter import (
    DocumentConverter,
    PdfFormatOption,
    ImageFormatOption,
)
from docling.pipeline.vlm_pipeline import VlmPipeline

In [2]:
source = "sample_invoice.jpg"

In [None]:
converter = DocumentConverter(
    format_options={
        # Set this input format to use IBM Granite Docling 258M (by default)
        # However, the config has not worked with Image format yet.
        # Hence, we use default models for image.
        # InputFormat.IMAGE: ImageFormatOption(pipeline_cls=VlmPipeline),
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
        ),
    }
)

In [4]:
doc = converter.convert(source=source).document

2025-10-22 21:53:55,376 - INFO - detected formats: [<InputFormat.IMAGE: 'image'>]
2025-10-22 21:53:57,066 - INFO - Going to convert document batch...
2025-10-22 21:53:57,067 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 4f2edc0f7d9bb60b38ebfecf9a2609f5
2025-10-22 21:53:57,075 - INFO - Loading plugin 'docling_defaults'
2025-10-22 21:53:57,077 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-22 21:53:57,087 - INFO - Loading plugin 'docling_defaults'
2025-10-22 21:53:57,087 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-22 21:53:57,202 - INFO - rapidocr cannot be used because onnxruntime is not installed.
2025-10-22 21:53:57,202 - INFO - easyocr cannot be used because it is not installed.
2025-10-22 21:53:57,737 - INFO - Accelerator device: 'cpu'
[32m[INFO] 2025-10-22 21:53:57,767 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2025-10-22 21:53:57,803 [RapidOCR] downlo

In [5]:
doc

DoclingDocument(schema_name='DoclingDocument', version='1.7.0', name='sample_invoice', origin=DocumentOrigin(mimetype='application/pdf', binary_hash=1179296745110519636, filename='sample_invoice.jpg', uri=None), furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>), body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/texts/0'), RefItem(cref='#/texts/1'), RefItem(cref='#/texts/2'), RefItem(cref='#/texts/3'), RefItem(cref='#/texts/4'), RefItem(cref='#/tables/0'), RefItem(cref='#/tables/1'), RefItem(cref='#/texts/5'), RefItem(cref='#/texts/6'), RefItem(cref='#/texts/7'), RefItem(cref='#/texts/8')], content_layer=<ContentLayer.BODY: 'body'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>), groups=[], texts=[SectionHeaderItem(self_ref='#/texts/0', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'bo

In [6]:
print(doc.export_to_markdown())

## FactureFA04/2015/085324

Azure Interior

4557DeSilvaSt

FremontCA94538

EtatsUnis

| Description           | Quantite    | Prix Taxes    | Soustotal   |
|-----------------------|-------------|---------------|-------------|
| Bureaupersonnalisable | 31,00Unites | 500,00TVA20%  | 15500.00    |
| Combinaisondebureau   | 66.00Unites | 300.00TVA10%  | 19800.00    |
| Canape troisplaces    | 42,00Unites | 1000.00TVA20% | 42000.00    |
| Poubellea pedale      | 77,00Unites | 10,00TVA20%   | 770,00      |

| Montant HT   | 78070,00   |
|--------------|------------|
| Taxes        | 13634,00   |
| MontantTTC   | 91704,00   |

Date

Echeacnce

N°BC

2015-04-12 2015-05-27BC06159


In [7]:
doc_ = converter.convert(source="JD.pdf").document

2025-10-22 21:54:19,054 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-22 21:54:19,063 - INFO - Going to convert document batch...
2025-10-22 21:54:19,063 - INFO - Initializing pipeline for VlmPipeline with options hash 14b35a24912cc09d5c7735b8ff9d88c1
2025-10-22 21:54:19,070 - INFO - Accelerator device: 'cpu'
2025-10-22 21:54:21,744 - INFO - Processing document JD.pdf
2025-10-22 21:57:38,605 - INFO - Finished converting document JD.pdf in 199.55 sec.


In [8]:
doc_.save_as_html("JD.html")