# Data Processing: Document Conversion with Standard Docling

This notebook uses **standard (non-VLM)** [Docling](https://docling-project.github.io/docling/) techniques to convert PDF documents into markdown and the [Docling Document](https://docling-project.github.io/docling/concepts/docling_document/) format, a structured representation of the original document that can be exported as JSON.

The standard pipeline options generally yield good and fast results for most documents. In some cases, however, alternative conversion pipelines can lead to better outcomes. For instance, forcing OCR is effective for scanned documents or images that contain text to be extracted and analyzed. In cases where relevant information is contained within formulas, code, or pictures; enrichment and picture description and classification might be useful. All these use cases are supported by this notebook.

## 📦 Installation

Install the [Docling](https://docling-project.github.io/docling/) package into this notebook environment. Run this once per session, it may take a minute. If you restart the kernel or change runtimes, re-run this cell before continuing.

In [None]:
!pip install -qq docling

## 🔧 Configuration

### Set files to convert

Set the list of PDF files to convert. You can mix public web URLs and local file paths, each entry will be processed in order. Replace the examples with your own documents as needed.

In [None]:
files = [
    "https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/001-trivial/minimal-document.pdf",
    "https://github.com/docling-project/docling/raw/v2.43.0/tests/data/pdf/2203.01017v2.pdf"
]

### Set output directory

Choose where to save results. This notebook creates the folder if it doesn’t exist and writes one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` file per source file, using the source's base name.

In [None]:
from pathlib import Path

output_dir = Path("document-conversion-standard/output")
output_dir.mkdir(parents=True, exist_ok=True)

### Configure conversion pipeline

Next we set the configuration options for our conversion pipeline. 

The next cell contains three combinations of pipeline options: the default (standard) options, a variant that forces OCR on the entire document, and another one which enables code, formula, and picture enrichments. Later in the *Conversion* section, you'll set the converter to either `standard_converter`, `ocr_converter`, or `enrichment_converter` depending on which conversion technique you'd like to use.

Note: OCR requires the Tesseract binary to run. Please refer to the Docling [installation](https://docling-project.github.io/docling/installation/) docs if you're not running this notebook from a Workbench image that has it installed already. 

For additional customization and a complete reference of Docling's conversion pipeline configuration, check the [official documentation](https://docling-project.github.io/docling/examples/).

In [None]:
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    TesseractOcrOptions,
    PdfPipelineOptions,
)
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend

# Standard pipeline options
standard_pipeline_options = PdfPipelineOptions()
standard_pipeline_options.generate_picture_images = True
standard_pipeline_options.do_table_structure = True
standard_pipeline_options.table_structure_options.do_cell_matching = True
standard_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=standard_pipeline_options,
            backend=DoclingParseV4DocumentBackend,
        )
    }
)

# Force OCR on the entire page
%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
ocr_pipeline_options = PdfPipelineOptions()
ocr_pipeline_options.generate_picture_images = True
ocr_pipeline_options.do_table_structure = True
ocr_pipeline_options.table_structure_options.do_cell_matching = True
ocr_pipeline_options.do_ocr = True
ocr_pipeline_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
ocr_pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO
)
ocr_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=ocr_pipeline_options,
            backend=DoclingParseV4DocumentBackend,
        )
    }
)

# Code and formula enrichments and picture description and classification
enrichment_pipeline_options = PdfPipelineOptions()
enrichment_pipeline_options.generate_picture_images = True
enrichment_pipeline_options.do_table_structure = True
enrichment_pipeline_options.table_structure_options.do_cell_matching = True
enrichment_pipeline_options.do_code_enrichment = True
enrichment_pipeline_options.do_formula_enrichment = True
enrichment_pipeline_options.do_picture_description = True
enrichment_pipeline_options.images_scale = 2
enrichment_pipeline_options.do_picture_classification = True
enrichment_pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO
)
enrichment_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=enrichment_pipeline_options,
            backend=DoclingParseV4DocumentBackend,
        )
    }
)

## ✨ Conversion

Finally, convert every document into one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` (markdown). If you'd like to change the conversion technique, set  `converter` to either `standard_converter`, `ocr_converter`, or `enrichment_converter`.

In [None]:
import json
from docling_core.types.doc import ImageRefMode

confidence_reports = dict()

for file in files:
    # Set the converter to use (standard_converter, ocr_converter, or enrichment_converter)
    converter = standard_converter
    
    # Convert the file
    print(f"Converting {file}...")
    result = converter.convert(file)
    document = result.document
    dictionary = document.export_to_dict()

    file_path = Path(file)

    # Calculate conversion confidence
    confidence_reports[file] = result.confidence

    # Export the document to JSON
    json_output_path = (output_dir / f"{file_path.stem}.json")
    with open(json_output_path, "w", encoding="utf-8") as f:
        json.dump(dictionary, f)
        print(f"Path of JSON output is: {Path(json_output_path).resolve()}")

    # Export the document to Markdown
    md_output_path = output_dir / f"{file_path.stem}.md"
    with open(md_output_path, "w", encoding="utf-8") as f:
        markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
        f.write(markdown)
        print(f"Path of markdown output is: {Path(md_output_path).resolve()}")

### Conversion confidence

When converting a document, Docling can calculate how confident it is in the quality of the conversion. This *confidence* is expressed as both a *score* and a *grade*. The score is a numeric value between 0 and 1, and the grade is a label that can be **poor**, **fair**, **good**, or **excellent**. If Docling is unable to calculate a confidence grade, the value will be marked as *unspecified*.

If your document receives a low score (for example, below 0.8) and a grade of *poor* or *fair*, you'll probably benefit from using a different conversion technique. In that case, go back to the *Conversion* section and try selecting a different approach (e.g. forcing OCR) and compare the results.

In [None]:
for file, confidence_report in confidence_reports.items():
    print(f"Conversion confidence for {file}:")
    
    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
    
    low_score_pages = []
    for page in confidence_report.pages:
        page_confidence_report = confidence_report.pages[page]
        if page_confidence_report.mean_score < confidence_report.mean_score:
            low_score_pages.append(page)

    print(f"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages) or 'none'}")
    
    print()

## 🍩 Additional resources

For additional example notebooks related to Data Processing, check the [Open Data Hub Data Processing](https://github.com/opendatahub-io/odh-data-processing/) repository on GitHub.

### Any Feedback?

We'd love to hear if you have any feedback on this or any other notebook in this series! Please [open an issue](https://github.com/opendatahub-io/odh-data-processing/issues) and help us improve our demos.