## Docling Conversion Error Reproduction - Testing Multiple Conversion Settings

### Point to PDF files

Modify the FILES_SOURCE variable below to point to a directory with your PDF files

In [1]:
from pathlib import Path

# --- --- ---

# Change this line to modify the path
FILES_SOURCE = 'files'

# --- --- ---

FILES_DIR = Path('files')

files = [file for file in FILES_DIR.rglob('*.pdf') if str(file).count('/') == 1]

print("Files to convert:\n")
for file in files: print(f'- {file}')

Files to convert:

- files/BofA_CoreChecking_en_ADA.pdf


### Install Docling

In [2]:
!pip install -q docling
!pip install -q rapidocr_onnxruntime
!pip install -q ocrmac

### Create Converter Utility

The following imports and code provide a utility function to easily create multiple document converters with settings customized by a python dictionary

In [3]:
# Imports

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    AcceleratorDevice,
    AcceleratorOptions,
    EasyOcrOptions,
    TesseractCliOcrOptions,
    TesseractOcrOptions,
    RapidOcrOptions,
    OcrMacOptions,
    PdfPipelineOptions,
    TableFormerMode,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling_core.types.doc import (
    ImageRefMode,
    PictureClassificationData,
)
from docling.pipeline.vlm_pipeline import VlmPipeline

In [4]:
# Structure of function parameters

'''
options {
    vlm: bool (False),
    pdf_backend: 'pypdfium2' | 'dlparse' | 'dlparse_v2' | 'dlparse_v4' (default),
    ocr: bool (True),
    ocr_engine: 'EasyOCR' (default) | 'Tesseract' | 'TesseractCLI' | 'RapidOCR' | 'OCRMac',
    force_ocr: bool (False),
    table_structure: bool (True),
    cell_matching: bool (True),
    table_mode: 'Accurate' | 'Fast' (default),
    images: bool (True),
    picture_classification: bool (False),
    picture_description: bool (False),
    formula_enrichment: bool (False),
    code_enrichment: bool (False)
'''

def createConverter(options):
    # Combine provided options and defaults
    
    default_options = {
        'vlm': False,
        'pdf_backend': 'dlparse_v4',
        'ocr': True,
        'ocr_engine': 'EasyOCR',
        'force_ocr': False,
        'table_structure': True,
        'cell_matching': True,
        'table_mode': 'Accurate',
        'images': True,
        'picture_classification': False,
        'picture_description': False,
        'formula_enrichment': False,
        'code_enrichment': False
    }

    newOptions = {**default_options, **options}
    
    # Create pipeline_options object

    if (newOptions['vlm']):
        converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_cls=VlmPipeline
                )
            }
        )
    else:
        pipeline_options = PdfPipelineOptions()
        pipeline_options.do_ocr = newOptions['ocr']
        
        match newOptions['ocr_engine'].replace(" ", "").lower():
            case 'tesseract':
                pipeline_options.ocr_options = TesseractOcrOptions()
            case 'tesseractcli':
                pipeline_options.ocr_options = TesseractCliOcrOptions()
            case 'rapidocr':
                pipeline_options.ocr_options = RapidOcrOptions()
            case 'ocrmac':
                pipeline_options.ocr_options = OcrMacOptions()
            case _:
                pipeline_options.ocr_options = EasyOcrOptions()
                
        pipeline_options.ocr_options.force_full_page_ocr = newOptions['force_ocr']
        pipeline_options.do_table_structure = newOptions['table_structure']
        pipeline_options.table_structure_options.do_cell_matching = newOptions['cell_matching']
        pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE if newOptions['table_mode'].lower() == 'accurate' else TableFormerMode.FAST
        pipeline_options.generate_page_images = newOptions['images']
        pipeline_options.do_picture_classification = newOptions['picture_classification']
        pipeline_options.do_picture_description = newOptions['picture_description']
        pipeline_options.do_formula_enrichment = newOptions['formula_enrichment']
        pipeline_options.do_code_enrichment = newOptions['code_enrichment']
        pipeline_options.accelerator_options = AcceleratorOptions(
            num_threads=4, device=AcceleratorDevice.AUTO
        )
    
        match newOptions['pdf_backend']:
            case 'dlparse':
                backend = DoclingParseDocumentBackend
            case 'dlparse_v2':
                backend = DoclingParseV2DocumentBackend
            case 'pypdfium2':
                backend = PyPdfiumDocumentBackend
            case _:
                backend = DoclingParseV4DocumentBackend
    
        # Create converter object
        
        converter = DocumentConverter(
            format_options = {
                InputFormat.PDF: PdfFormatOption(
                    pipeline_options=pipeline_options,
                    backend=backend,
                )
            }
        )

    return converter

### Define converters

Create multiple converters with different settings to determine which result in omissions and which don't

In [5]:
# VLM is not included by default, as it takes much longer to run, depending on your hardware.
# However, it can be added to the converterOptions list to be enabled, as well as any custom conversion settings

'''
{
    'alias': 'VLM',
    'options': 
        {
            'vlm': True
        }
},
'''

converterOptions = [
    {
        'alias': 'Default',
        'options': 
            {}
    },
    {
        'alias': 'Accurate Tables (EasyOCR)',
        'options': 
            {
                'table_mode': 'Accurate'
            },
    },
    {
        'alias': 'Pypdfium',
        'options': 
            {
                'table_mode': 'Accurate',
                'pdf_backend': 'pypdfium2'
            },
    },
    {
        'alias': 'Dlparse',
        'options': 
            {
                'table_mode': 'Accurate',
                'pdf_backend': 'dlparse'
            },
    },
    {
        'alias': 'Dlparse_v2',
        'options': 
            {
                'table_mode': 'Accurate',
                'pdf_backend': 'dlparse_v2'
            },
    },
    {
        'alias': 'Dlparse_v4',
        'options': 
            {
                'table_mode': 'Accurate',
                'pdf_backend': 'dlparse_v4'
            },
    },
    {
    'alias': 'Tesseract',
        'options': 
            {
                'table_mode': 'Accurate',
                'ocr_engine': 'Tesseract'
            },
    },
    {
        'alias': 'TesseractCLI',
        'options': 
            {
                'table_mode': 'Accurate',
                'ocr_engine': 'TesseractCLI'
            },
    },
    {
        'alias': 'RapidOCR',
        'options': 
            {
                'table_mode': 'Accurate',
                'ocr_engine': 'RapidOCR'
            },
    },
    {
        'alias': 'OCRMac',
        'options': 
            {
                'table_mode': 'Accurate',
                'ocr_engine': 'OCRMac'
            },
    },
    {
        'alias': 'Tesseract Force',
        'options': 
            {
                'table_mode': 'Accurate',
                'ocr_engine': 'Tesseract',
                'force_ocr': True
            },
    },
    {
        'alias': 'TesseractCLI Force',
        'options': 
            {
                'table_mode': 'Accurate',
                'ocr_engine': 'TesseractCLI',
                'force_ocr': True
            },
    },
    {
        'alias': 'EasyOCR Force',
        'options': 
            {
                'table_mode': 'Accurate',
                'force_ocr': True
            },
    },
    {
        'alias': 'RapidOCR Force',
        'options': 
            {
                'table_mode': 'Accurate',
                'ocr_engine': 'RapidOCR',
                'force_ocr': True
            },
    },
    {
        'alias': 'OCRMac Force',
        'options': 
            {
                'table_mode': 'Accurate',
                'ocr_engine': 'OCRMac',
                'force_ocr': True
            },
    },
] 

In [6]:
converters = [{"alias": converterOption['alias'], "converter": createConverter(converterOption['options'])} for converterOption in converterOptions]

### Convert files

For demonstration purposes, files are converted directly to markdown. Each file is converted with each converter

**If you run into errors with conversion settings that use Tesseract**, you may need to check how you are running this notebook. Ensure that it is run from a terminal shell (rather than a GUI like Anaconda Navigator), and that the terminal shell has Tesseract installed and the TESSDATA_PREFIX environment variable set. Reference [Docling documentation on OCR settings](https://docling-project.github.io/docling/installation/) for more installation instructions.

In [7]:
import time

converted = []

print('Beginning conversion\n')

start = time.time()

totalFiles = len(files)
totalConverters = len(converters)

for fileIndex, file in enumerate(files):
    print(f'Beginning conversion for file: {file} ({fileIndex + 1}/{totalFiles} files)\n')

    for converterIndex, converter in enumerate(converters):
        result = converter['converter'].convert(files[0]).document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
        converted.append({ 'alias': converter['alias'], 'document': result})
        print(f'Document converted with {converter['alias']} ({round(time.time() - start, 2)} seconds) (Conversion {totalFiles * fileIndex + converterIndex + 1}/{totalFiles * totalConverters} overall)')

    print(f'\nConversion completed for file: {file} ({round(time.time() - start, 2)} seconds)\n')

Beginning conversion

Beginning conversion for file: files/BofA_CoreChecking_en_ADA.pdf (1/1 files)

Document converted with Default (6.13 seconds) (Conversion 1/15 overall)
Document converted with Accurate Tables (EasyOCR) (11.11 seconds) (Conversion 2/15 overall)
Document converted with Pypdfium (15.33 seconds) (Conversion 3/15 overall)
Document converted with Dlparse (19.96 seconds) (Conversion 4/15 overall)
Document converted with Dlparse_v2 (24.28 seconds) (Conversion 5/15 overall)
Document converted with Dlparse_v4 (29.21 seconds) (Conversion 6/15 overall)
Document converted with Tesseract (32.36 seconds) (Conversion 7/15 overall)




Document converted with TesseractCLI (35.75 seconds) (Conversion 8/15 overall)
Document converted with RapidOCR (38.8 seconds) (Conversion 9/15 overall)
Document converted with OCRMac (42.2 seconds) (Conversion 10/15 overall)
Document converted with Tesseract Force (49.05 seconds) (Conversion 11/15 overall)
Document converted with TesseractCLI Force (56.94 seconds) (Conversion 12/15 overall)




Document converted with EasyOCR Force (71.4 seconds) (Conversion 13/15 overall)
Document converted with RapidOCR Force (83.29 seconds) (Conversion 14/15 overall)
Document converted with OCRMac Force (89.58 seconds) (Conversion 15/15 overall)

Conversion completed for file: files/BofA_CoreChecking_en_ADA.pdf (89.58 seconds)



### Review conversion

Here, I specifically check for a known omission error in the BofA document. If you check the original source file, in the "Overdraft settings and fees," there is an embedded part of the table below "Option 1" that says "Overdraft Item Fee" with descriptive text and bullets. In all of the conversion settings with different OCR options and pdf backends, this section is always omitted (note the lack of content bewteen "Option 1:" and "Option 2:"

In [8]:
for doc in converted:
    sectionStart = doc['document'].lower().find('option 1')
    if sectionStart < 0: sectionStart = doc['document'].lower().find('option1')
    sectionEnd = doc['document'].lower().find('![image]')
    
    print(f'Converted with {doc['alias']}:\n\n{doc['document'][sectionStart : sectionEnd]}\n\n')

Converted with Default:

Option 1: Standard - This setting will be automatically applied to your account.

- · Your checks and scheduled payments may be paid, causing an overdraft.
- · You may be charged an Overdraft Item Fee if you overdraw your account.
- · If we return an item unpaid, we won't charge a fee, but the payee may.

Option 2: Decline All - You can choose this setting if you would like to have your transactions declined or returned unpaid when you don't have enough money. With this setting you can avoid Overdraft Item Fees.

- · Checks or scheduled payments will be returned unpaid if you don't have enough money in your account.
- · If your account becomes overdrawn for any reason, we won't charge you an Overdraft Item Fee.
- · When we decline or return a transaction, we won't charge a fee, but the payee may.

Keep in mind, regardless of your overdraft setting, if you set up Balance Connect  for overdraft protection,  we'll automatically ® 2 transfer available funds from on