# Parsing PDFs on a laptop (or on-premise)

# Reminder: PDF parsing workflow
<div style="background-color:white;text-align: center;">
    <img src="../data/presentation/pdf_parsing_flow.svg" alt="pdf_parsing_flow" style="width:800px;"/>
</div>


## Common stuff

In [None]:
from pathlib import Path

ROOT_PATH = Path(".").absolute().parent
DATA_PATH = ROOT_PATH / "data"
OUTPUT_PATH = DATA_PATH / "outputs"
NOTEBOOKS_PATH = ROOT_PATH / "pdf_parsing"
EXAMPLES_PATH = DATA_PATH / "examples"

In [None]:
import logging
import torch
import transformers
from IPython.display import IFrame, Markdown
from pdf_parsing.logging_utils import set_loggers_if_needed
import docling
import camelot
import marker

logger = logging.getLogger(__name__)
set_loggers_if_needed(
    [
        transformers.__name__,
        logger.name,
        torch.__name__,
        docling.__name__,
        camelot.__name__,
        marker.__name__,
    ]
)

In [None]:
SAMPLE_ARTICLE = "sample_article.pdf"
SAMPLE_INVOICE = "sample_invoice.pdf"
SAMPLE_SCANNED_TABLE = "sample_scanned_table.pdf"

ARTICLE_PATH = EXAMPLES_PATH / SAMPLE_ARTICLE
INVOICE_PATH = EXAMPLES_PATH / SAMPLE_INVOICE
SCANNED_TABLE_PATH = EXAMPLES_PATH / SAMPLE_SCANNED_TABLE

# IFrames path must be relative to the current HTML page, which is this doc
REL_EXAMPLES_PATH = Path("..", "data", "examples")

## [Camelot](https://camelot-py.readthedocs.io/en/master/)

In [None]:
import pandas as pd


def camelot_tables(pdf_path: Path) -> list[pd.DataFrame]:
    tables = camelot.read_pdf(pdf_path)
    return [t.df for t in tables]

## [MarkerPDF](https://github.com/datalab-to/marker)

Marker works very well with the base config.

We can easily export documents as markdowns, however there's no native way to get CSV or Dataframe for tables, we hence parse the markdown document.

In [None]:
import re

from typing import Any
from marker.output import text_from_rendered
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter

MD_HEADER_SEP_RE = re.compile(r"^\|[\s\-\|:]+\|$")
MD_TABLE_RE = re.compile(r"(\|.*\|(?:\n\|.*\|)*)")


def marker_markdown(pdf_path: Path, config: dict[str, Any] = None) -> tuple[str, dict]:
    if config is None:
        config = dict()
    config["output_format"] = "markdown"
    config_parser = ConfigParser(config)
    renderer = config_parser.get_renderer()
    converter = PdfConverter(
        config=config_parser.generate_config_dict(),
        artifact_dict=create_model_dict(),
        processor_list=config_parser.get_processors(),
        renderer=renderer,
    )
    parsed = converter(str(pdf_path))
    content, _, images = text_from_rendered(parsed)
    return content, images


def md_to_dfs(md_content: str) -> list[pd.DataFrame]:
    dfs = [
        _md_table_to_df(md_table.string)
        for md_table in MD_TABLE_RE.finditer(md_content)
    ]
    return dfs


def _md_table_to_df(md_table: str) -> pd.DataFrame:
    lines = (line.strip() for line in md_table.strip().split("\n"))
    lines = (line for line in lines if not MD_HEADER_SEP_RE.match(line))
    rows = []
    for line in lines:
        if line.startswith("|") and line.endswith("|"):
            cells = [cell.strip() for cell in line[1:-1].split("|")]
            rows.append(cells)
    df = pd.DataFrame(rows[1:], columns=rows[0])
    return df


def save_marker_markdown(content: str, images: dict, *, path: Path):
    if not images:
        path.write_text(content)
    else:
        # If the Markdown contains image, create a directory and save
        # them inside. They will be referenced in the markdown
        markdown_dir = path.with_name(path.with_suffix("").name)
        markdown_dir.mkdir(parents=True, exist_ok=True)
        content_path = markdown_dir / path.name
        content_path.write_text(content)
        for im_name, im in images.items():
            im.save(str(markdown_dir / im_name))

## [Docling](https://docling-project.github.io/docling/)

With docling we can get the parse PDF as a `DoclingDocument` and then easily convert it to Markdown or Dataframe:

In [None]:
from docling_core.types import DoclingDocument
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions


DOCLING_DEFAULT_OPTS = {
    InputFormat.PDF: PdfFormatOption(
        pipeline_options=PdfPipelineOptions(generate_picture_images=True)
    )
}


def docling_parsing(
    pdf_path: Path, format_options: dict[InputFormat, PdfFormatOption] = None
) -> DoclingDocument:
    if format_options is None:
        format_options = DOCLING_DEFAULT_OPTS
    converter = DocumentConverter(format_options=format_options)
    result = converter.convert(pdf_path)
    return result.document


def docling_doc_to_dfs(doc: DoclingDocument) -> list[pd.DataFrame]:
    return [table.export_to_dataframe() for table in doc.tables]

Docling base config use EasyOCR. Using the following config is equivalent to `docling my_doc.pdf`:

Docling is very configurable we can define use a few different configs.

### Docling + [tesseract](https://tesseract-ocr.github.io/tessdoc/) as OCR (installation required)

Docling runs with EasyOCR by default, to improve perfs and speedup we can use tesseract.
Using the following config is equivalent to `docling --ocr-engine tesseract --ocr-lang auto my_doc.pdf`:

In [None]:
from docling.datamodel.pipeline_options import TesseractOcrOptions
from docling.datamodel.base_models import InputFormat

DOCLING_TESSERACT_OPTS = {
    InputFormat.PDF: PdfFormatOption(
        pipeline_options=PdfPipelineOptions(
            generate_picture_images=True,
            ocr_options=TesseractOcrOptions(lang=["auto"]),
        ),
    )
}

### Docling + VLMs

We can also use Docling with a small SmolDocling VLM. Using the following config is equivalent to `docling --pipeline vlm --vlm-model smoldocling my_doc.pdf`:

In [None]:
import platform
from docling.datamodel.vlm_model_specs import (
    QWEN25_VL_3B_MLX,
    SMOLDOCLING_MLX,
    SMOLDOCLING_TRANSFORMERS,
)
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.pipeline.vlm_pipeline import VlmPipeline

if platform.system() == "Darwin":  # Additional speedups on MacOS
    smol_vlm_options = SMOLDOCLING_MLX
else:
    smol_vlm_options = SMOLDOCLING_TRANSFORMERS

DOCLING_SMOL_OPTS = {
    InputFormat.PDF: PdfFormatOption(
        pipeline_cls=VlmPipeline,
        pipeline_options=VlmPipelineOptions(vlm_options=smol_vlm_options),
    )
}

We can also use larger VLMs like QWEN2.5B on MacOS . Using the following config is equivalent to `docling --pipeline vlm --vlm-model qwen25_vl_3b_mlx my_doc.pdf`:

In [None]:
# the accelerated/MLX Qwen model is only available on MacOS
if platform.system() == "Darwin":
    DOCLING_QWEN_OPTS = {
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=VlmPipelineOptions(vlm_options=QWEN25_VL_3B_MLX),
        )
    }

# Examples

## Example 1: article

In [None]:
IFrame(REL_EXAMPLES_PATH / SAMPLE_ARTICLE, width=1200, height=1200)

The PDF is computer-generated with no complex layout element we can go for level-0 or level-1 tools.

## Example 1 - level 1: Camelot

In [None]:
ARTICLE_OUTPUT_PATH = OUTPUT_PATH / "article"
ARTICLE_OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

Let's try to extract the table using Camelot first:


In [None]:
article_camelot_tables = camelot_tables(ARTICLE_PATH)

In [None]:
len(article_camelot_tables)

In [None]:
article_camelot_df = article_camelot_tables[0]

In [None]:
article_camelot_df

After some post-processing, we can easily get the proper table:

In [None]:
article_camelot_df = article_camelot_df.replace("\n", "", regex=True)
article_camelot_df.iloc[0, 3] = ""
article_camelot_df.iloc[0] += article_camelot_df.iloc[1]
article_camelot_df = article_camelot_df.set_axis(
    article_camelot_df.iloc[0].tolist(), axis="columns"
)
article_camelot_df.drop([0, 1])

In [None]:
article_camelot_df.to_csv(ARTICLE_OUTPUT_PATH / "camelot_table.csv")

Camelot performs nicely on this table, however it can only output tables, not the full document. Let's see how level 2 tools perform.

## Example 1 - level 2: Marker and Docling

In [None]:
article_marker_md, article_marker_images = marker_markdown(ARTICLE_PATH)

In [None]:
Markdown(article_marker_md)

In [None]:
article_marker_dfs = md_to_dfs(article_marker_md)

In [None]:
len(article_marker_dfs)

In [None]:
article_marker_dfs[0]

In [None]:
article_marker_df = article_marker_dfs[0]
new_columns = (
    article_marker_df.columns[:3].tolist() + article_marker_df.iloc[0, 3:].tolist()
)
new_columns = [c.replace("<br>", " ") for c in new_columns]
article_marker_df = article_marker_df.set_axis(new_columns, axis="columns")
article_marker_df.drop([0])

In [None]:
article_marker_df.to_csv(ARTICLE_OUTPUT_PATH / "marker_table.csv")

In [None]:
save_marker_markdown(
    article_marker_md, article_marker_images, path=ARTICLE_OUTPUT_PATH / "marker.md"
)

This looks great nice, see what Docling does:

In [None]:
article_docling_doc = docling_parsing(ARTICLE_PATH)

In [None]:
article_docling_md = article_docling_doc.export_to_markdown()
Markdown(article_docling_md)

In [None]:
article_docling_dfs = docling_doc_to_dfs(article_docling_doc)

In [None]:
len(article_docling_dfs)

In [None]:
article_docling_df = article_docling_dfs[0]
article_docling_df

In [None]:
article_docling_df.to_csv(ARTICLE_OUTPUT_PATH / "docling_table.csv")

In [None]:
(ARTICLE_OUTPUT_PATH / "docling.md").write_text(article_docling_md)

### Conclusions
- both Camelot, Marker and Docling provide decent table parsing results
- Docling properly handles subcolumns, and requires no post-processing
- Marker and Docling additionally allow parsing the full document (not only the table)


## Example 2: invoice

In [None]:
IFrame(REL_EXAMPLES_PATH / SAMPLE_INVOICE, width=1200, height=1200)

The invoice is computer generated. Its layout is quite simple in appearance, however, the **document layout is not trivial, and the table is quite implicit:
We must use level-2 libs at least**

In [None]:
INVOICE_OUTPUT_PATH = OUTPUT_PATH / "invoice"
INVOICE_OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

## Example 2: Marker

In [None]:
invoice_marker_md, invoice_marker_md_images = marker_markdown(INVOICE_PATH)

In [None]:
Markdown(invoice_marker_md)

In [None]:
save_marker_markdown(
    invoice_marker_md, invoice_marker_md_images, path=INVOICE_OUTPUT_PATH / "marker.md"
)
md_to_dfs(invoice_marker_md)[0].to_csv(INVOICE_OUTPUT_PATH / "marker_table.csv")

If we look at the actual [markdown output](../data/outputs/invoice/marker/marker.md), it's almost perfect !

## Example 2: Docling

In [None]:
invoice_docling_doc = docling_parsing(INVOICE_PATH)

In [None]:
from docling_core.types.doc import ImageRefMode

invoice_docling_doc_md = invoice_docling_doc.export_to_markdown(
    image_mode=ImageRefMode.EMBEDDED
)

In [None]:
Markdown(invoice_docling_doc_md)

In [None]:
invoice_docling_doc.save_as_markdown(
    INVOICE_OUTPUT_PATH / "docling.md", image_mode=ImageRefMode.EMBEDDED
)
invoice_docling_doc.tables[0].export_to_dataframe().to_csv(
    INVOICE_OUTPUT_PATH / "docling_table.csv"
)

If we look at the actual [markdown output](../data/outputs/invoice/docling.md), the table is perfectly extracted, the document layout is however not as clean as with Marker.

### Conclusions
- both Marker and Docling (base configuration) get the table right
- Marker does a better job at preserving the document content



# Example 3: scanned table

In [None]:
IFrame(REL_EXAMPLES_PATH / SAMPLE_SCANNED_TABLE, width=1200, height=600)

We have a scanned table, level 3 tools are recommended.

In [None]:
SCANNED_TABLE_OUTPUT_PATH = OUTPUT_PATH / "scanned_table"
SCANNED_TABLE_OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

## Examples 3: Docling VLMs


In [None]:
scanned_table_docling_smol_doc = docling_parsing(SCANNED_TABLE_PATH, DOCLING_SMOL_OPTS)

In [None]:
scanned_table_docling_smol_md = scanned_table_docling_smol_doc.export_to_markdown()
Markdown(scanned_table_docling_smol_md)

In [None]:
(SCANNED_TABLE_OUTPUT_PATH / "docling_smol.md").write_text(
    scanned_table_docling_smol_md
)
scanned_table_docling_smol_doc.tables[0].export_to_dataframe().to_csv(
    SCANNED_TABLE_OUTPUT_PATH / "docling_smol_table.csv"
)

Columns, get mixed up, let's try a larger VLM:

In [None]:
scanned_table_docling_qwen_doc = docling_parsing(SCANNED_TABLE_PATH, DOCLING_QWEN_OPTS)

In [None]:
scanned_table_docling_qwen_md = scanned_table_docling_qwen_doc.export_to_markdown()
Markdown(scanned_table_docling_qwen_md)

In [None]:
(SCANNED_TABLE_OUTPUT_PATH / "docling_qwen.md").write_text(
    scanned_table_docling_qwen_md
)
scanned_table_docling_qwen_doc.tables[0].export_to_dataframe().to_csv(
    SCANNED_TABLE_OUTPUT_PATH / "docling_qwen_table.csv"
)

## Examples 3: [OlmOCR](https://olmocr.allenai.org/)

Let's upload or doc to [OlmOCR](https://olmocr.allenai.org/), we get the following output:


<table>
<thead>
<tr>
<th>LÉGUMINEUSE</th>
<th>TREMPAGE</th>
<th>CUISSON (à partir de l&#39;ébullition)</th>
<th>VOLUME D&#39;EAU pour 1 volume de légumineuses à ajouter à la cuisson</th>
<th>QUANTITÉ par personne</th>
<th>CUISSON sans trempage</th>
</tr>
</thead>
<tbody><tr>
<td>Haricots azukis</td>
<td>12 h</td>
<td>1 h</td>
<td>2,5</td>
<td>60 g</td>
<td>1 h 30</td>
</tr>
<tr>
<td>Haricots (cocos, noirs, rouges, blancs...)</td>
<td>12 h</td>
<td>1 h</td>
<td>2,5</td>
<td>60 g</td>
<td></td>
</tr>
<tr>
<td>Haricots mungos</td>
<td></td>
<td></td>
<td>2,5</td>
<td>60 g</td>
<td>2 h</td>
</tr>
<tr>
<td>Flageolets</td>
<td></td>
<td></td>
<td>2,5</td>
<td>60 g</td>
<td>1 h 30</td>
</tr>
<tr>
<td>Lentilles vertes</td>
<td>4 h</td>
<td>30 min</td>
<td>2,5</td>
<td>60 g</td>
<td>45 min</td>
</tr>
<tr>
<td>Lentilles corail</td>
<td></td>
<td></td>
<td></td>
<td>60 g</td>
<td>10 à 15 min</td>
</tr>
<tr>
<td>Pois cassés</td>
<td>2 h</td>
<td>30 min</td>
<td>2</td>
<td>100 g (purée) 80 g (soupe)</td>
<td>1 h</td>
</tr>
<tr>
<td>Pois chiches</td>
<td>12 h</td>
<td>1 h</td>
<td>2,5</td>
<td>60 g</td>
<td></td>
</tr>
</tbody></table>


### Conclusions
- Docling-Smol mixes up some columns
- Docling-Qwen get the table right
- OlmOCR also mixes up somes columsn

