# OCR with Docling & Pixtral

### Todos

- [ ] Add fake orders to qdrant based on one of the PDFs
- [ ] Make sure the OCR works on the PDFs
- [ ] Deploy a Web-Server + API which can take a PDF and return the results

In [None]:
# !pip install docling, pytesseract, pdf2image

## Docling

In [1]:
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
)

import io
import base64
import pandas as pd
import json

from PIL import Image
from openai import OpenAI

from pdf2image import convert_from_path

In [2]:
source = "data/ab-4.pdf"

In [3]:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True


# ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
# pipeline_options.ocr_options = ocr_options

In [4]:
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        )
    }
)

In [5]:
result = converter.convert(source)
print(result.document.export_to_markdown())

Firmenname - Musterstraße 50 - 12345 Musterstadt

Mustermann GmbH

Herrn Max Mustermann Musterstraße 12 12345 Musterstadt

## Auftragsbestätigung Nr. XXXXX

Sehr geehrte Damen und Herren,

wir bedanken uns für Ihren Auftrag. Gemäß unserem Angebot XXXXXX vom TT.MM.JJJJ erbringen wir im Einzelnen die folgenden Leistungen:

| Pos.   | Menge    | Bezeichnung                       | Einzelpreis   | Gesamtpreis   |
|--------|----------|-----------------------------------|---------------|---------------|
| 1      | 1 Stk.   | Fernseher 40 Zoll | Musterartikel | 1000,00 EUR   | 1000,00 EUR   |
| 2      | Pauschal | Anfahrt und Aufbau                | 120,00 EUR    | 120,00 EUR    |
|        |          |                                   | Zwischensumme | 1120,00 EUR   |
|        |          |                                   | 19% MwSt.     | 212,80 EUR    |
|        |          |                                   | Gesamtbetrag  | 1332,80 EUR   |

Sie haben noch Fragen? Sie erreichen uns tägli

In [6]:
for table_ix, table in enumerate(result.document.tables):
    table_df: pd.DataFrame = table.export_to_dataframe()
    print(f"## Table {table_ix}")
    print(table_df.to_markdown())
    if table_ix > 1:
        print("\n\n")

## Table 0
|    | Pos.   | Menge    | Bezeichnung                       | Einzelpreis   | Gesamtpreis   |
|---:|:-------|:---------|:----------------------------------|:--------------|:--------------|
|  0 | 1      | 1 Stk.   | Fernseher 40 Zoll | Musterartikel | 1000,00 EUR   | 1000,00 EUR   |
|  1 | 2      | Pauschal | Anfahrt und Aufbau                | 120,00 EUR    | 120,00 EUR    |
|  2 |        |          |                                   | Zwischensumme | 1120,00 EUR   |
|  3 |        |          |                                   | 19% MwSt.     | 212,80 EUR    |
|  4 |        |          |                                   | Gesamtbetrag  | 1332,80 EUR   |


In [28]:
ordered_products = [
    {"Name": "Fernseher Smart TV QLED 4K", "Price": 1000, "SKU": "123456"},
    {"Name": "Aufbauservice", "Price": 120, "SKU": "123457"},
]

In [None]:
product_strs = []

for product in ordered_products:
    product_string = ""
    for key, value in product.items():
        product_string += f"{key}: {value}\n"

    product_strs.append(product_string)

print(product_strs[0])

In [None]:
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

In [45]:
schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "product",
        "schema": {
            "type": "object",
            "properties": {
                "SKU": {"type": "string"},
                "In_Table": {"type": "boolean"},
            },
            "required": ["SKU", "In_Table"],
        },
    },
}

In [52]:
schema["json_schema"]["schema"]

{'type': 'object',
 'properties': {'SKU': {'type': 'string'}, 'In_Table': {'type': 'boolean'}},
 'required': ['SKU', 'In_Table']}

In [57]:
results = []

for product_str in product_strs:
    completion = client.chat.completions.create(
        model="phi-4",
        messages=[
            {"role": "system", "content": "Extract the product information."},
            {
                "role": "user",
                "content": f"""Please check that the ordered product is in the table.
                Here is the table:
                {table_df.to_json()}

                Here are the ordered product:
                {product_str}

                The product information may differ, i.e. the SKU may be missing or the product name may be different. However, the price must be the same. If there is a product that almost matches the ordered product, please select it.

                Please return a JSON object with the SKU of the product ordered product and weather it is in the table or not.""",
            },
        ],
        response_format=schema,
        temperature=0.3,
        seed=42,
    )

    results.append(json.loads(completion.choices[0].message.content))

print(results)

[{'SKU': '123456', 'In_Table': True}, {'SKU': '123457', 'In_Table': True}]
