# Data Extraction - Marker/Surya + Azure OpenAI GPT-4o

This sample demonstrates how to use Marker + Surya OCR to analyze the structure of a document to Markdown format, and then use Azure OpenAI's GPT-4o model to extract a structured output from the content.

## Objectives

By the end of this sample, you will have learned how to:

- Convert a document to Markdown format using the self-hosted Surya OCR model in combination with the Marker library.
- Use the [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure) to extract structured data from the content using Azure OpenAI's GPT-4o model.

## Setup

In [1]:
import sys
sys.path.append('../../')

from IPython.display import display, Markdown

import os
import pandas as pd
from dotenv import dotenv_values
from marker.convert import convert_single_pdf
from marker.models import load_all_models
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from modules.app_settings import AppSettings
from modules.data_extraction_result import DataExtractionResult
from modules.invoice import Invoice, InvoiceProduct, InvoiceSignature, InvoiceEvaluator
from modules.stopwatch import Stopwatch

In [2]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

openai_token_provider = get_bearer_token_provider(credential, 'https://cognitiveservices.azure.com/.default')

openai_client = AzureOpenAI(
    azure_endpoint=settings.openai_endpoint,
    azure_ad_token_provider=openai_token_provider,
    api_version="2024-08-01-preview"
)

## Establish the expected output

The following code block contains the expected output of the sample based on the details of the [Invoice](../../assets/Invoice.pdf). The expected output has been defined by a human evaluating the document.

In [3]:
pdf_path = f"{working_dir}/samples/assets/"
pdf_file_name = "Invoice.pdf"

expected = Invoice(
    invoice_number='3847193',
    purchase_order_number='15931',
    customer_name='Sharp Consulting',
    customer_address='73 Regal Way, Leeds, LS1 5AB, UK',
    delivery_date='2024-05-16',
    payable_by='2024-05-24',
    products=[
        InvoiceProduct(
            id='MA197',
            description='STRETCHWRAP ROLL',
            unit_price=16.62,
            quantity=5,
            total=83.10,
            reason=None
        ),
        InvoiceProduct(
            id='ST4086',
            description='BALLPOINT PEN MED.',
            unit_price=2.49,
            quantity=10,
            total=24.90,
            reason=None
        ),
        InvoiceProduct(
            id='JF9912413BF',
            description='BUBBLE FILM ROLL CL.',
            unit_price=15.46,
            quantity=12,
            total=185.52,
            reason=None
        ),
    ],
    returns=[
        InvoiceProduct(
            id='MA145',
            description='POSTAL TUBE BROWN',
            unit_price=None,
            quantity=1,
            total=None,
            reason='This item was provided in previous order as a replacement'
        ),
        InvoiceProduct(
            id='JF7902',
            description='MAILBOX 25PK',
            unit_price=None,
            quantity=1,
            total=None,
            reason='Not required'
        ),
    ],
    total_product_quantity=27,
    total_product_price=293.52,
    product_signatures=[
        InvoiceSignature(
            type='Customer',
            name='Sarah H',
            is_signed=True
        ),
        InvoiceSignature(
            type='Driver',
            name='James T',
            is_signed=True
        )
    ],
    returns_signatures=[
        InvoiceSignature(
            type='Customer',
            name='Sarah H',
            is_signed=True
        ),
        InvoiceSignature(
            type='Driver',
            name='James T',
            is_signed=True
        )
    ]   
)

invoice_evaluator = InvoiceEvaluator(expected)

## Extract data from the document

The following code block executes the data extraction process using Surya OCR + Marker and Azure OpenAI's GPT-4o model.

It performs the following steps:

1. Load the models required for Marker into memory.
2. Get the document bytes from the provided file path. _Note: In this example, we are processing a local document, however, you can use any document storage location of your choice, such as Azure Blob Storage._
3. Use Marker to analyze the structure of the document and convert it to Markdown format.
4. Using Azure OpenAI's GPT-4o model and its [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure), extract a structured data transfer object (DTO) from the content of the Markdown.

In [None]:
marker_models = load_all_models()

In [None]:
fname = f"{pdf_path}{pdf_file_name}"

stopwatch = Stopwatch()
stopwatch.start()

markdown, images, out_meta = convert_single_pdf(fname, marker_models, langs=["English"], batch_multiplier=2, start_page=None)

completion = openai_client.beta.chat.completions.parse(
    model=settings.gpt4o_model_deployment_name,
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant that extracts data from documents.",
        },
        {
            "role": "user",
            "content": f"""Extract the data from this invoice. 
            - If a value is not present, provide null.
            - Dates should be in the format YYYY-MM-DD.""",
        },
        {
            "role": "user",
            "content": markdown,
        }
    ],
    response_format=Invoice,
    max_tokens=4096,
    temperature=0.1,
    top_p=0.1
)

stopwatch.stop()

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the data extraction process.

This includes:

- The Markdown representation of the document structure as determined by Marker and Surya OCR.
- The accuracy of the structured data extraction comparing the expected output with the output generated by Azure OpenAI's GPT-4o model.
- The execution time of the end-to-end process.
- The total number of tokens consumed by the GPT-4o model.
- The side-by-side comparison of the expected output and the output generated by Azure OpenAI's GPT-4o model.

In [None]:
# Displays the output of the Marker analysis in Markdown format.
display(Markdown(markdown))

In [None]:
# Gets the parsed Invoice object from the completion response.
invoice = completion.choices[0].message.parsed

# Determines the accuracy of the extracted data against the expected values.
accuracy = invoice_evaluator.evaluate(invoice)

# Gets the prompt tokens and completion tokens from the completion response.
prompt_tokens = completion.usage.prompt_tokens
completion_tokens = completion.usage.completion_tokens

# Save the output of the data extraction result.
extraction_result = DataExtractionResult(invoice.to_dict(), accuracy, prompt_tokens, completion_tokens, stopwatch.elapsed)

with open(f"{working_dir}/samples/extraction/text-based/marker-surya-openai.{pdf_file_name}.json", "w") as f:
    f.write(extraction_result.to_json(indent=4))
    
# Display the outputs of the data extraction process.
print(f"Accuracy: {accuracy['overall'] * 100:.2f}%")
print(f"Execution time: {stopwatch.elapsed:.2f} seconds")
print(f"Prompt tokens: {prompt_tokens}")
print(f"Completion tokens: {completion_tokens}")

def display_invoice_comparison(expected, extracted):
    def flatten_dict(d, parent_key='', sep='_'):
        items = []
        for k, v in d.items():
            new_key = f"{parent_key}{sep}{k}" if parent_key else k
            if isinstance(v, dict):
                items.extend(flatten_dict(v, new_key, sep=sep).items())
            elif isinstance(v, list):
                for i, item in enumerate(v):
                    items.extend(flatten_dict({f"{new_key}_{i}": item}, '', sep=sep).items())
            else:
                items.append((new_key, v))
        return dict(items)

    def highlight_comparison(actual_value, expected_value):
        if isinstance(actual_value, dict) and isinstance(expected_value, dict):
            return {k: highlight_comparison(actual_value.get(k), expected_value.get(k)) for k in expected_value.keys()}
        elif isinstance(actual_value, list) and isinstance(expected_value, list):
            return [highlight_comparison(v, ev) for v, ev in zip(actual_value, expected_value)]
        else:
            if actual_value == expected_value:
                return f"<span style='color: green'>{actual_value}</span>"
            else:
                return f"<span style='color: red'>{actual_value}</span>"

    expected_flat = flatten_dict(expected)
    extracted_flat = flatten_dict(extracted)
    rows = []
    for key in expected_flat.keys():
        rows.append({
            "Field": key,
            "Expected": expected_flat[key],
            "Extracted": highlight_comparison(extracted_flat.get(key), expected_flat[key])
        })
    df = pd.DataFrame(rows)
    display(Markdown(df.to_markdown(index=False, tablefmt="unsafehtml")))

display_invoice_comparison(expected.to_dict(), invoice.to_dict())