# Data Extraction - Azure OpenAI GPT-4o with Vision

This sample demonstrates how to use Azure OpenAI's GPT-4o model using vision capabilities to analyze a document's pages as images, and extract structured data without the need for OCR pre-processing.

## Objectives

By the end of this sample, you will have learned how to:

- Convert document pages into images.
- Use the [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure) to extract structured data from the document page images using Azure OpenAI's GPT-4o model.

## Setup

In [1]:
import sys
sys.path.append('../../')

from IPython.display import display, Markdown

import os
import pandas as pd
from dotenv import dotenv_values
from pdf2image import convert_from_bytes
import base64
import io
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from modules.app_settings import AppSettings
from modules.data_extraction_result import DataExtractionResult
from modules.vehicle_insurance_policy import VehicleInsurancePolicy, VehicleInsuranceCostDetails, VehicleInsuranceRenewalDetails, VehicleInsurancePersonDetails, VehicleInsuranceVehicleDetails, VehicleInsuranceExcessDetails, VehicleInsurancePolicyEvaluator
from modules.stopwatch import Stopwatch

In [2]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

openai_token_provider = get_bearer_token_provider(credential, 'https://cognitiveservices.azure.com/.default')

openai_client = AzureOpenAI(
    azure_endpoint=settings.openai_endpoint,
    azure_ad_token_provider=openai_token_provider,
    api_version="2024-08-01-preview"
)

## Establish the expected output

The following code block contains the expected output of the sample based on the details of the [Vehicle Insurance Policy](../../assets/VehicleInsurancePolicy.pdf). The expected output has been defined by a human evaluating the document.

In [3]:
pdf_path = f"{working_dir}/samples/assets/"
pdf_file_name = "VehicleInsurancePolicy.pdf"

expected = VehicleInsurancePolicy(
    policy_number='GB20246717948',
    cost=VehicleInsuranceCostDetails(
        annual_total=532.19,
        payable_by_date='2024-06-13'
    ),
    renewal=VehicleInsuranceRenewalDetails(
        renewal_notification_date='2025-05-12',
        renewal_due_date='2025-05-26'
    ),
    effective_from='2024-06-03',
    effective_to='2025-06-02',
    last_date_to_cancel='2024-06-17',
    policyholder=VehicleInsurancePersonDetails(
        first_name='Joe',
        last_name='Bloggs',
        date_of_birth='1990-01-05',
        address='73 Regal Way, LEEDS, West Yorkshire, LS1 5AB',
        email_address='Joe.Bloggs@me.com',
        total_years_of_residence_in_uk=34,
        driving_license_number='BLOGGS901050JJ1AB'
    ),
    vehicle=VehicleInsuranceVehicleDetails(
        registration_number='VS24DMC',
        make='Hyundai',
        model='IONIQ 5 Premium 73 kWh RWD',
        year=2024,
        value=40000
    ),
    accident_excess=VehicleInsuranceExcessDetails(
        compulsory=250,
        voluntary=250,
        unapproved_repair_penalty=250
    ),
    fire_and_theft_excess=VehicleInsuranceExcessDetails(
        compulsory=250,
        voluntary=250,
        unapproved_repair_penalty=250
    ),
)

insurance_policy_evaluator = VehicleInsurancePolicyEvaluator(expected)

## Extract data from the document

The following code block executes the data extraction process using Azure OpenAI's GPT-4o model using vision capabilities.

It performs the following steps:

1. Get the document bytes from the provided file path. _Note: In this example, we are processing a local document, however, you can use any document storage location of your choice, such as Azure Blob Storage._
2. Use py2pdf to convert the document's pages into images per page as base64 strings.
3. Using Azure OpenAI's GPT-4o model and its [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure), extract a structured data transfer object (DTO) from the content of the images.

In [9]:
fname = f"{pdf_path}{pdf_file_name}"

stopwatch = Stopwatch()
stopwatch.start()

user_content = []
user_content.append({
    "type": "text",
    "text": f"""Extract the data from this insurance policy. 
    - If a value is not present, provide null.
    - Some values must be inferred based on the rules defined in the policy.
    - Dates should be in the format YYYY-MM-DD."""
})

document_bytes = open(fname, "rb").read()

page_images = convert_from_bytes(document_bytes)
for page_image in page_images:
    byteIO = io.BytesIO()
    page_image.save(byteIO, format='PNG')
    base64_data = base64.b64encode(byteIO.getvalue()).decode('utf-8')
    
    user_content.append({
        "type": "image_url",
        "image_url": {
            "url": f"data:image/png;base64,{base64_data}"
        }
    })
    
completion = openai_client.beta.chat.completions.parse(
    model=settings.gpt4o_model_deployment_name,
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant that extracts data from documents.",
        },
        {
            "role": "user",
            "content": user_content
        }
    ],
    response_format=VehicleInsurancePolicy,
    max_tokens=4096,
    temperature=0.1,
    top_p=0.1
)

stopwatch.stop()

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the data extraction process.

This includes:

- The accuracy of the structured data extraction comparing the expected output with the output generated by Azure OpenAI's GPT-4o model.
- The execution time of the end-to-end process.
- The total number of tokens consumed by the GPT-4o model.
- The side-by-side comparison of the expected output and the output generated by Azure OpenAI's GPT-4o model.

In [None]:
# Gets the parsed VehicleInsurancePolicy object from the completion response.
insurance_policy = completion.choices[0].message.parsed

# Determines the accuracy of the extracted data against the expected values.
accuracy = insurance_policy_evaluator.evaluate(insurance_policy)

# Gets the prompt tokens and completion tokens from the completion response.
prompt_tokens = completion.usage.prompt_tokens
completion_tokens = completion.usage.completion_tokens

# Save the output of the data extraction result.
extraction_result = DataExtractionResult(insurance_policy.to_dict(), accuracy, prompt_tokens, completion_tokens, stopwatch.elapsed)

with open(f"{working_dir}/samples/extraction/vision-based/openai.{pdf_file_name}.json", "w") as f:
    f.write(extraction_result.to_json(indent=4))
    
# Display the outputs of the data extraction process.
print(f"Accuracy: {accuracy['overall'] * 100:.2f}%")
print(f"Execution time: {stopwatch.elapsed:.2f} seconds")
print(f"Prompt tokens: {prompt_tokens}")
print(f"Completion tokens: {completion_tokens}")

def display_insurance_policy_comparison(expected, extracted):
    def flatten_dict(d, parent_key='', sep='_'):
        items = []
        for k, v in d.items():
            new_key = f"{parent_key}{sep}{k}" if parent_key else k
            if isinstance(v, dict):
                items.extend(flatten_dict(v, new_key, sep=sep).items())
            elif isinstance(v, list):
                for i, item in enumerate(v):
                    items.extend(flatten_dict({f"{new_key}_{i}": item}, '', sep=sep).items())
            else:
                items.append((new_key, v))
        return dict(items)

    def highlight_comparison(actual_value, expected_value):
        if isinstance(actual_value, dict) and isinstance(expected_value, dict):
            return {k: highlight_comparison(actual_value.get(k), expected_value.get(k)) for k in expected_value.keys()}
        elif isinstance(actual_value, list) and isinstance(expected_value, list):
            return [highlight_comparison(v, ev) for v, ev in zip(actual_value, expected_value)]
        else:
            if isinstance(actual_value, str) and actual_value.lower() == expected_value.lower():
                return f"<span style='color: green'>{actual_value}</span>"
            elif actual_value == expected_value:
                return f"<span style='color: green'>{actual_value}</span>"
            else:
                return f"<span style='color: red'>{actual_value}</span>"

    expected_flat = flatten_dict(expected)
    extracted_flat = flatten_dict(extracted)
    rows = []
    for key in expected_flat.keys():
        rows.append({
            "Field": key,
            "Expected": expected_flat[key],
            "Extracted": highlight_comparison(extracted_flat.get(key), expected_flat[key])
        })
    df = pd.DataFrame(rows)
    display(Markdown(df.to_markdown(index=False, tablefmt="unsafehtml")))

display_insurance_policy_comparison(expected.to_dict(), insurance_policy.to_dict())