# Document Extraction with Azure OpenAI GPT-4o (Files input only)

**Before running this notebook, ensure you have selected the correct Python kernel. If running in the `devcontainer` environment, this is likely to be 3.12.11 at `/usr/local/python/current/bin/python`.**

![Example devcontainer notebook kernel](../../../../images/python-notebook-kernel.png)

This sample demonstrates how to extract structured data from any document using Azure OpenAI's GPT-4o model with Files input.

This is achieved by the following process:

- Construct a system prompt that defines the instruction for extracting structured data from documents.
- Construct a user prompt that includes the specific extraction instruction for the type of document, and each document as a base64 file input.
- Use the Azure OpenAI chat completions API with the GPT-4o model to generate a structured output from the content.

## Objectives

By the end of this sample, you will have learned how to:

- Convert a document into a base64 encoded file for processing by GPT-4o.
- Use prompt engineering techniques to instruct GPT-4o to extract structured data from a type of document.
- Use the [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure) to extract structured data from the document page images using Azure OpenAI's GPT-4o model.
- Use the [logprobs](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#request-body:~:text=False-,logprobs,-integer) parameter in an OpenAI request to determine the confidence of the extracted structured output.

## Useful Tips

- Combine this technique with a [page classification](../../classification/README.md) approach to reduce the number of pages to extract from to only those that match your criteria for extraction.

## Setup

### Import modules

This sample takes advantage of the following Python dependencies:

- **pdf2image** for converting a PDF file into a set of images per page.
- **openai** to interface with the Azure OpenAI chat completions API to generate structured extraction outputs using the GPT-4o model.
- **azure-identity** to securely authenticate with deployed Azure Services using Microsoft Entra ID credentials.

The following local components are also used:

- [**vehicle_insurance_policy**](../../modules/samples/models/vehicle_insurance_policy.py) to provide the expected structured output JSON schema for vehicle insurance policy documents.
- [**accuracy_evaluator**](../../modules/samples/evaluation/accuracy_evaluator.py) to evaluate the output of the extraction process with expected results.
- [**openai_confidence**](../../modules/samples/confidence/openai_confidence.py) to calculate the confidence of the extraction process based on the `logprobs` response from the OpenAI API request.
- [**document_processing_result**](../../modules/samples/models/document_processing_result.py) to store the results of the extraction process as a file.
- [**stopwatch**](../../modules/samples/utils/stopwatch.py) to measure the end-to-end execution time for the extraction process.
- [**app_settings**](../../modules/samples/app_settings.py) to access environment variables from the `.env` file.

In [1]:
import sys
sys.path.append('../../modules/') # Import local modules

from IPython.display import display
import os
import pandas as pd
from dotenv import dotenv_values
import base64
import io
import json
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from concurrent.futures import ThreadPoolExecutor
from pdf2image import convert_from_bytes

from samples.app_settings import AppSettings
from samples.utils.stopwatch import Stopwatch
from samples.utils.storage_utils import create_json_file
from samples.models.document_processing_result import DataExtractionResult

from samples.models.vehicle_insurance_policy import VehicleInsurancePolicy
from samples.confidence.confidence_utils import merge_confidence_values
from samples.confidence.openai_confidence import evaluate_confidence as evaluate_openai_confidence
from samples.evaluation.accuracy_evaluator import AccuracyEvaluator
from samples.evaluation.comparison import get_extraction_comparison

### Configure the Azure services

To use Azure OpenAI, the SDK is used to create a client instance using a deployed endpoint and authentication credentials.

For this sample, the credentials of the Azure CLI are used to authenticate with the deployed services.

In [2]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))
sample_path = f"{working_dir}/samples/python/extraction/vision"
sample_name = "document-extraction-gpt-vision"

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

openai_token_provider = get_bearer_token_provider(credential, 'https://cognitiveservices.azure.com/.default')

openai_client = AzureOpenAI(
    azure_endpoint=settings.azure_openai_endpoint,
    azure_ad_token_provider=openai_token_provider,
    api_version=settings.azure_openai_api_version
)

### Establish the expected output

To compare the accuracy of the extraction process, the expected output of the extraction process has been defined in the following code block based on each page of a [Vehicle Insurance Policy](../../../assets/vehicle_insurance/policy_1.pdf).

> **Note**: More insurance policy examples can be found in the [assets folder](../../../assets/vehicle_insurance). These examples include the PDF file and an associated JSON metadata file that provides the expected structured output. You can add your own scenarios by following the same structure.

The expected output has been defined by a human evaluating the document.

In [3]:
path = f"{working_dir}/samples/assets/vehicle_insurance/"
metadata_fname = "policy_5.json" # Change this to the file you want to evaluate
metadata_fpath = f"{path}{metadata_fname}"

with open(metadata_fpath, 'r') as f:
    data = json.load(f)
    
expected = VehicleInsurancePolicy(**data['expected'])
pdf_fname = data['fname']
pdf_fpath = f"{path}{pdf_fname}"

insurance_policy_evaluator = AccuracyEvaluator(match_keys=[])

## Extract data from the document

The following code block executes the data extraction process using Azure OpenAI's GPT-4o model using vision capabilities.

It performs the following steps:

1. Get the document bytes from the provided file path. _Note: In this example, we are processing a local document, however, you can use any document storage location of your choice, such as Azure Blob Storage._
2. Use pdf2image to convert the document's pages into images per page as base64 strings.
3. Using Azure OpenAI's GPT-4o model and its [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure), extract a structured data transfer object (DTO) from the content of the images.

In [4]:
system_prompt = f"""You are an AI assistant that extracts data from documents."""

In [5]:
# Prepare the user content for the OpenAI API including any specific details for processing this type of document and the document page images.
user_content = []

In [6]:
user_text_prompt = """Extract the data from this insurance policy. 
- If a value is not present, provide null.
- Some values must be inferred based on the rules defined in the policy.
- Dates should be in the format YYYY-MM-DD."""

user_content.append({
    "type": "text",
    "text": user_text_prompt
})

In [7]:
with Stopwatch() as image_stopwatch:
    with open(pdf_fpath, "rb") as f:
        document_bytes = f.read()
        
    # Encode the PDF bytes to base64
    pdf_base64 = base64.b64encode(document_bytes).decode('utf-8')
    
    file_input = {
        "type": "file",
        "file": {
            "filename": pdf_fname,
            "file_data": f"data:application/pdf;base64,{pdf_base64}"
        }
    }
    
    user_content.append(file_input)

In [8]:
with Stopwatch() as oai_stopwatch:
    completion = openai_client.beta.chat.completions.parse(
        model=settings.azure_openai_chat_deployment,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_content
            }
        ],
        response_format=VehicleInsurancePolicy,
        max_tokens=4096,
        temperature=0.1,
        top_p=0.1,
        logprobs=True # Enabled to determine the confidence of the response.
    )

BadRequestError: Error code: 400 - {'error': {'message': "Invalid Value: 'file'. This model does not support file content types.", 'type': 'invalid_request_error', 'param': 'messages[1].content[1].type', 'code': 'invalid_value'}}

### Understanding the Structured Outputs JSON schema

Using [Pydantic's JSON schema feature](https://docs.pydantic.dev/latest/concepts/json_schema/), the [Insurance Policy](../../modules/samples/models/vehicle_insurance_policy.py) data model is automatically converted to a JSON schema when applied to the `response_format` parameter of the OpenAI chat completions request.

The JSON schema is used to instruct the GPT-4o model to generate a strict output that adheres to the structure defined. The approach using Pydantic makes it easier for developers to manage the data structure in code, with helpful descriptions and examples that will be included in the final JSON schema.

Demonstrated below, you can see how the Insurance Policy data model is understood by the OpenAI request:

In [None]:
# Highlight the schema sent to the OpenAI model
print(json.dumps(VehicleInsurancePolicy.model_json_schema(), indent=2))

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the data extraction process.

This includes:

- The accuracy of the structured data extraction comparing the expected output with the output generated by Azure OpenAI's GPT-4o model.
- The confidence score of the structured data extraction based on the log probability of the output generated by Azure OpenAI's GPT-4o model.
- The execution time of the end-to-end process.
- The total number of tokens consumed by the GPT-4o model.
- The side-by-side comparison of the expected output and the output generated by Azure OpenAI's GPT-4o model.

### Understanding Accuracy vs Confidence

When using AI to extract structured data, both confidence and accuracy are essential for different but complementary reasons.

- **Accuracy** measures how close the AI model's output is to a ground truth or expected output. It reflects how well the model's predictions align with reality.
  - Accuracy ensures consistency in the extraction process, which is crucial for downstream tasks using the data.
- **Confidence** represents the AI model's internal assessment of how certain it is about its predictions.
  - Confidence indicates that the model is certain about its predictions, which can be a useful indicator for human reviewers to step in for manual verification.

High accuracy and high confidence are ideal, but in practice, there is often a trade-off between the two. While accuracy cannot always be self-assessed, confidence scores can and should be used to prioritize manual verification of low-confidence predictions.

In [None]:
# Gets the parsed VehicleInsurancePolicy object from the completion response.
insurance_policy = completion.choices[0].message.parsed

expected_dict = expected.model_dump()
insurance_policy_dict = insurance_policy.model_dump()

In [None]:
# Determines the accuracy of the extracted data against the expected values.
accuracy = insurance_policy_evaluator.evaluate(expected=expected_dict, actual=insurance_policy_dict)

In [None]:
# Determines the confidence of the extracted data using the log probabilities of the OpenAI completion response.
confidence = evaluate_openai_confidence(insurance_policy_dict, completion.choices[0])

In [None]:
# Gets the total execution time of the data extraction process.
total_elapsed = image_stopwatch.elapsed + oai_stopwatch.elapsed

# Gets the prompt tokens and completion tokens from the completion response.
prompt_tokens = completion.usage.prompt_tokens
completion_tokens = completion.usage.completion_tokens

In [None]:
# Save the output of the data extraction result.
extraction_result = DataExtractionResult(insurance_policy_dict, confidence, accuracy, prompt_tokens, completion_tokens, total_elapsed)

create_json_file(f"{sample_path}/{sample_name}.{pdf_fname}.json", extraction_result)

In [None]:
# Display the outputs of the data extraction process.
df = pd.DataFrame([
    {
        "Accuracy": f"{accuracy['overall'] * 100:.2f}%",
        "Confidence": f"{confidence['_overall'] * 100:.2f}%",
        "Execution Time": f"{total_elapsed:.2f} seconds",
        "Image Pre-processing Execution Time": f"{image_stopwatch.elapsed:.2f} seconds",
        "OpenAI Execution Time": f"{oai_stopwatch.elapsed:.2f} seconds",
        "Prompt Tokens": prompt_tokens,
        "Completion Tokens": completion_tokens
    }
])

display(df)
display(get_extraction_comparison(expected_dict, insurance_policy_dict, confidence, accuracy['accuracy']))