# Data Extraction - Azure AI Document Intelligence + Azure OpenAI GPT-4o

This sample demonstrates how to extract structured data from any document using Azure AI Document Intelligence and Azure OpenAI GPT models.

![Data Extraction](../../../images/extraction-document-intelligence-openai.png)

This is achieved by the following process:

- Analyze a document using Azure AI Document Intelligence's `prebuilt-layout` model to extract the structure as Markdown.
- Construct a system prompt that defines the instruction for extracting structured data from documents.
- Construct a user prompt that includes specific extraction instruction for the type of document, and the Markdown content of the document.
- Use the Azure OpenAI chat completions API with the GPT-4o model to generate a structured output from the content.

## Objectives

By the end of this sample, you will have learned how to:

- Convert a document to Markdown format using Azure AI Document Intelligence.
- Use prompt engineering techniques to instruct GPT-4o to extract structured data from a type of document.
- Use the [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure) to extract structured data from a document using Azure OpenAI's GPT-4o model.
- Use the analysis result from Azure AI Document Intelligence to determine the confidence of the extracted structured output.
- Use the [logprobs](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#request-body:~:text=False-,logprobs,-integer) parameter in an OpenAI request to determine the confidence of the extracted structured output.

## Setup

### Import modules

This sample takes advantage of the following Python dependencies:

- **azure-ai-documentintelligence** to interface with the Azure AI Document Intelligence API for analyzing documents.
- **openai** to interface with the Azure OpenAI chat completions API to generate structured extraction outputs using the GPT-4o model.
- **azure-identity** to securely authenticate with deployed Azure Services using Microsoft Entra ID credentials.

The following local modules are also used:

- **modules.app_settings** to access environment variables from the `.env` file.
- **modules.comparison** to compare the output of the extraction process with expected results.
- **modules.document_intelligence_confidence** to evaluate the confidence of the extraction process based on the extracted structured output and the analysis result from Azure AI Document Intelligence.
- **modules.document_processing_result** to store the results of the extraction process as a file.
- **modules.openai_confidence** to calculate the confidence of the classification process based on the `logprobs` response from the API request.
- **modules.invoice** to provide the expected structured output JSON schema for invoice documents.
- **modules.utils** `Stopwatch` to measure the end-to-end execution time for the classification process.

In [1]:
import sys
sys.path.append('../../') # Import local modules

from IPython.display import display, Markdown
import os
import pandas as pd
import json
from dotenv import dotenv_values
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, DocumentContentFormat
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import json

from modules.app_settings import AppSettings
from modules.utils import Stopwatch
from modules.accuracy_evaluator import AccuracyEvaluator
from modules.comparison import get_extraction_comparison
from modules.confidence import merge_confidence_values
from modules.document_intelligence_confidence import evaluate_confidence as di_evaluate_confidence
from modules.openai_confidence import evaluate_confidence as oai_evaluate_confidence
from modules.invoice import Invoice
from modules.document_processing_result import DataExtractionResult

### Configure the Azure services

To use Azure AI Document Intelligence and Azure OpenAI, their SDKs are used to create client instances using a deployed endpoint and authentication credentials.

For this sample, the credentials of the Azure CLI are used to authenticate with the deployed services.

In [2]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

openai_token_provider = get_bearer_token_provider(credential, 'https://cognitiveservices.azure.com/.default')

openai_client = AzureOpenAI(
    azure_endpoint=settings.openai_endpoint,
    azure_ad_token_provider=openai_token_provider,
    api_version="2024-12-01-preview" # Requires the latest API version for structured outputs.
)

document_intelligence_client = DocumentIntelligenceClient(
    endpoint=settings.ai_services_endpoint,
    credential=credential
)

### Establish the expected output

To compare the accuracy of the extraction process, the expected output of the extraction process has been defined in the following code block based on the details of an [Invoice](../../assets/invoices/invoice_1.pdf).

> **Note**: More invoice examples can be found in the [assets folder](../../assets/invoices). These examples include the PDF file and an associated JSON metadata file that provides the expected structured output. You can add your own scenarios by following the same structure.

```json
{
    "fname": "<name of the invoice file>",
    "expected": {
        "customer_name": "",
        "customer_address": {
            "street": "",
            "city": "",
            "state": "",
            "postal_code": "",
            "country": ""
        },
        "customer_tax_id": "",
        "shipping_address": "",
        "purchase_order": "",
        "invoice_id": "",
        "invoice_date": "",
        "payable_by": "",
        "vendor_name": "",
        "vendor_address": "",
        "vendor_tax_id": "",
        "remittance_address": "",
        "subtotal": 0,
        "total_discount": 0,
        "total_tax": 0,
        "invoice_total": 0,
        "payment_terms": "",
        "items": [
            {
                "product_code": "",
                "description": "",
                "quantity": 0,
                "tax": 0,
                "tax_rate": "",
                "unit_price": 0,
                "total": 0,
                "reason": null
            }
        ],
        "total_item_quantity": 0,
        "items_customer_signature": {
            "signatory": "",
            "is_signed": true
        },
        "items_vendor_signature": {
            "signatory": "",
            "is_signed": true
        },
        "returns": [
            {
                "product_code": "",
                "description": "",
                "quantity": 0,
                "tax": null,
                "tax_rate": null,
                "unit_price": null,
                "total": null,
                "reason": ""
            }
        ],
        "total_return_quantity": 0,
        "returns_customer_signature": {
            "signatory": "",
            "is_signed": true
        },
        "returns_vendor_signature": {
            "signatory": "",
            "is_signed": true
        }
    }
}
```

The expected output has been defined by a human evaluating the document.

In [3]:
path = f"{working_dir}/samples/assets/invoices/"
metadata_fname = "invoice_1.json" # Change this to the file you want to evaluate
metadata_fpath = f"{path}{metadata_fname}"

with open(metadata_fpath, "r") as f:
    data = json.load(f)
    
expected = Invoice(**data['expected'])
pdf_fname = data['fname']
pdf_fpath = f"{path}{pdf_fname}"

invoice_evaluator = AccuracyEvaluator(match_keys=['product_code', 'description'])

## Extract data from the document

The following code block executes the data extraction process using Azure AI Document Intelligence and Azure OpenAI's GPT-4o model.

It performs the following steps:

1. Get the document bytes from the provided file path. _Note: In this example, we are processing a local document, however, you can use any document storage location of your choice, such as Azure Blob Storage._
2. Use Azure AI Document Intelligence to analyze the structure of the document and convert it to Markdown format using the pre-built layout model.
3. Using Azure OpenAI's GPT-4o model and its [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure), extract a structured data transfer object (DTO) from the content of the Markdown.

In [4]:
with Stopwatch() as di_stopwatch:
    with open(pdf_fpath, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            model_id="prebuilt-layout",
            body=f,
            output_content_format=DocumentContentFormat.MARKDOWN,
            content_type="application/pdf"
        )
    
    result: AnalyzeResult = poller.result()

markdown = result.content

In [5]:
with Stopwatch() as oai_stopwatch:
    completion = openai_client.beta.chat.completions.parse(
        model=settings.gpt4o_model_deployment_name,
        messages=[
            {
                "role": "system",
                "content": "You are an AI assistant that extracts data from documents.",
            },
            {
                "role": "user",
                "content": f"""Extract the data from this invoice. 
                - If a value is not present, provide null.
                - Dates should be in the format YYYY-MM-DD.""",
            },
            {
                "role": "user",
                "content": markdown,
            }
        ],
        response_format=Invoice,
        max_tokens=4096,
        temperature=0.1,
        top_p=0.1,
        logprobs=True # Enabled to determine the confidence of the response.
    )

### Understanding the Structured Outputs JSON schema

Using [Pydantic's JSON schema feature](https://docs.pydantic.dev/latest/concepts/json_schema/), the [Invoice](../../modules/invoice.py) data model is automatically converted to a JSON schema when applied to the `response_format` parameter of the OpenAI chat completions request.

The JSON schema is used to instruct the GPT-4o model to generate a strict output that adheres to the structure defined. The approach using Pydantic makes it easier for developers to manage the data structure in code, with helpful descriptions and examples that will be included in the final JSON schema.

Demonstrated below, you can see how the Invoice data model is understood by the OpenAI request:

In [6]:
# Highlight the schema sent to the OpenAI model
print(json.dumps(Invoice.model_json_schema(), indent=2))

{
  "$defs": {
    "InvoiceAddress": {
      "description": "A class representing an address in an invoice.\n\nAttributes:\n    street: Street address\n    city: City, e.g. New York\n    state: State, e.g. NY\n    postal_code: Postal code, e.g. 10001\n    country: Country, e.g. USA",
      "properties": {
        "street": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "description": "Street address, e.g. 123 Main St.",
          "title": "Street"
        },
        "city": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "description": "City, e.g. New York",
          "title": "City"
        },
        "state": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the data extraction process.

This includes:

- The Markdown representation of the document structure as determined by Azure AI Document Intelligence.
- The accuracy of the structured data extraction comparing the expected output with the output generated by Azure OpenAI's GPT-4o model.
- The confidence score of the structured data extraction by comparing against the Azure AI Document Intelligence analysis.
- The execution time of the end-to-end process.
- The total number of tokens consumed by the GPT-4o model.
- The side-by-side comparison of the expected output and the output generated by Azure OpenAI's GPT-4o model.

### Understanding Accuracy vs Confidence

When using AI to extract structured data, both confidence and accuracy are essential for different but complementary reasons.

- **Accuracy** measures how close the AI model's output is to a ground truth or expected output. It reflects how well the model's predictions align with reality.
  - Accuracy ensures consistency in the extraction process, which is crucial for downstream tasks using the data.
- **Confidence** represents the AI model's internal assessment of how certain it is about its predictions.
  - Confidence indicates that the model is certain about its predictions, which can be a useful indicator for human reviewers to step in for manual verification.

High accuracy and high confidence are ideal, but in practice, there is often a trade-off between the two. While accuracy cannot always be self-assessed, confidence scores can and should be used to prioritize manual verification of low-confidence predictions.

In [7]:
# Displays the output of the Azure AI Document Intelligence pre-built layout analysis in Markdown format.
display(Markdown(markdown))

NEXGEN

Innovation drives progress


# INVOICE


<table>
<tr>
<th>Customer:</th>
<th colspan="2">Sharp Consulting</th>
<th colspan="2">Address:</th>
<th colspan="3">73 Regal Way, Leeds, LS1 5AB, UK</th>
</tr>
<tr>
<td>Delv. Date:</td>
<td>5/16/2024</td>
<td colspan="2">Invoice Number:</td>
<td colspan="2">3847193</td>
<td>Purchase Order:</td>
<td>15931</td>
</tr>
</table>


<table>
<tr>
<th>Item Code</th>
<th>Item Desc.</th>
<th>Unit Price</th>
<th>Quantity</th>
<th>Total</th>
</tr>
<tr>
<td>MA197</td>
<td>STRETCHWRAP ROLL</td>
<td>16.62</td>
<td>5</td>
<td>83.10</td>
</tr>
<tr>
<td>ST4086</td>
<td>BALLPOINT PEN MED.</td>
<td>2.49</td>
<td>10</td>
<td>24.90</td>
</tr>
<tr>
<td>JF9912413BF</td>
<td>BUBBLE FILM ROLL CL.</td>
<td>15.46</td>
<td>12</td>
<td>185.52</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</table>


NOTES


<table>
<tr>
<th>Total Pcs .:</th>
<th>27</th>
<th>Total Price:</th>
<th>293.52</th>
</tr>
<tr>
<td colspan="4">Payable on or before 5/24/2024</td>
</tr>
<tr>
<td>Cust. Sig.</td>
<td>Sfr</td>
<td>Drivr. Sig.</td>
<td></td>
</tr>
<tr>
<td>Cust. Name:</td>
<td>Sarah H</td>
<td>Drivr. Name:</td>
<td>James T</td>
</tr>
</table>


<!-- PageBreak -->


<table>
<tr>
<td colspan="6">NEXGEN Innovation drives progress</td>
</tr>
<tr>
<td></td>
<td colspan="5">RETURNS</td>
</tr>
<tr>
<td>Customer:</td>
<td>Sharp Consulting</td>
<td colspan="2">Address:</td>
<td colspan="2">73 Regal Way, Leeds, LS1 5AB, UK</td>
</tr>
<tr>
<td>Coll. Date:</td>
<td>5/16/2024</td>
<td></td>
<td></td>
<td colspan="2"></td>
</tr>
<tr>
<td>Item Code</td>
<td>Item Desc.</td>
<td>Pcs.</td>
<td colspan="3">Reason</td>
</tr>
<tr>
<td>MA145</td>
<td>POSTAL TUBE BROWN</td>
<td>1</td>
<td colspan="3">This item was provided in previous order</td>
</tr>
<tr>
<td>JF7902</td>
<td>MAILBOX 25PK</td>
<td>1 7</td>
<td colspan="3">as replacement</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td colspan="3">Not required</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td colspan="3"></td>
</tr>
<tr>
<td colspan="6"></td>
</tr>
<tr>
<td>NOTES</td>
<td></td>
<td colspan="4"></td>
</tr>
<tr>
<td></td>
<td></td>
<td colspan="4"></td>
</tr>
<tr>
<td></td>
<td colspan="5"></td>
</tr>
<tr>
<td></td>
<td colspan="5"></td>
</tr>
<tr>
<td colspan="2"></td>
<td></td>
<td colspan="3"></td>
</tr>
<tr>
<td></td>
<td></td>
<td colspan="4"></td>
</tr>
<tr>
<td>Total Pcs .:</td>
<td>2</td>
<td></td>
<td colspan="3"></td>
</tr>
<tr>
<td colspan="6"></td>
</tr>
<tr>
<td>Cust. Sig.</td>
<td>8</td>
<td></td>
<td colspan="2">Drivr. Sig.</td>
<td>&amp;</td>
</tr>
<tr>
<td>Cust. Name:</td>
<td>Sarah H</td>
<td></td>
<td colspan="2">Drivr. Name:</td>
<td>James T</td>
</tr>
</table>


In [8]:
# Gets the parsed Invoice object from the completion response.
invoice = completion.choices[0].message.parsed

expected_dict = expected.to_dict()
invoice_dict = invoice.to_dict()

In [9]:
# Determines the accuracy of the extracted data against the expected values.
accuracy = invoice_evaluator.evaluate(expected=expected_dict, actual=invoice_dict)

In [10]:
# Determines the confidence of the extracted data using both the OpenAI and Azure Document Intelligence responses.
di_confidence = di_evaluate_confidence(invoice_dict, result)
oai_confidence = oai_evaluate_confidence(invoice_dict, completion.choices[0])

confidence = merge_confidence_values(di_confidence, oai_confidence)

In [11]:
# Gets the total execution time of the data extraction process.
total_elapsed = di_stopwatch.elapsed + oai_stopwatch.elapsed

# Gets the prompt tokens and completion tokens from the completion response.
prompt_tokens = completion.usage.prompt_tokens
completion_tokens = completion.usage.completion_tokens

In [12]:
# Save the output of the data extraction result.
extraction_result = DataExtractionResult(invoice_dict, confidence, accuracy, prompt_tokens, completion_tokens, total_elapsed)

with open(f"{working_dir}/samples/extraction/text-based/document-intelligence-openai.{pdf_fname}.json", "w") as f:
    f.write(extraction_result.to_json(indent=4))

In [13]:
# Display the outputs of the data extraction process.
df = pd.DataFrame([
    {
        "Accuracy": f"{accuracy['overall'] * 100:.2f}%",
        "Confidence": f"{confidence['_overall'] * 100:.2f}%",
        "Execution Time": f"{total_elapsed:.2f} seconds",
        "Document Intelligence Execution Time": f"{di_stopwatch.elapsed:.2f} seconds",
        "OpenAI Execution Time": f"{oai_stopwatch.elapsed:.2f} seconds",
        "Prompt Tokens": prompt_tokens,
        "Completion Tokens": completion_tokens
    }
])

display(df)
display(get_extraction_comparison(expected_dict, invoice_dict, confidence, accuracy['accuracy']))

Unnamed: 0,Accuracy,Confidence,Execution Time,Document Intelligence Execution Time,OpenAI Execution Time,Prompt Tokens,Completion Tokens
0,93.69%,93.54%,14.58 seconds,5.79 seconds,8.79 seconds,3391,445


Unnamed: 0,Field,Expected,Extracted,Confidence,Accuracy
0,customer_address_city,Leeds,Leeds,98.70%,Match
1,customer_address_country,UK,UK,98.70%,Match
2,customer_address_postal_code,LS1 5AB,LS1 5AB,98.70%,Match
3,customer_address_state,,,0.00%,Match
4,customer_address_street,73 Regal Way,73 Regal Way,98.70%,Match
5,customer_name,Sharp Consulting,Sharp Consulting,99.40%,Match
6,customer_tax_id,,,0.00%,Match
7,invoice_date,2024-05-16,2024-05-16,99.99%,Match
8,invoice_id,3847193,3847193,99.80%,Match
9,invoice_total,293.520000,293.520000,99.40%,Match
