# Data Extraction - Azure AI Document Intelligence + Phi-3.5-mini

This sample demonstrates how to extract structured data from any document using Azure AI Document Intelligence and small language models, such as Microsoft's Phi-3.5-mini.

This is achieved by the following process:

- Analyze a document using Azure AI Document Intelligence's `prebuilt-layout` model to extract the structure as Markdown.
- Construct a system prompt that defines the instruction for extracting structured data from documents.
- Construct a user prompt that includes specific extraction instruction for the type of document, the expected JSON schema, and the Markdown content of the document
- Use the chat completions API with the Phi-3.5-mini model to generate a structured output from the content.

## Objectives

By the end of this sample, you will have learned how to:

- Convert a document to Markdown format using Azure AI Document Intelligence.
- Use prompt engineering techniques to instruct Phi-3.5-mini to extract structured data from a type of document using example JSON structures.
- Use the analysis result from Azure AI Document Intelligence to determine the confidence of the extracted structured output.

## Setup

### Import modules

This sample takes advantage of the following Python dependencies:

- **azure-ai-documentintelligence** to interface with the Azure AI Document Intelligence API for analyzing documents.
- **openai** to use the default OpenAI client implementation to interface with a deployed Phi-3.5-mini serverless endpoint in Azure AI Studio. _Note: The serverless endpoint for Phi-3.5-mini uses the same API schema as the OpenAI API._
- **azure-identity** to securely authenticate with deployed Azure Services using Microsoft Entra ID credentials.

The following local modules are also used:

- **modules.app_settings** to access environment variables from the `.env` file.
- **modules.comparison** to compare the output of the extraction process with expected results.
- **modules.document_intelligence_confidence** to evaluate the confidence of the extraction process based on the extracted structured output and the analysis result from Azure AI Document Intelligence.
- **modules.document_processing_result** to store the results of the extraction process as a file.
- **modules.invoice** to provide the expected structured output JSON schema for invoice documents.
- **modules.stopwatch** to measure the end-to-end execution time for the classification process.

In [1]:
import sys
sys.path.append('../../') # Import local modules

from IPython.display import display, Markdown
import os
import pandas as pd
import json
from dotenv import dotenv_values
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, ContentFormat
from openai import OpenAI
from azure.identity import DefaultAzureCredential

from modules.app_settings import AppSettings
from modules.comparison import extraction_comparison
from modules.document_intelligence_confidence import evaluate_confidence
from modules.document_processing_result import DataExtractionResult
from modules.invoice import Invoice, InvoiceProduct, InvoiceSignature, InvoiceEvaluator
from modules.stopwatch import Stopwatch

### Configure the Azure services

To use Azure AI Document Intelligence and a serverless model endpoint deployment, their SDKs are used to create client instances using a deployed endpoint and authentication credentials.

For this sample, the credentials of the Azure CLI are used to authenticate with Azure AI Document Intelligence. The OpenAI client for the Phi-3.5-mini model uses an API key to authenticate with the serverless endpoint.

In [2]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

# The serverless endpoint for Phi-3.5-mini uses the same API as OpenAI so we can use the OpenAI client
openai_client = OpenAI(
    base_url=settings.phi35_mini_endpoint,
    api_key=settings.phi35_mini_primary_key,
)

document_intelligence_client = DocumentIntelligenceClient(
    endpoint=settings.ai_services_endpoint,
    credential=credential
)

### Establish the expected output

To compare the accuracy of the extraction process, the expected output of the extraction process has been defined in the following code block based on the details of the [Invoice](../../assets/Invoice.pdf).

The expected output has been defined by a human evaluating the document.

In [3]:
pdf_path = f"{working_dir}/samples/assets/"
pdf_file_name = "Invoice.pdf"
fname = f"{pdf_path}{pdf_file_name}"

expected = Invoice(
    invoice_number='3847193',
    purchase_order_number='15931',
    customer_name='Sharp Consulting',
    customer_address='73 Regal Way, Leeds, LS1 5AB, UK',
    delivery_date='2024-05-16',
    payable_by='2024-05-24',
    products=[
        InvoiceProduct(
            id='MA197',
            description='STRETCHWRAP ROLL',
            unit_price=16.62,
            quantity=5,
            total=83.10,
            reason=None
        ),
        InvoiceProduct(
            id='ST4086',
            description='BALLPOINT PEN MED.',
            unit_price=2.49,
            quantity=10,
            total=24.90,
            reason=None
        ),
        InvoiceProduct(
            id='JF9912413BF',
            description='BUBBLE FILM ROLL CL.',
            unit_price=15.46,
            quantity=12,
            total=185.52,
            reason=None
        ),
    ],
    returns=[
        InvoiceProduct(
            id='MA145',
            description='POSTAL TUBE BROWN',
            unit_price=None,
            quantity=1,
            total=None,
            reason='This item was provided in previous order as a replacement'
        ),
        InvoiceProduct(
            id='JF7902',
            description='MAILBOX 25PK',
            unit_price=None,
            quantity=1,
            total=None,
            reason='Not required'
        ),
    ],
    total_product_quantity=27,
    total_product_price=293.52,
    product_signatures=[
        InvoiceSignature(
            type='Customer',
            name='Sarah H',
            is_signed=True
        ),
        InvoiceSignature(
            type='Driver',
            name='James T',
            is_signed=True
        )
    ],
    returns_signatures=[
        InvoiceSignature(
            type='Customer',
            name='Sarah H',
            is_signed=True
        ),
        InvoiceSignature(
            type='Driver',
            name='James T',
            is_signed=True
        )
    ]   
)

invoice_evaluator = InvoiceEvaluator(expected)

## Extract data from the document

The following code block executes the data extraction process using Azure AI Document Intelligence and Phi-3.5-mini.

It performs the following steps:

1. Get the document bytes from the provided file path. _Note: In this example, we are processing a local document, however, you can use any document storage location of your choice, such as Azure Blob Storage._
2. Use Azure AI Document Intelligence to analyze the structure of the document and convert it to Markdown format using the pre-built layout model.
3. Using a Phi-3.5-mini Serverless Endpoint model deployment in Azure AI Studio and prompt engineering techniques, extract a structured data transfer object (DTO) from the content of the Markdown.

In [4]:
with Stopwatch() as di_stopwatch:
    with open(fname, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout",
            analyze_request=f,
            output_content_format=ContentFormat.MARKDOWN,
            content_type="application/pdf"
        )
        
    result: AnalyzeResult = poller.result()

markdown = result.content

with Stopwatch() as oai_stopwatch:
    completion = openai_client.chat.completions.create(
        model=settings.gpt4o_model_deployment_name,
        messages=[
            {
                "role": "system",
                "content": "You are an AI assistant that extracts data from documents and returns them as structured JSON objects.",
            },
            {
                "role": "user",
                "content": f"""Extract the data from this invoice. 
                - If a value is not present, provide null.
                - **Do not return as a JSON code block.**
                - Use the following structure: {json.dumps(Invoice.example().to_dict())}""",
            },
            {
                "role": "user",
                "content": markdown,
            }
        ],
        max_tokens=4096,
        temperature=0.1,
        top_p=0.1
    )

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the data extraction process.

This includes:

- The Markdown representation of the document structure as determined by Azure AI Document Intelligence.
- The accuracy of the structured data extraction comparing the expected output with the output generated by Phi-3.5-mini.
- The confidence score of the structured data extraction by comparing against the Azure AI Document Intelligence analysis.
- The execution time of the end-to-end process.
- The total number of tokens consumed by the Phi-3.5-mini model.
- The side-by-side comparison of the expected output and the output generated by Phi-3.5-mini.

### Understanding Accuracy vs Confidence

When using AI to extract structured data, both confidence and accuracy are essential for different but complementary reasons.

- **Accuracy** measures how close the AI model's output is to a ground truth or expected output. It reflects how well the model's predictions align with reality.
  - Accuracy ensures consistency in the extraction process, which is crucial for downstream tasks using the data.
- **Confidence** represents the AI model's internal assessment of how certain it is about its predictions.
  - Confidence indicates that the model is certain about its predictions, which can be a useful indicator for human reviewers to step in for manual verification.

High accuracy and high confidence are ideal, but in practice, there is often a trade-off between the two. While accuracy cannot always be self-assessed, confidence scores can and should be used to prioritize manual verification of low-confidence predictions.

In [None]:
# Displays the output of the Azure AI Document Intelligence pre-built layout analysis in Markdown format.
display(Markdown(markdown))

In [None]:
# Gets the JSON response from the completion and converts it to an Invoice object.
response_json = completion.choices[0].message.content
invoice = Invoice.from_json(response_json)

# Determines the accuracy of the extracted data against the expected values.
accuracy = invoice_evaluator.evaluate(invoice)

# Determines the confidence of the extracted data against the expected values using the result of the Azure AI Document Intelligence pre-built layout analysis.
confidence = evaluate_confidence(invoice.to_dict(), result)

# Gets the total execution time of the data extraction process.
total_elapsed = di_stopwatch.elapsed + oai_stopwatch.elapsed

# Gets the prompt tokens and completion tokens from the completion response.
prompt_tokens = completion.usage.prompt_tokens
completion_tokens = completion.usage.completion_tokens

# Save the output of the data extraction result.
extraction_result = DataExtractionResult(invoice.to_dict(), confidence, accuracy, prompt_tokens, completion_tokens, total_elapsed)

with open(f"{working_dir}/samples/extraction/text-based/document-intelligence-phi.{pdf_file_name}.json", "w") as f:
    f.write(extraction_result.to_json(indent=4))
    
# Display the outputs of the data extraction process.
df = pd.DataFrame([
    {
        "Accuracy": f"{accuracy['overall'] * 100:.2f}%",
        "Confidence": f"{confidence['_overall'] * 100:.2f}%",
        "Execution Time": f"{total_elapsed:.2f} seconds",
        "Document Intelligence Execution Time": f"{di_stopwatch.elapsed:.2f} seconds",
        "Phi-3.5-mini Execution Time": f"{oai_stopwatch.elapsed:.2f} seconds",
        "Prompt Tokens": prompt_tokens,
        "Completion Tokens": completion_tokens
    }
])

display(Markdown(df.to_markdown(index=False, tablefmt='unsafehtml')))
display(Markdown(extraction_comparison(expected.to_dict(), invoice.to_dict(), confidence)))