
# Azure Form Recognizer - PDF Data Extraction

## Overview
This script demonstrates how to use Azure Form Recognizer’s custom and prebuilt models to extract specific fields and confidence scores from an invoice PDF. The data extracted includes key details like invoice number, invoice date, seller and buyer information, and more.

## Dependencies
- `azure-ai-formrecognizer`: Azure SDK for Form Recognizer
- `azure-core`: Azure core SDK for credentials
- `pandas`: Python data manipulation library
- `json`: For saving the extracted data in JSON format

## Code Explanation

### 1. **Initialize the Azure Form Recognizer Client**
The script uses the `DocumentAnalysisClient` from Azure to connect to the Form Recognizer service. The endpoint and API key are used to authenticate the client.



### 2. **Extract Fields Using Custom Model**
The `extract_fields_from_pdf` function uses the custom model (model_id) to extract fields such as `invoice_number`, `invoice_date`, and others from the provided PDF.


### 3. **Extract Fields Using Prebuilt Invoice Model**
The `pdf_to_json` function extracts fields from invoices using Azure’s prebuilt invoice model. It gathers fields like `Seller Name`, `Invoice Date`, and `Client Company Name`.



### 4. **Combine Data and Confidence Scores**
The `extract_and_combine_data` function combines the results from both custom and prebuilt models into a unified data structure and saves them in JSON files.



### 5. **Helper Function for Field Extraction**
The `extract_field` function ensures that the field values are extracted in a consistent format (string).



### 6. **Example Usage**
Finally, the script calls the `extract_and_combine_data` function with a sample PDF file path to demonstrate how the data is processed and printed.


---


In [None]:
import json
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
import pandas as pd

# Initialize the Document Analysis Client
endpoint = "https://shivi1.cognitiveservices.azure.com/"
api_key = "85f87e60ae51447385af5d3fda6b7dd9"
model_id = "trail2"

# Function 1: Extract fields and confidence scores using custom model
def extract_fields_from_pdf(pdf_file_path):
    document_analysis_client = DocumentAnalysisClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(api_key)
    )
    with open(pdf_file_path, "rb") as f:
        poller = document_analysis_client.begin_analyze_document(model_id, f)
        result = poller.result()

    extracted_fields = {}
    confidence_scores = {}
    key_mapping = {
        "invoice_number": "Invoice Number",
        "invoice_date": "Invoice Date",
        "Country": "Country",
        "seller_address": "Seller Address",
        "phone_number": "Phone No",
        "seller_taxid": "Seller TaxId",
        "po_number": "PO Number",
        "buyer_taxif": "Buyer TaxId",
        "seller_name": "Seller Name",
        "buyer_address": "Buyer Address"
    }
    for document in result.documents:
        for field_name, field_value in document.fields.items():
            new_field_name = key_mapping.get(field_name, field_name)
            extracted_fields[new_field_name] = field_value.value
            confidence_scores[new_field_name] = field_value.confidence

    return extracted_fields, confidence_scores

# Function 2: Extract fields and confidence scores using prebuilt-invoice model
def pdf_to_json(file_path):
    document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(api_key))
    with open(file_path, "rb") as file:
        poller = document_analysis_client.begin_analyze_document("prebuilt-invoice", file)
        result = poller.result()

    data = {}
    confidence_scores = {}
    for invoice in result.documents:
        fields_to_extract = {
            "Seller Name": invoice.fields.get("VendorName"),
            "Invoice Date": invoice.fields.get("InvoiceDate"),
            "Client Company Name": invoice.fields.get("CustomerName"),
        }

        for key, field in fields_to_extract.items():
            data[key] = extract_field(field)
            confidence_scores[key] = field.confidence if field else None

    return data, confidence_scores

# Helper function for field extraction
def extract_field(field, default_value=None):
    return str(field.value) if field else default_value

# Main function to extract data and confidence scores, merge them, and write to JSON files
def extract_and_combine_data(pdf_file_path):
    data1, confidence1 = extract_fields_from_pdf(pdf_file_path)
    data2, confidence2 = pdf_to_json(pdf_file_path)

    combined_data = {**data1, **data2}
    combined_confidence = {**confidence1, **confidence2}

    # Write combined data to JSON file
    with open("combined_data.json", "w") as json_file:
        json.dump(combined_data, json_file, indent=4)

    # Write combined confidence scores to JSON file
    with open("combined_confidence.json", "w") as json_file:
        json.dump(combined_confidence, json_file, indent=4)

    return [combined_data, combined_confidence]

# Example usage
pdf_file_path = "/content/Copy of AUH Invoice - WO 1.pdf"
result = extract_and_combine_data(pdf_file_path)

# Print the results
print("Combined Data:")
print(json.dumps(result[0], indent=4))
print("\nCombined Confidence Scores:")
print(json.dumps(result[1], indent=4))


Combined Data:
{
    "Invoice Number": "No:INV/MAG/3859/2024",
    "Invoice Date": "2024-09-24",
    "PO Number": "202450195",
    "Country": "DHABI-UAE",
    "Seller TaxId": "100381552700003",
    "Buyer Address": "M/S. HAVELOCK ONE INTERIORS LLC PO BOX: 30096",
    "Buyer TaxId": "100316509700003",
    "Phone No": "+971 2041005",
    "Seller Address": null,
    "Seller Name": "MANAR AL GHARB ELECTROMECHANICAL INST LLC",
    "Client Company Name": "M/S. HAVELOCK ONE INTERIORS LLC"
}

Combined Confidence Scores:
{
    "Invoice Number": 0.772,
    "Invoice Date": 0.958,
    "PO Number": 0.983,
    "Country": 0.948,
    "Seller TaxId": 0.947,
    "Buyer Address": 0.395,
    "Buyer TaxId": 0.92,
    "Phone No": 0.918,
    "Seller Address": 0.912,
    "Seller Name": 0.885,
    "Client Company Name": 0.915
}



---

## Conclusion
This script demonstrates how to extract key fields and their respective confidence scores from a PDF invoice using Azure’s Form Recognizer service. It combines results from both a custom model and a prebuilt model, providing a comprehensive view of the extracted data and their confidence levels.

