# Conduct complex analysis with Pro mode

> #################################################################################
>
> **Note:** Pro mode is currently available only for `document` data.  
> [Supported file types](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/service-limits#document-and-text): pdf, tiff, jpg, jpeg, png, bmp, heif
>
> #################################################################################

This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases, particularly those requiring multi-step reasoning, and complex decision-making (for instance, identifying inconsistencies, drawing inferences, and making sophisticated decisions). Pro mode allows input from multiple content files and includes the option to provide reference data at analyzer creation time.

In this walkthrough, you'll learn how to:
1. Create an analyzer with a schema and reference data.
2. Analyze your files using Pro mode.

For more details on Pro mode, see the [Azure AI Content Understanding: Standard and Pro Modes](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/standard-pro-modes) documentation.

## Prerequisites
1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)
1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the [.env](./.env) file.
    - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container.
    - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference docs.
    > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data.
1. Install the required packages to run the sample.

In [None]:
%pip install -r ../requirements.txt

## Analyzer template and local files setup
- **analyzer_template**: In this sample we define an analyzer template for invoice-contract verification.
- **input_docs**: We can have multiple input document files in one folder or designate a single document file location. 
- **reference_docs(Optional)**: During analyzer creation, we can provide documents that can aid in providing context that the analyzer references at inference time. We will get OCR results for these files if needed, generate a reference JSONL file, and upload these files to a designated Azure blob storage.

> For example, if you're looking to analyze invoices to ensure they're consistent with a contractual agreement, you can supply the invoice and other relevant documents (for example, a purchase order) as inputs, and supply the contract files as reference data. The service applies reasoning to validate the input documents according to your schema, which might be to identify discrepancies to flag for further review.

In [None]:
# Define paths for analyzer template, input documents, and reference documents
analyzer_template = "../analyzer_templates/invoice_contract_verification_pro_mode.json"
input_docs = "../data/field_extraction_pro_mode/invoice_contract_verification/input_docs"

# NOTE: Reference documents are optional in Pro mode. Can comment out below line if not using reference documents.
reference_docs = "../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs"

> Let's take a look at the analyzer template of Pro mode

In [None]:
import json
with open(analyzer_template, "r") as file:
    print(json.dumps(json.load(file), indent=2))

> In the analyzer, `"mode"` needs to be in `"pro"`. The defined field - "PaymentTermsInconsistencies" is a `"generate"` field and is asked to reason about inconsistency, and will be able to use referenced documents to be uploaded in [reference docs](../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs)

## Create Azure content understanding client
> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility class that contains the functions, Before the release of the Content Understanding SDK, please consider it a lightweight SDK., Fill in values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.

> ⚠️ Important:
You must update the code below to match your Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
If you skip this step, the sample may not run correctly.

> ⚠️ Note: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is much safer and is highly recommended for production environments.

In [None]:
import logging
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# import utility package from python samples root directory
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

# For authentication, you can use either token-based auth or subscription key, and only one of them is required
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
# IMPORTANT: Replace with your actual subscription key or set up in ".env" file if not using token auth
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview")

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    # IMPORTANT: Comment out token_provider if using subscription key
    token_provider=token_provider,
    # IMPORTANT: Uncomment this if using subscription key
    # subscription_key=AZURE_AI_API_KEY,
    x_ms_useragent="azure-ai-content-understanding-python/pro_mode", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

## Prepare reference data
In this step, we will 
- Use Azure AI service to Extract OCR results from reference documents (if needed).
- Generate a reference `.jsonl` file.
- Upload these files to the designated Azure blob storage.

We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the Prerequisites step.



In [None]:
# Load reference storage configuration from environment
REFERENCE_DOC_SAS_URL = os.getenv("REFERENCE_DOC_SAS_URL")
REFERENCE_DOC_PATH = os.getenv("REFERENCE_DOC_PATH")

> ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data. Please skip or comment out below section to skip the preparation of reference documents.

In [None]:
# Set skip_analyze to True if you already have OCR results for the documents in the reference_docs folder
# Please name the OCR result files with the same name as the original document files including its extension, and add the suffix ".result.json"
# For example, if the original document is "invoice.pdf", the OCR result file should be named "invoice.pdf.result.json"
# NOTE: Please comment out the follwing line if you don't have any reference documents.
await client.generate_knowledge_base_on_blob(reference_docs, REFERENCE_DOC_SAS_URL, REFERENCE_DOC_PATH, skip_analyze=False)

## Create analyzer with defined schema for Pro mode
Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.

We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in the [.env](./.env) file and used in the previous step.

In [None]:
import uuid
CUSTOM_ANALYZER_ID = "pro-mode-sample-" + str(uuid.uuid4())

response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID,
    analyzer_template_path=analyzer_template,
    pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL,
    pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue was encountered when trying to create the analyzer. "
        "Please double-check your deployment and configurations for potential problems."
    )

## Use created analyzer to analyze the input documents
After the analyzer is successfully created, we can use it to analyze our input files.
> NOTE: Pro mode does multi-step reasoning and may take a longer time to analyze.

In [None]:
from IPython.display import FileLink, display

response = client.begin_analyze(CUSTOM_ANALYZER_ID, file_location=input_docs)
result_json = client.poll_result(response, timeout_seconds=600)  # set a longer timeout for pro mode

# Create the output directory if it doesn't exist
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)

output_path = os.path.join(output_dir, f"{CUSTOM_ANALYZER_ID}_result.json")
with open(output_path, "w", encoding="utf-8") as file:
    json.dump(result_json, file, indent=2)

logging.info(f"Full analyzer result saved to: {output_path}")
display(FileLink(output_path))

> Let's take a look at the extracted fields with Pro mode 

In [None]:
fields = result_json["result"]["contents"][0]["fields"]
print(json.dumps(fields, indent=2))

> As seen in the field `PaymentTermsInconsistencies`, for example, the purchase contract has detailed payment terms that were agreed to prior to the service. However, the implied payment terms on the invoice conflict with this. Pro mode was able to identify the corresponding contract for this invoice from the reference documents and then analyze the contract together with the invoice to discover this inconsistency.

## Delete exist analyzer in Content Understanding Service
This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. Without deletion, the analyzer will remain in your service for subsequent reuse.

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID)

## Bonus sample
We would like to introduce another sample to highlight how Pro mode supports multi-document input and advanced reasoning. Unlike Document Standard Mode, which processes one document at a time, Pro mode can analyze multiple documents within a single analysis call. With Pro mode, the service not only processes each document independently, but also cross-references the documents to perform reasoning across them, enabling deeper insights and validation.

### First, we need to set up variables for the second sample

In [None]:
# Define paths for analyzer template, input documents, and reference documents of the second sample
analyzer_template_2 = "../analyzer_templates/insurance_claims_review_pro_mode.json"
input_docs_2 = "../data/field_extraction_pro_mode/insurance_claims_review/input_docs"
reference_docs_2 = "../data/field_extraction_pro_mode/insurance_claims_review/reference_docs"

# Load reference storage configuration from environment
REFERENCE_DOC_SAS_URL_2 = os.getenv("REFERENCE_DOC_SAS_URL")  # Reuse the same blob container
REFERENCE_DOC_PATH_2 = os.getenv("REFERENCE_DOC_PATH").rstrip("/") + "_2/"  # NOTE: Use a different path for the second sample
CUSTOM_ANALYZER_ID_2 = "pro-mode-sample-" + str(uuid.uuid4())

### Generate knowledge base for the second sample
Let's upload [refernce documents](../data/field_extraction_pro_mode/insurance_claims_review/reference_docs/) with existing OCR results for the second sample. These documents contain driver coverage policy that are useful in reviewing insurance claims.

In [None]:
logging.info("Start generating knowledge base for the second sample...")
await client.generate_knowledge_base_on_blob(reference_docs_2, REFERENCE_DOC_SAS_URL_2, REFERENCE_DOC_PATH_2, skip_analyze=True)

### Create analyzer for the second sample
We can reuse previous AzureContentUnderstandingClient

In [None]:
response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID_2,
    analyzer_template_path=analyzer_template_2,
    pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL_2,
    pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH_2,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue was encountered when trying to create the analyzer. "
        "Please double-check your deployment and configurations for potential problems."
    )

### Analyze the multiple input documents with the second analyzer
Please note that the [input_docs_2](../data/field_extraction_pro_mode/insurance_claims_review/input_docs/) directory contains two PDF files as input: one is a car accident report, and the other is a repair estimate.

The first document includes details such as the car’s license plate number, vehicle model, and other incident-related information.
The second document provides a breakdown of the estimated repair costs.

Due to the complexity of this multi-document scenario and the processing involved, it may take a few minutes to generate the results.

In [None]:
logging.info("Start analyzing input documents for the second sample...")
response = client.begin_analyze(CUSTOM_ANALYZER_ID_2, file_location=input_docs_2)
result_json = client.poll_result(response, timeout_seconds=600)  # set a longer timeout for pro mode

# Save the result to a JSON file
# Create the output directory if it doesn't exist
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, f"{CUSTOM_ANALYZER_ID_2}_result.json")
with open(output_path, "w", encoding="utf-8") as file:
    json.dump(result_json, file, indent=2)

logging.info(f"Full analyzer result saved to: {output_path}")
display(FileLink(output_path))

### Let's take a look at the analyze result

In [None]:
result_json["result"]["contents"][0]["fields"]

### Let's take a deeper look at `LineItemCorroboration` field in the result

> We can see that the field `ReportingOfficer` is only available in the car accident report, while fields like `VIN` come solely from the repair estimate document. This shows that information is extracted from both documents to generate a single result. It also illustrates the N:1 relationship between the inputs and the analysis result.  

> Multiple input documents are combined to produce one unified output. There is always one analysis result, and this is not a batch model where N input documents would yield N outputs.

In [None]:
fields = result_json["result"]["contents"][0]["fields"]["LineItemCorroboration"]
print(json.dumps(fields, indent=2))

> In the `LineItemCorroboration` field, we see that each line item, generated from *repair estimate document*, is extracted with its corresponding information, claim status, and evidence. Items that are not covered by the policy, such as the Starbucks drink and hotel stay, are marked as suspicious, while damage repairs that are supported by the supplied documents in the claim and are permitted by the policy are confirmed.

### [Optional] Delete the analyzer for second sample after use

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID_2)