# Conduct Complex Analysis with Pro mode

> #################################################################################
>
> **Note:** Pro mode is currently available only for `document` data.  
> [Supported file types](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/service-limits#document-and-text): pdf, tiff, jpg, jpeg, png, bmp, heif
>
> #################################################################################

This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases, particularly those requiring multi-step reasoning and complex decision-making (for example, identifying inconsistencies, drawing inferences, and making sophisticated decisions). Pro mode enables input from multiple content files and includes the option to provide reference data at analyzer creation time.

In this walkthrough, you will learn how to:
1. Create an analyzer with a schema and reference data.
2. Analyze your files using Pro mode.

For more details on Pro mode, see the [Azure AI Content Understanding: Standard and Pro Modes](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/standard-pro-modes) documentation.

## Prerequisites
1. Ensure the Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).
2. If you plan to use reference documents, follow the [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) instructions to set `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the [.env](./.env) file.
    - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container.
    - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference documents.
    > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using only input documents. For example, the service can reason across two or more input files even without any reference data.
3. Install the required packages to run the sample.

In [None]:
%pip install -r ../requirements.txt

## Analyzer Template and Local Files Setup
- **analyzer_template**: In this sample, we define an analyzer template for invoice-contract verification.
- **input_docs**: You can provide multiple input document files in one folder or specify a single document file location.
- **reference_docs (Optional)**: During analyzer creation, you can provide documents that aid in providing context for the analyzer at inference time. The service will generate OCR results for these files if needed, produce a reference `.jsonl` file, and upload these files to a specified Azure Blob storage location.

> For example, if you want to analyze invoices for consistency with contractual agreements, you can supply the invoice and other relevant documents (e.g., purchase orders) as inputs, and supply contract files as reference data. The service applies reasoning to validate the input documents according to your schema, such as identifying discrepancies to flag for further review.

In [None]:
# Define paths for analyzer template, input documents, and reference documents
analyzer_template = "../analyzer_templates/invoice_contract_verification_pro_mode.json"
input_docs = "../data/field_extraction_pro_mode/invoice_contract_verification/input_docs"

# NOTE: Reference documents are optional in Pro mode. Comment out the below line if not using reference documents.
reference_docs = "../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs"

> Let's examine the analyzer template used in Pro mode.

In [None]:
import json
with open(analyzer_template, "r") as file:
    print(json.dumps(json.load(file), indent=2))

> In the analyzer, the `"mode"` must be set to `"pro"`. The defined field `"PaymentTermsInconsistencies"` is a `"generate"` field, which is designed to reason about inconsistencies. It will utilize the reference documents uploaded to the [reference docs](../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs) folder.

## Create Azure Content Understanding Client
> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing the necessary functions. Note that before the release of the Content Understanding SDK, consider it a lightweight SDK.
Fill in the values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.

> ⚠️ Important:
You must update the code below to match your Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
If you skip this step, the sample may not run correctly.

> ⚠️ Note: While using a subscription key works, using a token provider with Azure Active Directory (AAD) is more secure and strongly recommended for production environments.

In [None]:
import logging
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Import utility package from python samples root directory
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

# For authentication, you can use either token-based auth or subscription key; only one is required
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
# IMPORTANT: Replace with your actual subscription key or set up in the ".env" file if not using token auth
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview")

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    # IMPORTANT: Comment out token_provider if using subscription key
    token_provider=token_provider,
    # IMPORTANT: Uncomment this if using subscription key
    # subscription_key=AZURE_AI_API_KEY,
    x_ms_useragent="azure-ai-content-understanding-python/pro_mode",  # This header is used for sample usage telemetry; comment out if opting out.
)

## Prepare Reference Data
In this step, we will:
- Use the Azure AI service to extract OCR results from reference documents (if needed).
- Generate a reference `.jsonl` file.
- Upload these files to the designated Azure Blob storage.

We utilize **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH**, which are set in the Prerequisites step.

In [None]:
# Load reference storage configuration from environment
REFERENCE_DOC_SAS_URL = os.getenv("REFERENCE_DOC_SAS_URL")
REFERENCE_DOC_PATH = os.getenv("REFERENCE_DOC_PATH")

> ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using only input documents. The service can reason across two or more input files even without any reference data. To skip preparation of reference documents, comment out or omit the following section.

In [None]:
# Set skip_analyze to True if you already have OCR results for the documents in the reference_docs folder
# Ensure OCR result files are named with the original document file name including its extension plus the suffix ".result.json"
# For example, for "invoice.pdf", the OCR result should be named "invoice.pdf.result.json"
# NOTE: Comment out the following line if you do not have any reference documents.
await client.generate_knowledge_base_on_blob(reference_docs, REFERENCE_DOC_SAS_URL, REFERENCE_DOC_PATH, skip_analyze=False)

## Create Analyzer with Defined Schema for Pro mode
Before creating the analyzer, assign a relevant name to the constant `ANALYZER_ID`. Here, we generate a unique suffix so this cell can be executed multiple times to create different analyzers.

We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** configured in the [.env](./.env) file and utilized in the previous step.

In [None]:
import uuid
CUSTOM_ANALYZER_ID = "pro-mode-sample-" + str(uuid.uuid4())

response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID,
    analyzer_template_path=analyzer_template,
    pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL,
    pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue was encountered when creating the analyzer. "
        "Please verify your deployment and configuration for potential issues."
    )

## Use Created Analyzer to Analyze the Input Documents
After the analyzer is created successfully, it can be used to analyze your input files.
> NOTE: Pro mode performs multi-step reasoning and may require a longer analysis time.

In [None]:
from IPython.display import FileLink, display

response = client.begin_analyze(CUSTOM_ANALYZER_ID, file_location=input_docs)
result_json = client.poll_result(response, timeout_seconds=600)  # Extended timeout for Pro mode

# Ensure the output directory exists
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)

output_path = os.path.join(output_dir, f"{CUSTOM_ANALYZER_ID}_result.json")
with open(output_path, "w", encoding="utf-8") as file:
    json.dump(result_json, file, indent=2)

logging.info(f"Full analyzer result saved to: {output_path}")
display(FileLink(output_path))

> Let's review the extracted fields produced by Pro mode.

In [None]:
fields = result_json["result"]["contents"][0]["fields"]
print(json.dumps(fields, indent=2))

> As shown in the `PaymentTermsInconsistencies` field, the purchase contract includes detailed payment terms agreed upon before the service. However, the invoice contains implied payment terms that conflict with the contract. Pro mode was able to identify the corresponding contract for this invoice from the reference documents and analyze both together to reveal this inconsistency.

## Delete Existing Analyzer in Content Understanding Service
This step is optional but recommended; it prevents test analyzers from remaining in your service. Without deletion, the analyzer will persist and could be used in subsequent analyses.

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID)

## Bonus Sample
Here we present another example highlighting how Pro mode supports multi-document input and advanced reasoning.
Unlike Document Standard Mode, which processes one document at a time, Pro mode can analyze multiple documents within a single analysis call. Pro mode not only processes each document independently, but also cross-references them to perform reasoning across documents, enabling deeper insights and validation.

### First, Set Up Variables for the Second Sample

In [None]:
# Define paths for analyzer template, input documents, and reference documents for the second sample
analyzer_template_2 = "../analyzer_templates/insurance_claims_review_pro_mode.json"
input_docs_2 = "../data/field_extraction_pro_mode/insurance_claims_review/input_docs"
reference_docs_2 = "../data/field_extraction_pro_mode/insurance_claims_review/reference_docs"

# Load reference storage configuration from environment
REFERENCE_DOC_SAS_URL_2 = os.getenv("REFERENCE_DOC_SAS_URL")  # Reusing the same blob container
REFERENCE_DOC_PATH_2 = os.getenv("REFERENCE_DOC_PATH").rstrip("/") + "_2/"  # NOTE: Use a different path for the second sample
CUSTOM_ANALYZER_ID_2 = "pro-mode-sample-" + str(uuid.uuid4())

### Generate Knowledge Base for the Second Sample
We will upload [reference documents](../data/field_extraction_pro_mode/insurance_claims_review/reference_docs/) with existing OCR results for the second sample. These documents contain driver coverage policies useful for reviewing insurance claims.

In [None]:
logging.info("Start generating knowledge base for the second sample...")
await client.generate_knowledge_base_on_blob(reference_docs_2, REFERENCE_DOC_SAS_URL_2, REFERENCE_DOC_PATH_2, skip_analyze=True)

### Create Analyzer for the Second Sample
We will reuse the existing AzureContentUnderstandingClient.

In [None]:
response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID_2,
    analyzer_template_path=analyzer_template_2,
    pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL_2,
    pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH_2,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue occurred while creating the analyzer. "
        "Please verify your deployment and configurations for potential issues."
    )

### Analyze Multiple Input Documents Using the Second Analyzer
Note that the [input_docs_2](../data/field_extraction_pro_mode/insurance_claims_review/input_docs/) directory contains two PDF files as input: one is a car accident report, and the other is a repair estimate.

The first document includes details such as the car’s license plate number, vehicle model, and other incident-related information.
The second document provides a breakdown of the estimated repair costs.

Due to the complexity of this multi-document scenario and the processing involved, it may take a few minutes to generate results.

In [None]:
logging.info("Start analyzing input documents for the second sample...")
response = client.begin_analyze(CUSTOM_ANALYZER_ID_2, file_location=input_docs_2)
result_json = client.poll_result(response, timeout_seconds=600)  # Extended timeout for Pro mode

# Save the results to a JSON file
# Ensure output directory exists
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, f"{CUSTOM_ANALYZER_ID_2}_result.json")
with open(output_path, "w", encoding="utf-8") as file:
    json.dump(result_json, file, indent=2)

logging.info(f"Full analyzer result saved to: {output_path}")
display(FileLink(output_path))

### Review the Analysis Result

In [None]:
result_json["result"]["contents"][0]["fields"]

### Examine the `LineItemCorroboration` Field in Detail

> The field `ReportingOfficer` appears only in the car accident report, while fields such as `VIN` come exclusively from the repair estimate document. This illustrates that information is extracted from both documents to produce a single unified result, demonstrating an N:1 relationship between inputs and the analysis output.

> Multiple input documents are combined to generate one comprehensive output. This is not a batch model where N input documents yield N outputs; instead, all inputs are reasoned about jointly.

In [None]:
fields = result_json["result"]["contents"][0]["fields"]["LineItemCorroboration"]
print(json.dumps(fields, indent=2))

> The `LineItemCorroboration` field shows that each line item, extracted from the *repair estimate document*, includes corresponding information, claim status, and evidence.
> Items not covered by the policy, such as a Starbucks drink and hotel stay, are flagged as suspicious, while repair damages supported by supplied claim documents and permitted by the policy are confirmed.

### [Optional] Delete the Analyzer for the Second Sample After Use

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID_2)