# Conduct Complex Analysis with Pro Mode

> #################################################################################
>
> **Note:** Pro mode is currently available only for `document` data.  
> [Supported file types](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/service-limits#document-and-text): pdf, tiff, jpg, jpeg, png, bmp, heif
>
> #################################################################################

This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases that require multi-step reasoning and complex decision-making, such as identifying inconsistencies, drawing inferences, and making sophisticated decisions. Pro mode enables input from multiple content files and allows you to provide reference data at analyzer creation time.

In this walkthrough, you'll learn how to:
1. Create an analyzer using a schema and reference data.
2. Analyze your files using Pro mode.

For more details on Pro mode, see the [Azure AI Content Understanding: Standard and Pro Modes](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/standard-pro-modes) documentation.

## Prerequisites
1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).
2. If using reference documents, set up `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the [.env](./.env) file by following the instructions in [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md).
    - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container.
    - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference documents.
    > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using only input documents. For example, the service can reason across two or more input files without any reference data.
3. Install the required packages to run the sample.

In [None]:
%pip install -r ../requirements.txt

## Analyzer Template and Local File Setup
- **analyzer_template**: In this sample, we define an analyzer template for invoice-contract verification.
- **input_docs**: You can provide multiple input document files in one folder or specify a single document file location.
- **reference_docs (Optional)**: During analyzer creation, you can provide documents that serve as context for the analyzer during inference. If needed, OCR results will be generated for these files, a reference JSONL file will be produced, and these files will be uploaded to a designated Azure Blob storage container.

> For example, if you want to analyze invoices to ensure consistency with a contractual agreement, you can supply the invoice and other relevant documents (such as a purchase order) as inputs, and provide the contract files as reference data. The service uses reasoning to validate the input documents against your schema, identifying discrepancies to flag for further review.

In [None]:
# Define paths for analyzer template, input documents, and reference documents
analyzer_template = "../analyzer_templates/invoice_contract_verification_pro_mode.json"
input_docs = "../data/field_extraction_pro_mode/invoice_contract_verification/input_docs"

# NOTE: Reference documents are optional in Pro mode. Comment out the line below if not using reference documents.
reference_docs = "../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs"

> Let's examine the analyzer template used for Pro mode.

In [None]:
import json
with open(analyzer_template, "r") as file:
    print(json.dumps(json.load(file), indent=2))

> In the analyzer, the `"mode"` must be set to `"pro"`. The defined field `"PaymentTermsInconsistencies"` is a `"generate"` field designed to reason about inconsistencies. It can also utilize the referenced documents uploaded to the [reference docs](../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs) folder.

## Create Azure Content Understanding Client
> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing relevant functions. Before the official Content Understanding SDK release, please consider it a lightweight SDK. 
Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with your Azure AI Service information.

> ⚠️ Important:
Update the code below to match your Azure authentication method. 
Look for the `# IMPORTANT` comments and adjust those sections accordingly.
Skipping this step may cause the sample to fail.

> ⚠️ Note: Using a subscription key is supported, but using a token provider with Azure Active Directory (AAD) is more secure and highly recommended for production.

In [None]:
import logging
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Import utility package from python samples root directory
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

# For authentication, you can use either token-based auth or subscription key; only one is required
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
# IMPORTANT: Replace with your actual subscription key or set it in the ".env" file if not using token auth
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview")

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    # IMPORTANT: Comment out token_provider if using subscription key
    token_provider=token_provider,
    # IMPORTANT: Uncomment this line if using subscription key
    # subscription_key=AZURE_AI_API_KEY,
    x_ms_useragent="azure-ai-content-understanding-python/pro_mode",  # Used for sample usage telemetry; comment out to opt out.
)

## Prepare Reference Data
In this step, we will:
- Use Azure AI service to extract OCR results from reference documents (if needed).
- Generate a reference `.jsonl` file.
- Upload these files to the designated Azure Blob storage.

This process uses **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** set during the Prerequisites step.


In [None]:
# Load reference storage configuration from environment
REFERENCE_DOC_SAS_URL = os.getenv("REFERENCE_DOC_SAS_URL")
REFERENCE_DOC_PATH = os.getenv("REFERENCE_DOC_PATH")

> ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using only input documents. For example, the service can reason across two or more input files without any reference data. 
> If you do not have reference documents, please skip or comment out the following section.

In [None]:
# Set skip_analyze to True if you already have OCR results for the documents in the reference_docs folder.
# OCR result files must be named with the original document file name plus the suffix ".result.json".
# For example, if the original file is "invoice.pdf", the OCR result should be named "invoice.pdf.result.json".
# NOTE: Comment out this line if you do not have reference documents.
await client.generate_knowledge_base_on_blob(reference_docs, REFERENCE_DOC_SAS_URL, REFERENCE_DOC_PATH, skip_analyze=False)

## Create Analyzer with Defined Schema for Pro Mode
Before creating the analyzer, assign a relevant name to the constant `ANALYZER_ID`. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.

We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** set in the [.env](./.env) file, as used in the previous step.

In [None]:
import uuid
CUSTOM_ANALYZER_ID = "pro-mode-sample-" + str(uuid.uuid4())

response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID,
    analyzer_template_path=analyzer_template,
    pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL,
    pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue was encountered when trying to create the analyzer. "
        "Please double-check your deployment and configurations for potential problems."
    )

## Analyze Input Documents Using the Created Analyzer
Once the analyzer is successfully created, you can use it to analyze your input files.
> NOTE: Pro mode involves multi-step reasoning and may take longer to complete the analysis.

In [None]:
from IPython.display import FileLink, display

response = client.begin_analyze(CUSTOM_ANALYZER_ID, file_location=input_docs)
result_json = client.poll_result(response, timeout_seconds=600)  # Increased timeout for Pro mode

# Create the output directory if it doesn't exist
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)

output_path = os.path.join(output_dir, f"{CUSTOM_ANALYZER_ID}_result.json")
with open(output_path, "w", encoding="utf-8") as file:
    json.dump(result_json, file, indent=2)

logging.info(f"Full analyzer result saved to: {output_path}")
display(FileLink(output_path))

> Let's examine the extracted fields from Pro mode.

In [None]:
fields = result_json["result"]["contents"][0]["fields"]
print(json.dumps(fields, indent=2))

> For example, the field `PaymentTermsInconsistencies` shows that the purchase contract contains detailed payment terms agreed upon prior to the service. However, the implied payment terms on the invoice conflict with this. Pro mode was able to identify the corresponding contract for this invoice from the reference documents and analyze both together to uncover this inconsistency.

## Delete Existing Analyzer from Content Understanding Service
This step is optional but recommended to prevent unnecessary analyzers from accumulating in your service. Without deletion, the analyzer will remain in your service and may affect subsequent usage.

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID)

## Bonus Sample
This additional sample demonstrates how Pro mode supports multi-document input and advanced reasoning. Unlike Document Standard Mode, which processes one document at a time, Pro mode can analyze multiple documents in a single analysis call. Pro mode processes each document independently and cross-references them to perform reasoning across documents, providing deeper insights and validation.

### Setting Up Variables for the Second Sample

In [None]:
# Define paths for analyzer template, input documents, and reference documents for the second sample
analyzer_template_2 = "../analyzer_templates/insurance_claims_review_pro_mode.json"
input_docs_2 = "../data/field_extraction_pro_mode/insurance_claims_review/input_docs"
reference_docs_2 = "../data/field_extraction_pro_mode/insurance_claims_review/reference_docs"

# Load reference storage configuration from environment
REFERENCE_DOC_SAS_URL_2 = os.getenv("REFERENCE_DOC_SAS_URL")  # Reuse the same blob container
REFERENCE_DOC_PATH_2 = os.getenv("REFERENCE_DOC_PATH").rstrip("/") + "_2/"  # Use a different path for the second sample
CUSTOM_ANALYZER_ID_2 = "pro-mode-sample-" + str(uuid.uuid4())

### Generate Knowledge Base for the Second Sample
Upload [reference documents](../data/field_extraction_pro_mode/insurance_claims_review/reference_docs/) with existing OCR results for this sample. These documents contain driver coverage policies useful for reviewing insurance claims.

In [None]:
logging.info("Starting to generate knowledge base for the second sample...")
await client.generate_knowledge_base_on_blob(reference_docs_2, REFERENCE_DOC_SAS_URL_2, REFERENCE_DOC_PATH_2, skip_analyze=True)

### Create Analyzer for the Second Sample
We reuse the previous AzureContentUnderstandingClient.

In [None]:
response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID_2,
    analyzer_template_path=analyzer_template_2,
    pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL_2,
    pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH_2,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue was encountered when trying to create the analyzer. "
        "Please double-check your deployment and configurations for potential problems."
    )

### Analyze Multiple Input Documents with the Second Analyzer
The [input_docs_2](../data/field_extraction_pro_mode/insurance_claims_review/input_docs/) directory contains two PDF files as input: a car accident report and a repair estimate.

The first document includes details such as the vehicle's license plate number, vehicle model, and other incident-related information.
The second document provides a breakdown of the estimated repair costs.

Due to the complexity of this multi-document scenario and the processing involved, generating the results may take a few minutes.

In [None]:
logging.info("Starting analysis of input documents for the second sample...")
response = client.begin_analyze(CUSTOM_ANALYZER_ID_2, file_location=input_docs_2)
result_json = client.poll_result(response, timeout_seconds=600)  # Increased timeout for Pro mode

# Save the result to a JSON file
# Create the output directory if it doesn't exist
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, f"{CUSTOM_ANALYZER_ID_2}_result.json")
with open(output_path, "w", encoding="utf-8") as file:
    json.dump(result_json, file, indent=2)

logging.info(f"Full analyzer result saved to: {output_path}")
display(FileLink(output_path))

### Examine the Analysis Result

In [None]:
result_json["result"]["contents"][0]["fields"]

### Deeper Look at the `LineItemCorroboration` Field

> The field `ReportingOfficer` is present only in the car accident report, while fields like `VIN` are found only in the repair estimate document. This demonstrates that information is extracted from both documents to produce a single consolidated result. 
> This also illustrates the many-to-one relationship between input documents and the analysis output—multiple input documents contribute to one unified analysis result. This is not a batch model where each input document yields a separate output.

In [None]:
fields = result_json["result"]["contents"][0]["fields"]["LineItemCorroboration"]
print(json.dumps(fields, indent=2))

> In the `LineItemCorroboration` field, each line item generated from the *repair estimate document* is extracted along with its corresponding information, claim status, and evidence. Items not covered by the policy, such as a Starbucks drink and hotel stay, are marked as suspicious, while damage repairs supported by the supplied claim documents and permitted by the policy are confirmed.

### [Optional] Delete the Analyzer for the Second Sample After Use

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID_2)