# Azure AI Content Understanding - Classifier and Analyzer Demo

This notebook demonstrates how to use the Azure AI Content Understanding service to:
1. Create a classifier for document categorization
2. Create a custom analyzer to extract specific fields
3. Combine the classifier and analyzers to classify, optionally split, and analyze documents within a flexible processing pipeline

For more detailed information before getting started, please refer to the official documentation:
[Understanding Classifiers in Azure AI Services](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier)

## Prerequisites
1. Ensure the Azure AI service is configured by following the [setup steps](../README.md#configure-azure-ai-service-resource).
2. Install the required packages to run this sample.

In [None]:
%pip install -r ../requirements.txt

## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that provides functions to interact with the Content Understanding API. Prior to the official release of the Content Understanding SDK, it serves as a lightweight SDK.
>
> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the details from your Azure AI Service.

> ⚠️ Important:
You must update the code below to use your preferred Azure authentication method.
Look for the `# IMPORTANT` comments in the code and modify those sections accordingly.
Skipping this step may cause the sample to not run correctly.

> ⚠️ Note: While using a subscription key is supported, it is strongly recommended to use a token provider with Azure Active Directory (AAD) for enhanced security in production environments.

In [None]:
%pip install python-dotenv azure-ai-contentunderstanding azure-identity

import logging
import json
import os
import sys
from datetime import datetime
import uuid
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.identity.aio import DefaultAzureCredential
from azure.ai.contentunderstanding.aio import ContentUnderstandingClient
from azure.ai.contentunderstanding.models import (
    ContentClassifier,
    ContentAnalyzer,
    ClassifierCategory,
    DocumentContent,
    FieldSchema,
    FieldDefinition,
    FieldType,
    ContentAnalyzerConfig,
)

# Add the parent directory to the Python path to import the sample_helper module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'python'))
from extension.sample_helper import extract_operation_id_from_poller, save_json_to_file, PollerType
from typing import Dict, Optional

load_dotenv()
logging.basicConfig(level=logging.INFO)

endpoint = os.environ.get("AZURE_CONTENT_UNDERSTANDING_ENDPOINT")
# Return AzureKeyCredential if AZURE_CONTENT_UNDERSTANDING_KEY is set, otherwise DefaultAzureCredential
key = os.getenv("AZURE_CONTENT_UNDERSTANDING_KEY")
credential = AzureKeyCredential(key) if key else DefaultAzureCredential()
# Create the ContentUnderstandingClient
client = ContentUnderstandingClient(endpoint=endpoint, credential=credential)

## Create a Basic Classifier
Classify document from URL using begin_classify API.

High-level steps:
1. Create a custom classifier
2. Classify a document from a remote URL
3. Save the classification result to a file
4. Clean up the created classifier

In [None]:
# Create a simple ContentClassifier object with default configuration.

# Args:
#     classifier_id: The classifier ID
#     description: Optional description for the classifier
#     tags: Optional tags for the classifier

# Returns:
#     ContentClassifier: A configured ContentClassifier object

def create_classifier_schema(description: Optional[str] = None, tags: Optional[Dict[str, str]] = None) -> ContentClassifier:
    categories = {
        "Loan application": ClassifierCategory(
            description="Documents submitted by individuals or businesses to request funding, typically including personal or business details, financial history, loan amount, purpose, and supporting documentation."
        ),
        "Invoice": ClassifierCategory(
            description="Billing documents issued by sellers or service providers to request payment for goods or services, detailing items, prices, taxes, totals, and payment terms."
        ),
        "Bank_Statement": ClassifierCategory(
            description="Official statements issued by banks that summarize account activity over a period, including deposits, withdrawals, fees, and balances."
        ),
    }

    classifier = ContentClassifier(
        categories=categories,
        split_mode="auto",
        description=description,
        tags=tags,
    )

    return classifier

# Generate a unique classifier ID
classifier_id = f"classifier-sample-{datetime.now().strftime('%Y%m%d')}-{datetime.now().strftime('%H%M%S')}-{uuid.uuid4().hex[:8]}"

# Create a custom classifier using object model
print(f"🔧 Creating custom classifier '{classifier_id}'...")

classifier_schema: ContentClassifier = create_classifier_schema(
    description=f"Custom classifier for URL classification demo: {classifier_id}",
    tags={"demo_type": "url_classification"},
)

# Start the classifier creation operation
poller = await client.content_classifiers.begin_create_or_replace(
    classifier_id=classifier_id,
    resource=classifier_schema,
)

# Wait for the classifier to be created
print(f"⏳ Waiting for classifier creation to complete...")
await poller.result()
print(f"✅ Classifier '{classifier_id}' created successfully!")


## Classify Your Document

Now, use the classifier to categorize your document.

In [None]:
# Read the mixed financial docs PDF file
pdf_path = "../data/mixed_financial_docs.pdf"
print(f"📄 Reading document file: {pdf_path}")
with open(pdf_path, "rb") as pdf_file:
    pdf_content = pdf_file.read()

# Begin binary classification operation
print(f"🔍 Starting binary classification with classifier '{classifier_id}'...")
classification_poller = await client.content_classifiers.begin_classify_binary(
    classifier_id=classifier_id,
    input=pdf_content,
    content_type="application/pdf",
)

# Wait for classification completion
print(f"⏳ Waiting for classification to complete...")
classification_result = await classification_poller.result()
print(f"✅ Classification completed successfully!")

# Extract operation ID for get_result
classification_operation_id = extract_operation_id_from_poller(
    classification_poller, PollerType.CLASSIFY_CALL
)
print(
    f"📋 Extracted classification operation ID: {classification_operation_id}"
)

# Get the classification result using the operation ID
print(
    f"🔍 Getting classification result using operation ID '{classification_operation_id}'..."
)
operation_status = await client.content_classifiers.get_result(
    operation_id=classification_operation_id,
)

print(f"✅ Classification result retrieved successfully!")
print(f"   Operation ID: {getattr(operation_status, 'id', 'N/A')}")
print(f"   Status: {getattr(operation_status, 'status', 'N/A')}")

# The actual classification result is in operation_status.result
operation_result = getattr(operation_status, "result", None)
if operation_result is not None:
    print(
        f"   Result contains {len(getattr(operation_result, 'contents', []))} contents"
    )

# Save the classification result to a file
saved_file_path = save_json_to_file(
    result=operation_status.as_dict(),
    filename_prefix="content_classifiers_get_result",
)
print(f"💾 Classification result saved to: {saved_file_path}")

## View Classification Results

Review the classification results generated for your document.

In [None]:
# Display classification results
print(f"📊 Classification Results:")
for content in classification_result.contents:
    document_content: DocumentContent = content
    print(f"   Category: {document_content.category}")
    print(f"       Start Page Number: {document_content.start_page_number}")
    print(f"       End Page Number: {document_content.end_page_number}")

## Saving Classification Results
The classification result is saved to a JSON file for later analysis.

In [None]:
# Save the classification result to a file

saved_file_path = save_json_to_file(
    result=classification_result.as_dict(),
    filename_prefix="content_classifiers_classify",
)
print(f"💾 Classification result saved to: {saved_file_path}")


## Clean up the created analyzer 
After the demo completes, the classifier is automatically deleted to prevent resource accumulation.

In [None]:
# Clean up the created classifier (demo cleanup)
print(f"🗑️  Deleting classifier '{classifier_id}' (demo cleanup)...")
await client.content_classifiers.delete(classifier_id=classifier_id)
print(f"✅ Classifier '{classifier_id}' deleted successfully!")

## Create a Custom Analyzer (Advanced)

Create a custom analyzer to extract specific fields from documents.
This example extracts common fields from loan application documents and generates document excerpts.

In [None]:
import asyncio

# Define fields schema
custom_analyzer = ContentAnalyzer(
    base_analyzer_id="prebuilt-documentAnalyzer",  # Built on top of the general document analyzer
    description="Loan application analyzer - extracts key information from loan applications",
    config=ContentAnalyzerConfig(
        return_details=True,
        enable_layout=True,          # Extract layout details
        enable_formula=False,        # Disable formula detection
        estimate_field_source_and_confidence=True, # Enable estimation of field location and confidence
        disable_content_filtering=False
    ),
    field_schema=FieldSchema(
        fields={
            "ApplicationDate": FieldDefinition(
                type=FieldType.DATE,
                method="generate",
                description="The date when the loan application was submitted."
            ),
            "ApplicantName": FieldDefinition(
                type=FieldType.STRING,
                method="generate",
                description="Full name of the loan applicant or company."
            ),
            "LoanAmountRequested": FieldDefinition(
                type=FieldType.NUMBER,
                method="generate",
                description="The total loan amount requested by the applicant."
            ),
            "LoanPurpose": FieldDefinition(
                type=FieldType.STRING,
                method="generate",
                description="The stated purpose or reason for the loan."
            ),
            "CreditScore": FieldDefinition(
                type=FieldType.NUMBER,
                method="generate",
                description="Credit score of the applicant, if available."
            ),
            "Summary": FieldDefinition(
                type=FieldType.STRING,
                method="generate",
                description="A brief summary overview of the loan application details."
            )
        }
    ),
    tags={"demo": "loan-application"}
)

# Generate a unique analyzer ID
analyzer_id = f"classifier-sample-{datetime.now().strftime('%Y%m%d')}-{datetime.now().strftime('%H%M%S')}-{uuid.uuid4().hex[:8]}"

# Create the custom analyzer
print(f"🔧 Creating custom analyzer '{analyzer_id}'...")
poller = await client.content_analyzers.begin_create_or_replace(
    analyzer_id=analyzer_id,
    resource=custom_analyzer,
)
result = await poller.result()
print(f"✅ Analyzer '{analyzer_id}' created successfully!")


## Create an Enhanced Classifier with Custom Analyzer

Now create a new classifier that uses the prebuilt invoice analyzer for invoices and the custom analyzer for loan application documents.
This combines document classification with field extraction in one operation.

In [None]:
def create_enhanced_classifier_schema(analyzer_id: str, description: Optional[str] = None, tags: Optional[Dict[str, str]] = None) -> ContentClassifier:
    categories = {
        "Loan application": {  # Both spaces and underscores allowed
            "description": "Documents submitted by individuals or businesses requesting funding, including personal/business details, financial history, and supporting documents.",
            "analyzerId": analyzer_id  # IMPORTANT: Use the custom analyzer created previously for loan applications
        },
        "Invoice": {
            "description": "Billing documents issued by sellers or service providers requesting payment for goods or services, detailing items, prices, taxes, totals, and payment terms.",
            "analyzerId": "prebuilt-invoice"  # Use prebuilt invoice analyzer for invoices
        },
        "Bank_Statement": {  # Both spaces and underscores allowed
            "description": "Official bank statements summarizing account activity over a period, including deposits, withdrawals, fees, and balances."
            # No analyzer specified - uses default processing
        }
    }

    classifier = ContentClassifier(
        categories=categories,
        split_mode="auto",
        description=description,
        tags=tags,
    )

    return classifier

# Generate a unique enhanced classifier ID
classifier_id = f"enhanced-classifier-sample-{datetime.now().strftime('%Y%m%d')}-{datetime.now().strftime('%H%M%S')}-{uuid.uuid4().hex[:8]}"

# Create the enhanced classifier schema
enhanced_classifier_schema = create_enhanced_classifier_schema(
    analyzer_id=analyzer_id,
    description=f"Custom classifier for URL classification demo: {classifier_id}",
    tags={"demo_type": "url_classification"}
)

# Create the enhanced classifier only if the custom analyzer was created successfully
if analyzer_id:
    poller = await client.content_classifiers.begin_create_or_replace(
        classifier_id=classifier_id,
        resource=enhanced_classifier_schema
    )

    # Wait for the classifier to be created
    print(f"⏳ Waiting for classifier creation to complete...")
    await poller.result()
    print(f"✅ Classifier '{classifier_id}' created successfully!")


## Process Document with Enhanced Classifier

Process the document again using the enhanced classifier.
Invoices and loan applications will now have additional fields extracted.

In [None]:
if classifier_id and analyzer_id:
    pdf_path = "../data/mixed_financial_docs.pdf"
    print(f"📄 Reading document file: {pdf_path}")
    with open(pdf_path, "rb") as pdf_file:
        pdf_content = pdf_file.read()

    # Begin binary classification operation
    print(f"🔍 Starting binary classification with classifier '{classifier_id}'...")
    classification_poller = await client.content_classifiers.begin_classify_binary(
        classifier_id=classifier_id,
        input=pdf_content,
        content_type="application/pdf",
    )

    # Wait for classification completion
    print(f"⏳ Waiting for classification to complete...")
    classification_result = await classification_poller.result()
    print(f"✅ Classification completed successfully!")
else:
    print("⚠️  Skipping enhanced classification - enhanced classifier was not created.")

## View Enhanced Results with Extracted Fields

Review the classification results alongside extracted fields from loan application documents.

In [None]:
# Display classification results
print(f"📊 Classification Results: {json.dumps(classification_result.as_dict(), indent=2)}")
for content in classification_result.contents:
    if hasattr(content, "classifications") and content.classifications:
        for classification in content.classifications:
            print(f"   Category: {classification.category}")
            print(f"   Confidence: {classification.confidence}")
            print(f"   Score: {classification.score}")

## Saving Classification Results
The classification result is saved to a JSON file for later analysis.

In [None]:
# Save the classification result to a file
saved_file_path = save_json_to_file(
    result=classification_result.as_dict(),
    filename_prefix="content_classifiers_classify_binary",
)
print(f"💾 Classification result saved to: {saved_file_path}")

## Clean up the created analyzer
After the demo completes, the analyzer is automatically deleted to prevent resource accumulation.

In [None]:
# Clean up the created analyzer (demo cleanup)
print(f"🗑️  Deleting analyzer '{analyzer_id}' (demo cleanup)...")
await client.content_analyzers.delete(analyzer_id=analyzer_id)
print(f"✅ Analyzer '{analyzer_id}' deleted successfully!")

## Clean up the created classifier
After the demo completes, the classifier is automatically deleted to prevent resource accumulation.

In [None]:
# Clean up the created classifier (demo cleanup)
print(f"🗑️  Deleting classifier '{classifier_id}' (demo cleanup)...")
await client.content_classifiers.delete(classifier_id=classifier_id)
print(f"✅ Classifier '{classifier_id}' deleted successfully!")