# Azure AI Content Understanding - Classifier and Analyzer Demo

This notebook demonstrates how to use the Azure AI Content Understanding service to:
1. Create a classifier for document categorization
2. Create a custom analyzer to extract specific fields
3. Combine the classifier and analyzers to classify, optionally split, and analyze documents within a flexible processing pipeline

For more detailed information before getting started, please refer to the official documentation:
[Understanding Classifiers in Azure AI Services](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier)

## Prerequisites
1. Ensure the Azure AI service is configured by following the [setup steps](../README.md#configure-azure-ai-service-resource).
2. Install the required packages to run this sample.

In [None]:
%pip install -r ../requirements.txt

## 1. Import Required Libraries

In [None]:
import json
import logging
import os
import sys
import uuid
from pathlib import Path

from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

print("✅ Libraries imported successfully!")

## 2. Import Azure Content Understanding Client

The `AzureContentUnderstandingClient` class manages all API interactions with the Azure AI service.

In [None]:
# Add the parent directory to the system path to access shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
try:
    from python.content_understanding_client import AzureContentUnderstandingClient
    print("✅ Azure Content Understanding Client imported successfully!")
except ImportError:
    print("❌ Error: Ensure 'AzureContentUnderstandingClient.py' exists in the same directory as this notebook.")
    raise

## 3. Configure Azure AI Service Settings and Prepare the Sample

Update the following settings to match your Azure environment:

- **AZURE_AI_ENDPOINT**: Your Azure AI service endpoint URL, or configure it in the ".env" file
- **AZURE_AI_API_VERSION**: Azure AI API version to use. Defaults to "2025-05-01-preview"
- **AZURE_AI_API_KEY**: Your Azure AI API key (optional if using token-based authentication)
- **ANALYZER_SAMPLE_FILE**: Path to the PDF document you want to process

In [None]:
# Authentication supports either token-based or subscription key methods; only one is required
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
# IMPORTANT: Substitute with your subscription key or configure in ".env" if not using token auth
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview")
ANALYZER_SAMPLE_FILE = "../data/mixed_financial_docs.pdf"  # Update this path to your PDF file

# Use DefaultAzureCredential for token-based authentication
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

file_location = Path(ANALYZER_SAMPLE_FILE)

print("📋 Configuration Summary:")
print(f"   Endpoint: {AZURE_AI_ENDPOINT}")
print(f"   API Version: {AZURE_AI_API_VERSION}")
print(f"   Document: {file_location.name if file_location.exists() else '❌ File not found'}")

## 4. Define Classifier Schema

The classifier schema defines:
- **Categories**: Document types to classify (e.g., Legal, Medical)
  - **description (Optional)**: Provides additional context or hints for categorizing or splitting documents. Useful when the category name alone is not sufficiently descriptive. Omit if the category name is self-explanatory.
- **splitMode Options**: Determines how multi-page documents are split before classification or analysis.
  - `"auto"`: Automatically split based on content.  
    For example, given categories “invoice” and “application form”:
      - A PDF with one invoice will be classified as a single document.
      - A PDF containing two invoices and one application form will be automatically split into three classified sections.
  - `"none"`: No splitting.  
    The entire multi-page document is treated as one unit for classification and analysis.
  - `"perPage"`: Split by page.  
    Treats each page as a separate document, useful if custom analyzers designed to operate at the page level.

In [None]:
# Define document categories and their descriptions
classifier_schema = {
    "categories": {
        "Loan application": {  # Both spaces and underscores are supported in category names
            "description": "Documents submitted by individuals or businesses to request funding, typically including personal or business details, financial history, loan amount, purpose, and supporting documentation."
        },
        "Invoice": {
            "description": "Billing documents issued by sellers or service providers to request payment for goods or services, detailing items, prices, taxes, totals, and payment terms."
        },
        "Bank_Statement": {  # Both spaces and underscores are supported
            "description": "Official statements issued by banks summarizing account activity over a period, including deposits, withdrawals, fees, and balances."
        },
    },
    "splitMode": "auto"  # IMPORTANT: Automatically detect document boundaries; adjust as needed.
}

print("📄 Classifier Categories:")
for category, details in classifier_schema["categories"].items():
    print(f"   • {category}: {details['description'][:60]}...")

## 5. Initialize Content Understanding Client

Create the client to interact with Azure AI services.

⚠️ Important:
Please update the authentication details below to match your Azure setup.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
Skipping this step may result in runtime errors.

⚠️ Note: While subscription key authentication works, using Azure Active Directory (AAD) token provider is more secure and recommended for production.

In [None]:
# Initialize the Azure Content Understanding client
try:
    content_understanding_client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=AZURE_AI_API_VERSION,
        # IMPORTANT: Comment out token_provider if using subscription key
        token_provider=token_provider,
        # IMPORTANT: Uncomment this if using subscription key
        # subscription_key=AZURE_AI_API_KEY,
    )
    print("✅ Content Understanding client initialized successfully!")
    print("   Ready to create classifiers and analyzers.")
except Exception as e:
    print(f"❌ Failed to initialize client: {e}")
    raise

## 6. Create a Basic Classifier

First, create a simple classifier that categorizes documents without performing additional analysis.

In [None]:
# Generate a unique classifier ID
classifier_id = "classifier-sample-" + str(uuid.uuid4())

try:
    # Create the classifier
    print(f"🔨 Creating classifier: {classifier_id}")
    print("   This may take a few seconds...")
    
    response = content_understanding_client.begin_create_classifier(classifier_id, classifier_schema)
    result = content_understanding_client.poll_result(response)
    
    print("\n✅ Classifier created successfully!")
    print(f"   Status: {result.get('status')}")
    print(f"   Resource Location: {result.get('resourceLocation')}")
    
except Exception as e:
    print(f"\n❌ Error creating classifier: {e}")
    if "already exists" in str(e):
        print("\n💡 Tip: The classifier already exists. You can:")
        print("   1. Use a different classifier ID")
        print("   2. Delete the existing classifier first")
        print("   3. Skip to document classification")

## 7. Classify Your Document

Now, use the classifier to categorize your document.

In [None]:
try:
    # Verify that the document exists
    if not file_location.exists():
        raise FileNotFoundError(f"Document not found at {file_location}")
    
    # Classify the document
    print(f"📄 Classifying document: {file_location.name}")
    print("\n⏳ Processing... This may take several minutes for large documents.")
    
    response = content_understanding_client.begin_classify(classifier_id, file_location=str(file_location))
    result = content_understanding_client.poll_result(response, timeout_seconds=360)
    
    print("\n✅ Classification completed successfully!")
    
except FileNotFoundError:
    print(f"\n❌ Document not found: {file_location}")
    print("   Please update 'file_location' to point to your PDF file.")
except Exception as e:
    print(f"\n❌ Error classifying document: {e}")

## 8. View Classification Results

Review the classification results generated for your document.

In [None]:
# Display classification results
if 'result' in locals() and result:
    result_data = result.get("result", {})
    contents = result_data.get("contents", [])
    
    print("📊 CLASSIFICATION RESULTS")
    print("=" * 50)
    print(f"\nTotal sections found: {len(contents)}")
    
    # Summarize each classified section
    print("\n📋 Document Sections:")
    for i, content in enumerate(contents, 1):
        print(f"\n   Section {i}:")
        print(f"   • Category: {content.get('category', 'Unknown')}")
        print(f"   • Pages: {content.get('startPageNumber', '?')} - {content.get('endPageNumber', '?')}")
        
    print("\nFull result output:")
    print(json.dumps(result, indent=2))
else:
    print("❌ No results available. Please run the classification step first.")

## 9. Create a Custom Analyzer (Advanced)

Create a custom analyzer to extract specific fields from documents.
This example extracts common fields from loan application documents and generates document excerpts.

In [None]:
# Define the analyzer schema with custom fields
analyzer_schema = {
    "description": "Loan application analyzer - extracts key information from loan applications",
    "baseAnalyzerId": "prebuilt-documentAnalyzer",  # Built on top of the general document analyzer
    "config": {
        "returnDetails": True,
        "enableLayout": True,          # Extract layout details
        "enableBarcode": False,        # Disable barcode detection
        "enableFormula": False,        # Disable formula detection
        "estimateFieldSourceAndConfidence": True, # Enable estimation of field location and confidence
        "disableContentFiltering": False
    },
    "fieldSchema": {
        "fields": {
            "ApplicationDate": {
                "type": "date",
                "method": "generate",
                "description": "The date when the loan application was submitted."
            },
            "ApplicantName": {
                "type": "string",
                "method": "generate",
                "description": "Full name of the loan applicant or company."
            },
            "LoanAmountRequested": {
                "type": "number",
                "method": "generate",
                "description": "The total loan amount requested by the applicant."
            },
            "LoanPurpose": {
                "type": "string",
                "method": "generate",
                "description": "The stated purpose or reason for the loan."
            },
            "CreditScore": {
                "type": "number",
                "method": "generate",
                "description": "Credit score of the applicant, if available."
            },
            "Summary": {
                "type": "string",
                "method": "generate",
                "description": "A brief summary overview of the loan application details."
            }
        }
    }
}

# Generate a unique analyzer ID
analyzer_id = "analyzer-loan-application-" + str(uuid.uuid4())

# Create the custom analyzer
try:
    print(f"🔨 Creating custom analyzer: {analyzer_id}")
    print("\n📋 The analyzer will extract the following fields:")
    for field_name, field_info in analyzer_schema["fieldSchema"]["fields"].items():
        print(f"   • {field_name}: {field_info['description']}")
    
    response = content_understanding_client.begin_create_analyzer(analyzer_id, analyzer_schema)
    result = content_understanding_client.poll_result(response)
    
    print("\n✅ Analyzer created successfully!")
    print(f"   Analyzer ID: {analyzer_id}")
    
except Exception as e:
    print(f"\n❌ Error creating analyzer: {e}")
    analyzer_id = None  # Set to None if creation failed

## 10. Create an Enhanced Classifier with Custom Analyzer

Now create a new classifier that uses the prebuilt invoice analyzer for invoices and the custom analyzer for loan application documents.
This combines document classification with field extraction in one operation.

In [None]:
# Generate a unique enhanced classifier ID
enhanced_classifier_id = "classifier-enhanced-" + str(uuid.uuid4())

# Define the enhanced classifier schema
enhanced_classifier_schema = {
    "categories": {
        "Loan application": {  # Both spaces and underscores allowed
            "description": "Documents submitted by individuals or businesses requesting funding, including personal/business details, financial history, and supporting documents.",
            "analyzerId": analyzer_id  # IMPORTANT: Use the custom analyzer created previously for loan applications
        },
        "Invoice": {
            "description": "Billing documents issued by sellers or service providers requesting payment for goods or services, detailing items, prices, taxes, totals, and payment terms.",
            "analyzerId": "prebuilt-invoice"  # Use prebuilt invoice analyzer for invoices
        },
        "Bank_Statement": {  # Both spaces and underscores allowed
            "description": "Official bank statements summarizing account activity over a period, including deposits, withdrawals, fees, and balances."
            # No analyzer specified - uses default processing
        }
    },
    "splitMode": "auto"
}

# Create the enhanced classifier only if the custom analyzer was created successfully
if analyzer_id:
    try:
        print(f"🔨 Creating enhanced classifier: {enhanced_classifier_id}")
        print("\n📋 Configuration:")
        print("   • Loan application documents → Custom analyzer with field extraction")
        print("   • Invoice documents → Prebuilt invoice analyzer")
        print("   • Bank_Statement documents → Standard processing")
        
        response = content_understanding_client.begin_create_classifier(enhanced_classifier_id, enhanced_classifier_schema)
        result = content_understanding_client.poll_result(response)
        
        print("\n✅ Enhanced classifier created successfully!")
        
    except Exception as e:
        print(f"\n❌ Error creating enhanced classifier: {e}")
else:
    print("⚠️  Skipping enhanced classifier creation - custom analyzer was not created successfully.")

## 11. Process Document with Enhanced Classifier

Process the document again using the enhanced classifier.
Invoices and loan applications will now have additional fields extracted.

In [None]:
if 'enhanced_classifier_id' in locals() and analyzer_id:
    try:
        # Verify the document exists
        if not file_location.exists():
            raise FileNotFoundError(f"Document not found at {file_location}")
        
        # Process document with enhanced classifier
        print("📄 Processing document with enhanced classifier")
        print(f"   Document: {file_location.name}")
        print("\n⏳ Processing with classification and field extraction...")
        
        response = content_understanding_client.begin_classify(enhanced_classifier_id, file_location=str(file_location))
        enhanced_result = content_understanding_client.poll_result(response, timeout_seconds=360)
        
        print("\n✅ Enhanced processing completed!")
        
    except Exception as e:
        print(f"\n❌ Error processing document: {e}")
else:
    print("⚠️  Skipping enhanced classification - enhanced classifier was not created.")

## 12. View Enhanced Results with Extracted Fields

Review the classification results alongside extracted fields from loan application documents.

In [None]:
# Display enhanced classification results
if 'enhanced_result' in locals() and enhanced_result:
    result_data = enhanced_result.get("result", {})
    contents = result_data.get("contents", [])
    
    print("📊 ENHANCED CLASSIFICATION RESULTS")
    print("=" * 70)
    print(f"\nTotal sections found: {len(contents)}")
    
    # Iterate through each document section
    for i, content in enumerate(contents, 1):
        print(f"\n{'='*70}")
        print(f"SECTION {i}")
        print(f"{'='*70}")
        
        category = content.get('category', 'Unknown')
        print(f"\n📁 Category: {category}")
        print(f"📄 Pages: {content.get('startPageNumber', '?')} - {content.get('endPageNumber', '?')}")
        
        # Display extracted fields if available
        fields = content.get('fields', {})
        if fields:
            print("\n🔍 Extracted Information:")
            for field_name, field_data in fields.items():
                print(f"\n   {field_name}:")
                print(f"   • Value: {field_data}")
else:
    print("❌ No enhanced results available. Please run the enhanced classification step first.")

You can also view the full JSON result below.

In [None]:
print(json.dumps(enhanced_result, indent=2))

## Summary and Next Steps

Congratulations! You have successfully:
1. ✅ Created a basic classifier to categorize documents
2. ✅ Created a custom analyzer to extract specific fields
3. ✅ Combined them into an enhanced classifier for intelligent document processing