# Azure AI Content Understanding - Classifier and Analyzer Demo

This notebook demonstrates how to use Azure AI Content Understanding service to:
1. Create a classifier to categorize documents
2. Create a custom analyzer to extract specific fields
3. Combine classifier and analyzer for intelligent document processing

## Prerequisites
- Azure subscription with access to Azure AI services
- Python 3.8 or higher
- A PDF document for testing (sample included)


In [None]:
%pip install -r requirements.txt

## 1. Import Required Libraries

In [None]:
import json
import logging
import os
import sys
import uuid
from pathlib import Path

from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

print("✅ Libraries imported successfully!")

## 2. Import Azure Content Understanding Client

The `AzureContentUnderstandingClient` class handles all API interactions with the Azure AI service.

In [None]:
# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
try:
    from python.content_understanding_client import AzureContentUnderstandingClient
    print("✅ Azure Content Understanding Client imported successfully!")
except ImportError:
    print("❌ Error: Make sure 'AzureContentUnderstandingClient.py' is in the same directory as this notebook.")
    raise

## 3. Configure Azure AI Service Settings

Update these settings to match your Azure environment:

- **azure_ai_service_endpoint**: Your Azure AI service endpoint URL
- **subscription_key**: Your subscription key (optional if using token authentication)
- **file_location**: Path to the PDF document you want to process

In [None]:
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview")
ANALYZER_SAMPLE_FILE = '../data/MS_Annual_Report_2024.pdf' # Update this path to your PDF file

file_location = Path(ANALYZER_SAMPLE_FILE)

# For authentication, you can use either token-based auth or subscription key, and only one of them is required

# Authentication - Using DefaultAzureCredential for token-based auth
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

# IMPORTANT: Replace with your actual subscription key if not using token auth
subscription_key = "dummy_key"

print("📋 Configuration Summary:")
print(f"   Endpoint: {AZURE_AI_ENDPOINT}")
print(f"   API Version: {AZURE_AI_API_VERSION}")
print(f"   Document: {file_location.name if file_location.exists() else '❌ File not found'}")

## 4. Define Classifier Schema

The classifier schema defines:
- **Categories**: Document types to classify (e.g., Legal, Medical)
- **Split Mode**: How to split multi-page documents
  - `"auto"`: Automatically split based on content
  - `"none"`: Don't split
  - `"perPage"`: Split every page

In [None]:
# Define document categories and their descriptions
classifier_schema = {
    "categories": {
        "Executive Summary & Strategy": {
            "description": "Leadership messages, strategic vision, and future outlook."
        },
        "Financial Performance & Segment Reporting": {
            "description": "Overall financial results and detailed performance by business units."
        },
        "Operations & Corporate Governance": {
            "description": "Business operations, governance structure, and risk management."
        },
        "Shareholder Information & Relations": {
            "description": "Annual meeting details, stock information, and shareholder services."
        }
    },
    "splitMode": "auto"  # IMPORTANT: Automatically detect document boundaries
}

print("📄 Classifier Categories:")
for category, details in classifier_schema["categories"].items():
    print(f"   • {category}: {details['description'][:60]}...")

## 5. Initialize Content Understanding Client

Create the client that will communicate with Azure AI services.

In [None]:
# Initialize the Azure Content Understanding client
try:
    content_understanding_client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=AZURE_AI_API_VERSION,
        # IMPORTANT: Comment out token_provider if using subscription key
        token_provider=token_provider,
        # IMPORTANT: Uncomment this if using subscription key
        # subscription_key=subscription_key
    )
    print("✅ Content Understanding client initialized successfully!")
    print("   Ready to create classifiers and analyzers.")
except Exception as e:
    print(f"❌ Failed to initialize client: {e}")
    raise

## 6. Create a Basic Classifier

First, we'll create a simple classifier that categorizes documents without additional analysis.

In [None]:
# Generate unique classifier ID
classifier_id = "classifier-sample-" + str(uuid.uuid4())

try:
    # Create classifier
    print(f"🔨 Creating classifier: {classifier_id}")
    print("   This may take a few seconds...")
    
    response = content_understanding_client.begin_create_classifier(classifier_id, classifier_schema)
    result = content_understanding_client.poll_result(response)
    
    print("\n✅ Classifier created successfully!")
    print(f"   Status: {result.get('status')}")
    print(f"   Resource Location: {result.get('resourceLocation')}")
    
except Exception as e:
    print(f"\n❌ Error creating classifier: {e}")
    if "already exists" in str(e):
        print("\n💡 Tip: The classifier already exists. You can:")
        print("   1. Use a different classifier ID")
        print("   2. Delete the existing classifier first")
        print("   3. Skip to document classification")

## 7. Classify Your Document

Now let's use the classifier to categorize your document.

In [None]:
try:
    # Check if document exists
    if not file_location.exists():
        raise FileNotFoundError(f"Document not found at {file_location}")
    
    # Classify document
    print(f"📄 Classifying document: {file_location.name}")
    print("\n⏳ Processing... This may take a few minutes for large documents.")
    
    response = content_understanding_client.begin_classify(classifier_id, file_location=str(file_location))
    result = content_understanding_client.poll_result(response, timeout_seconds=360)
    
    print("\n✅ Classification completed successfully!")
    
except FileNotFoundError:
    print(f"\n❌ Document not found: {file_location}")
    print("   Please update file_location to point to your PDF file.")
except Exception as e:
    print(f"\n❌ Error classifying document: {e}")

## 8. View Classification Results

Let's examine what the classifier found in your document.

In [None]:
# Display classification results
if 'result' in locals() and result:
    result_data = result.get("result", {})
    contents = result_data.get("contents", [])
    
    print("📊 CLASSIFICATION RESULTS")
    print("=" * 50)
    print(f"\nTotal sections found: {len(contents)}")
    
    # Show summary of each classified section
    print("\n📋 Document Sections:")
    for i, content in enumerate(contents, 1):
        print(f"\n   Section {i}:")
        print(f"   • Category: {content.get('category', 'Unknown')}")
        print(f"   • Pages: {content.get('startPageNumber', '?')} - {content.get('endPageNumber', '?')}")
        
    print("\nFull result:")
    print(json.dumps(result, indent=2))
else:
    print("❌ No results available. Please run the classification step first.")

## 9. Create a Custom Analyzer (Advanced)

Now let's create a custom analyzer that can extract specific fields from documents.
This analyzer will:
- Extract visit dates from medical documents
- Generate document excerpts

In [None]:
# Define analyzer schema with custom fields
analyzer_schema = {
    "description": "Medical encounter analyzer - extracts key information from medical records",
    "baseAnalyzerId": "prebuilt-documentAnalyzer",  # Built on top of the general document analyzer
    "config": {
        "returnDetails": True,
        "enableLayout": True,          # Extract layout information
        "enableBarcode": False,        # Skip barcode detection
        "enableFormula": False,        # Skip formula detection
        "estimateFieldSourceAndConfidence": False, # Set to True if you want to estimate the field location (aka grounding) and confidence
        "disableContentFiltering": False,
    },
    "fieldSchema": {
        "fields": {
            "ReportDate": {
                "type": "date",
                "method": "generate",
                "description": "The publication or filing date of the annual report."
            },
            "CompanyName": {
                "type": "string",
                "method": "generate",
                "description": "The name of the company issuing the report."
            },
            "FiscalYear": {
                "type": "string",
                "method": "generate",
                "description": "The fiscal year the report covers."
            },
            "NetIncome": {
                "type": "number",
                "method": "generate",
                "description": "Net income or profit reported for the fiscal year."
            },
            "Summary": {
                "type": "string",
                "method": "generate",
                "description": "Brief summary of the annual report"
            }
        }
    }
}

# Generate unique analyzer ID
analyzer_id = "analyzer-medical-" + str(uuid.uuid4())

# Create the analyzer
try:
    print(f"🔨 Creating custom analyzer: {analyzer_id}")
    print("\n📋 Analyzer will extract:")
    for field_name, field_info in analyzer_schema["fieldSchema"]["fields"].items():
        print(f"   • {field_name}: {field_info['description']}")
    
    response = content_understanding_client.begin_create_analyzer(analyzer_id, analyzer_schema)
    result = content_understanding_client.poll_result(response)
    
    print("\n✅ Analyzer created successfully!")
    print(f"   Analyzer ID: {analyzer_id}")
    
except Exception as e:
    print(f"\n❌ Error creating analyzer: {e}")
    analyzer_id = None  # Set to None if creation failed

## 10. Create an Enhanced Classifier with Custom Analyzer

Now we'll create a new classifier that uses our custom analyzer for medical documents.
This combines classification with field extraction in one operation.

In [None]:
# Generate unique enhanced classifier ID
enhanced_classifier_id = "classifier-enhanced-" + str(uuid.uuid4())

# Create enhanced classifier schema
enhanced_classifier_schema = {
    "categories": {
        "Legal": {
            "description": "Legal documents including subpoenas, declarations, contracts, and other legal paperwork."
            # No analyzer specified - uses default processing
        },
        "Annual Report": {
            "description": "Each document must ends with 'end of encounter'. Dont rely on page numbers",
            "analyzerId": analyzer_id  # IMPORTANT: Use our custom analyzer for annual reports
        },
        "Declaration_of_custodian": {
            "description": "Declarations of custodian documents, often used in legal contexts."
        }
    },
    "splitMode": "auto"
}

# Create the enhanced classifier
if analyzer_id:  # Only create if analyzer was successfully created
    try:
        print(f"🔨 Creating enhanced classifier: {enhanced_classifier_id}")
        print("\n📋 Configuration:")
        print("   • Legal documents → Standard processing")
        print("   • Medical documents → Custom analyzer with field extraction")
        
        response = content_understanding_client.begin_create_classifier(enhanced_classifier_id, enhanced_classifier_schema)
        result = content_understanding_client.poll_result(response)
        
        print("\n✅ Enhanced classifier created successfully!")
        
    except Exception as e:
        print(f"\n❌ Error creating enhanced classifier: {e}")
else:
    print("⚠️  Skipping enhanced classifier creation - analyzer was not created successfully.")

## 11. Process Document with Enhanced Classifier

Let's process the document again using our enhanced classifier.
Medical documents will now have additional fields extracted.

In [None]:
if 'enhanced_classifier_id' in locals() and analyzer_id:
    try:
        # Check if document exists
        if not file_location.exists():
            raise FileNotFoundError(f"Document not found at {file_location}")
        
        # Process with enhanced classifier
        print("📄 Processing document with enhanced classifier")
        print(f"   Document: {file_location.name}")
        print("\n⏳ Processing with classification + field extraction...")
        
        response = content_understanding_client.begin_classify(enhanced_classifier_id, file_location=str(file_location))
        enhanced_result = content_understanding_client.poll_result(response, timeout_seconds=360)
        
        print("\n✅ Enhanced processing completed!")
        
    except Exception as e:
        print(f"\n❌ Error processing document: {e}")
else:
    print("⚠️  Skipping enhanced classification - enhanced classifier was not created.")

## 12. View Enhanced Results with Extracted Fields

Let's see the classification results along with the extracted fields from medical documents.

In [None]:
# Display enhanced results
if 'enhanced_result' in locals() and enhanced_result:
    result_data = enhanced_result.get("result", {})
    contents = result_data.get("contents", [])
    
    print("📊 ENHANCED CLASSIFICATION RESULTS")
    print("=" * 70)
    print(f"\nTotal sections found: {len(contents)}")
    
    # Process each section
    for i, content in enumerate(contents, 1):
        print(f"\n{'='*70}")
        print(f"SECTION {i}")
        print(f"{'='*70}")
        
        category = content.get('category', 'Unknown')
        print(f"\n📁 Category: {category}")
        print(f"📄 Pages: {content.get('startPageNumber', '?')} - {content.get('endPageNumber', '?')}")
        
        # Show extracted fields for medical documents
        fields = content.get('fields', {})
        if fields:
            print("\n🔍 Extracted Information:")
            for field_name, field_data in fields.items():
                print(f"\n   {field_name}:")
                print(f"   • Value: {field_data}")
else:
    print("❌ No enhanced results available. Please run the enhanced classification step first.")

You can also see the fulll JSON result below.

In [None]:
print(json.dumps(enhanced_result, indent=2))

## Summary and Next Steps

Congratulations! You've successfully:
1. ✅ Created a basic classifier to categorize documents
2. ✅ Created a custom analyzer to extract specific fields
3. ✅ Combined them into an enhanced classifier for intelligent document processing