# Extract Custom Fields from Your Pre-transcribed File

This notebook demonstrates how to use analyzers to extract custom fields from your pre-transcribed input files.

## Prerequisites
1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).
2. Install the required packages to run the sample.

In [None]:
%pip install -r ../requirements.txt

## Analyzer Templates

Below is a collection of analyzer templates designed to extract fields from various input file types.

These templates are highly customizable, allowing you to adapt them to your specific requirements. For additional verified templates provided by Microsoft, please visit [here](../analyzer_templates/).

In [None]:
extraction_templates = {
    "call_recording_pretranscribe_batch": ('../analyzer_templates/call_recording_analytics_text.json', '../data/batch_pretranscribed.json'),
    "call_recording_pretranscribe_fast": ('../analyzer_templates/call_recording_analytics_text.json', '../data/fast_pretranscribed.json'),
    "call_recording_pretranscribe_cu": ('../analyzer_templates/call_recording_analytics_text.json', '../data/cu_pretranscribed.json')
}

Specify the analyzer template to use and assign a unique name for the analyzer that will be created from the template.

In [None]:
analyzer_template = "call_recording_pretranscribe_batch"
(analyzer_template_path, analyzer_sample_file_path) = extraction_templates[analyzer_template]

## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class providing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, this class can be considered a lightweight SDK.

> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with your Azure AI Service credentials.

> ⚠️ Important:
Make sure to update the code below to match your chosen Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
Skipping this step may prevent the sample from running correctly.

> ⚠️ Note: While subscription key authentication works, it is strongly recommended to use a token provider with Azure Active Directory (AAD) for improved security in production environments.

In [None]:
import logging
import json
import os
import sys
import uuid
from dotenv import load_dotenv
from azure.storage.blob import ContainerSasPermissions
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential
from azure.ai.contentunderstanding.aio import ContentUnderstandingClient
from azure.ai.contentunderstanding.models import (
    ContentAnalyzer,
    ContentAnalyzerConfig,
    FieldSchema,
    FieldDefinition,
    FieldType,
    GenerationMethod,
    AnalysisMode,
    ProcessingLocation,
)
from datetime import datetime

# Add the parent directory to the Python path to import the sample_helper module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'python'))
from extension.document_processor import DocumentProcessor
from extension.sample_helper import extract_operation_id_from_poller, PollerType, save_json_to_file

load_dotenv()
logging.basicConfig(level=logging.INFO)

endpoint = os.environ.get("AZURE_CONTENT_UNDERSTANDING_ENDPOINT")
# Return AzureKeyCredential if AZURE_CONTENT_UNDERSTANDING_KEY is set, otherwise DefaultAzureCredential
key = os.getenv("AZURE_CONTENT_UNDERSTANDING_KEY")
credential = AzureKeyCredential(key) if key else DefaultAzureCredential()
# Create the ContentUnderstandingClient
client = ContentUnderstandingClient(endpoint=endpoint, credential=credential)
print("✅ ContentUnderstandingClient created successfully")

try:
    processor = DocumentProcessor(client)
    print("✅ DocumentProcessor created successfully")
except Exception as e:
    print(f"❌ Failed to create DocumentProcessor: {e}")
    raise

## Create Analyzer from the Template

In [None]:
analyzer_id = f"conversational_field_extraction-sample-{datetime.now().strftime('%Y%m%d')}-{datetime.now().strftime('%H%M%S')}-{uuid.uuid4().hex[:8]}"

# Create a custom analyzer using object model
print(f"🔧 Creating custom analyzer '{analyzer_id}'...")

content_analyzer = ContentAnalyzer(
    base_analyzer_id="prebuilt-audioAnalyzer",
    config=ContentAnalyzerConfig(
        return_details=True,
    ),
    description="Sample call recording analytics",
    field_schema=FieldSchema(
        fields={
            "Summary": FieldDefinition(
                description="A one-paragraph summary",
                method=GenerationMethod.GENERATE,
                type=FieldType.STRING,
            ),
            "Topics": FieldDefinition(
                description="Top 5 topics mentioned",
                type=FieldType.ARRAY,
                method=GenerationMethod.GENERATE,
                items_property={
                    "type": "string",
                }
            ),
            "Companies": FieldDefinition(
                description="List of companies mentioned",
                type=FieldType.ARRAY,
                method=GenerationMethod.GENERATE,
                items_property={
                    "type": "string"
                }
            ),
            "People": FieldDefinition(
                description="List of people mentioned",
                type=FieldType.ARRAY,
                method=GenerationMethod.GENERATE,
                items_property=FieldDefinition(
                    type=FieldType.OBJECT,
                    properties={
                        "Name": FieldDefinition(
                            type=FieldType.STRING,
                            description="Person's name"
                        ),
                        "Role": FieldDefinition(
                            type=FieldType.STRING,
                            description="Person's title/role"
                        )
                    }
                )
            ),
            "Sentiment": FieldDefinition(
                type=FieldType.STRING,
                method=GenerationMethod.CLASSIFY,
                description="Overall sentiment",
                enum=[
                    "Positive",
                    "Neutral",
                    "Negative"
                ]
            ),
            "Categories": FieldDefinition(
                type=FieldType.ARRAY,
                method=GenerationMethod.CLASSIFY,
                description="List of relevant categories",
                items_property=FieldDefinition(
                    type=FieldType.STRING,
                    enum=[
                        "Agriculture",
                        "Business",
                        "Finance",
                        "Health",
                        "Insurance",
                        "Mining",
                        "Pharmaceutical",
                        "Retail",
                        "Technology",
                        "Transportation"
                    ]
                )
            )
        }
    )
)

# Start the analyzer creation operation
poller = await client.content_analyzers.begin_create_or_replace(
    analyzer_id=analyzer_id,
    resource=content_analyzer,
    content_type="application/json"
)

# Extract operation ID from the poller
operation_id = extract_operation_id_from_poller(
    poller, PollerType.ANALYZER_CREATION
)
print(f"📋 Extracted creation operation ID: {operation_id}")

# Wait for the analyzer to be created
print(f"⏳ Waiting for analyzer creation to complete...")
await poller.result()
print(f"✅ Analyzer '{analyzer_id}' created successfully!")

## Extract Fields Using the Analyzer

Once the analyzer is successfully created, you can use it to analyze your input files.

In [None]:
from extension.transcripts_processor import TranscriptsProcessor

test_file_path = analyzer_sample_file_path

transcripts_processor = TranscriptsProcessor()
webvtt_output, webvtt_output_file_path = transcripts_processor.convert_file(test_file_path)

if "WEBVTT" not in webvtt_output:
    print("Error: The output is not in WebVTT format.")
else:
    # Read the sample invoice PDF file
    with open(webvtt_output_file_path, 'r', encoding='utf-8') as f:
        webvtt_content = f.read()

    print(f"✅ Sample WebVTT file read successfully from {webvtt_output_file_path}")
    # Begin document analysis operation
    print(f"🔍 Starting document analysis with analyzer '{analyzer_id}'...")
    analysis_poller = await client.content_analyzers.begin_analyze_binary(
        analyzer_id=analyzer_id,
        input=webvtt_content,
        content_type="application/octet-stream",
    )

    # Wait for analysis completion
    print(f"⏳ Waiting for document analysis to complete...")
    analysis_result = await analysis_poller.result()
    print(f"✅ Document analysis completed successfully!")

    # Extract operation ID for get_result
    analysis_operation_id = extract_operation_id_from_poller(
        analysis_poller, PollerType.ANALYZE_CALL
    )
    print(f"📋 Extracted analysis operation ID: {analysis_operation_id}")

    # Get the analysis result using the operation ID
    print(
        f"🔍 Getting analysis result using operation ID '{analysis_operation_id}'..."
    )
    operation_status = await client.content_analyzers.get_result(
        operation_id=analysis_operation_id,
    )

    print(f"✅ Analysis result retrieved successfully!")
    print(f"   Operation ID: {operation_status.id}")
    print(f"   Status: {operation_status.status}")

    # The actual analysis result is in operation_status.result
    operation_result = operation_status.result
    if operation_result is None:
        print("⚠️  No analysis result available")
        
    print(f"   Result contains {len(operation_result.contents)} contents")

    # Save the analysis result to a file
    saved_file_path = save_json_to_file(
        result=operation_result.as_dict(),
        filename_prefix="conversational_field_extraction_get_result",
    )
    print(f"💾 Analysis result saved to: {saved_file_path}")


## Clean Up
Optionally, delete the sample analyzer from your Azure resource. In typical usage scenarios, you would analyze multiple files using the same analyzer.

In [None]:
# Clean up the created analyzer (demo cleanup)
print(f"🗑️  Deleting analyzer '{analyzer_id}' (demo cleanup)...")
await client.content_analyzers.delete(analyzer_id=analyzer_id)
print(f"✅ Analyzer '{analyzer_id}' deleted successfully!")