# Enhance Your Analyzer with Labeled Data


> #################################################################################
>
> Note: Currently, this feature is only available when the analyzer scenario is set to `document`.
>
> #################################################################################

Labeled data consists of samples that have been tagged with one or more labels to add context or meaning. This additional information is used to improve the analyzer's performance.

In your own projects, you can use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to annotate your data with the labeling tool.

This notebook demonstrates how to create an analyzer using your labeled data and how to analyze your files afterward.


## Prerequisites
1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).
2. Set environment variables related to training data by following the steps in [Set env for training data](../docs/set_env_for_training_data_and_reference_doc.md) and adding them to the [.env](./.env) file.
   - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,
   - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` to generate the SAS URL automatically during later steps.
   - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where the training data will be uploaded.
3. Install the packages required to run the sample:


In [None]:
%pip install -r ../requirements.txt

## Analyzer Template and Local Training Folder Setup
In this sample, we define a template for receipts.

The training folder should contain a flat (one-level) directory of labeled receipt documents. Each document includes:
- The original file (e.g., PDF or image).
- A corresponding `labels.json` file with labeled fields.
- A corresponding `result.json` file with OCR results.

In [None]:
training_docs_folder = "../data/document_training"

## Create Azure Content Understanding Client
> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that contains helper functions. Before the official release of the Content Understanding SDK, please consider it a lightweight SDK.
>
> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the information from your Azure AI Service.

> ⚠️ Important:
You must update the code below to match your Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
If you skip this step, the sample may not run correctly.

> ⚠️ Note: While using a subscription key works, using a token provider with Azure Active Directory (AAD) is safer and highly recommended for production environments.

In [None]:
import logging
import json
import os
import sys
import uuid
from dotenv import load_dotenv
from azure.storage.blob import ContainerSasPermissions
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential
from azure.ai.contentunderstanding.aio import ContentUnderstandingClient
from azure.ai.contentunderstanding.models import (
    ContentAnalyzer,
    FieldSchema,
    FieldDefinition,
    FieldType,
    GenerationMethod,
    AnalysisMode,
    ProcessingLocation,
)

# Add the parent directory to the Python path to import the sample_helper module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'python'))
from extension.document_processor import DocumentProcessor
from extension.sample_helper import extract_operation_id_from_poller, PollerType, save_json_to_file

load_dotenv()
logging.basicConfig(level=logging.INFO)

endpoint = os.environ.get("AZURE_CONTENT_UNDERSTANDING_ENDPOINT")
# Return AzureKeyCredential if AZURE_CONTENT_UNDERSTANDING_KEY is set, otherwise DefaultAzureCredential
key = os.getenv("AZURE_CONTENT_UNDERSTANDING_KEY")
credential = AzureKeyCredential(key) if key else DefaultAzureCredential()
# Create the ContentUnderstandingClient
client = ContentUnderstandingClient(endpoint=endpoint, credential=credential)
print("✅ ContentUnderstandingClient created successfully")

try:
    processor = DocumentProcessor(client)
    print("✅ DocumentProcessor created successfully")
except Exception as e:
    print(f"❌ Failed to create DocumentProcessor: {e}")
    raise

## Prepare Labeled Data
In this step, we will:
- Use the environment variables `TRAINING_DATA_PATH` and SAS URL related variables set in the Prerequisites step.
- Attempt to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.
- If `TRAINING_DATA_SAS_URL` is not set, try generating it automatically using `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` environment variables.
- Verify that each document file in the local folder has corresponding `.labels.json` and `.result.json` files.
- Upload these files to the Azure Blob storage container specified by the environment variables.

In [None]:
# Load reference storage configuration from environment
training_data_path = os.getenv("TRAINING_DATA_PATH") or f"training_data_{uuid.uuid4().hex[:8]}"
training_data_sas_url = os.getenv("TRAINING_DATA_SAS_URL")

if not training_data_path.endswith("/"):
    training_data_path += "/"

if not training_data_sas_url:
    TRAINING_DATA_STORAGE_ACCOUNT_NAME = os.getenv("TRAINING_DATA_STORAGE_ACCOUNT_NAME")
    TRAINING_DATA_CONTAINER_NAME = os.getenv("TRAINING_DATA_CONTAINER_NAME")
    print(f"TRAINING_DATA_STORAGE_ACCOUNT_NAME: {TRAINING_DATA_STORAGE_ACCOUNT_NAME}")
    print(f"TRAINING_DATA_CONTAINER_NAME: {TRAINING_DATA_CONTAINER_NAME}")

    if TRAINING_DATA_STORAGE_ACCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME:
        # We require "Write" permission to upload, modify, or append blobs
        training_data_sas_url = processor.generate_container_sas_url(
            account_name=TRAINING_DATA_STORAGE_ACCOUNT_NAME,
            container_name=TRAINING_DATA_CONTAINER_NAME,
            permissions=ContainerSasPermissions(read=True, write=True, list=True),
            expiry_hours=1,
        )

await processor.generate_training_data_on_blob(training_docs_folder, training_data_sas_url, training_data_path)

## Create Analyzer with Defined Schema
Before creating the analyzer, fill in the constant `ANALYZER_ID` with a relevant name for your task. In this example, we generate a unique suffix so that this cell can be run multiple times to create different analyzers.

We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** as set in the [.env](./.env) file and used in the previous step.

In [None]:
import datetime

analyzer_id = f"analyzer-training-sample-{datetime.now().strftime('%Y%m%d')}-{datetime.now().strftime('%H%M%S')}-{uuid.uuid4().hex[:8]}"

content_analyzer = ContentAnalyzer(
    base_analyzer_id="prebuilt-documentAnalyzer",
    description="Extract useful information from receipt",
    field_schema=FieldSchema(
        name="receipt schema",
        description="Schema for receipt",
        fields={
            "MerchantName": FieldDefinition(
                type=FieldType.STRING,
                method=GenerationMethod.EXTRACT,
                description=""
            ),
            "Items": FieldDefinition(
                type=FieldType.ARRAY,
                method=GenerationMethod.GENERATE,
                description="",
                items_property={
                    "type": "object",
                    "method": "extract",
                    "properties": {
                        "Quantity": {
                            "type": "string",
                            "method": "extract",
                            "description": ""
                        },
                        "Name": {
                            "type": "string",
                            "method": "extract",
                            "description": ""
                        },
                        "Price": {
                            "type": "string",
                            "method": "extract",
                            "description": ""
                        }
                    }
                }
            ),
            "TotalPrice": FieldDefinition(
                type=FieldType.STRING,
                method=GenerationMethod.EXTRACT,
                description=""
            )
        }
    ),
    mode=AnalysisMode.STANDARD,
    processing_location=ProcessingLocation.GEOGRAPHY,
    tags={"demo_type": "get_result"},
    training_data={
        "kind": "blob",
        "containerUrl": training_data_sas_url,
        "prefix": training_data_path
    },
)
print(f"🔧 Creating custom analyzer '{analyzer_id}'...")
poller = await client.content_analyzers.begin_create_or_replace(
    analyzer_id=analyzer_id,
    resource=content_analyzer,
)

# Extract operation ID from the poller
operation_id = extract_operation_id_from_poller(
    poller, PollerType.ANALYZER_CREATION
)
print(f"📋 Extracted creation operation ID: {operation_id}")

# Wait for the analyzer to be created
print(f"⏳ Waiting for analyzer creation to complete...")
await poller.result()
print(f"✅ Analyzer '{analyzer_id}' created successfully!")

## Use Created Analyzer to Extract Document Content
After the analyzer is successfully created, you can use it to analyze your input files.

In [None]:
file_path = "../data/receipt.png"
print(f"📄 Reading document file: {file_path}")
with open(file_path, "rb") as f:
    data_content = f.read()

# Begin document analysis operation
print(f"🔍 Starting document analysis with analyzer '{analyzer_id}'...")
analysis_poller = await client.content_analyzers.begin_analyze_binary(
    analyzer_id=analyzer_id, 
    input=data_content,
    content_type="application/octet-stream")

# Wait for analysis completion
print(f"⏳ Waiting for document analysis to complete...")
analysis_result = await analysis_poller.result()
print(f"✅ Document analysis completed successfully!")

 # Extract operation ID for get_result
analysis_operation_id = extract_operation_id_from_poller(
    analysis_poller, PollerType.ANALYZE_CALL
)
print(f"📋 Extracted analysis operation ID: {analysis_operation_id}")

# Get the analysis result using the operation ID
print(
    f"🔍 Getting analysis result using operation ID '{analysis_operation_id}'..."
)
operation_status = await client.content_analyzers.get_result(
    operation_id=analysis_operation_id,
)

print(f"✅ Analysis result retrieved successfully!")
print(f"   Operation ID: {operation_status.id}")
print(f"   Status: {operation_status.status}")

# The actual analysis result is in operation_status.result
operation_result = operation_status.result
if operation_result is None:
    print("⚠️  No analysis result available")

print(f"📄 Analysis Result: {json.dumps(operation_result.as_dict())}")

# Save the analysis result to a file
saved_file_path = save_json_to_file(
    result=operation_result.as_dict(),
    filename_prefix="analyzer_training_get_result",
)

## Delete Existing Analyzer in Content Understanding Service
This snippet is optional and is included to prevent test analyzers from remaining in your service. Without deletion, the analyzer will stay in your service and may be reused in subsequent operations.

In [None]:
print(f"🗑️  Deleting analyzer '{analyzer_id}' (demo cleanup)...")
await client.content_analyzers.delete(analyzer_id=analyzer_id)
print(f"✅ Analyzer '{analyzer_id}' deleted successfully!")