# Enhance Your Analyzer with Labeled Data


> #################################################################################
>
> Note: Currently, this feature is only available when the analyzer scenario is set to `document`.
>
> #################################################################################

Labeled data consists of samples that have been tagged with one or more labels to provide additional context or meaning. This enriched information helps improve the analyzer's performance.

For your own projects, you can use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to annotate your data using the labeling tool.

This notebook demonstrates how to create an analyzer using your labeled data and how to analyze your files afterward.


## Prerequisites
1. Please ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).
2. Set environment variables related to training data by following the instructions in [Set env for training data](../docs/set_env_for_training_data_and_reference_doc.md) and add them to the [.env](./.env) file.
   - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,
   - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` to generate the SAS URL automatically in later steps.
   - Also, set `TRAINING_DATA_PATH` to specify the folder path within the container where the training data will be uploaded.
3. Please install the packages required to run the sample:

In [None]:
%pip install -r ../requirements.txt

## Analyzer Template and Local Training Folder Setup
In this sample, we define a template for receipts.

The training folder should contain a flat (one-level) directory of labeled receipt documents. Each document includes:
- The original file (e.g., PDF or image).
- A corresponding `labels.json` file containing labeled fields.
- A corresponding `result.json` file with OCR results.

In [None]:
analyzer_template = "../analyzer_templates/receipt.json"
training_docs_folder = "../data/document_training"

## Create Azure Content Understanding Client
> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing helper functions. Before the official release of the Content Understanding SDK, please consider it a lightweight SDK.
>
> Please fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the information from your Azure AI Service.

> ⚠️ Important:
Please update the code below to match your Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
If you skip this step, the sample may not run correctly.

> ⚠️ Note: While using a subscription key works, using a token provider with Azure Active Directory (AAD) is safer and highly recommended for production environments.

In [None]:
import logging
import json
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Import utility package from the Python samples root directory
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=os.getenv("AZURE_AI_ENDPOINT"),
    api_version=os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview"),
    # IMPORTANT: Comment out token_provider if using subscription key
    token_provider=token_provider,
    # IMPORTANT: Uncomment this if using subscription key
    # subscription_key=os.getenv("AZURE_AI_API_KEY"),
    x_ms_useragent="azure-ai-content-understanding-python/analyzer_training", # This header is used for sample usage telemetry; please comment out this line if you wish to opt out.
)

## Prepare Labeled Data
In this step, we will:
- Use the environment variables `TRAINING_DATA_PATH` and SAS URL-related variables set in the Prerequisites step.
- Attempt to retrieve the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.
- If `TRAINING_DATA_SAS_URL` is not set, generate it automatically using `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` environment variables.
- Verify that each document file in the local folder has corresponding `.labels.json` and `.result.json` files.
- Upload these files to the Azure Blob storage container specified by the environment variables.

In [None]:
training_data_sas_url = os.getenv("TRAINING_DATA_SAS_URL")
if not training_data_sas_url:
    TRAINING_DATA_STORAGE_ACCOUNT_NAME = os.getenv("TRAINING_DATA_STORAGE_ACCOUNT_NAME")
    TRAINING_DATA_CONTAINER_NAME = os.getenv("TRAINING_DATA_CONTAINER_NAME")
    if not TRAINING_DATA_STORAGE_ACCOUNT_NAME and not training_data_sas_url:
        raise ValueError(
            "Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables."
        )
    from azure.storage.blob import ContainerSasPermissions
    # Requires "Write" (critical for upload/modify/append) along with "Read" and "List" for viewing and listing blobs.
    training_data_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(
        account_name=TRAINING_DATA_STORAGE_ACCOUNT_NAME,
        container_name=TRAINING_DATA_CONTAINER_NAME,
        permissions=ContainerSasPermissions(read=True, write=True, list=True),
        expiry_hours=1,
    )

training_data_path = os.getenv("TRAINING_DATA_PATH")

await client.generate_training_data_on_blob(training_docs_folder, training_data_sas_url, training_data_path)

## Create Analyzer with Defined Schema
Before creating the analyzer, please fill in the constant `ANALYZER_ID` with a relevant name for your task. In this example, we generate a unique suffix so that this cell can be run multiple times to create different analyzers.

We use **training_data_sas_url** and **training_data_path** as set in the [.env](./.env) file and used in the previous step.

In [None]:
import uuid
CUSTOM_ANALYZER_ID = "train-sample-" + str(uuid.uuid4())

response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID,
    analyzer_template_path=analyzer_template,
    training_storage_container_sas_url=training_data_sas_url,
    training_storage_container_path_prefix=training_data_path,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue was encountered when trying to create the analyzer. "
        "Please double-check your deployment and configurations for potential issues."
    )

## Use Created Analyzer to Extract Document Content
After the analyzer is successfully created, you can use it to analyze your input files.

In [None]:
response = client.begin_analyze(CUSTOM_ANALYZER_ID, file_location='../data/receipt.png')
result_json = client.poll_result(response)

logging.info(json.dumps(result_json, indent=2))

## Delete Existing Analyzer in Content Understanding Service
This snippet is optional and is included to prevent test analyzers from persisting in your service. Without deletion, the analyzer will remain in your service and may be reused in subsequent operations.

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID)