# Enhance Your Analyzer with Labeled Data

> #################################################################################
>
> Note: Currently, this feature is available only when the analyzer scenario is set to `document`.
>
> #################################################################################

Labeled data consists of samples that have been tagged with one or more labels to provide context or meaning. This labeling is used to improve the analyzer's performance.

In your own projects, you can use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to annotate your data with the labeling tool.

This notebook demonstrates how to create an analyzer using labeled data and how to analyze your files after the labeled data is prepared.


## Prerequisites
1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).
2. Follow the steps in [Set environment variables for training data](../docs/set_env_for_training_data_and_reference_doc.md) to add the training data related environment variables `TRAINING_DATA_SAS_URL` and `TRAINING_DATA_PATH` into the [.env](./.env) file:
    - `TRAINING_DATA_SAS_URL`: SAS URL for your Azure Blob container.
    - `TRAINING_DATA_PATH`: Folder path within the container where training data will be uploaded.
3. Install the necessary packages to run the sample.



In [None]:
%pip install -r ../requirements.txt

## Analyzer Template and Local Training Folder Setup
In this sample, we define a template for receipts.

The training folder should contain a flat (single-level) directory of labeled receipt documents. Each document should include:
- The original file (e.g., PDF or image).
- A corresponding `labels.json` file containing labeled fields.
- A corresponding `result.json` file containing OCR results.

In [None]:
analyzer_template = "../analyzer_templates/receipt.json"
training_docs_folder = "../data/document_training"

## Create Azure Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that encapsulates necessary functions. Before the official release of the Content Understanding SDK, consider this a lightweight SDK.

Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with your Azure AI Service details.

> ⚠️ Important:
>
You must update the code below to match your Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
If you omit this step, the sample may not run correctly.

> ⚠️ Note: Using a subscription key works, but using an Azure Active Directory (AAD) token provider is much safer and is highly recommended for production environments.

In [None]:
import logging
import json
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Import utility package from python samples root directory
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=os.getenv("AZURE_AI_ENDPOINT"),
    api_version=os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview"),
    # IMPORTANT: Comment out token_provider if using subscription key
    token_provider=token_provider,
    # IMPORTANT: Uncomment this if using subscription key
    # subscription_key=os.getenv("AZURE_AI_API_KEY"),
    x_ms_useragent="azure-ai-content-understanding-python/analyzer_training",  # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

## Prepare Labeled Data

In this step, we will:
- Verify that each document file in the local folder has corresponding `.labels.json` and `.result.json` files.
- Upload these files to the configured Azure Blob storage.

This process uses the **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** environment variables set up during the Prerequisites step.

In [None]:
TRAINING_DATA_SAS_URL = os.getenv("TRAINING_DATA_SAS_URL")
TRAINING_DATA_PATH = os.getenv("TRAINING_DATA_PATH")

await client.generate_training_data_on_blob(training_docs_folder, TRAINING_DATA_SAS_URL, TRAINING_DATA_PATH)

## Create Analyzer with Defined Schema

Before creating the analyzer, set the constant `ANALYZER_ID` to a meaningful name for your task. Here, we generate a unique suffix so that this cell can be run multiple times to create different analyzers.

This step uses the **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** variables configured in the [.env](./.env) file and referenced in the previous step.

In [None]:
import uuid
CUSTOM_ANALYZER_ID = "train-sample-" + str(uuid.uuid4())

response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID,
    analyzer_template_path=analyzer_template,
    training_storage_container_sas_url=TRAINING_DATA_SAS_URL,
    training_storage_container_path_prefix=TRAINING_DATA_PATH,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue was encountered while creating the analyzer. "
        "Please double-check your deployment and configurations for potential issues."
    )

## Use the Created Analyzer to Extract Document Content

Once the analyzer is successfully created, you can use it to analyze your input files.

In [None]:
response = client.begin_analyze(CUSTOM_ANALYZER_ID, file_location='../data/receipt.png')
result_json = client.poll_result(response)

logging.info(json.dumps(result_json, indent=2))

## Delete Existing Analyzer in Content Understanding Service

This step is optional but recommended to prevent test analyzers from accumulating in your service. Without deletion, the analyzer will remain available for subsequent reuse.

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID)