# Extract Custom Fields from Your Pretranscribed File

This notebook demonstrates how to use analyzers to extract custom fields from your transcription input files.

## Prerequisites
1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)
2. Install the required packages to run the sample.

In [1]:
%pip install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Analyzer Templates

Below is a collection of analyzer templates designed to extract fields from various input file types.

These templates are highly customizable, allowing you to modify them to suit your specific needs. For additional verified templates from Microsoft, please visit [here](../analyzer_templates/README.md).

In [2]:
extraction_templates = {
    "call_recording_pretranscribe_batch": ('../analyzer_templates/call_recording_analytics_text.json', '../data/batch_pretranscribed.json'),
    "call_recording_pretranscribe_fast": ('../analyzer_templates/call_recording_analytics_text.json', '../data/fast_pretranscribed.json'),
    "call_recording_pretranscribe_cu": ('../analyzer_templates/call_recording_analytics_text.json', '../data/cu_pretranscribed.json')
}

Specify the analyzer template you want to use and provide a name for the analyzer to be created based on the template.

In [7]:
import uuid

ANALYZER_TEMPLATE = "call_recording_pretranscribe_batch"
ANALYZER_ID = "prebuilt-callCenter"

(analyzer_template_path, analyzer_sample_file_path) = extraction_templates[ANALYZER_TEMPLATE]

## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, it can be regarded as a lightweight SDK.


In [8]:
import logging
import json
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview")

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    token_provider=token_provider,
    # x_ms_useragent="azure-ai-content-understanding-python/field_extraction", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Request method: 'GET'
Request headers:
    'User-Agent': 'azsdk-python-identity/1.23.0 Python/3.11.12 (Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.36)'
No body was attached to the request
INFO:azure.identity._credentials.chained:DefaultAzureCredential acquired a token from AzureDeveloperCliCredential


## Extract Fields Using the Analyzer

After the analyzer is successfully created, we can use it to analyze our input files.

In [9]:
from python.extension.transcripts_processor import TranscriptsProcessor

test_file_path=analyzer_sample_file_path

transcripts_processor = TranscriptsProcessor()
webvtt_output, webvtt_output_file_path = transcripts_processor.convert_file(test_file_path)

if "WEBVTT" not in webvtt_output:
    print("Error: The output is not in WebVTT format.")
else:    
    response = client.begin_analyze(ANALYZER_ID, file_location=webvtt_output_file_path)
    print("Response:", response)
    result_json = client.poll_result(response)

print(json.dumps(result_json, indent=2))


Load transcription completed.
processing a fast transcription file.
Fast to WebVTT Conversion completed.
Conversion completed. The result has been saved to '../data/transcripts_processor_output/batch_pretranscribed.json.convertedTowebVTT.txt'


INFO:python.content_understanding_client:Analyzing file ../data/transcripts_processor_output/batch_pretranscribed.json.convertedTowebVTT.txt with analyzer: prebuilt-callCenter


Response: <Response [202]>


INFO:python.content_understanding_client:Request 4c2c7f32-0b29-4be4-b4a5-240feeb85109 in progress ...
INFO:python.content_understanding_client:Request 4c2c7f32-0b29-4be4-b4a5-240feeb85109 in progress ...
INFO:python.content_understanding_client:Request 4c2c7f32-0b29-4be4-b4a5-240feeb85109 in progress ...
INFO:python.content_understanding_client:Request 4c2c7f32-0b29-4be4-b4a5-240feeb85109 in progress ...
INFO:python.content_understanding_client:Request 4c2c7f32-0b29-4be4-b4a5-240feeb85109 in progress ...
INFO:python.content_understanding_client:Request 4c2c7f32-0b29-4be4-b4a5-240feeb85109 in progress ...
INFO:python.content_understanding_client:Request result is ready after 17.41 seconds.


{
  "id": "4c2c7f32-0b29-4be4-b4a5-240feeb85109",
  "status": "Succeeded",
  "result": {
    "analyzerId": "prebuilt-callCenter",
    "apiVersion": "2025-05-01-preview",
    "createdAt": "2025-06-06T05:39:22Z",
    "stringEncoding": "utf8",
    "contents": [
      {
        "markdown": "# Audio: 00:00.000 => 00:00.000\n\nTranscript\n```\nWEBVTT\n\n\n```",
        "fields": {
          "Summary": {
            "type": "string",
            "valueString": "The transcript does not contain any meaningful content or conversation."
          },
          "Sentiment": {
            "type": "string",
            "valueString": "Neutral"
          }
        },
        "kind": "audioVisual",
        "startTimeMs": 0,
        "endTimeMs": 0,
        "transcriptPhrases": []
      }
    ]
  }
}


> The markdown output contains layout information, which is very useful for Retrieval-Augmented Generation (RAG) scenarios. You can paste the markdown into a viewer such as Visual Studio Code and preview the layout structure.

In [10]:
  print(result_json["result"]["contents"][0]["markdown"])

# Audio: 00:00.000 => 00:00.000

Transcript
```
WEBVTT


```


> You can get the layout information, including ```words/lines``` in the pagesnode and paragraphs info in ```paragraphs```, and ```tables``` in the table.

In [12]:
print(json.dumps(result_json["result"]["contents"][0], indent=2))

{
  "markdown": "# Audio: 00:00.000 => 00:00.000\n\nTranscript\n```\nWEBVTT\n\n\n```",
  "fields": {
    "Summary": {
      "type": "string",
      "valueString": "The transcript does not contain any meaningful content or conversation."
    },
    "Sentiment": {
      "type": "string",
      "valueString": "Neutral"
    }
  },
  "kind": "audioVisual",
  "startTimeMs": 0,
  "endTimeMs": 0,
  "transcriptPhrases": []
}
