# Document Classification with Azure AI Document Intelligence and Text Embeddings

This sample demonstrates how to classify a document using Azure AI Document Intelligence and text embeddings.

![Data Classification](../../../images/classification-embeddings.png)

This is achieved by the following process:

- Define a list of classifications, with descriptions and keywords.
- Create text embeddings for each of the classifications.
- Analyze a document using Azure AI Document Intelligence's `prebuilt-layout` model to extract the text from each page.
- For each page:
  - Create text embeddings.
  - Compare the embeddings with the embeddings of each classification.
  - Assign the page to the classification with the highest similarity that exceeds a given threshold.

## Objectives

By the end of this sample, you will have learned how to:

- Convert text to embeddings using Azure OpenAI's `text-embedding-3-large` model.
- Convert a document's pages to Markdown format using Azure AI Document Intelligence.
- Use cosine similarity to compare embeddings of classifications with document pages to classify them.

## Useful Tips

- Combine this technique with a [page extraction](../extraction/README.md) approach to ensure that you extract the most relevant data from a document's pages.

## Setup

### Import modules

This sample takes advantage of the following Python dependencies:

- **numpy** and **sklearn** for determining the cosine similarity between embeddings.
- **azure-ai-documentintelligence** to interface with the Azure AI Document Intelligence API for analyzing documents.
- **openai** to interface with the Azure OpenAI API for generating text embeddings.
- **azure-identity** to securely authenticate with deployed Azure Services using Microsoft Entra ID credentials.

The following local components are also used:

- [**classification**](../modules/samples/models/classification.py) to define the classifications.
- [**accuracy_evaluator**](../modules/samples/evaluation/accuracy_evaluator.py) to evaluate the output of the classification process with expected results.
- [**document_processing_result**](../modules/samples/models/document_processing_result.py) to store the results of the classification process as a file.
- [**stopwatch**](../modules/samples/utils/stopwatch.py) to measure the end-to-end execution time for the classification process.
- [**app_settings**](../modules/samples/app_settings.py) to access environment variables from the `.env` file.

In [1]:
import sys
sys.path.append('../modules/') # Import local modules

from IPython.display import display
import os
import pandas as pd
from dotenv import dotenv_values
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, DocumentContentFormat
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from concurrent.futures import ThreadPoolExecutor, as_completed
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

from samples.app_settings import AppSettings
from samples.utils.stopwatch import Stopwatch
from samples.utils.storage_utils import create_json_file
from samples.models.document_processing_result import DataClassificationResult

from samples.models.classification import Classifications, Classification
from samples.evaluation.accuracy_evaluator import AccuracyEvaluator
from samples.evaluation.comparison import get_classification_comparison

### Configure the Azure services

To use Azure AI Document Intelligence and Azure OpenAI, their SDKs are used to create client instances using a deployed endpoint and authentication credentials.

For this sample, the credentials of the Azure CLI are used to authenticate with the deployed services.

In [2]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))
sample_path = f"{working_dir}/samples/python/classification/"
sample_name = "document-classification-text-embeddings"

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

openai_token_provider = get_bearer_token_provider(credential, 'https://cognitiveservices.azure.com/.default')

openai_client = AzureOpenAI(
    azure_endpoint=settings.openai_endpoint,
    azure_ad_token_provider=openai_token_provider,
    api_version="2024-12-01-preview" # Requires the latest API version for structured outputs.
)

document_intelligence_client = DocumentIntelligenceClient(
    endpoint=settings.ai_services_endpoint,
    credential=credential
)

### Establish the expected output

To compare the accuracy of the classification process, the expected output of the classification process has been defined in the following code block based on each page of a [Vehicle Insurance Policy](../../assets/vehicle_insurance/policy_1.pdf).

The expected output has been defined by a human evaluating the document.

> **Note**: Only the `page_number` and `classification` are used in the accuracy evaluation.

In [3]:
path = f"{working_dir}/samples/assets/vehicle_insurance/"
pdf_fname = "policy_1.pdf"
pdf_fpath = f"{path}{pdf_fname}"

expected = Classifications(classifications=[
    Classification(page_number=0, classification="Insurance Policy", similarity=1.0),
    Classification(page_number=1, classification="Insurance Policy", similarity=1.0),
    Classification(page_number=2, classification="Insurance Policy", similarity=1.0),
    Classification(page_number=3, classification="Insurance Policy", similarity=1.0),
    Classification(page_number=4, classification="Insurance Policy", similarity=1.0),
    Classification(page_number=5, classification="Insurance Certificate", similarity=1.0),
    Classification(page_number=6, classification="Terms and Conditions", similarity=1.0),
    Classification(page_number=7, classification="Terms and Conditions", similarity=1.0),
    Classification(page_number=8, classification="Terms and Conditions", similarity=1.0),
    Classification(page_number=9, classification="Terms and Conditions", similarity=1.0),
    Classification(page_number=10, classification="Terms and Conditions", similarity=1.0),
    Classification(page_number=11, classification="Terms and Conditions", similarity=1.0),
    Classification(page_number=12, classification="Terms and Conditions", similarity=1.0)
])

classification_evaluator = AccuracyEvaluator(match_keys=["page_number"], ignore_keys=["page_number", "similarity"])

## Define classifications

The following code block defines the classifications for a document. Each classification has a name, description, and keywords that will be used to generate embeddings and compare similarity with each page of the document.

> **Note**, the classifications have been defined based on expected content in a specific type of document, in this example, [a Vehicle Insurance Policy](../../assets/vehicle_insurance/policy_1.pdf).

In [4]:
classifications = [
    {
        "classification": "Insurance Policy",
        "description": "Specific information related to an insurance policy, such as coverage, limits, premiums, and terms, often used for reference or clarification purposes.",
        "keywords": [
            "welcome letter",
            "personal details",
            "vehicle details",
            "insured driver details",
            "policy details",
            "incident/conviction history",
            "schedule of insurance",
            "vehicle damage excesses"
        ]
    },
    {
        "classification": "Insurance Certificate",
        "description": "A document that serves as proof of insurance coverage, often required for legal, regulatory, or contractual purposes.",
        "keywords": [
            "certificate of vehicle insurance",
            "effective date of insurance",
            "entitlement to drive",
            "limitations of use"
        ]
    },
    {
        "classification": "Terms and Conditions",
        "description": "The rules, requirements, or obligations that govern an agreement or contract, often related to insurance policies, financial products, or legal documents.",
        "keywords": [
            "terms and conditions",
            "legal statements",
            "payment instructions",
            "legal obligations",
            "covered for",
            "claim settlement",
            "costs to pay",
            "legal responsibility",
            "personal accident coverage",
            "medical expense coverage",
            "personal liability coverage",
            "windscreen damage coverage",
            "uninsured motorist protection",
            "renewal instructions",
            "cancellation instructions"
        ]
    }
]

## Convert the document pages to Markdown

To classify the document pages using embeddings, the text from each page must first be extracted.

The following code block converts the document pages to Markdown format using Azure AI Document Intelligence's `prebuilt-layout` model.

For the purposes of this sample, we will be classifying each page. The benefit of using Azure AI Document Intelligence for this extraction is that it provides a page-by-page analysis result of the document.

In [5]:
with Stopwatch() as di_stopwatch:
    with open(pdf_fpath, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            model_id="prebuilt-layout",
            body=f,
            output_content_format=DocumentContentFormat.MARKDOWN,
            content_type="application/pdf"
        )
        
    result: AnalyzeResult = poller.result()

In [6]:
pages_content = []
for page in result.pages:
    # Extract the entire content for each page of the document based on the span offsets and lengths
    content = result.content[page.spans[0]['offset']: page.spans[0]['offset'] + page.spans[0]['length']]
    pages_content.append(content)

## Create embeddings

With the text extracted from the document and the classifications defined, the next step is to create embeddings for each page and classification.

### Retrieving embeddings for text

The following helper function retrieves embeddings for a given piece of text using Azure OpenAI's `text-embedding-3-large` model.

In [7]:
def get_embedding(text: str):
    response = openai_client.embeddings.create(
        input=text,
        model=settings.text_embedding_model_deployment_name
    )
    embedding = response.data[0].embedding
    return embedding

### Convert the classifications to embeddings

The following code block takes each classification and generates the embeddings for the keywords.

In [8]:
def process_classification(classification):
    combined_text = f"{', '.join(classification['keywords'])}"
    classification['embedding'] = get_embedding(combined_text)

with Stopwatch() as ce_stopwatch:
    with ThreadPoolExecutor() as executor:
        executor.map(process_classification, classifications)

### Convert the document pages to embeddings

The following code block takes each page of the document and generates the embeddings for the text.

In [9]:
page_embeddings = [None] * len(pages_content)

with Stopwatch() as de_stopwatch:
    with ThreadPoolExecutor() as executor:
        future_to_idx = {executor.submit(get_embedding, text): idx for idx, text in enumerate(pages_content)}
        for future in as_completed(future_to_idx):
            idx = future_to_idx[future]
            page_embeddings[idx] = future.result()

## Classify the document pages

The following code block runs the classification process using cosine similarity to compare the embeddings of the document pages with the embeddings of the predefined categories.

It performs the following steps iteratively for each page in the document:

1. Calculates the cosine similarity between the embeddings of the page and the matrix of embeddings of the predefined categories.
2. Finds the best match for the page based on the maximum cosine similarity score.
3. If the cosine similarity score is above a certain threshold, the page is classified under the best match category. Otherwise, the page is classified as "Unclassified".

In [10]:
similarity_threshold = 0.6 # Minimum similarity threshold for classification

In [11]:
classification_embeddings = [cls['embedding'] for cls in classifications]
classification_matrix = np.array(classification_embeddings)

with Stopwatch() as classify_stopwatch:
    document_classifications = Classifications(classifications=[])
    for idx, page_emb in enumerate(page_embeddings):
        if not page_emb:
            classification = "Unclassified"
            similarity = 0.0
        else:
            page_vector = np.array(page_emb).reshape(1, -1)
            similarities = cosine_similarity(page_vector, classification_matrix)[0]
            best_match_idx = np.argmax(similarities)
            best_similarity = similarities[best_match_idx]

            if best_similarity >= similarity_threshold:
                classification = classifications[best_match_idx]['classification']
            else:
                classification = f"""Unclassified ({classifications[best_match_idx]['classification']})"""
                
        document_classifications.classifications.append(
            Classification(
                page_number=idx,
                classification=classification,
                similarity=best_similarity
            )
        )

## Calculate the accuracy

The following code block calculates the accuracy of the classification process by comparing the actual classifications with the predicted classifications.

In [12]:
expected_dict = expected.model_dump()
classifications_dict = document_classifications.model_dump()

accuracy = classification_evaluator.evaluate(expected=expected_dict, actual=classifications_dict)

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the classification process.

This includes:

- The accuracy of the classification process comparing the expected output with the result of comparing the embeddings.
- The execution time of the end-to-end process.
- The classification results for each page in the document.

### Understanding Similarity

Cosine similarity is a metric used to measure how similar two vectors are. Embeddings are numerical representations of text. By converting a document page and classification keywords to embeddings, we can compare the similarity between the two using this technique.

Similarity scores close to 1 indicate that the two vectors share similar characteristics, while scores closer to 0 or negative values indicate that the two vectors are dissimilar.

In [13]:
# Gets the total execution time of the classification process.
total_elapsed = di_stopwatch.elapsed + ce_stopwatch.elapsed + de_stopwatch.elapsed + classify_stopwatch.elapsed

In [14]:
# Save the output of the data classification result.
classification_result = DataClassificationResult(classifications_dict, accuracy, total_elapsed)

create_json_file(f"{sample_path}/{sample_name}.{pdf_fname}.json", classification_result)

In [15]:
# Display the outputs of the classification process.
df = pd.DataFrame([
    {
        "Accuracy": f"{accuracy['overall'] * 100:.2f}%",
        "Execution Time": f"{total_elapsed:.2f} seconds",
        "Document Intelligence Execution Time": f"{di_stopwatch.elapsed:.2f} seconds",
        "Classification Embedding Execution Time": f"{ce_stopwatch.elapsed:.2f} seconds",
        "Document Embedding Execution Time": f"{de_stopwatch.elapsed:.2f} seconds",
        "Classification Execution Time": f"{classify_stopwatch.elapsed:.2f} seconds"
    }
])

display(df)
display(get_classification_comparison(expected, document_classifications))

Unnamed: 0,Accuracy,Execution Time,Document Intelligence Execution Time,Classification Embedding Execution Time,Document Embedding Execution Time,Classification Execution Time
0,61.54%,9.28 seconds,7.37 seconds,1.27 seconds,0.63 seconds,0.02 seconds


Page,Expected,Extracted,Similarity
0,Insurance Policy,Unclassified (Insurance Policy),0.588678
1,Insurance Policy,Unclassified (Insurance Policy),0.5812
2,Insurance Policy,Insurance Policy,0.661986
3,Insurance Policy,Unclassified (Insurance Policy),0.522973
4,Insurance Policy,Unclassified (Insurance Policy),0.596999
5,Insurance Certificate,Insurance Certificate,0.669797
6,Terms and Conditions,Terms and Conditions,0.607947
7,Terms and Conditions,Terms and Conditions,0.634872
8,Terms and Conditions,Unclassified (Terms and Conditions),0.562734
9,Terms and Conditions,Terms and Conditions,0.60846
