# Classification - Azure AI Document Intelligence + Embeddings

This sample demonstrates how to use embeddings for a defined set of categories to classify documents to compare similarity with the embeddings of a given document's pages.

## Objectives

By the end of this sample, you will have learned how to:

- Convert a predefined set of categories to embeddings using Azure OpenAI's `text-embedding-3-large` model.
- Convert a document's pages to Markdown format using Azure AI Document Intelligence.
- Compare the embeddings of the document's pages with the embeddings of the predefined categories to classify the document.

## Setup

In [94]:
import sys
sys.path.append('../')

from IPython.display import display, Markdown

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import os
from dotenv import dotenv_values
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, ContentFormat
import json
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from modules.app_settings import AppSettings
from modules.stopwatch import Stopwatch

In [95]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

openai_token_provider = get_bearer_token_provider(credential, 'https://cognitiveservices.azure.com/.default')

openai_client = AzureOpenAI(
    azure_endpoint=settings.openai_endpoint,
    azure_ad_token_provider=openai_token_provider,
    api_version="2024-08-01-preview"
)

document_intelligence_client = DocumentIntelligenceClient(
    endpoint=settings.ai_services_endpoint,
    credential=credential
)

## Establish the classifications

The following code block contains the classification definitions for a document. The classifications have been defined based on expected content in a specific type of document, in this example, insurance documents.

In [96]:
pdf_path = f"{working_dir}/samples/assets/"
pdf_file_name = "VehicleInsurancePolicy.pdf"

classifications = [
    {
        "classification": "Insurance Correspondence",
        "description": "An insurance communication exchanged between individuals, organizations, or parties, typically in written or electronic form, often used for record-keeping or official purposes.",
        "keywords": [
            "letter",
            "communication",
            "email",
            "fax",
            "letterhead",
        ]
    },
    {
        "classification": "Contact Information",
        "description": "Personal or organizational details that can be used to contact or identify individuals or entities, often used for communication or reference purposes.",
        "keywords": [
            "policyholder",
            "your address",
            "email address",
            "phone number",
        ]
    },
    {
        "classification": "Policy Details",
        "description": "Specific information related to an insurance policy, such as coverage, limits, premiums, and terms, often used for reference or clarification purposes.",
        "keywords": [
            "cover type",
            "effective date",
            "excesses",
            "schedule",
        ]
    },
    {
        "classification": "Insurance Certificate",
        "description": "A document that serves as proof of insurance coverage, often required for legal, regulatory, or contractual purposes.",
        "keywords": [
            "certificate",
            "proof",
            "coverage",
            "liability",
            "endorsement",
            "declaration",
        ]
    },
    {
        "classification": "Terms and Conditions",
        "description": "The rules, requirements, or obligations that govern an agreement or contract, often related to insurance policies, financial products, or legal documents.",
        "keywords": [
            "legal",
            "statements",
            "terms",
            "conditions",
            "rules",
            "requirements",
            "obligations",
            "agreement",
            "responsibilities",
            "payment",
            "renewal",
            "cancellation",
            "what's covered",
        ]
    }
]

## Convert the document pages to Markdown

The following code block converts the document pages to Markdown format using Azure AI Document Intelligence. 

In this example, we will be creating embeddings per page. The benefit of using Azure AI Document Intelligence for this extraction is that it provides a page-by-page analysis result of the document.

In [97]:
fname = f"{pdf_path}{pdf_file_name}"

with open(fname, "rb") as f:
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout",
        analyze_request=f,
        output_content_format=ContentFormat.MARKDOWN,
        content_type="application/pdf"
    )
    
result: AnalyzeResult = poller.result()

pages = []
for page in result.pages:
    content = result.content[page.spans[0]['offset']: page.spans[0]['offset'] + page.spans[0]['length']]
    pages.append(content)

## Prepare the embeddings

The following code blocks prepare the embeddings for the classifications and the document pages using Azure OpenAI's `text-embedding-3-large` model.

In [98]:
def get_embedding(text: str):
    response = openai_client.embeddings.create(
        input=text,
        model=settings.text_embedding_model_deployment_name
    )
    embedding = response.data[0].embedding
    return embedding

### Convert the classifications to embeddings

In [99]:
for classification in classifications:
    combined_text = f"{classification['classification']} {' '.join(classification['keywords'])}"
    classification['embedding'] = get_embedding(combined_text)

### Convert the document pages to embeddings

In [100]:
page_embeddings = []
for idx, text in enumerate(pages):
    embedding = get_embedding(text)
    page_embeddings.append(embedding)

## Classify the document pages

The following code block executes the classification process using cosine similarity to compare the embeddings of the document pages with the embeddings of the predefined categories.

It performs the following steps iteratively for each page in the document:

1. Calculates the cosine similarity between the embeddings of the page and the matrix of embeddings of the predefined categories.
2. Finds the best match for the page based on the maximum cosine similarity score.
3. If the cosine similarity score is above a certain threshold, the page is classified under the best match category. Otherwise, the page is classified as "Unclassified".

In [101]:
similarity_threshold = 0.4

classification_embeddings = [cls['embedding'] for cls in classifications]
classification_matrix = np.array(classification_embeddings)

document_classifications = []
for idx, page_emb in enumerate(page_embeddings):
    if not page_emb:
        classification = "Unclassified"
        similarity = 0.0
    else:
        page_vector = np.array(page_emb).reshape(1, -1)
        similarities = cosine_similarity(page_vector, classification_matrix)[0]
        best_match_idx = np.argmax(similarities)
        best_similarity = similarities[best_match_idx]
        if best_similarity >= similarity_threshold:
            best_classification = classifications[best_match_idx]['classification']
            classification = best_classification
            similarity = best_similarity
            all_similarities = list(zip([cls['classification'] for cls in classifications], similarities))
        else:
            classification = "Unclassified"
            similarity = best_similarity
            all_similarities = list(zip([cls['classification'] for cls in classifications], similarities))
    document_classifications.append({
        "page_number": idx + 1,
        "classification": classification,
        "similarity": f"{round(similarity * 100)}%",
        "all_similarities": [(cls, f"{round(similarity * 100)}%") for cls, similarity in all_similarities]
    })

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the classification process.

This includes:
- The classification results for each page in the document.

In [None]:
# Display the outputs of the classification process.
display(Markdown(f"### Document Classifications:"))
for page in document_classifications:
    display(Markdown(f"#### Page {page['page_number']}"))
    display(Markdown(f"**Classification:** {page['classification']}"))
    display(Markdown(f"**All Similarities:** {page['all_similarities']}"))