# Document Classification with Azure AI Document Intelligence and Text Embeddings

This sample demonstrates how to classify a document using Azure AI Document Intelligence and text embeddings.

![Data Classification](../../../images/classification-embeddings.png)

This is achieved by the following process:

- Define a list of classifications, with descriptions and keywords.
- Create text embeddings for each of the classifications.
- Analyze a document using Azure AI Document Intelligence's `prebuilt-layout` model to extract the text from each page.
- For each page:
  - Create text embeddings.
  - Compare the embeddings with the embeddings of each classification.
  - Assign the page to the classification with the highest similarity that exceeds a given threshold.

## Objectives

By the end of this sample, you will have learned how to:

- Convert text to embeddings using Azure OpenAI's `text-embedding-3-large` model.
- Convert a document's pages to Markdown format using Azure AI Document Intelligence.
- Use cosine similarity to compare embeddings of classifications with document pages to classify them.

## Useful Tips

- Combine this technique with a [page extraction](../extraction/README.md) approach to ensure that you extract the most relevant data from a document's pages.

## Setup

### Import modules

This sample takes advantage of the following .NET dependencies:

- **System.Numerics.Tensors** to perform cosine similiarity between embeddings.
- **Azure.AI.DocumentIntelligence** to interface with the Azure AI Document Intelligence API for analyzing documents.
- **Azure.AI.OpenAI** to interface with the Azure OpenAI chat completions API to generate structured classification outputs using the GPT-4o model.
- **Azure.Identity** to securely authenticate with deployed Azure Services using Microsoft Entra ID credentials.

The following local components are also used:

- [**Classification**](../modules/samples/models/Classification.csx) to define the classifications.
- [**AccuracyEvaluator**](../modules/samples/evaluation/AccuracyEvaluator.csx) to evaluate the output of the classification process with expected results.
- [**DocumentProcessingResult**](../modules/samples/models/DocumentProcessingResult.csx) to store the results of the classification process as a file.
- [**AppSettings**](../modules/samples/AppSettings.csx) to access environment variables from the `.env` file.

In [1]:
#r "nuget: System.Numerics.Tensors, 9.0.3"
#r "nuget: Azure.Identity, 1.13.2"
#r "nuget: Azure.AI.OpenAI, 2.1.0"
#r "nuget: Azure.AI.DocumentIntelligence, 1.0.0"
#r "nuget: DotNetEnv, 3.1.1"

#!import ../modules/samples/AppSettings.csx
#!import ../modules/samples/helpers/StopwatchContext.csx
#!import ../modules/samples/models/Classification.csx
#!import ../modules/samples/models/DocumentProcessingResult.csx
#!import ../modules/samples/evaluation/AccuracyEvaluator.csx

using System;
using System.IO;
using System.Numerics.Tensors;
using System.Text.Json;
using Azure;
using Azure.Core;
using Azure.Identity;
using Azure.AI.OpenAI;
using Azure.AI.DocumentIntelligence;
using DotNetEnv;

### Configure the Azure services

To use Azure AI Document Intelligence and Azure OpenAI, their SDKs are used to create a client instance using a deployed endpoint and authentication credentials.

For this sample, the credentials of the Azure CLI are used to authenticate with the deployed services.

In [2]:
string workingDir = Path.GetFullPath("../../../");
AppSettings settings = new AppSettings(new Dictionary<string, string>(Env.Load(Path.Combine(workingDir, ".env"))));
string samplePath = Path.Combine(workingDir, "samples/dotnet/classification/");
string sampleName = "document-classification-text-embeddings";

DefaultAzureCredential credential = new DefaultAzureCredential(
    new DefaultAzureCredentialOptions { 
        ExcludeWorkloadIdentityCredential = true,
        ExcludeAzureDeveloperCliCredential = true,
        ExcludeEnvironmentCredential = true,
        ExcludeManagedIdentityCredential = true,
        ExcludeAzurePowerShellCredential = true,
        ExcludeSharedTokenCacheCredential = true,
        ExcludeInteractiveBrowserCredential = true
    }
);

AzureOpenAIClient openaiClient = new AzureOpenAIClient(
    new Uri(settings.OpenAIEndpoint),
    credential
);

var documentIntelligenceClient = new DocumentIntelligenceClient(
    new Uri(settings.AIServicesEndpoint),
    credential
);

### Establish the expected output

To compare the accuracy of the classification process, the expected output of the classification process has been defined in the following code block based on each page of a [Vehicle Insurance Policy](../../assets/vehicle_insurance/policy_1.pdf).

The expected output has been defined by a human evaluating the document.

> **Note**: Only the `PageNumber` and `Classification` are used in the accuracy evaluation.

In [3]:
string path = Path.Combine(workingDir, "samples/assets/vehicle_insurance/");
string pdfFName = "policy_1.pdf";
string pdfFPath = Path.Combine(path, pdfFName);

ClassificationsModel expected = new ClassificationsModel()
{
    Classifications = new List<ClassificationModel>()
    {
        new ClassificationModel() { Classification = "Insurance Policy", ImageRangeStart = 1, ImageRangeEnd = 5 },
        new ClassificationModel() { Classification = "Insurance Certificate", ImageRangeStart = 6, ImageRangeEnd = 6 },
        new ClassificationModel() { Classification = "Terms and Conditions", ImageRangeStart = 7, ImageRangeEnd = 13 }
    }
};

AccuracyEvaluator<ClassificationsModel> classificationEvaluator = new AccuracyEvaluator<ClassificationsModel>(matchKeys: new List<string>() { "Classification", "ImageRangeStart" }, ignoreKeys: new List<string>() { });

## Define classifications

The following code block defines the classifications for a document. Each classification has a name, description, and keywords that will be used to classify the document's pages.

> **Note**, the classifications have been defined based on expected content in a specific type of document, in this example, [a Vehicle Insurance Policy](../../assets/vehicle_insurance/policy_1.pdf).

In [4]:
List<ClassificationDefinitionModel> classifications = new List<ClassificationDefinitionModel>()
{
    new ClassificationDefinitionModel() { 
        Classification = "Insurance Policy", 
        Description = "Specific information related to an insurance policy, such as coverage, limits, premiums, and terms, often used for reference or clarification purposes.",
        Keywords = new List<string>() {
            "welcome letter",
            "personal details",
            "vehicle details",
            "insured driver details",
            "policy details",
            "incident/conviction history",
            "schedule of insurance",
            "vehicle damage excesses"
        }
    },
    new ClassificationDefinitionModel() { 
        Classification = "Insurance Certificate", 
        Description = "A document that serves as proof of insurance coverage, often required for legal, regulatory, or contractual purposes.",
        Keywords = new List<string>() {
            "certificate of vehicle insurance",
            "effective date of insurance",
            "entitlement to drive",
            "limitations of use"
        }
    },
    new ClassificationDefinitionModel() { 
        Classification = "Terms and Conditions", 
        Description = "The rules, requirements, or obligations that govern an agreement or contract, often related to insurance policies, financial products, or legal documents.",
        Keywords = new List<string>() {
            "terms and conditions",
            "legal statements",
            "payment instructions",
            "legal obligations",
            "covered for",
            "claim settlement",
            "costs to pay",
            "legal responsibility",
            "personal accident coverage",
            "medical expense coverage",
            "personal liability coverage",
            "windscreen damage coverage",
            "uninsured motorist protection",
            "renewal instructions",
            "cancellation instructions"
        }
    }
};

## Convert the document pages to Markdown

To classify the document pages using embeddings, the text from each page must first be extracted.

The following code block converts the document pages to Markdown format using Azure AI Document Intelligence's `prebuilt-layout` model.

For the purposes of this sample, we will be classifying each page. The benefit of using Azure AI Document Intelligence for this extraction is that it provides a page-by-page analysis result of the document.

In [5]:
AnalyzeResult result;

StopwatchContext diSw;

using (diSw = new StopwatchContext())
{
    var pollerResult = await documentIntelligenceClient.AnalyzeDocumentAsync(
        WaitUntil.Completed,
        modelId: "prebuilt-layout",
        bytesSource: BinaryData.FromBytes(File.ReadAllBytes(pdfFPath))
    );

    result = pollerResult.Value;
}

In [6]:
var pagesContent = new List<string>();

foreach (var page in result.Pages)
{
    // Extract the entire content for each page of the document based on the span offsets and lengths
    var pageContent = result.Content.Substring(page.Spans[0].Offset, page.Spans[0].Length);
    pagesContent.Add(pageContent);
}

## Create embeddings

With the text extracted from the document and the classifications defined, the next step is to create embeddings for each page and classification.

### Retrieving embeddings for text

The following helper function retrieves embeddings for a given piece of text using Azure OpenAI's `text-embedding-3-large` model.

In [7]:
private async Task<float[]> GetEmbeddingAsync(string text)
{
    var embeddingClient = openaiClient.GetEmbeddingClient(settings.TextEmbeddingModelDeploymentName);
    var embedding = await embeddingClient.GenerateEmbeddingAsync(text);
    return embedding.Value.ToFloats().ToArray();
}

### Convert the classifications to embeddings

The following code block takes each classification and generates the embeddings for the keywords.

In [8]:
StopwatchContext ceSw;

using (ceSw = new StopwatchContext())
{
    foreach (var classification in classifications)
    {
        var classificationEmbeddings = await GetEmbeddingAsync(string.Join(", ", classification.Keywords));
        classification.Embedding = classificationEmbeddings;
    }
}

### Convert the document pages to embeddings

The following code block takes each page of the document and generates the embeddings for the text.

In [9]:
var pageEmbeddings = new List<float[]>();

StopwatchContext deSw;

using (deSw = new StopwatchContext())
{
    foreach (var pageContent in pagesContent)
    {
        var pageEmbedding = await GetEmbeddingAsync(pageContent);
        pageEmbeddings.Add(pageEmbedding);
    }
}

## Classify the document pages

The following code block runs the classification process using cosine similarity to compare the embeddings of the document pages with the embeddings of the predefined categories.

It performs the following steps iteratively for each page in the document:

1. Calculates the cosine similarity between the embeddings of the page and the matrix of embeddings of the predefined categories.
2. Finds the best match for the page based on the maximum cosine similarity score.
3. If the cosine similarity score is above a certain threshold, the page is classified under the best match category. Otherwise, the page is classified as "Unclassified".

In [10]:
var similarityThreshold = 0.5f; // Minimum similarity threshold for classification

In [11]:
ClassificationsModel documentClassifications;

StopwatchContext classifySw;

using (classifySw = new StopwatchContext())
{
    documentClassifications = new ClassificationsModel()
    {
        Classifications = new List<ClassificationModel>()
    };

    for (int i = 0; i < pageEmbeddings.Count; i++)
    {
        var pageEmbedding = pageEmbeddings[i];
        var pageContent = pagesContent[i];

        var similarities = new List<float>();

        foreach (var c in classifications)
        {
            var similarity = TensorPrimitives.CosineSimilarity(new ReadOnlySpan<float>(pageEmbedding), new ReadOnlySpan<float>(c.Embedding));
            similarities.Add(similarity);
        }

        var maxSimilarity = similarities.Max();
        var classificationIndex = similarities.IndexOf(maxSimilarity);
        var classification = classifications[classificationIndex];

        // Check the last appended classification, and if it matches the current one, update the range end, otherwise add as new

        if (documentClassifications.Classifications.Count > 0 && documentClassifications.Classifications.Last().Classification == classification.Classification)
        {
            documentClassifications.Classifications.Last().ImageRangeEnd = i + 1;
        }
        else
        {
            documentClassifications.Classifications.Add(new ClassificationModel()
            {
                Classification = classification.Classification,
                ImageRangeStart = i + 1,
                ImageRangeEnd = i + 1,
            });
        }
    }
}

## Calculate the accuracy

The following code block calculates the accuracy of the classification process by comparing the actual classifications with the predicted classifications.

In [12]:
var accuracy = classificationEvaluator.Evaluate(expected, documentClassifications);

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the classification process.

This includes:

- The accuracy of the classification process comparing the expected output with the result of comparing the embeddings.
- The execution time of the end-to-end process.
- The classification results for each page in the document.

### Understanding Similarity

Cosine similarity is a metric used to measure how similar two vectors are. Embeddings are numerical representations of text. By converting a document page and classification keywords to embeddings, we can compare the similarity between the two using this technique.

Similarity scores close to 1 indicate that the two vectors share similar characteristics, while scores closer to 0 or negative values indicate that the two vectors are dissimilar.

In [13]:
// Gets the total execution time of the classification process.
var totalElapsed = diSw.Elapsed + ceSw.Elapsed + deSw.Elapsed + classifySw.Elapsed;

In [14]:
// Save the output of the data classification result
var classificationResult = new DataProcessingResult<ClassificationsModel>(
    documentClassifications,
    accuracy,
    null,
    null,
    null,
    totalElapsed
);

var classificationResultJson = JsonSerializer.Serialize(classificationResult, new JsonSerializerOptions { WriteIndented = true });
var classificationResultFPath = Path.Combine(samplePath, $"{sampleName}.{pdfFName}.json");

await File.WriteAllTextAsync(classificationResultFPath, classificationResultJson);

In [15]:
// Display the outputs of the classification process.
var output = new
{
    Accuracy = $"{float.Parse(accuracy["overall"].ToString()) * 100:0.00}%",
    ExecutionTime = $"{totalElapsed.TotalSeconds:0.00} seconds",
    DocumentIntelligenceExecutionTime = $"{diSw.Elapsed.TotalSeconds:0.00} seconds",
    ClassificationEmbeddingExecutionTime = $"{ceSw.Elapsed.TotalSeconds:0.00} seconds",
    DocumentEmbeddingExecutionTime = $"{deSw.Elapsed.TotalSeconds:0.00} seconds",
    ClassificationExecutionTime = $"{classifySw.Elapsed.TotalSeconds:0.00} seconds",
};

display(output);
display(documentClassifications);

Unnamed: 0,Unnamed: 1
Accuracy,100.00%
ExecutionTime,12.44 seconds
DocumentIntelligenceExecutionTime,7.50 seconds
ClassificationEmbeddingExecutionTime,1.97 seconds
DocumentEmbeddingExecutionTime,2.97 seconds
ClassificationExecutionTime,0.01 seconds


index,value
,
,
,
Classifications,indexvalue0Submission#3+ClassificationModelClassificationInsurance PolicyImageRangeStart1ImageRangeEnd51Submission#3+ClassificationModelClassificationInsurance CertificateImageRangeStart6ImageRangeEnd62Submission#3+ClassificationModelClassificationTerms and ConditionsImageRangeStart7ImageRangeEnd13
index,value
0,Submission#3+ClassificationModelClassificationInsurance PolicyImageRangeStart1ImageRangeEnd5
,
Classification,Insurance Policy
ImageRangeStart,1
ImageRangeEnd,5

index,value
,
,
,
0,Submission#3+ClassificationModelClassificationInsurance PolicyImageRangeStart1ImageRangeEnd5
,
Classification,Insurance Policy
ImageRangeStart,1
ImageRangeEnd,5
1,Submission#3+ClassificationModelClassificationInsurance CertificateImageRangeStart6ImageRangeEnd6
,

Unnamed: 0,Unnamed: 1
Classification,Insurance Policy
ImageRangeStart,1
ImageRangeEnd,5

Unnamed: 0,Unnamed: 1
Classification,Insurance Certificate
ImageRangeStart,6
ImageRangeEnd,6

Unnamed: 0,Unnamed: 1
Classification,Terms and Conditions
ImageRangeStart,7
ImageRangeEnd,13
