# Document Classification with Azure OpenAI's GPT-4o Vision Capabilities

This sample demonstrates how to classify a document using Azure OpenAI's GPT-4o model with vision capabilities.

![Data Classification](../../../images/classification-openai.png)

This is achieved by the following process:

- Define a list of classifications, with descriptions and keywords.
- Construct a system prompt that defines the instruction for classifying document pages.
- Construct a user prompt that includes the defined classifications, and each document page as an base64 encoded image.
- Use the Azure OpenAI chat completions API with the GPT-4o model to generate a classification for each document page as a structured output.

## Objectives

By the end of this sample, you will have learned how to:

- Convert a document into a set of base64 encoded images for processing by GPT-4o.
- Use prompt engineering techniques to instruct GPT-4o to classify a document's pages into predefined categories.

## Useful Tips

- Combine this technique with a [page extraction](../extraction/README.md) approach to ensure that you extract the most relevant data from a document's pages.

## Setup

### Import modules

This sample takes advantage of the following .NET dependencies:

- **pdf2image-dotnet** for converting a PDF file into a set of images per page.
- **Azure.AI.OpenAI** to interface with the Azure OpenAI chat completions API to generate structured classification outputs using the GPT-4o model.
- **Azure.Identity** to securely authenticate with deployed Azure Services using Microsoft Entra ID credentials.

The following local components are also used:

- [**Classification**](../modules/samples/models/Classification.csx) to define the classifications.
- [**OpenAIStructuredOutputsHelpers**](../modules/samples/helpers/OpenAIStructuredOutputsHelpers.csx) to generate valid OpenAI JSON schemas for structured outputs, and use them with the OpenAI API `ResponseFormat` parameter to return a structured output in the response.
- [**AccuracyEvaluator**](../modules/samples/evaluation/AccuracyEvaluator.csx) to evaluate the output of the classification process with expected results.
- [**OpenAIConfidence**](../modules/samples/confidence/OpenAIConfidence.csx) to calculate the confidence of the classification process based on the `logprobs` response from the OpenAI API request using the `Microsoft.ML.Tokenizers` and `Microsoft.ML.Tokenizers.Data.O200kBase` libraries (the latter is required for tokenizers used for the GPT-4o model).
- [**DocumentProcessingResult**](../modules/samples/models/DocumentProcessingResult.csx) to store the results of the classification process as a file.
- [**AppSettings**](../modules/samples/AppSettings.csx) to access environment variables from the `.env` file.

In [1]:
#r "nuget: Azure.Identity, 1.13.2"
#r "nuget: Azure.AI.OpenAI, 2.1.0"
#r "nuget: DotNetEnv, 3.1.1"
#r "nuget: Microsoft.ML.Tokenizers, 1.0.2"
#r "nuget: Microsoft.ML.Tokenizers.Data.O200kBase, 1.0.2"
#r "nuget: pdf2image-dotnet, 1.0.0"

#!import ../modules/samples/AppSettings.csx
#!import ../modules/samples/helpers/OpenAIStructuredOutputsHelpers.csx
#!import ../modules/samples/helpers/StopwatchContext.csx
#!import ../modules/samples/models/Classification.csx
#!import ../modules/samples/models/DocumentProcessingResult.csx
#!import ../modules/samples/evaluation/AccuracyEvaluator.csx
#!import ../modules/samples/confidence/OpenAIConfidence.csx

using System;
using System.IO;
using System.Text.Json;
using Azure.Core;
using Azure.Identity;
using Azure.AI.OpenAI;
using Azure.AI.OpenAI.Chat;
using OpenAI;
using OpenAI.Chat;
using DotNetEnv;
using Pdf2Image;

### Configure the Azure services

To use Azure OpenAI, the SDK is used to create a client instance using a deployed endpoint and authentication credentials.

For this sample, the credentials of the Azure CLI are used to authenticate with the deployed services.

In [2]:
string workingDir = Path.GetFullPath("../../../");
AppSettings settings = new AppSettings(new Dictionary<string, string>(Env.Load(Path.Combine(workingDir, ".env"))));
string samplePath = Path.Combine(workingDir, "samples/dotnet/classification/");
string sampleName = "document-classification-gpt-vision";

DefaultAzureCredential credential = new DefaultAzureCredential(
    new DefaultAzureCredentialOptions { 
        ExcludeWorkloadIdentityCredential = true,
        ExcludeAzureDeveloperCliCredential = true,
        ExcludeEnvironmentCredential = true,
        ExcludeManagedIdentityCredential = true,
        ExcludeAzurePowerShellCredential = true,
        ExcludeSharedTokenCacheCredential = true,
        ExcludeInteractiveBrowserCredential = true
    }
);

AzureOpenAIClient openaiClient = new AzureOpenAIClient(
    new Uri(settings.OpenAIEndpoint),
    credential
);

### Establish the expected output

To compare the accuracy of the classification process, the expected output of the classification process has been defined in the following code block based on each page of a [Vehicle Insurance Policy](../../assets/vehicle_insurance/policy_1.pdf).

The expected output has been defined by a human evaluating the document.

> **Note**: Only the `PageNumber` and `Classification` are used in the accuracy evaluation.

In [3]:
string path = Path.Combine(workingDir, "samples/assets/vehicle_insurance/");
string pdfFName = "policy_1.pdf";
string pdfFPath = Path.Combine(path, pdfFName);

ClassificationsModel expected = new ClassificationsModel()
{
    Classifications = new List<ClassificationModel>()
    {
        new ClassificationModel() { Classification = "Insurance Policy", ImageRangeStart = 1, ImageRangeEnd = 5 },
        new ClassificationModel() { Classification = "Insurance Certificate", ImageRangeStart = 6, ImageRangeEnd = 6 },
        new ClassificationModel() { Classification = "Terms and Conditions", ImageRangeStart = 7, ImageRangeEnd = 13 }
    }
};

AccuracyEvaluator<ClassificationsModel> classificationEvaluator = new AccuracyEvaluator<ClassificationsModel>(matchKeys: new List<string>() { "Classification", "ImageRangeStart" }, ignoreKeys: new List<string>() { });

## Define classifications

The following code block defines the classifications for a document. Each classification has a name, description, and keywords that will be used to classify the document's pages.

> **Note**, the classifications have been defined based on expected content in a specific type of document, in this example, [a Vehicle Insurance Policy](../../assets/vehicle_insurance/policy_1.pdf).

In [4]:
List<ClassificationDefinitionModel> classifications = new List<ClassificationDefinitionModel>()
{
    new ClassificationDefinitionModel() { 
        Classification = "Insurance Policy", 
        Description = "Specific information related to an insurance policy, such as coverage, limits, premiums, and terms, often used for reference or clarification purposes.",
        Keywords = new List<string>() {
            "welcome letter",
            "personal details",
            "vehicle details",
            "insured driver details",
            "policy details",
            "incident/conviction history",
            "schedule of insurance",
            "vehicle damage excesses"
        }
    },
    new ClassificationDefinitionModel() { 
        Classification = "Insurance Certificate", 
        Description = "A document that serves as proof of insurance coverage, often required for legal, regulatory, or contractual purposes.",
        Keywords = new List<string>() {
            "certificate of vehicle insurance",
            "effective date of insurance",
            "entitlement to drive",
            "limitations of use"
        }
    },
    new ClassificationDefinitionModel() { 
        Classification = "Terms and Conditions", 
        Description = "The rules, requirements, or obligations that govern an agreement or contract, often related to insurance policies, financial products, or legal documents.",
        Keywords = new List<string>() {
            "terms and conditions",
            "legal statements",
            "payment instructions",
            "legal obligations",
            "covered for",
            "claim settlement",
            "costs to pay",
            "legal responsibility",
            "personal accident coverage",
            "medical expense coverage",
            "personal liability coverage",
            "windscreen damage coverage",
            "uninsured motorist protection",
            "renewal instructions",
            "cancellation instructions"
        }
    }
};

## Classify the document pages

The following code block runs the classification process using Azure OpenAI's GPT-4o model using vision capabilities.

It performs the following steps:

1. Get the document bytes from the provided file path. _Note: In this example, we are processing a local document, however, you can use any document storage location of your choice, such as Azure Blob Storage._
2. Use `pdf2image-dotnet` to convert the document to a list of images per page as base64 strings.
3. Use Azure OpenAI's GPT-4o model and the classification definitions to provide a classification for each page of the document.

In [5]:
StringBuilder systemPromptBuilder = new StringBuilder();
systemPromptBuilder.AppendLine("You are an AI assistant that helps detect the boundaries of sub-section or sub-documents using the provided classifications.");
systemPromptBuilder.AppendLine("## Classifications");
systemPromptBuilder.AppendLine(JsonSerializer.Serialize(classifications));

string systemPrompt = systemPromptBuilder.ToString();

In [6]:
List<ChatMessageContentPart> userContent = new List<ChatMessageContentPart>();

In [7]:
StringBuilder userTextPromptBuilder = new StringBuilder();
userTextPromptBuilder.AppendLine("Classify insurance documents in the provided page images.");
userTextPromptBuilder.AppendLine("- A single classification may span multiple page images.");
userTextPromptBuilder.AppendLine("- A single page image may contain multiple classifications.");
userTextPromptBuilder.AppendLine("- If a page image does not contain a classification, ignore it.");

string userTextPrompt = userTextPromptBuilder.ToString();

userContent.Add(ChatMessageContentPart.CreateTextPart(userTextPrompt));

In [8]:
StopwatchContext imageSw;

using (imageSw = new StopwatchContext())
{
    var pdfBytes = File.ReadAllBytes(pdfFPath);
    var pages = await Pdf2ImageConverter.FromBytesAsync(pdfBytes);

    for (int i = 0; i < pages.Count; i++)
    {
        var pageData = BinaryData.FromBytes(pages[i]);
        userContent.Add(ChatMessageContentPart.CreateTextPart($"Page {i + 1}"));
        userContent.Add(ChatMessageContentPart.CreateImagePart(pageData, "image/png"));
    }
}

In [9]:
ParsedChatCompletion<ClassificationsModel> completion;

StopwatchContext oaiSw;

using (oaiSw = new StopwatchContext())
{
    completion = await openaiClient
        .GetChatClient(settings.GPT4OModelDeploymentName)
        .CompleteChatAsync(
            [
                new SystemChatMessage(systemPrompt),
                new UserChatMessage(userContent)
            ],
            new ChatCompletionOptions
            {   
                ResponseFormat = CreateJsonSchemaFormat<ClassificationsModel>("classificationsModel", jsonSchemaIsStrict: true),
                MaxOutputTokenCount = 4096,
                Temperature = 0.1f,
                TopP = 0.1f,
                IncludeLogProbabilities = true
            }
        );
}

## Calculate the accuracy

The following code block calculates the accuracy of the classification process by comparing the actual classifications with the predicted classifications.

In [10]:
var documentClassifications = completion.Parsed;

var accuracy = classificationEvaluator.Evaluate(expected, documentClassifications);

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the classification process.

This includes:

- The accuracy of the classification process comparing the expected output with the output generated by Azure OpenAI's GPT-4o model.
- The confidence score of the classification process based on the log probability of the predicted classification.
- The execution time of the end-to-end process.
- The total number of tokens consumed by the GPT-4o model.
- The classification results for each page of the document.

### Understanding Accuracy vs Confidence

When using AI to classify data, both confidence and accuracy are essential for different but complementary reasons.

- **Accuracy** measures how close the AI model's output is to a ground truth or expected output. It reflects how well the model's predictions align with reality.
  - Accuracy ensures consistency in the classification process, which is crucial for downstream tasks using the data.
- **Confidence** represents the AI model's internal assessment of how certain it is about its predictions.
  - Confidence indicates that the model is certain about its predictions, which can be a useful indicator for human reviewers to step in for manual verification.

High accuracy and high confidence are ideal, but in practice, there is often a trade-off between the two. While accuracy cannot always be self-assessed, confidence scores can and should be used to prioritize manual verification of low-confidence predictions.

In [11]:
var confidence = OpenAIConfidence<ClassificationsModel>.EvaluateConfidence(documentClassifications, completion.Origin);

In [12]:
// Gets the total execution time of the classification process.
var totalElapsed = imageSw.Elapsed + oaiSw.Elapsed;

// Gets the prompt tokens and completion tokens from the completion response.
var promptTokens = completion.Usage.InputTokenCount;
var completionTokens = completion.Usage.OutputTokenCount;

In [13]:
// Save the output of the data classification result
var classificationResult = new DataProcessingResult<ClassificationsModel>(
    documentClassifications,
    accuracy,
    confidence,
    promptTokens,
    completionTokens,
    totalElapsed
);

var classificationResultJson = JsonSerializer.Serialize(classificationResult, new JsonSerializerOptions { WriteIndented = true });
var classificationResultFPath = Path.Combine(samplePath, $"{sampleName}.{pdfFName}.json");

await File.WriteAllTextAsync(classificationResultFPath, classificationResultJson);

In [14]:
// Display the outputs of the classification process.
var output = new
{
    Accuracy = $"{float.Parse(accuracy["overall"].ToString()) * 100:0.00}%",
    Confidence = $"{float.Parse(confidence["_overall"].ToString()) * 100:0.00}%",
    ExecutionTime = $"{totalElapsed.TotalSeconds:0.00} seconds",
    ImagePreprocessingTime = $"{imageSw.Elapsed.TotalSeconds:0.00} seconds",
    OpenAIExecutionTime = $"{oaiSw.Elapsed.TotalSeconds:0.00} seconds",
    PromptTokens = promptTokens,
    CompletionTokens = completionTokens,
};

display(output);
display(confidence);

Unnamed: 0,Unnamed: 1
Accuracy,100.00%
Confidence,96.44%
ExecutionTime,22.03 seconds
ImagePreprocessingTime,8.27 seconds
OpenAIExecutionTime,13.76 seconds
PromptTokens,10496
CompletionTokens,58


key,type,value
index,value,Unnamed: 2_level_1
key,value,Unnamed: 2_level_2
key,type,value
key,type,value
key,type,value
key,value,Unnamed: 2_level_6
key,type,value
key,type,value
key,type,value
key,value,Unnamed: 2_level_10
key,type,value
key,type,value
key,type,value
Classifications,System.Collections.Generic.List<System.Object>,indexvalue0keyvalueClassificationkeytypevalueconfidenceSystem.Double0.9991068948863359valueSystem.StringInsurance PolicyImageRangeStartkeytypevalueconfidenceSystem.Double0.9998931248191417valueSystem.String1ImageRangeEndkeytypevalueconfidenceSystem.Double0.6936893779323026valueSystem.String51keyvalueClassificationkeytypevalueconfidenceSystem.Double0.998090270354434valueSystem.StringInsurance CertificateImageRangeStartkeytypevalueconfidenceSystem.Double0.9999534372472745valueSystem.String6ImageRangeEndkeytypevalueconfidenceSystem.Double0.9899174946204887valueSystem.String62keyvalueClassificationkeytypevalueconfidenceSystem.Double0.9999549495338071valueSystem.StringTerms and ConditionsImageRangeStartkeytypevalueconfidenceSystem.Double0.999909215726114valueSystem.String7ImageRangeEndkeytypevalueconfidenceSystem.Double0.9988262991267038valueSystem.String13
index,value,
0,keyvalueClassificationkeytypevalueconfidenceSystem.Double0.9991068948863359valueSystem.StringInsurance PolicyImageRangeStartkeytypevalueconfidenceSystem.Double0.9998931248191417valueSystem.String1ImageRangeEndkeytypevalueconfidenceSystem.Double0.6936893779323026valueSystem.String5,
key,value,
Classification,keytypevalueconfidenceSystem.Double0.9991068948863359valueSystem.StringInsurance Policy,
key,type,value
confidence,System.Double,0.9991068948863359
value,System.String,Insurance Policy
ImageRangeStart,keytypevalueconfidenceSystem.Double0.9998931248191417valueSystem.String1,
key,type,value

index,value,Unnamed: 2_level_0
key,value,Unnamed: 2_level_1
key,type,value
key,type,value
key,type,value
key,value,Unnamed: 2_level_5
key,type,value
key,type,value
key,type,value
key,value,Unnamed: 2_level_9
key,type,value
key,type,value
key,type,value
0,keyvalueClassificationkeytypevalueconfidenceSystem.Double0.9991068948863359valueSystem.StringInsurance PolicyImageRangeStartkeytypevalueconfidenceSystem.Double0.9998931248191417valueSystem.String1ImageRangeEndkeytypevalueconfidenceSystem.Double0.6936893779323026valueSystem.String5,
key,value,
Classification,keytypevalueconfidenceSystem.Double0.9991068948863359valueSystem.StringInsurance Policy,
key,type,value
confidence,System.Double,0.9991068948863359
value,System.String,Insurance Policy
ImageRangeStart,keytypevalueconfidenceSystem.Double0.9998931248191417valueSystem.String1,
key,type,value
confidence,System.Double,0.9998931248191417
value,System.String,1

key,value,Unnamed: 2_level_0
key,type,value
key,type,value
key,type,value
Classification,keytypevalueconfidenceSystem.Double0.9991068948863359valueSystem.StringInsurance Policy,
key,type,value
confidence,System.Double,0.9991068948863359
value,System.String,Insurance Policy
ImageRangeStart,keytypevalueconfidenceSystem.Double0.9998931248191417valueSystem.String1,
key,type,value
confidence,System.Double,0.9998931248191417
value,System.String,1
ImageRangeEnd,keytypevalueconfidenceSystem.Double0.6936893779323026valueSystem.String5,
key,type,value

key,type,value
confidence,System.Double,0.9991068948863359
value,System.String,Insurance Policy

key,type,value
confidence,System.Double,0.9998931248191416
value,System.String,1.0

key,type,value
confidence,System.Double,0.6936893779323026
value,System.String,5.0

key,value,Unnamed: 2_level_0
key,type,value
key,type,value
key,type,value
Classification,keytypevalueconfidenceSystem.Double0.998090270354434valueSystem.StringInsurance Certificate,
key,type,value
confidence,System.Double,0.998090270354434
value,System.String,Insurance Certificate
ImageRangeStart,keytypevalueconfidenceSystem.Double0.9999534372472745valueSystem.String6,
key,type,value
confidence,System.Double,0.9999534372472745
value,System.String,6
ImageRangeEnd,keytypevalueconfidenceSystem.Double0.9899174946204887valueSystem.String6,
key,type,value

key,type,value
confidence,System.Double,0.998090270354434
value,System.String,Insurance Certificate

key,type,value
confidence,System.Double,0.9999534372472744
value,System.String,6.0

key,type,value
confidence,System.Double,0.9899174946204888
value,System.String,6.0

key,value,Unnamed: 2_level_0
key,type,value
key,type,value
key,type,value
Classification,keytypevalueconfidenceSystem.Double0.9999549495338071valueSystem.StringTerms and Conditions,
key,type,value
confidence,System.Double,0.9999549495338071
value,System.String,Terms and Conditions
ImageRangeStart,keytypevalueconfidenceSystem.Double0.999909215726114valueSystem.String7,
key,type,value
confidence,System.Double,0.999909215726114
value,System.String,7
ImageRangeEnd,keytypevalueconfidenceSystem.Double0.9988262991267038valueSystem.String13,
key,type,value

key,type,value
confidence,System.Double,0.9999549495338071
value,System.String,Terms and Conditions

key,type,value
confidence,System.Double,0.999909215726114
value,System.String,7.0

key,type,value
confidence,System.Double,0.9988262991267038
value,System.String,13.0
