# Using Azure OpenAI GPT-4 Vision to extract structured JSON data from PDF documents

This notebook demonstrates [how to use GPT-4 Vision](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/gpt-with-vision?tabs=rest) to extract structured JSON data from PDF documents, such as invoices, using the [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/overview).

## Pre-requisites

The notebook uses [PowerShell](https://learn.microsoft.com/powershell/scripting/install/installing-powershell) and [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) to deploy all necessary Azure resources. Both tools are available on Windows, macOS and Linux environments. It also uses [.NET 8](https://dotnet.microsoft.com/download/dotnet/8.0) to run the C# code that interacts with the Azure OpenAI Service.

Running this notebook will deploy the following resources in your Azure subscription:
- Azure Resource Group
- Azure OpenAI Service (Sweden Central)
- GPT-4 Vision model deployment (50K capacity)

**Note**: The GPT-4 with Vision model is not available in all Azure OpenAI regions. For more information, see the [Azure OpenAI Service documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#standard-deployment-model-availability).

## Deploy infrastructure with Az CLI & Bicep

The following will prompt you to login to Azure. Once logged in, the current default subscription in your available subscriptions will be set for deployment.

> **Note:** If you have multiple subscriptions, you can change the default subscription by running `az account set --subscription <subscription_id>`.

Then, all the necessary Azure resources will be deployed, previously listed, using [Azure Bicep](https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/).

The deployment occurs at the subscription level, creating a new resource group. The location of the deployment is set to **Sweden Central** and this can be changed to another location that supports the GPT-4 Turbo with Vision model, as well as other parameters, in the [`./infra/main.bicepparam`](./infra/main.bicepparam) file.

Once deployed, the Azure OpenAI Service endpoint and key will be stored in the [`./config.env`](./config.env) file for use in the .NET code.

### Understanding the deployment

#### Managed Identity

A [user-assigned Managed Identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview) is created for authenticating with Azure OpenAI instead of using API keys by using role-based access control (RBAC) permissions.

Read more about [how to configure Azure OpenAI Service with managed identities](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/managed-identity).

#### OpenAI Services

An [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/overview) instance is deployed in the Sweden Central region. This is deployed with the `gpt-4 (turbo-2024-04-09)` model to be used for inference.

In [None]:
$loggedIn = az account show --query "name" -o tsv

if ($loggedIn -ne $null) {
    Write-Host "Already logged in as $loggedIn"
} else {
    Write-Host "Logging in..."
    az login
}

# Retrieve the default subscription ID
$subscriptionId = (
    (
        az account list -o json `
            --query "[?isDefault]"
    ) | ConvertFrom-Json
).id

# Set the subscription
az account set --subscription $subscriptionId
Write-Host "Subscription set to $subscriptionId"

$deploymentName = 'gpt-4-document-extraction'
$location = 'swedencentral'

$deploymentOutputs = (.\Setup-Environment.ps1 -DeploymentName $deploymentName -Location $location -SkipInfrastructure $true)

## Install .NET dependencies

This notebook uses .NET to interact with the Azure OpenAI Service. It takes advantage of the following NuGet packages:

### PDFtoImage

The [PDFtoImage](https://github.com/sungaila/PDFtoImage) library is used to convert PDF documents to JPEG images. The library provides a simple layer to convert PDF documents using the static `PDFtoImage.Conversion` class. Reading the bytes of the PDF, the library will create an image and store it with a given file name.

### DotNetEnv

The [DotNetEnv](https://github.com/tonerdo/dotnet-env) library is used to load environment variables from a `.env` file which can be accessed via the `Environment.GetEnvironmentVariable(string)` method. This library is used to load the Azure OpenAI Service endpoint, key and model deployment name from the [`./config.env`](./config.env) file.

In [None]:
#r "nuget:System.Text.Json, 8.0.1"
#r "nuget:Azure.AI.OpenAI, 1.0.0-beta.17"
#r "nuget:Azure.Identity, 1.11.3"
#r "nuget:DotNetEnv, 3.0.0"
#r "nuget:PDFtoImage, 4.0.1"

In [None]:
using System.Net;
using System.Net.Http;
using System.Text.Json.Nodes;
using System.Text.Json;
using System.IO; 

using Azure;
using Azure.AI.OpenAI;
using Azure.Core;
using Azure.Identity;
using DotNetEnv;
using PDFtoImage;
using SkiaSharp;

In [None]:
Env.Load("config.env");

var endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT");
var modelDeployment = Environment.GetEnvironmentVariable("AZURE_OPENAI_COMPLETION_MODEL_DEPLOYMENT_NAME");
var apiVersion = "2024-03-01-preview";

string visionEndpoint = $"{endpoint}openai/deployments/{modelDeployment}/chat/completions?api-version={apiVersion}";

var credential = new DefaultAzureCredential(new DefaultAzureCredentialOptions { 
    ExcludeEnvironmentCredential = true,
    ExcludeManagedIdentityCredential = true,
    ExcludeSharedTokenCacheCredential = true,
    ExcludeInteractiveBrowserCredential = true,
    ExcludeAzurePowerShellCredential = true,
    ExcludeVisualStudioCodeCredential = false,
    ExcludeAzureCliCredential = false
});

var bearerToken = credential.GetToken(new TokenRequestContext(new[] { "https://cognitiveservices.azure.com/.default" })).Token;

var pdfName = "Invoice_1.pdf";
var pdfJsonExtractionName = $"{pdfName}.Extraction.json";

## Convert PDF to image

For the GPT-4 with Vision model to extract structured JSON data from a PDF document, the document must first be converted to an image. The following code demonstrates how to convert a PDF document to a JPEG image using the `PDFtoImage` library.

### Important notes for image analysis with the GPT-4 with Vision model

- The maximum size for images is restricted to 20MB.
- The `image_url` parameter in the message body has a `detail` property that can be set to `low` to enable a lower resolution image analysis for faster results with fewer tokens. However, this could impact the accuracy of the result.
- When providing images, there is a limit of 10 images per call.

Based on these notes, you may need to perform pre-processing of your PDF when converting it to images to ensure that the images are within the size limits and that the resolution is appropriate for the analysis. This may include:

- Reducing the resolution of the images.
- Splitting the PDF into multiple images, if it contains less than 10 pages.
- Stitching multiple images together, if the PDF contains more than 10 pages.
- Compressing the images to reduce the file size.

Experiment with different pre-processing techniques to find the best approach for your specific use case.

The following code provides examples using .NET to convert a PDF document with multiple pages into one image that stitches the pages together.

In [None]:
var pdf = await File.ReadAllBytesAsync(pdfName);
var pageImages = PDFtoImage.Conversion.ToImages(pdf);

var totalPageCount = pageImages.Count();

// If there are more than 10 pages, we need to stitch images together so that the total number of pages is less than or equal to 10
int maxSize = (int)Math.Ceiling(totalPageCount / 10.0);
var pageImageGroups = new List<List<SKBitmap>>();
for (int i = 0; i < totalPageCount; i += maxSize)
{
    var pageImageGroup = pageImages.Skip(i).Take(maxSize).ToList();
    pageImageGroups.Add(pageImageGroup);
}

var pdfImageFiles = new List<string>();

var count = 0;
foreach (var pageImageGroup in pageImageGroups)
{
    var pdfImageName = $"{pdfName}.Part_{count}.jpg";

    int totalHeight = pageImageGroup.Sum(image => image.Height);
    int width = pageImageGroup.Max(image => image.Width);
    var stitchedImage = new SKBitmap(width, totalHeight);
    var canvas = new SKCanvas(stitchedImage);
    int currentHeight = 0;
    foreach (var pageImage in pageImageGroup)
    {
        canvas.DrawBitmap(pageImage, 0, currentHeight);
        currentHeight += pageImage.Height;
    }
    using (var stitchedFileStream = new FileStream(pdfImageName, FileMode.Create, FileAccess.Write))
    {
        stitchedImage.Encode(stitchedFileStream, SKEncodedImageFormat.Jpeg, 100);
    }
    pdfImageFiles.Add(pdfImageName);
    count++;

    Console.WriteLine($"Saved image to {pdfImageName}");
}

## Use GPT-4 with Vision to extract the data from the image

Now that the PDF document has been converted to an image, the GPT-4 with Vision model can be used to extract structured JSON data from the image. The following code demonstrates how to use the deployed Azure OpenAI Service directly via the API to extract structured JSON data from the image.

In this example, the payload for the Chat completion endpoint is a JSON object with the following details:

### System Prompt

The system prompt is the instruction to the model that prescribes the model's behavior. They allow you to constrain the model's behavior to a specific task, making it more adaptable for specific use cases, such as extracting structured JSON data from documents.

In this case, it is to extract structured JSON data from the image. Here is what we have provided:

**You are an AI assistant that extracts data from documents and returns them as structured JSON objects. Do not return as a code block.**

> **Note:** To avoid the response being returned as a code block, we have included the instruction to not return as a code block. 

Learn more about [system prompts](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/system-message).

### User Prompt

The user prompt is the input to the model that provides context for the model's response. It is the input that the model uses to generate a response. 

In this case, it is the image of the document plus some additional text context to help the model understand the task.

In order to extract structured JSON data, we need to provide the expected structure of the response. We can do this by creating our data transfer object (DTO) and providing a serialized, empty version of it in the user prompt.

Here is what we have provided:

**Extract the data from this invoice. If a value is not present, provide null. Use the following structure: {"InvoiceNumber":"","PurchaseOrderNumber":"","CustomerName":"","CustomerAddress":"","DeliveryDate":"","PayableBy":"","Products":[{"Id":"","Description":"","UnitPrice":0,"Quantity":0,"Total":0}],"TotalQuantity":0,"TotalPrice":0,"Returns":[{"Id":"","Description":"","Quantity":0,"Reason":""}],"ProductsSignatures":[{"Type":"","Name":"","IsSigned":false}],"ReturnsSignatures":[{"Type":"","Name":"","IsSigned":false}]}**

> **Note:** For the user prompt, it is ideal to provide a structure for the JSON response. Without one, the model will determine this for you and you may not get consistency across responses. 

This prompt ensures that the model understands the task, and the additional text context provides the model with the necessary information to extract the structured JSON data from the image. This approach would result in a response similar to the following:

```json
{
  "InvoiceNumber": "3847193",
  "PurchaseOrderNumber": "15931",
  "CustomerName": "Sharp Consulting",
  "CustomerAddress": "73 Regal Way, Leeds, LS1 5AB, UK",
  "DeliveryDate": "2024-05-16T00:00:00",
  "PayableBy": "2024-05-24",
  "Products": [
    {
      "Id": "PPR006",
      "Description": "UNIT A4 80gsm",
      "UnitPrice": 25.86,
      "Quantity": 5,
      "Total": 129.30,
      "Reason": null
    },
    {
      "Id": "3M12",
      "Description": "NOTES 51x76mm Y",
      "UnitPrice": 3.00,
      "Quantity": 12,
      "Total": 36.00,
      "Reason": null
    }
  ],
  "Returns": [
    {
      "Id": "MA145",
      "Description": "POSTAL TUBE BROWN",
      "UnitPrice": null,
      "Quantity": 1,
      "Total": null,
      "Reason": "Previous order has sufficient stock, no replacement required."
    },
    {
      "Id": "JF7902",
      "Description": "MAILBOX 25PK",
      "UnitPrice": null,
      "Quantity": 1,
      "Total": null,
      "Reason": "Not required."
    }
  ],
  "TotalQuantity": 66,
  "TotalPrice": 1075.70,
  "ProductsSignatures": [
    {
      "Type": "Customer",
      "Name": "Sarah H.",
      "IsSigned": true
    },
    {
      "Type": "Driver",
      "Name": "James T",
      "IsSigned": true
    }
  ],
  "ReturnsSignatures": [
    {
      "Type": "Customer",
      "Name": "Sarah H.",
      "IsSigned": true
    },
    {
      "Type": "Driver",
      "Name": "James T",
      "IsSigned": true
    }
  ]
}
```

In [None]:
public class InvoiceData
{
    public string? InvoiceNumber { get; set; }

    public string? PurchaseOrderNumber { get; set; }

    public string? CustomerName { get; set; }

    public string? CustomerAddress { get; set; }

    public DateTime? DeliveryDate { get; set; }

    public DateTime? PayableBy { get; set; }

    public IEnumerable<InvoiceDataProduct>? Products { get; set; }

    public IEnumerable<InvoiceDataProduct>? Returns { get; set; }

    public double? TotalQuantity { get; set; }

    public double? TotalPrice { get; set; }

    public IEnumerable<InvoiceDataSignature>? ProductsSignatures { get; set; }

    public IEnumerable<InvoiceDataSignature>? ReturnsSignatures { get; set; }

    public static InvoiceData Empty => new()
    {
        InvoiceNumber = string.Empty,
        PurchaseOrderNumber = string.Empty,
        CustomerName = string.Empty,
        CustomerAddress = string.Empty,
        DeliveryDate = DateTime.MinValue,
        Products =
            new List<InvoiceDataProduct> { new() { Id = string.Empty, Description = string.Empty, UnitPrice = 0.0, Quantity = 0.0, Total = 0.0 } },
        Returns =
            new List<InvoiceDataProduct> { new() { Id = string.Empty, Quantity = 0.0, Reason = string.Empty } },
        TotalQuantity = 0,
        TotalPrice = 0,
        ProductsSignatures = new List<InvoiceDataSignature>
        {
            new()
            {
                Type = string.Empty,
                Name = string.Empty,
                IsSigned = false
            }
        },
        ReturnsSignatures = new List<InvoiceDataSignature>
        {
            new()
            {
                Type = string.Empty,
                Name = string.Empty,
                IsSigned = false
            }
        }
    };

    public class InvoiceDataProduct
    {
        public string? Id { get; set; }

        public string? Description { get; set; }

        public double? UnitPrice { get; set; }

        public double Quantity { get; set; }

        public double? Total { get; set; }

        public string? Reason { get; set; }
    }

    public class InvoiceDataSignature
    {
        public string? Type { get; set; }

        public string? Name { get; set; }

        public bool? IsSigned { get; set; }
    }
}


In [None]:
var userPromptParts = new List<JsonNode>{
    new JsonObject
    {
        { "type", "text" },
        { "text", $"Extract the data from this invoice. If a value is not present, provide null. Reasons may overlap multiple lines, arrows indicate which reason relates to which line item. Use the following structure:{JsonSerializer.Serialize(InvoiceData.Empty)}" }
    }
};

foreach (var pdfImageFile in pdfImageFiles)
{
    var imageBytes = await File.ReadAllBytesAsync(pdfImageFile);
    var base64Image = Convert.ToBase64String(imageBytes);
    userPromptParts.Add(new JsonObject
    {
        { "type", "image_url" },
        { "image_url", new JsonObject { { "url", $"data:image/jpeg;base64,{base64Image}" } } }
    });
}

JsonObject jsonPayload = new JsonObject
{
    {
        "messages", new JsonArray 
        {
            new JsonObject
            {
                { "role", "system" },
                { "content", "You are an AI assistant that extracts data from documents and returns them as structured JSON objects. Do not return as a code block." }
            },
            new JsonObject
            {
                { "role", "user" },
                { "content", new JsonArray(userPromptParts.ToArray())}
            }
        }
    },
    { "model", modelDeployment },
    { "max_tokens", 4096 },
    { "temperature", 0.1 },
    { "top_p", 0.1 },
};

string payload = JsonSerializer.Serialize(jsonPayload, new JsonSerializerOptions
{
    WriteIndented = true
});

In [None]:
var invoiceData = InvoiceData.Empty;

using (HttpClient httpClient = new HttpClient())
{
    httpClient.BaseAddress = new Uri(visionEndpoint);
    httpClient.DefaultRequestHeaders.Add("Authorization", $"Bearer {bearerToken}");
    httpClient.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json"));

    var stringContent = new StringContent(payload, Encoding.UTF8, "application/json");

    var response = await httpClient.PostAsync(visionEndpoint, stringContent);

    if (response.IsSuccessStatusCode)
    {
        using (var responseStream = await response.Content.ReadAsStreamAsync())
        {
            // Parse the JSON response using JsonDocument
            using (var jsonDoc = await JsonDocument.ParseAsync(responseStream))
            {
                // Access the message content dynamically
                JsonElement jsonElement = jsonDoc.RootElement;
                string messageContent = jsonElement.GetProperty("choices")[0].GetProperty("message").GetProperty("content").GetString();

                // Output the message content
                File.WriteAllText(pdfJsonExtractionName, messageContent);
                Console.WriteLine($"{pdfJsonExtractionName} has been created with the content from the response from the OpenAI API.");

                invoiceData = JsonSerializer.Deserialize<InvoiceData>(messageContent);
            }
        }
    }
    else
    {
        Console.WriteLine(response);
    }
}