Skip to content

Latest commit

 

History

History
323 lines (261 loc) · 16.2 KB

File metadata and controls

323 lines (261 loc) · 16.2 KB

Quickstart: Extract invoice data using the Form Recognizer REST API with Python

In this quickstart, you'll use the Azure Form Recognizer REST API with Python to extract and identify relevant information in invoices.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites

To complete this quickstart, you must have:

  • Python installed (if you want to run the sample locally).
  • An invoice document. You can use the sample invoice for this quickstart.

Note

This quickstart uses a local file. To use a invoice document accessed by URL instead, see the reference documentation.

Create a Form Recognizer resource

Go to the Azure portal and create a new Form Recognizer resource . In the Create pane, provide the following information:

Name A descriptive name for your resource. We recommend using a descriptive name, for example MyNameFormRecognizer.
Subscription Select the Azure subscription which has been granted access.
Location The location of your cognitive service instance. Different locations may introduce latency, but have no impact on the runtime availability of your resource.
Pricing tier The cost of your resource depends on the pricing tier you choose and your usage. For more information, see the API pricing details.
Resource group The Azure resource group that will contain your resource. You can create a new group or add it to a pre-existing group.

Note

Normally when you create a Cognitive Service resource in the Azure portal, you have the option to create a multi-service subscription key (used across multiple cognitive services) or a single-service subscription key (used only with a specific cognitive service). However currently Form Recognizer is not included in the multi-service subscription.

When your Form Recognizer resource finishes deploying, find and select it from the All resources list in the portal. Your key and endpoint will be located on the resource's key and endpoint page, under resource management. Save both of these to a temporary location before going forward.

Analyze an invoice

To start analyzing an invoice, call the Analyze Invoice API using the Python script below. Before you run the script, make these changes:

  1. Replace <Endpoint> with the endpoint that you obtained with your Form Recognizer subscription.

  2. Replace <subscription key> with the subscription key you copied from the previous step.

  3. Replace <path to your invoice> with the local path where you have a sample invoice saved.

        ########### Python Form Recognizer Async Invoice #############
    
        import json
        import time
        from requests import get, post
    
        # Endpoint URL
        endpoint = r"<Endpoint>"
        apim_key = "<subscription key>"
        post_url = endpoint + "/formrecognizer/v2.1/prebuilt/invoice/analyze"
        source = r"<path to your invoice>"
    
        headers = {
            # Request headers
            'Content-Type': '<file type>',
            'Ocp-Apim-Subscription-Key': apim_key,
        }
    
        params = {
            "includeTextDetails": True
            "locale": "en-US"
        }
    
        with open(source, "rb") as f:
            data_bytes = f.read()
    
        try:
            resp = post(url = post_url, data = data_bytes, headers = headers, params = params)
            if resp.status_code != 202:
                print("POST analyze failed:\n%s" % resp.text)
                quit()
            print("POST analyze succeeded:\n%s" % resp.headers)
            get_url = resp.headers["operation-location"]
        except Exception as e:
            print("POST analyze failed:\n%s" % str(e))
            quit()
  4. Save the code in a file with a .py extension. For example, form-recognizer-invoice.py.

  5. Open a command prompt window.

  6. At the prompt, use the python command to run the sample. For example, python form-recognizer-invoice.py.

You'll receive a 202 (Success) response that includes an Operation-Location header, which the script will print to the console. This header contains an operation ID that you can use to query the status of the asynchronous operation and get the results. In the following example value, the string after operations/ is the operation ID.

Get the invoice results

After you've called the Analyze Invoice API, you call the Get Analyze Invoice Result API to get the status of the operation and the extracted data. Add the following code to the bottom of your Python script. This uses the operation ID value in a new API call. This script calls the API at regular intervals until the results are available. We recommend an interval of one second or more.

n_tries = 10
n_try = 0
wait_sec = 6
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
        resp_json = json.loads(resp.text)
        if resp.status_code != 200:
            print("GET Invoice results failed:\n%s" % resp_json)
            quit()
        status = resp_json["status"]
        if status == "succeeded":
            print("Invoice Analysis succeeded:\n%s" % resp_json)
            quit()
        if status == "failed":
            print("Invoice Analysis failed:\n%s" % resp_json)
            quit()
        # Analysis still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
    except Exception as e:
        msg = "GET analyze results failed:\n%s" % str(e)
        print(msg)
        quit()
  1. Save the script.
  2. Again use the python command to run the sample. For example, python form-recognizer-invoice.py.

Examine the response

The script will print responses to the console until the Analyze Invoice operation completes. Then, it will print the extracted text data in JSON format. The "readResults" field contains every line of text that was extracted from the invoice, the "pageResults" includes the tables and selections marks extracted from the invoice and the "documentResults" field contains key/value information for the most relevant parts of the invoice.

See the following invoice document and its corresponding JSON output:

Sample Python script to extract invoice or a batch of invoices into a CSV file

This sample python script shows you how to get started using the Invoice API. It can run with single invoice as a parameter or folder and will output the JSON file ".invoice.json" and a CSV file invoiceResutls.csv with the extracted values results. When running on a folder, it will scan through all "pdf","jpg","jpeg","png","bmp","tif","tiff" files and run them with the API.

########### Python Form Recognizer Async Invoice #############

import json
import time
import os
import ntpath
import sys
from requests import get, post
import csv
from tabulate import tabulate


def analyzeInvoice(filename):
    invoiceResultsFilename = filename + ".invoice.json"

    # do not run analyze if .invoice.json file is present on disk
    if os.path.isfile(invoiceResultsFilename):
        with open(invoiceResultsFilename) as json_file:
            return json.load(json_file)

    # Endpoint URL
    #endpoint = r"<Endpoint>"
    #apim_key = "<subscription key>"
    post_url = endpoint + "/formrecognizer/v2.1/prebuilt/invoice/analyze"
    headers = {
        # Request headers
        'Content-Type': 'application/octet-stream',
        'Ocp-Apim-Subscription-Key': apim_key,
    }

    params = {
        "includeTextDetails": True
    }

    with open(filename, "rb") as f:
        data_bytes = f.read()

    try:
        resp = post(url = post_url, data = data_bytes, headers = headers, params = params)
        if resp.status_code != 202:
            print("POST analyze failed:\n%s" % resp.text)
            return None
        print("POST analyze succeeded: %s" % resp.headers["operation-location"])
        get_url = resp.headers["operation-location"]
    except Exception as e:
        print("POST analyze failed:\n%s" % str(e))
        return None

    n_tries = 50
    n_try = 0
    wait_sec = 6

    while n_try < n_tries:
        try:
            resp = get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
            resp_json = json.loads(resp.text)
            if resp.status_code != 200:
                print("GET Invoice results failed:\n%s" % resp_json)
                return None
            status = resp_json["status"]
            if status == "succeeded":
                print("Invoice analysis succeeded.")
                with open(invoiceResultsFilename, 'w') as outfile:
                    json.dump(resp_json, outfile, indent=4)
                return resp_json
            if status == "failed":
                print("Analysis failed:\n%s" % resp_json)
                return None
            # Analysis still running. Wait and retry.
            time.sleep(wait_sec)
            n_try += 1
        except Exception as e:
            msg = "GET analyze results failed:\n%s" % str(e)
            print(msg)
            return None

    print("Invoice analyze is not complete after {0} seconds:\n{1}".format(n_try*wait_sec,resp_json))
    return None

def parseInvoiceResults(resp_json):
    docResults = resp_json["analyzeResult"]["documentResults"]
    invoiceResult = {}
    lineItems = []
    lineItemsHeadersOrder = ["Date","ProductCode", "Description", "UnitPrice", "Quantity", "Unit", "Tax", "Amount", "FullText"]
    for docResult in docResults:
        for fieldName, fieldValue in sorted(docResult["fields"].items()):
            if fieldName != "Items":
                valueFields = list(filter(lambda item: ("value" in item[0]) and ("valueString" not in item[0]), fieldValue.items()))
                invoiceResult[fieldName] = fieldValue["text"]
                if len(valueFields) == 1:
                    print("{0:26} : {1:50}      NORMALIZED VALUE: {2}".format(fieldName , fieldValue["text"], valueFields[0][1]))
                    invoiceResult[fieldName + "_normalized"] = valueFields[0][1]
                else:
                    print("{0:26} : {1}".format(fieldName , fieldValue["text"]))
            else:
                for item in fieldValue["valueArray"]:
                    itemValue = {}
                    itemValue["FullText"] = item["text"]
                    if "valueObject" in item:
                        for lineDetailName, lineDetailValue in sorted(item["valueObject"].items()):
                            itemValue[lineDetailName] = lineDetailValue["text"]
                            
                    lineItems.append(itemValue)
                   
    if lineItems:
        lineItemsPretty = []
        presentHeaders = list(set(val for dic in lineItems for val in dic.keys()))
        sortedheaders = [x for x in lineItemsHeadersOrder if x in presentHeaders]
        
        #Fill values with empty values to sort columns correctly by tabulate
        for item in lineItems:
            itemValue = {}
            for header in sortedheaders:
                if (header in item):
                    itemValue[header] = item[header]
                else:
                    itemValue[header] = ""

            lineItemsPretty.append(itemValue)
        
        print("")
        print("Line Items:")
        print(tabulate(lineItemsPretty, headers="keys", showindex=True, tablefmt = "pretty"))
        
    print("")
    return invoiceResult

def main(argv):
    if (len(argv)  != 2):
        print("ERROR: Please provide invoice filename or root directory with invoice PDFs/images as an argument to the python script")
        return

    # list of invoice to analyze
    invoiceFiles = []
    csvPostfix = '-invoiceResults.csv'
    if os.path.isfile(argv[1]):
        # Single invoice
        invoiceFiles.append(argv[1])
        csvFileName = argv[1] + csvPostfix
    else:
        # Folder of invoices
        supportedExt = ['.pdf', '.jpg','.jpeg','.tif','.tiff','.png','.bmp']
        invoiceDirectory = argv[1]
        csvFileName = os.path.join(invoiceDirectory, os.path.basename(os.path.abspath(invoiceDirectory)) + csvPostfix)
        for root, directories, filenames in os.walk(invoiceDirectory):
            for invoiceFilename in filenames:
                ext = os.path.splitext(invoiceFilename)[-1].lower()
                if ext in supportedExt:
                    fullname = os.path.join(root, invoiceFilename)
                    invoiceFiles.append(fullname)

    with open(csvFileName, mode='w', newline='\n', encoding='utf-8') as csv_file:
        fieldnames = ['Filename',
                      'FullFilename','InvoiceTotal','InvoiceTotal_normalized','AmountDue','AmountDue_normalized','SubTotal','SubTotal_normalized','TotalTax','TotalTax_normalized','CustomerName','VendorName',
                      'InvoiceId','CustomerId','PurchaseOrder','InvoiceDate','InvoiceDate_normalized','DueDate','DueDate_normalized',
                      'VendorAddress','VendorAddressRecipient','BillingAddress','BillingAddressRecipient','ShippingAddress','ShippingAddressRecipient','CustomerAddress','CustomerAddressRecipient','ServiceAddress','ServiceAddressRecipient','RemittanceAddress','RemittanceAddressRecipient', 'ServiceStartDate','ServiceStartDate_normalized','ServiceEndDate','ServiceEndDate_normalized','PreviousUnpaidBalance','PreviousUnpaidBalance_normalized']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        counter = 0
        for invoiceFullFilename in invoiceFiles:
            counter = counter + 1
            invoiceFilename = ntpath.basename(invoiceFullFilename)
            print("----- Processing {0}/{1} : {2} -----".format(counter, len(invoiceFiles),invoiceFullFilename))

            resp_json = analyzeInvoice(invoiceFullFilename)

            if (resp_json is not None):
                invoiceResults = parseInvoiceResults(resp_json)
                invoiceResults["FullFilename"] = invoiceFullFilename
                invoiceResults["Filename"] = invoiceFilename
                writer.writerow(invoiceResults)

if __name__ == '__main__':
    main(sys.argv)
  1. Save the code in a file with a .py extension. For example, form-recognizer-invoice-to-csv.py.
  2. Replace <Endpoint> with the endpoint that you obtained with your Form Recognizer subscription.
  3. Replace <subscription key> with the subscription key you copied from the previous step.
  4. Open a command prompt window.
  5. At the prompt, use the python command to run the sample. For example, python form-recognizer-invoice.py {file name or folder name} The Python script can run with a single invoice or a folder as the parameter and will output the JSON file ".invoice.json" and the values extracted from the invoices into a CSV file "-invoiceResults.csv" with the results. When running on a folder, it will scan through all "pdf","jpg","jpeg","png","bmp","tif","tiff" files and run them with the API.

Next steps

In this quickstart, you used the Form Recognizer REST API with Python to extract the content from invoices. Next, see the reference documentation to explore the Form Recognizer API in more depth.