# Document Information Extraction Demo

This notebook is designed to demonstrate how to easily consume the SAP AI Business Services - Document Information Extraction service. In this demo, we first create a new client to then use the service to extract data from an example invoice.

## Fetch python module

This notebook requires the python package containing the client.

In [7]:
# import from PyPI when available
# !pip install ...

## Settings

The settings require a valid service key for the Document Information Extraction service on SAP Cloud Plattform.

The keys in the service key needed here are named exactly as the variables, specifically:

- url: The URL of the service deployment provided in the outermost hierachy of the service key json file
- uaa_url: The URL of the UAA server used for authentication provided in the uaa of the service key json file
- uaa_clientid: The clientid used for authentication to the UAA server provided in the uaa of the service key json file
- uaa_clientsecret: The clientsecret used for authentication to the UAA server provided in the uaa of the service key json file

In [122]:
service_key = {
  "url": "https://aiservices-trial-dox.cfapps.eu10.hana.ondemand.com",
  "uaa": {
    "tenantmode": "shared",
    "sburl": "https://internal-xsuaa.authentication.eu10.hana.ondemand.com",
    "subaccountid": "11939619-af01-4a8c-bda5-8e5435280791",
    "clientid": "sb-15d327e4-af3a-471e-a5fd-05b51fb3a9b8!b76941|na-9e50499f-78dd-40ca-ad8d-60acf02cff8b!b30417",
    "xsappname": "15d327e4-af3a-471e-a5fd-05b51fb3a9b8!b76941|na-9e50499f-78dd-40ca-ad8d-60acf02cff8b!b30417",
    "clientsecret": "e6TnWWQC5yj2NGi7mE/QEslNTQM=",
    "url": "https://bee89f2ctrial.authentication.eu10.hana.ondemand.com",
    "uaadomain": "authentication.eu10.hana.ondemand.com",
    "verificationkey": "-----BEGIN PUBLIC KEY-----MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAt4W3fOLGXUJn7n5lDqidj63AdwMP/MGR9Y7Jv6ZmSYAgvZPD2usA1l5vlDBfCbFS+2SDVCZfHTkspIpXv5tKQvVQ9V3NFQsJt+xReIb+5Rk3RSQFv7+B7OkhmZjCYJvBIftF2mInM16GJ1FgsK2lJt1rH2GfOna7inenbktiI5/gBYiXq/68lP/XgIeRf3hOaigasm3zsr4EEd+qxlzCB0ryK8EvRYXqYzIELLkRiZtg+iCz59RZGZnUZ29aTfF+4mZbzh+R/cAHL1uNmg5jOIr9fuZfQ3l8vYppGaay1i1A6vzNJU/QDMd1619ss6qg1MkPKyeHXZtHriOKGqLl8wIDAQAB-----END PUBLIC KEY-----",
    "apiurl": "https://api.authentication.eu10.hana.ondemand.com",
    "identityzone": "bee89f2ctrial",
    "identityzoneid": "11939619-af01-4a8c-bda5-8e5435280791",
    "tenantid": "11939619-af01-4a8c-bda5-8e5435280791",
    "zoneid": "11939619-af01-4a8c-bda5-8e5435280791"
  },
  "swagger": "/document-information-extraction/v1/"
}
url = service_key['url']
client_id = service_key['uaa']['clientid']
client_secret = service_key['uaa']['clientsecret']
uaa_url = service_key['uaa']['url']

## Initialize Demo

In [123]:
# Import DOX API client
import sys
sys.path.insert(1,'../../')
from sap_business_document_processing import DoxApiClient
import json

In [124]:
# Instantiate object used to communicate with Document Information Extraction REST API
api_client = DoxApiClient(url, client_id, client_secret, uaa_url)

## Display access token

In [125]:
# Token can be used to interact with e.g. swagger UI to explore DOX API
print(api_client.session.token)
print(f"\nYou can use this token to Authorize here and explore the API via Swagger UI: \n{api_client.base_url}")

{'access_token': 'eyJhbGciOiJSUzI1NiIsImprdSI6Imh0dHBzOi8vYmVlODlmMmN0cmlhbC5hdXRoZW50aWNhdGlvbi5ldTEwLmhhbmEub25kZW1hbmQuY29tL3Rva2VuX2tleXMiLCJraWQiOiJkZWZhdWx0LWp3dC1rZXktNzcwMTI5NjY1IiwidHlwIjoiSldUIn0.eyJqdGkiOiIxYzc1NDIzMDYyN2E0Njk3ODI3NTU1YzU5YzA2YjA5OCIsImV4dF9hdHRyIjp7ImVuaGFuY2VyIjoiWFNVQUEiLCJzdWJhY2NvdW50aWQiOiIxMTkzOTYxOS1hZjAxLTRhOGMtYmRhNS04ZTU0MzUyODA3OTEiLCJ6ZG4iOiJiZWU4OWYyY3RyaWFsIiwic2VydmljZWluc3RhbmNlaWQiOiIxNWQzMjdlNC1hZjNhLTQ3MWUtYTVmZC0wNWI1MWZiM2E5YjgifSwic3ViIjoic2ItMTVkMzI3ZTQtYWYzYS00NzFlLWE1ZmQtMDViNTFmYjNhOWI4IWI3Njk0MXxuYS05ZTUwNDk5Zi03OGRkLTQwY2EtYWQ4ZC02MGFjZjAyY2ZmOGIhYjMwNDE3IiwiYXV0aG9yaXRpZXMiOlsidWFhLnJlc291cmNlIiwibmEtOWU1MDQ5OWYtNzhkZC00MGNhLWFkOGQtNjBhY2YwMmNmZjhiIWIzMDQxNy50ZWNobmljYWxzY29wZSJdLCJzY29wZSI6WyJ1YWEucmVzb3VyY2UiLCJuYS05ZTUwNDk5Zi03OGRkLTQwY2EtYWQ4ZC02MGFjZjAyY2ZmOGIhYjMwNDE3LnRlY2huaWNhbHNjb3BlIl0sImNsaWVudF9pZCI6InNiLTE1ZDMyN2U0LWFmM2EtNDcxZS1hNWZkLTA1YjUxZmIzYTliOCFiNzY5NDF8bmEtOWU1MDQ5OWYtNzhkZC00MGNhLWFkOGQtNjBhY2YwMmNmZjhiIW

## See list of document fields you can extract

In [126]:
# Get the fields and document types that can be used
capabilities = api_client.get_capabilities()
data_types = capabilities['documentTypes']
print('Available document types:', data_types)
print('Available extraction fields:')
for data_type in data_types:
    print(f"for '{data_type}':")
    print('\tHeader fields:')
    [print('\t\t', hf['name']) for hf in capabilities['extraction']['headerFields'] if (data_type in hf['supportedDocumentTypes'])]
    print('\tLine item fields:')
    [print('\t\t', li['name']) for li in capabilities['extraction']['lineItemFields'] if (data_type in li['supportedDocumentTypes'])]

Available document types: ['invoice', 'paymentAdvice']
Available extraction fields:
for 'invoice':
	Header fields:
		 documentNumber
		 taxId
		 taxName
		 purchaseOrderNumber
		 shippingAmount
		 netAmount
		 grossAmount
		 currencyCode
		 receiverContact
		 documentDate
		 taxAmount
		 taxRate
		 receiverName
		 receiverAddress
		 receiverTaxId
		 deliveryDate
		 paymentTerms
		 deliveryNoteNumber
		 senderBankAccount
		 senderAddress
		 senderName
		 dueDate
		 discount
		 barcode
	Line item fields:
		 description
		 netAmount
		 quantity
		 unitPrice
		 materialNumber
		 unitOfMeasure
for 'paymentAdvice':
	Header fields:
		 documentNumber
		 grossAmount
		 currencyCode
		 documentDate
		 senderName
	Line item fields:
		 netAmount
		 documentNumber
		 documentDate
		 discountAmount
		 deductionAmount


# (optional) Create a Client

To use Document Information Extraction, you need a client. This client is used to distinguish and separate data. You can create a new client if you wish to perform the information extraction with a separate client.

In [131]:
# Check which clients exist for this tenant
api_client.get_clients()

[]

In [91]:
# Create a new client with the id 'c_00' and name 'Client 00'
api_client.create_client(client_id='c_00', client_name='Client 00')

POST request to URL https://aiservices-trial-dox.cfapps.eu10.hana.ondemand.com/document-information-extraction/v1/clients failed with body: {"error": {"code": "E102", "message": "Request exceeds trial account quotas.", "details": []}}



BDPClientException: {'error': {'code': 'E102', 'message': 'Request exceeds trial account quotas.', 'details': []}}

## Upload a document and retrieve the extracted result

In [109]:
# Specify the fields that should be extracted
header_fields = [
         "documentNumber",
         "taxId",
         "purchaseOrderNumber",
         "shippingAmount",
         "netAmount",
         "senderAddress",
         "senderName",
         "grossAmount",
         "currencyCode",
         "receiverContact",
         "documentDate",
         "taxAmount",
         "taxRate",
         "receiverName",
         "receiverAddress"
    ]
line_item_fields = [
         "description",
         "netAmount",
         "quantity",
         "unitPrice",
         "materialNumber"
    ]

# Extract information from invoice
document_result = api_client.extract_information_from_document('sample-invoice-1.pdf', 
                                                               client_id='default', 
                                                               document_type='invoice', 
                                                               header_fields=header_fields, 
                                                               line_item_fields=line_item_fields)

POST request to URL https://aiservices-trial-dox.cfapps.eu10.hana.ondemand.com/document-information-extraction/v1/document/jobs failed with body: {"error": {"code": "E1", "message": "Invalid client ID(s). The provided client ID(s) does/do not exist.", "details": [{"code": "0", "message": "Invalid client id(s): default"}]}}



BDPApiException: Information extraction failed for some documents: [{'error': {'code': 'E1', 'message': 'Invalid client ID(s). The provided client ID(s) does/do not exist.', 'details': [{'code': '0', 'message': 'Invalid client id(s): default'}]}, 'status_code': 400}]

In [16]:
# Check the extracted data
print(json.dumps(document_result, indent=2))

{
  "status": "DONE",
  "id": "96a8e9c9-6494-45b2-8f63-e4b5cd978355",
  "fileName": "sample-invoice-1.pdf",
  "documentType": "invoice",
  "created": "2021-04-29T14:08:32.907275+00:00",
  "finished": "2021-04-29T14:08:50.529580+00:00",
  "country": "XX",
  "extraction": {
    "headerFields": [
      {
        "name": "taxAmount",
        "category": "amounts",
        "value": 8.5,
        "rawValue": "8.50",
        "type": "number",
        "page": 1,
        "confidence": 0.996145057678223,
        "coordinates": {
          "x": 0.877734899520874,
          "y": 0.481346666812897,
          "w": 0.0364649891853333,
          "h": 0.00940251350402832
        },
        "group": 1
      },
      {
        "name": "documentNumber",
        "category": "document",
        "value": "INV-3337",
        "rawValue": "INV-3337",
        "type": "string",
        "page": 1,
        "confidence": 0.99559623003006,
        "coordinates": {
          "x": 0.75975975975976,
          "y": 0.1364

In [41]:
from IPython.display import IFrame
from fpdf import FPDF
from PyPDF2 import PdfFileReader, PdfFileWriter
import io
def create_overlay(document_path, output_path):
    input_pdf = PdfFileReader(document_path)
    output_pdf = PdfFileWriter()
    
    pdf = FPDF(unit='pt')
    pdf.set_font('Helvetica')
    pdf.set_margins(0, 0)
    pdf.set_draw_color(102, 255, 178)
    pdf.set_line_width(1)
    
    for n in range(len(input_pdf.pages)):
        input_page = input_pdf.getPage(n)
        width = float(input_page.mediaBox.getWidth())
        height = float(input_page.mediaBox.getHeight())
        pdf.add_page(format=(width, height))
        
        for hf in (hf for hf in document_result['extraction']['headerFields'] if hf['page'] == n + 1):
            x = hf['coordinates']['x']
            y = hf['coordinates']['y']
            w = hf['coordinates']['w']
            h = hf['coordinates']['h']
            pdf.rect(x * width, y * height, w * width, h * height)
        for li in (li for line in document_result['extraction']['lineItems'] for li in line if li['page'] == n + 1):
            x = li['coordinates']['x']
            y = li['coordinates']['y']
            w = li['coordinates']['w']
            h = li['coordinates']['h']
            pdf.rect(x * width, y * height, w * width, h * height)
            
    overlay = PdfFileReader(io.BytesIO(pdf.output()))
    
    for n in range(len(input_pdf.pages)):
        page = input_pdf.getPage(n)
        page.mergePage(overlay.getPage(n))
        output_pdf.addPage(page)
      
    with open(output_path, 'wb') as out:
        output_pdf.write(out)

In [97]:
# Let's visualize the extraction results
output_file = 'output.pdf'
create_overlay('sample-invoice-1.pdf', output_file)

IFrame(output_file, 700, 1000)

## Upload Ground Truth

Ground truth values can be uploaded to evaluate the results of the Document Information Extraction

In [98]:
# Load ground truth values from json file
with open('gt-sample-invoice-1.json') as ground_truth_file:
    ground_truth = json.load(ground_truth_file)

In [102]:
# Add ground truth values to the uploaded invoice
api_client.post_ground_truth_for_document(document_id=document_result['id'], ground_truth=ground_truth)

{'status': 'DONE',
 'message': 'Ground truth / corrected values uploaded successfully'}