# Document Information Extraction Demo

This notebook is designed to demonstrate how to easily consume the SAP AI Business Services - Document Information Extraction service. In this demo, we first create a new client to then use the service to extract data from an example invoice.

## Fetch python module

This notebook requires the python package containing the client.

In [None]:
# import from PyPI when available
!pip install -i https://test.pypi.org/simple/ sap-business-document-processing

## Settings

The settings require a valid service key for the Document Information Extraction service on SAP Cloud Plattform.

The keys in the service key needed here are named exactly as the variables, specifically:

- url: The URL of the service deployment provided in the outermost hierachy of the service key json file
- uaa_url: The URL of the UAA server used for authentication provided in the uaa of the service key json file
- uaa_clientid: The clientid used for authentication to the UAA server provided in the uaa of the service key json file
- uaa_clientsecret: The clientsecret used for authentication to the UAA server provided in the uaa of the service key json file

In [None]:
service_key = {
  "url": "https://aiservices-trial-dox.cfapps.eu10.hana.ondemand.com",
  "uaa": {
    "tenantmode": "shared",
    "sburl": "***************",
    "subaccountid": "***************",
    "clientid": "***************",
    "xsappname": "***************",
    "clientsecret": "***************",
    "url": "***************",
    "uaadomain": "***************",
    "verificationkey": "***************",
    "apiurl": "***************",
    "identityzone": "***************",
    "identityzoneid": "***************",
    "tenantid": "***************",
    "zoneid": "***************"
  },
  "swagger": "/document-information-extraction/v1/"
}
url = service_key['url']
client_id = service_key['uaa']['clientid']
client_secret = service_key['uaa']['clientsecret']
uaa_url = service_key['uaa']['url']

## Initialize Demo

In [None]:
# Import DOX API client
from sap_business_document_processing import DoxApiClient
import json

In [None]:
# Instantiate object used to communicate with Document Information Extraction REST API
api_client = DoxApiClient(url, client_id, client_secret, uaa_url)

## Display access token

In [None]:
# Token can be used to interact with e.g. swagger UI to explore DOX API
print(api_client.session.token)
print(f"\nYou can use this token to Authorize here and explore the API via Swagger UI: \n{api_client.base_url}")

## See list of document fields you can extract

In [None]:
# Get the fields and document types that can be used
capabilities = api_client.get_capabilities()
data_types = capabilities['documentTypes']
print('Available document types:', data_types)
print('Available extraction fields:')
for data_type in data_types:
    print(f"for '{data_type}':")
    print('\tHeader fields:')
    [print('\t\t', hf['name']) for hf in capabilities['extraction']['headerFields'] if (data_type in hf['supportedDocumentTypes'])]
    print('\tLine item fields:')
    [print('\t\t', li['name']) for li in capabilities['extraction']['lineItemFields'] if (data_type in li['supportedDocumentTypes'])]

# (optional) Create a Client

To use Document Information Extraction, you need a client. This client is used to distinguish and separate data. You can create a new client if you wish to perform the information extraction with a separate client.

In [None]:
# Check which clients exist for this tenant
api_client.get_clients()

In [None]:
# Create a new client with the id 'c_00' and name 'Client 00'
api_client.create_client(client_id='c_00', client_name='Client 00')

## Upload a document and retrieve the extracted result

In [None]:
# Specify the fields that should be extracted
header_fields = [
         "documentNumber",
         "taxId",
         "purchaseOrderNumber",
         "shippingAmount",
         "netAmount",
         "senderAddress",
         "senderName",
         "grossAmount",
         "currencyCode",
         "receiverContact",
         "documentDate",
         "taxAmount",
         "taxRate",
         "receiverName",
         "receiverAddress"
    ]
line_item_fields = [
         "description",
         "netAmount",
         "quantity",
         "unitPrice",
         "materialNumber"
    ]

# Extract information from invoice
document_result = api_client.extract_information_from_document('sample-invoice-1.pdf', 
                                                               client_id='default', 
                                                               document_type='invoice', 
                                                               header_fields=header_fields, 
                                                               line_item_fields=line_item_fields)

In [None]:
# Check the extracted data
print(json.dumps(document_result, indent=2))

In [None]:
from IPython.display import IFrame
from fpdf import FPDF
from PyPDF2 import PdfFileReader, PdfFileWriter
import io

max_text_width = 150
font_size = 10

def create_overlay(document_path, output_path):
    input_pdf = PdfFileReader(document_path)
    output_pdf = PdfFileWriter()
    
    pdf = FPDF(unit='pt')
    pdf.set_font('Helvetica')
    pdf.set_font_size(font_size)
    pdf.set_margins(0, 0)
    pdf.set_draw_color(102, 255, 178)
    pdf.set_line_width(1)
    
    
    for n in range(len(input_pdf.pages)):
        input_page = input_pdf.getPage(n)
        width = float(input_page.mediaBox.getWidth())
        height = float(input_page.mediaBox.getHeight())
        pdf.add_page(format=(width, height))
        
        for hf in (hf for hf in document_result['extraction']['headerFields'] if hf['page'] == n + 1):
            x = hf['coordinates']['x']
            y = hf['coordinates']['y']
            w = hf['coordinates']['w']
            h = hf['coordinates']['h']
            pdf.rect(x * width, y * height, w * width, h * height)
            pdf.set_xy((x + w) * width + 2, y * height)
            pdf.multi_cell(min(pdf.get_string_width(str(hf['value'])) + 6, max_text_width), h=font_size, txt=str(hf['value']), border=1)
            #print(hf['value'])
        for li in (li for line in document_result['extraction']['lineItems'] for li in line if li['page'] == n + 1):
            x = li['coordinates']['x']
            y = li['coordinates']['y']
            w = li['coordinates']['w']
            h = li['coordinates']['h']
            pdf.rect(x * width, y * height, w * width, h * height)
            pdf.set_xy((x + w) * width, y * height)
            pdf.multi_cell(min(pdf.get_string_width(str(li['value'])) + 6, max_text_width), h=font_size, txt=str(li['value']), border=1)
    
    overlay = PdfFileReader(io.BytesIO(pdf.output()))
    
    for n in range(len(input_pdf.pages)):
        page = input_pdf.getPage(n)
        page.mergePage(overlay.getPage(n))
        output_pdf.addPage(page)
      
    with open(output_path, 'wb') as out:
        output_pdf.write(out)

In [None]:
# Let's visualize the extraction results
output_file = 'output.pdf'
create_overlay('sample-invoice-1.pdf', output_file)

IFrame(output_file, 700, 1000)

## Upload Ground Truth

Ground truth values can be uploaded to evaluate the results of the Document Information Extraction

In [None]:
# Load ground truth values from json file
with open('gt-sample-invoice-1.json') as ground_truth_file:
    ground_truth = json.load(ground_truth_file)

In [None]:
# Add ground truth values to the uploaded invoice
api_client.post_ground_truth_for_document(document_id=document_result['id'], ground_truth=ground_truth)

In [None]:
# You can now also retrieve the uploaded ground truth values by setting extracted_values to False
api_client.get_extraction_for_document(document_id=document_result['id'], extracted_values=False)