# Document Information Extraction Showcase

This notebook is designed to demonstrate how to easily consume the SAP AI Business Services - Document Information Extraction service. In this demo, we first create a new client to then use the service to extract data from an example invoice.

## Extract credentials from service key

You require a valid service key for the Document Information Extraction service on the SAP Business Technology Platform. For the detailed setup steps see: https://help.sap.com/viewer/5fa7265b9ff64d73bac7cec61ee55ae6/SHIP/en-US/0d68dc0002f0484ba25f85f3170166e0.html 

The necessary credentials are the following:

- url: The URL of the service deployment provided in the outermost hierarchy of the service key json file
- uaa_url: The URL of the UAA server used for authentication provided in the uaa part of the service key json file
- clientid: The clientid used for authentication to the UAA server provided in the uaa part of the service key json file
- clientsecret: The clientsecret used for authentication to the UAA server provided in the uaa part of the service key json file

In [None]:
# Please insert your copied service key json here
service_key = {
  "url": "*******",
  "uaa": {
    "tenantmode": "*******",
    "sburl": "*******",
    "subaccountid": "*******",
    "clientid": "*******",
    "xsappname": "*******",
    "clientsecret": "*******",
    "url": "*******",
    "uaadomain": "*******",
    "verificationkey": "*******",
    "apiurl": "*******",
    "identityzone": "*******",
    "identityzoneid": "*******",
    "tenantid": "*******",
    "zoneid": "*******"
  },
  "swagger": "/document-information-extraction/v1/"
}
url = service_key['url']
uaa_url = service_key['uaa']['url']
client_id = service_key['uaa']['clientid']
client_secret = service_key['uaa']['clientsecret']

## Initialize DoxApiClient

In [None]:
# Import DOX API client
from sap_business_document_processing import DoxApiClient

In [None]:
# Instantiate object used to communicate with Document Information Extraction REST API
api_client = DoxApiClient(url, client_id, client_secret, uaa_url)

## (optional) Display access token

In [None]:
# Token can be used to interact with e.g. swagger UI to explore DOX API
print(api_client.session.token)
print(f"\nYou can use this token to authorize here and explore the API via Swagger UI: \n{api_client.base_url}")

## See list of document fields you can extract

In [None]:
# Get the available document types and corresponding extraction fields
from utils import display_capabilities
capabilities = api_client.get_capabilities()
display_capabilities(capabilities)

## (optional) Create a Client

To use Document Information Extraction, you need a client. This client is used to distinguish and separate data. You can create a new client if you wish to perform the information extraction with a separate client. One 'default' client already exists.

In [None]:
# Check which clients exist for this tenant
api_client.get_clients()

In [None]:
# Create a new client with the id 'c_00' and name 'Client 00'
api_client.create_client(client_id='c_00', client_name='Client 00')

## Upload a document and retrieve the extracted result

In [None]:
# The constants provide supported content types that can be imported, e.g. for PDF, PNG, JPEG or TIFF files as well as the
# CONTENT_TYPE_UNKNOWN that lets the library fetch the content type automatically based on the file's extension
from sap_business_document_processing.document_information_extraction_client.constants import CONTENT_TYPE_PDF

# Specify the fields that should be extracted
header_fields = [
         "documentNumber",
         "taxId",
         "purchaseOrderNumber", 
         "shippingAmount",
         "netAmount",
         "senderAddress",
         "senderName",
         "grossAmount",
         "currencyCode",
         "receiverContact",
         "documentDate",
         "taxAmount",
         "taxRate",
         "receiverName",
         "receiverAddress"
    ]
line_item_fields = [
         "description",
         "netAmount",
         "quantity",
         "unitPrice",
         "materialNumber"
    ]

# Extract information from invoice document
document_result = api_client.extract_information_from_document(document_path='sample-invoice-1.pdf', 
                                                               client_id='default', 
                                                               document_type='invoice', 
                                                               mime_type=CONTENT_TYPE_PDF, 
                                                               header_fields=header_fields, 
                                                               line_item_fields=line_item_fields)

In [None]:
# Check the extracted data
import json
print(json.dumps(document_result, indent=2))

In [None]:
# Let's visualize the extracted values on the invoice document
from utils import display_extraction
display_extraction(document_result, 'sample-invoice-1.pdf')

## (optional) Upload Ground Truth

Ground truth values can be uploaded to evaluate the results of the Document Information Extraction

In [None]:
# Load ground truth values from json file
with open('gt-sample-invoice-1.json') as ground_truth_file:
    ground_truth = json.load(ground_truth_file)

In [None]:
# Add ground truth values to the uploaded invoice
api_client.post_ground_truth_for_document(document_id=document_result['id'], ground_truth=ground_truth)

In [None]:
# You can now also retrieve the uploaded ground truth values by setting extracted_values to False
api_client.get_extraction_for_document(document_id=document_result['id'], extracted_values=False)