# Auditing workflow for named entity detection using Amazon Textract, Amazon Comprehend and Amazon A2I

**Note:** This is the accompanying notebook for Chapter 14 - auditing workflows for named entity detection. Before executing the code in this notebook, please review Chapter 14 in the book and the Read Me included along with this notebook. To demonstrate accessing APIs and to see how the solution works step by step we use this Jupyter notebook in the book. You can however build this whole solution using AWS Lambda with event triggers that are alerted whenever a task is completed (for example, you can use the start_entities_detection_job API and setup a Amazon CloudWatch event rule to be triggered when the job is complete, which can execute an AWS Lambda function to perform the next set of steps. Please refer to the Further Reading section in the book to refer to a github repository of this solution deployed using AWS CloudFormation and AWS Lambda. 

In this notebook we will walk you through the code required to setup your own document processing workflow with Amazon Textract, an Amazon Comprehend Custom Entity Recognizer, and Amazon Augmented AI to extract the content of a sample loan form, detect custom entities to determine if the application should be approved or rejected, setup and send to a human review loop to review predictions, update the entity list, retrain the Comprehend custom entity model, and finally save the loan approval decision to a DynamoDB table.

* Step 0 - Install and import libraries
* Step 1 - Train an Amazon Comprehend Custom Entity Recognizer
* Step 2 - Create a private human review workforce
* Step 3 - Extract input document contents using Amazon Textract
* Step 4 - Detect custom entities using Amazon Comprehend
* Step 5 - Setup and send to Amazon A2I human loop
* Step 6 - Review and modify predictions
* Step 7 - Retrain Comprehend Custom Entity Recognizer with updated entities
* Step 8 - Store predictions for downstream processing

## Prerequisites

Please make sure that you review and complete all the prerequisites documented in Chapter 14 of the book before you execute the code provided in this notebook. To run this notebook you need to ensure that you setup the permissions in AWS Identity and Access Management (IAM) as mentioned below:

**`Sagemaker Notebook Execution Role`** 
Please attach the following policies to your [Amazon SageMaker Notebook IAM Role](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role-sagemaker-notebook.html)
* Comprehend Full Access
* Sagemaker Full Access
* Your Sagemaker Execution Role should have access to S3 already. If not add the following JSON statement as [an inline policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html):
    * {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "*"
            ],
            "Effect": "Allow"
                }
            ]
        }
* Add an IAM:PassRole permission as an inline policy to your SageMaker Notebook Execution Role
    * {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "iam:PassRole"
            ],
            "Effect": "Allow",
            "Resource": "<your-sagemaker-notebook-instance-execution-role-ARN>"
            }
           ]
        }
        
**`Trust Relationship for SageMaker execution role`**
* Finally [update or replace the Trust Relationship](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html) for your SageMaker execution role with the following JSON statement:
    * { "Version": "2012-10-17", 
        "Statement": [ 
            { "Effect": "Allow", 
              "Principal": 
                { "Service": 
                    [ "sagemaker.amazonaws.com", 
                      "s3.amazonaws.com", 
                      "comprehend.amazonaws.com" ] 
                    }, 
                    "Action": "sts:AssumeRole" } 
                ] 
            }



## Step 0 - Import Libraries

We will be using a Textract key value pair example code from the [Amazon Textract documentation](https://docs.aws.amazon.com/textract/latest/dg/examples-extract-kvp.html) for parsing through the Textract response, data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for content analysis, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with Amazon Textract, Amazon Comprehend and Amazon A2I. Let's now import the libraries we need.

In [None]:
import pandas as pd
import webbrowser, os
import json
import boto3
import re
import sagemaker
from sagemaker import get_execution_role
from sagemaker.s3 import S3Uploader, S3Downloader
import uuid
import time
import io
from io import BytesIO
import sys
import csv
from pprint import pprint
from IPython.display import Image, display
from PIL import Image as PImage, ImageDraw

# Define IAM role
role = get_execution_role()
print("RoleArn: {}".format(role))
sess = sagemaker.Session()
bucket = '<your-s3-bucket>'
prefix = 'chapter14'

s3 = boto3.client('s3')

## Step 1 - Train an Amazon Comprehend Custom Entity Recognizer

As a first step we will train an [Amazon Comprehend custom entity recognizer](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html) model to detect two entities "PERSON" or "GHOST". A PERSON represents a genuine applicant for a mortgage form, to be sent to downstream applications for further processing and a GHOST represents either a bot or a fake applicant and should be rejected. In this chapter and notebook we will walk you through setting up this workflow in its entirety.


In [None]:
# initialize the boto3 handle for comprehend
comprehend = boto3.client('comprehend')

In [None]:
s3_raw_key = prefix + "/train/raw_txt.csv" 
s3_entity_key = prefix + "/train/entitylist.csv"

# upload the datasets from our repo to S3
s3.upload_file('train/raw_txt.csv',bucket,s3_raw_key)
s3.upload_file('train/entitylist.csv',bucket,s3_entity_key)

In [None]:
# S3 locations for our training inputs

s3_raw_txt = 's3://{}/{}'.format(bucket, s3_raw_key)
s3_entity_list = 's3://{}/{}'.format(bucket, s3_entity_key)

In [None]:
# Declare a request object to send the S3 location for our entities list and the training dataset
cer_input_object = {

      "Documents": { 
         "S3Uri": s3_raw_txt
      },
      "EntityList": { 
         "S3Uri": s3_entity_list
      },
      "EntityTypes": [
                {
                    "Type": "PERSON"
                },
                {
                    "Type": "GHOST"
                }
      ]
   
}

In [None]:
import datetime
cer_name = "loan-app-recognizer"+str(datetime.datetime.now().strftime("%s"))
cer_response = comprehend.create_entity_recognizer(
        RecognizerName = cer_name, 
        DataAccessRoleArn = role,
        InputDataConfig = cer_input_object,
        LanguageCode = "en"
)

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

response = comprehend.describe_entity_recognizer(
    EntityRecognizerArn=cer_response['EntityRecognizerArn']
)
pp.pprint(response)

### Check the status of training
Let us now use the Amazon Comprehend AWS Console to check the status of our training job:
1. Go to the [Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#welcome)
1. Click on the burger symbol on the top left, and select `custom entity recognition`
1. Scroll down a little bit to the `Entity Recognizers` view and click on the name of your Entity recognizer
1. Review the `Status`, wait for some time, refresh the page until it changes to `Trained`

## Step 2 - Create a private human review workforce

This step requires you to use the AWS Console. However, we highly recommend that you follow it, especially when creating your own task with a custom template we will use for this notebook. We will create a private workteam and add only one user (you) to it.

To create a private team:

   1. Go to AWS Console > Amazon SageMaker > Labeling workforces
   1. Click "Private" and then "Create private team".
   1. Enter the desired name for your private workteam.
   1. Enter your own email address in the "Email addresses" section.
   1. Enter the name of your organization and a contact email to administer the private workteam.
   1. Click "Create Private Team".
   1. The AWS Console should now return to AWS Console > Amazon SageMaker > Labeling workforces. Your newly created team should be visible under "Private teams". Next to it you will see an ARN which is a long string that looks like arn:aws:sagemaker:region-name-123456:workteam/private-crowd/team-name. Please copy this ARN to paste in the cell below.
   1. You should get an email from no-reply@verificationemail.com that contains your workforce username and password.
   1. In AWS Console > Amazon SageMaker > Labeling workforces, click on the URL in Labeling portal sign-in URL. Use the email/password combination from Step 8 to log in (you will be asked to create a new, non-default password).
   1. This is your private worker's interface. When we create a verification task in Verify your task using a private team below, your task should appear in this window. You can invite your colleagues to participate in the labeling job by clicking the "Invite new workers" button.

Please refer to the [Amazon SageMaker documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management.html) if you need more details.

In [None]:
# Enter the Workteam ARN from step 7 above
WORKTEAM_ARN= '<your-private-workteam-arn>'

## Step 3 - Extract input document contents using Amazon Textract

We will now review the input sample loan application image included in the repository, and then use Amazon Textract to first extract the image content, specifically we will select the key value pairs or form data that is of interest to our solution, create an inference request CSV file to pass as an input to our Comprehend custom entity recognizer.

### Review the input document

In [None]:
# Document
documentName = "input/sample-loan-application.png"

display(Image(filename=documentName))

In [None]:
# Let us now load this image into our S3 bucket
s3.upload_file(documentName,bucket,prefix+'/'+documentName)

### Analyze document using the Textract API

We will extract the key value pair data from this document to transform and create a request string for inference.  As you can see we are using the Amazon Textract AnalyzeDocument API. This accepts image files (png or jpeg) as an input. 

To use this example with a PDF file or for processing multiple documents together replace with [StartDocumentAnalysis API](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentAnalysis.html). The job identifier (JobId) is returned. Amazon Textract sends a message to an Amazon Simple Notification Service (Amazon SNS) topic specified in the call. Call GetDocumentAnalysis, and pass the job identifier (JobId) from the initial call to get the results from Textract.

In [None]:
textract = boto3.client('textract')
    
response = textract.analyze_document(Document={'S3Object': {
            'Bucket': bucket,
            'Name': prefix+'/'+documentName
        }}, FeatureTypes=['FORMS'])

Now we will install the Amazon Textract Response Parser library that will help us parsing the JSON response from Amazon Textract. Let us first install the library.

In [None]:
!pip install amazon-textract-response-parser

We will now extract the key and value pairs we need for our solution. We will not use the checkbox fields but only those fields with values in them. Also we will filter out the fields that we actually need in the next few steps.

In [None]:
from trp import Document
doc = Document(response)
df = pd.DataFrame()
# Iterate over elements in the document
x = 0
for page in doc.pages:
    for field in page.form.fields:   
        if field.key is not None and field.value is not None:
            if field.value.text not in ('SELECTED','NOT_SELECTED'):
                df.at[x,'key'] = field.key.text
                df.at[x,'value'] = field.value.text
                x+=1
                #print("Field: Key: {}, Value: {}".format(field.key.text, field.value.text))

In [None]:
df        

### Extract contents for sending to Comprehend CER

In [None]:
# Select only the keys we need
df = df.loc[df['key'].isin(['Name (First, Middle, Last, Suffix)','Cell Phone','Years','Social Security Number','Country','Date of Birth (mm/dd/yyyy)','TOTAL $'])]
df

In [None]:
# Transpose the dataframe so we have all we need in a single row
df_T = df.T
df_T

In [None]:
# Now we will drop the key row, rename columns and get it ready to create the CSV file
df_T.columns = df_T.iloc[0]
df_T = df_T.reset_index(drop=True)
df_T = df_T.drop([0])
df_T = df_T.reset_index(drop=True)
df_T = df_T.rename(columns={"Name (First, Middle, Last, Suffix)": "Name", "Date of Birth (mm/dd/yyyy)": "Date of Birth"})
df_T

## Step 4 - Detect Entities using Comprehend custom entity recognizer

### Create Comprehend custom entity recognizer Inference request
We will now create a request file that will be sent to the Amazon Comprehend CER model to detect our entities we trained it on. This request CSV file comprises of data that we extracted from our input document using Amazon Textract

In [None]:
# Lets remove unnecessary spaces and a document column
df_T.columns = df_T.columns.str.rstrip()
df_T['doc'] = 1
df_T

### Run the Comprehend custom entity inference

If you recollect, we trained a custom entity recognizer in Step 1 of this notebook to recognize entities from our input document and determine if the contents indicate a real PERSON or a GHOST. Now we will call the Comprehend custom entity recognizer and pass the contents of the document from the dataframe.

In [None]:
# Get the contents of interest from the extracted document
for idx, row in df_T.iterrows():
        entry = 'Country'+':'+str(row['Country']).strip()+" "+'Years'+':'+str(row['Years']).strip()+" "+'Cell Phone'+':'+str(row['Cell Phone']).strip()+" "+'Name'+':'+str(row['Name']).strip()+" "+'Social Security Number'+':'+str(row['Social Security Number']).strip()+" "+'TOTAL $'+':'+str(row['TOTAL $']).strip()+" "+'Date of Birth'+':'+str(row['Date of Birth']).strip()

In [None]:
# Lets setup an Amazon Comprehend real-time endpoint
custom_recognizer_arn=cer_response['EntityRecognizerArn']

endpoint_response = comprehend.create_endpoint(
    EndpointName='nlp-chapter14-cer-endpoint',
    ModelArn=custom_recognizer_arn,
    DesiredInferenceUnits=2,
    DataAccessRoleArn=role
)

endpoint_response['EndpointArn']

**Note:** Navigate to Amazon Comprehend console, go to custom entity recognition from the left menu, click on your recognizer, and scroll down to verify your real-time endpoint has been created successfully. If the endpoint is not active, the code in the cell below will fail. It may take about 15 minutes for the endpoint to be ready.

In [None]:
# Start the Custom Entity Recognition Job
response = comprehend.detect_entities(Text=entry,
                    LanguageCode='en',
                    EndpointArn=endpoint_response['EndpointArn']
            )

print(response)

### Prepare the response from Comprehend for Amazon A2I

Let's now create a list to be sent to Amazon A2I for building the UI that the human workflow will use to review the predictions our entity recognizer.

In [None]:
#Display the results from the detection
import json
human_loop_input = []
data = {}
ent = response['Entities']
existing_entities = []
if ent != None and len(ent) > 0:
    for entity in ent:       
        current_entity = {}
        current_entity['label'] = entity['Type']
        current_entity['text'] = entity['Text']
        current_entity['startOffset'] = entity['BeginOffset']
        current_entity['endOffset'] = entity['EndOffset']
        existing_entities.append(current_entity)
        
    data['ORIGINAL_TEXT'] = entry
    data['ENTITIES'] = existing_entities   
    human_loop_input.append(data)

print(human_loop_input)        

## Step 5 - Setup and send to Amazon A2I human loop

Now that we have the detected entities from our Comprehend custom entity recognizer, it is time to set up a human workflow using the Private Team we created in `Step 2` and send the results to the Amazon A2I human loop for review, and modifications/augmentation as required. Subsequently, we will update the `entitylist.csv` file that we originally used to train our Comprehend custom entity recognizer so we can prepare it for retraining based on the human feedback. 

In [None]:
timestamp = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
# Amazon SageMaker client
sagemaker = boto3.client('sagemaker')

# Amazon Augment AI (A2I) client
a2i = boto3.client('sagemaker-a2i-runtime')

# Flow definition name
flowDefinition = 'fd-nlp-chapter14-' + timestamp

# Task UI name - this value is unique per account and region. You can also provide your own value here.
taskUIName = 'ui-nlp-chapter14-' + timestamp

# Flow definition outputs
OUTPUT_PATH = f's3://' + bucket + '/' + prefix + '/a2i-results'

### Create the human task UI

The template in the cell below will be rendered to the human workers whenever the human loop is required. For over 70 pre built UIs, check: https://github.com/aws-samples/amazon-a2i-sample-task-uis. Let's also declare some variables that we need during the next set of steps.

In [None]:
# We customized the tabular template for our notebook as below
template = r"""
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-entity-annotation
        name="crowd-entity-annotation"
        header="Highlight parts of the text below"
        labels="{{ task.input.labels | to_json | escape }}"
        initial-value="{{ task.input.initialValue }}"
        text="{{ task.input.originalText }}"
>
    <full-instructions header="Please follow the instructions below">
        <ol>
            <li><strong>Read</strong> the text carefully.</li>
            <li><strong>Highlight</strong> words, phrases, or sections of the text.</li>
            <li><strong>Choose</strong> the label that best matches what you have highlighted.</li>
            <li>To <strong>change</strong> a label, choose highlighted text and select a new label.</li>
            <li>To <strong>remove</strong> a label from highlighted text, choose the X next to the abbreviated label name on the highlighted text.</li>
            <li>You can select all of a previously highlighted text, but not a portion of it.</li>
        </ol>
    </full-instructions>

    <short-instructions>
        Highlight and label the custom entities that were not detected by the model
    </short-instructions>

</crowd-entity-annotation>

<script>
    document.addEventListener('all-crowd-elements-ready', () => {
        document
            .querySelector('crowd-entity-annotation')
            .shadowRoot
            .querySelector('crowd-form')
            .form;
    });
</script>
"""



In [None]:
def create_task_ui():
    '''
    Creates a Human Task UI resource.

    Returns:
    struct: HumanTaskUiArn
    '''
    response = sagemaker.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': template})
    return response

In [None]:
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)

### Create the Flow Definition

Flow Definitions allow us to specify:

* The workforce that your tasks will be sent to.
* The instructions that your workforce will receive. This is called a worker task template.
* Where your output data will be stored.
* This demo is going to use the API, but you can optionally create this workflow definition in the console as well. 

For more details and instructions, see: https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-create-flow-definition.html.

In [None]:
create_workflow_definition_response = sagemaker.create_flow_definition(
        FlowDefinitionName= flowDefinition,
        RoleArn= role,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Review the contents and correct values as indicated",
            "TaskTitle": "LOAN APPLICATION REVIEW"
        },
        OutputConfig={
            "S3OutputPath" : OUTPUT_PATH
        }
    )
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] # let's save this ARN for future use

In [None]:
for x in range(60):
    describeFlowDefinitionResponse = sagemaker.describe_flow_definition(FlowDefinitionName=flowDefinition)
    print(describeFlowDefinitionResponse['FlowDefinitionStatus'])
    if (describeFlowDefinitionResponse['FlowDefinitionStatus'] == 'Active'):
        print("Flow Definition is active")
        break
    time.sleep(2)

### Sending predictions to Amazon A2I human loops

In [None]:
# Let's start the human loop
human_loops_started = []

import json

for line in human_loop_input:
    humanLoopName = str(uuid.uuid4())
    human_loop_in = {}
    human_loop_in['labels'] = [{'label': 'PERSON', 'shortDisplayName': 'PER', 'fullDisplayName': 'PERSON'},{'label': 'GHOST', 'shortDisplayName': 'GHO', 'fullDisplayName': 'GHOST'}]
    human_loop_in['originalText'] = line['ORIGINAL_TEXT']
    human_loop_in['initialValue'] = line['ENTITIES']
                
        
    start_loop_response = a2i.start_human_loop(
        HumanLoopName=humanLoopName,
        FlowDefinitionArn=flowDefinitionArn,
        HumanLoopInput={
                "InputContent": json.dumps(human_loop_in)
            }
        )
    print(human_loop_in)
    human_loops_started.append(humanLoopName)
    print(f'Starting human loop with name: {humanLoopName}  \n')

#### Check status of human loop

In [None]:
completed_human_loops = []
a2i_resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {a2i_resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {a2i_resp["HumanLoopOutput"]}')
print('\n')
   
      
if a2i_resp["HumanLoopStatus"] == "Completed":
    completed_human_loops.append(resp)

## Step 6 - Review and modify predictions

Now we will login the Amazon A2I Task UI to review, change, and re-label the predictions from Amazon Comprehend custom entity recognizer

#### Let's login to the worker portal to review the predictions and modify them as required

In [None]:
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

#### Let's check the status of the human loop again to see it completed

In [None]:
completed_human_loops = []
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
print('\n')
    
if resp["HumanLoopStatus"] == "Completed":
    completed_human_loops.append(resp)

#### Let's review the annotation output

In [None]:
import re
import pprint

pp = pprint.PrettyPrinter(indent=4)

for resp in completed_human_loops:
    splitted_string = re.split('s3://' + bucket  + '/', resp['HumanLoopOutput']['OutputS3Uri'])
    output_bucket_key = splitted_string[1]
    response = s3.get_object(Bucket=bucket, Key=output_bucket_key)
    content = response["Body"].read()
    json_output = json.loads(content)
    pp.pprint(json_output)
    print('\n')

### Verify if new entities are present

If our human workflow team updated the entity and labels, we need to retrain our custom entity recognition model to ensure its able to detect the new or updated entities the next time around. In the next set of cells we will update our original entitylist file with the new or changed entity labels, and submit a retraining job

In [None]:
a2i_entities = json_output['humanAnswers'][0]['answerContent']['crowd-entity-annotation']['entities']
a2i_entities

In [None]:
input_content = json_output['inputContent']
original_text = input_content['originalText']

In [None]:
print(input_content)
print("*********************")
print(original_text)

In [None]:
retrain='N'
el = open('train/entitylist.csv','r').read()
for annotated_entity in a2i_entities:
    if original_text[annotated_entity['startOffset']:annotated_entity['endOffset']] not in el:
        retrain='Y'
        word = '\n'+original_text[annotated_entity['startOffset']:annotated_entity['endOffset']]+','+annotated_entity['label'].upper()
        print("Updating Entity List with: " + word)
        open('train/entitylist.csv','a').write(word)

if retrain == 'Y':
    print("Entity list updated, model to be retrained")

## Step 7 - Retrain Comprehend Custom Entity Recognizer with updated entities

In the previous section we saw that the human loop had identified one new entity and 2 entities that the model detected correctly, but not present in the original entitylist, so all these 3 entities were updated in the entity list. In this step we will retrain a new Amazon Comprehend model using the updated entity list.  

In [None]:
s3_raw_key = prefix + "/train/raw_txt.csv" 
s3_entity_key = prefix + "/train/entitylist.csv"

# upload the datasets from our repo to S3
s3.upload_file('train/raw_txt.csv',bucket,s3_raw_key)
s3.upload_file('train/entitylist.csv',bucket,s3_entity_key)

In [None]:
# S3 locations for our training inputs

s3_raw_txt = 's3://{}/{}'.format(bucket, s3_raw_key)
s3_entity_list = 's3://{}/{}'.format(bucket, s3_entity_key)

In [None]:
# Declare a request object to send the S3 location for our entities list and the training dataset
cer_input_object = {

      "Documents": { 
         "S3Uri": s3_raw_txt
      },
      "EntityList": { 
         "S3Uri": s3_entity_list
      },
      "EntityTypes": [
                {
                    "Type": "PERSON"
                },
                {
                    "Type": "GHOST"
                }
      ]
   
}

In [None]:
import datetime
cer_name = "retrain-loan-recognizer"+str(datetime.datetime.now().strftime("%s"))
cer_response = comprehend.create_entity_recognizer(
        RecognizerName = cer_name, 
        DataAccessRoleArn = role,
        InputDataConfig = cer_input_object,
        LanguageCode = "en"
)

In [None]:
response = comprehend.describe_entity_recognizer(
    EntityRecognizerArn=cer_response['EntityRecognizerArn']
)
pp.pprint(response)

#### To proceed with testing the retrained recognizer, please follow steps 2 to 5 from above

## Step 8 - Store predictions for downstream processing

Now we understand the complete document management workflow, let us now execute the steps needed to persist the results from our entity detection so we can send it to a downstream application. For example, in our case let us assume that an adjudication application requires our decision to determine if they should process the incoming loan application or not. To do this, we will examine the output from Amazon A2I. If the majority of the entities or all entities are of type "GHOST", we will send a rejection decision, if the majority are of type "PERSON" we send a summary approval, if all of them are "PERSON" we will send approval, and if they are evenly distributed we will send a rejection decision.

In [None]:
# Check the response from A2I
a2i_entities

In [None]:
#Let's conver the dict to list for ease of use
labellist = []
for i in a2i_entities:
    labellist.append(i['label'])
    

In [None]:
# Let's check the weights and determine the document status
from collections import Counter

docstatus = ''

ghost = float(Counter(labellist)['GHOST'])
person = float(Counter(labellist)['PERSON'])

if ghost >= len(labellist)*.5:
    docstatus = 'REJECT'
elif min(len(labellist)*.5, len(labellist)*.8) < person < max(len(labellist)*.5, len(labellist)*.8):
    docstatus = 'SUMMARY APPROVE'
elif person > len(labellist)*.8:
    docstatus = 'APPROVE'

print(docstatus)    

#### Create the DynamoDB table

In [None]:
# now we will create a dynamoDB table and upload the results along with the original content from the document
# Get the service resource.
dynamodb = boto3.resource('dynamodb')
tablename = "loan_status-"+str(uuid.uuid4())
# Create the DynamoDB table.
table = dynamodb.create_table(
    TableName=tablename,
    KeySchema=[
        {
            'AttributeName': 'doc',
            'KeyType': 'HASH'
        }
    ],
    AttributeDefinitions=[
        {
            'AttributeName': 'doc',
            'AttributeType': 'N'
        },
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 5,
        'WriteCapacityUnits': 5
    }
)
# Wait until the table exists, this will take a minute or so
table.meta.client.get_waiter('table_exists').wait(TableName=tablename)

# Print out some data about the table.
print("Table successfully created. Item count is: " + str(table.item_count))

#### Load the table

In [None]:
for i, r in df_T.iterrows():
    table.put_item(
       Item={
        'doc': row['doc'],
        'Country': str(row['Country']) ,
        'Years': str(row['Years']),
        'Cell Phone': str(row['Cell Phone']),   
        'Name': str(row['Name']),
        'Social Security Number': str(row['Social Security Number']),
        'TOTAL $': str(row['TOTAL $']),
        'Date of Birth': str(row['Date of Birth']),
        'Document Status': docstatus 
        }
    )

print("Items were successfully created in DynamoDB table")

In [None]:
# Let's check our insert
response = table.get_item(
    Key={
        'doc': 1
    }
)
item = response['Item']
print(item)

## Conclusion

And that's it, we are done with our demo. Please refer to the Further Reading section in the book for more example approaches for this use case as well the code sample for building this same solution using AWS Lambda and CloudFormation.