## Step 0 - Install and Import Libraries

We will be using the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python) for parsing through the Textract response, data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for content analysis, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with Amazon Textract and Amazon A2I. Let's now install and import them.

In [494]:
# install trp
!pip install amazon-textract-response-parser

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [912]:
import pandas as pd
import webbrowser, os
import json
import boto3
import re
import sagemaker
from sagemaker import get_execution_role
from sagemaker.s3 import S3Uploader, S3Downloader
import uuid
import time
import io
from io import BytesIO
import sys
import csv
from pprint import pprint
from IPython.display import Image, display
from PIL import Image as PImage, ImageDraw

from IPython.display import Image, display, IFrame
from PIL import Image as PImage, ImageDraw
from textractprettyprinter.t_pretty_print_expense import get_string, Textract_Expense_Pretty_Print, Pretty_Print_Table_Format, get_expensesummary_string, get_expenselineitemgroups_string
# from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string

### Setup a private review workforce

This step requires you to use the AWS Console. However, we highly recommend that you follow it, especially when creating your own task with a custom template we will use for this notebook. We will create a private workteam and add only one user (you) to it.

To create a private team:

   1. Go to AWS Console > Amazon SageMaker > Labeling workforces
   1. Click "Private" and then "Create private team".
   1. Enter the desired name for your private workteam.
   1. Enter your own email address in the "Email addresses" section.
   1. Enter the name of your organization and a contact email to administer the private workteam.
   1. Click "Create Private Team".
   1. The AWS Console should now return to AWS Console > Amazon SageMaker > Labeling workforces. Your newly created team should be visible under "Private teams". Next to it you will see an ARN which is a long string that looks like arn:aws:sagemaker:region-name-123456:workteam/private-crowd/team-name. Please copy this ARN to paste in the cell below.
   1. You should get an email from no-reply@verificationemail.com that contains your workforce username and password.
   1. In AWS Console > Amazon SageMaker > Labeling workforces, click on the URL in Labeling portal sign-in URL. Use the email/password combination from Step 8 to log in (you will be asked to create a new, non-default password).
   1. This is your private worker's interface. When we create a verification task in Verify your task using a private team below, your task should appear in this window. You can invite your colleagues to participate in the labeling job by clicking the "Invite new workers" button.

Please refer to the Amazon SageMaker documentation if you need more details.

In [913]:
# Enter the Workteam ARN from step 7 above
WORKTEAM_ARN= 'arn:aws:sagemaker:us-east-1:485636232393:workteam/private-crowd/demo'
 
# Define IAM role
role = get_execution_role()
print("RoleArn: {}".format(role))
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'textract-a2i-handwritten'

RoleArn: arn:aws:iam::485636232393:role/TextractA2I-SageMakerIamRole-T6LRZRF62Q68


In [963]:
!pip install poppler-utils


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting poppler-utils
  Downloading poppler_utils-0.1.0-py3-none-any.whl (9.2 kB)
Installing collected packages: poppler-utils
Successfully installed poppler-utils-0.1.0


In [1001]:
!df -h


Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1.9G     0  1.9G   0% /dev
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           2.0G  588K  2.0G   1% /run
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/xvda1      180G  135G   46G  75% /
/dev/xvdf       9.7G   44M  9.2G   1% /home/ec2-user/SageMaker
tmpfs           391M     0  391M   0% /run/user/1002
tmpfs           391M     0  391M   0% /run/user/1001
tmpfs           391M     0  391M   0% /run/user/1000


## Step 1 - Use Amazon Textract to retrieve document content and inspect response

In this step, we will download our test invoice from a S3 bucket to our notebook instance, and then use Amazon Textract to read the hand-written content present in the invoice line items table, and load this into a pandas dataframe for analysis.

#### Review the sample document which has both printed and handwritten content in the tables

In [1005]:
import fitz  # PyMuPDF
from IPython.display import display_pdf
from pdf2image import convert_from_path

documentName = "f1040 Page 1 Sample - Handwritten - Soumya.pdf"


def pdf_to_images(pdf_file, output_folder):
    # Open the PDF file
    pdf_document = fitz.open(pdf_file)

    # Iterate through pages and convert each page to an image
    for page_number in range(len(pdf_document)):
        page = pdf_document[page_number]
        image = page.get_pixmap()

        # Save the image as a PNG file
        image.save(f"{output_folder}/{pdf_file}.png")

output_folder = '.'  # Replace with your desired output folder

pdf_to_images(documentName, output_folder)
image_paths = [f"{'.'}/{documentName}.png"]



In [1075]:
s3_img_url = S3Uploader.upload(f"{documentName}.png", 's3://{}/{}'.format(bucket, prefix))

In [784]:
client = boto3.client(
         service_name='textract',
         region_name= 'us-east-1',
         endpoint_url='https://textract.us-east-1.amazonaws.com',
)

In [785]:
with open(documentName, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', documentName)

Image loaded f1040 Page 1 Sample - Handwritten - Soumya.pdf


In [786]:
# process using image bytes
response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES','FORMS'])

In [787]:
response

{'DocumentMetadata': {'Pages': 1},
 'Blocks': [{'BlockType': 'PAGE',
   'Geometry': {'BoundingBox': {'Width': 1.0,
     'Height': 1.0,
     'Left': 0.0,
     'Top': 0.0},
    'Polygon': [{'X': 0.0, 'Y': 2.555005664817145e-07},
     {'X': 1.0, 'Y': 0.0},
     {'X': 1.0, 'Y': 1.0},
     {'X': 9.367063853460422e-07, 'Y': 1.0}]},
   'Id': '1c20c8c8-3823-4188-beaa-ad8dbe42a600',
   'Relationships': [{'Type': 'CHILD',
     'Ids': ['8f7b1058-3309-46d4-8ac3-77e390f1176c',
      '7a5c8cfa-a7ec-4d0f-88d2-a797c829f3d9',
      'a79f6fab-e290-4190-bd07-856cef6c28ee',
      'e4119cb5-4403-4deb-b016-bd6c11c0ea69',
      'ad185e1a-cf8c-4b6c-ad82-e0be6dc95fe8',
      'dbd26741-74ee-49f0-95be-ddcebff675e0',
      '428e1c4a-42de-46b1-92f3-34ae3840a4ba',
      '9de74134-b335-4651-a26a-a15c61590ef2',
      'cdeb30c8-0037-4534-b956-c343b4b3bcf0',
      '456a762f-40f7-40dd-a9c7-a6e06fdf9b9d',
      '4ad03df3-7225-42a5-a1e6-ba17fe2ab990',
      'bfb0a1b8-d870-4095-b5aa-cc0b729feec8',
      'ce5871ef-65ed-4d2e

In [788]:
from trp import Document

# Parse JSON response from Textract
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}".format(line.text))
        for word in line.words:
            print("Word: {}".format(word.text))

    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

    # Print fields
    for field in page.form.fields:
        key_text = field.key.text if field.key else "N/A"
        value_text = field.value.text if field.value else "N/A"
        print("Field: Key: {}, Value: {}".format(key_text, value_text))
        


Line: 1040
Word: 1040
Line: Department of the Treasury - Internal Revenue Service
Word: Department
Word: of
Word: the
Word: Treasury
Word: -
Word: Internal
Word: Revenue
Word: Service
Line: U.S. Individual Income Tax Return
Word: U.S.
Word: Individual
Word: Income
Word: Tax
Word: Return
Line: 2022
Word: 2022
Line: OMB No. 1545-0074
Word: OMB
Word: No.
Word: 1545-0074
Line: IRS Use Only-Do not write or staple in this space.
Word: IRS
Word: Use
Word: Only-Do
Word: not
Word: write
Word: or
Word: staple
Word: in
Word: this
Word: space.
Line: Filing Status
Word: Filing
Word: Status
Line: Single
Word: Single
Line: Married filing jointly
Word: Married
Word: filing
Word: jointly
Line: Married filing separately (MFS)
Word: Married
Word: filing
Word: separately
Word: (MFS)
Line: Head of household (HOH)
Word: Head
Word: of
Word: household
Word: (HOH)
Line: Qualifying surviving
Word: Qualifying
Word: surviving
Line: Check only
Word: Check
Word: only
Line: spouse (QSS)
Word: spouse
Word: (QSS)
Line

In [789]:
from trp import Document

# Parse JSON response from Textract
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}".format(line.text))
        for word in line.words:
            print("Word: {}, Confidence: {}".format(word.text, word.confidence))  # Add confidence

    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}, Confidence: {}".format(r, c, cell.text, cell.confidence))  # Add confidence

    # Print fields
    for field in page.form.fields:
        key_text = field.key.text if field.key else "N/A"
        key_confidence = field.key.confidence if field.key else "N/A"
        value_text = field.value.text if field.value else "N/A"
        value_confidence = field.value.confidence if field.value else "N/A"
        print("Field: Key: {}, Confidence: {}, Value: {}, Confidence: {}".format(key_text, key_confidence, value_text, value_confidence))  # Add confidence


Line: 1040
Word: 1040, Confidence: 99.93353271484375
Line: Department of the Treasury - Internal Revenue Service
Word: Department, Confidence: 99.87731170654297
Word: of, Confidence: 99.91316223144531
Word: the, Confidence: 99.95600891113281
Word: Treasury, Confidence: 90.90239715576172
Word: -, Confidence: 78.8201904296875
Word: Internal, Confidence: 89.28166961669922
Word: Revenue, Confidence: 99.81559753417969
Word: Service, Confidence: 99.86648559570312
Line: U.S. Individual Income Tax Return
Word: U.S., Confidence: 98.98809814453125
Word: Individual, Confidence: 98.58805847167969
Word: Income, Confidence: 98.80280303955078
Word: Tax, Confidence: 99.95079040527344
Word: Return, Confidence: 99.95491790771484
Line: 2022
Word: 2022, Confidence: 99.92893981933594
Line: OMB No. 1545-0074
Word: OMB, Confidence: 99.82048034667969
Word: No., Confidence: 99.04751586914062
Word: 1545-0074, Confidence: 99.56050872802734
Line: IRS Use Only-Do not write or staple in this space.
Word: IRS, Confi

In [790]:
from trp import Document
import json

# Create a dictionary to store fields and values
fields_dict = {}

# Parse JSON response from Textract
doc = Document(response)

# Function to calculate the average confidence score for a list of elements
def calculate_average_confidence(elements):
    total_confidence = sum(element.confidence for element in elements)
    return total_confidence / len(elements) if len(elements) > 0 else 0.0

# Iterate over elements in the document
for page in doc.pages:
    # Create a dictionary to store word confidence scores
    word_confidence_scores = {}
    
    # Print lines and words
    for line in page.lines:
        print("Line: {}, Confidence: {}".format(line.text, line.confidence))
        for word in line.words:
            word_text = word.text
            word_confidence = word.confidence
            word_confidence_scores[word_text] = word_confidence  # Store word confidence score in the dictionary
            print("Word: {}, Confidence: {}".format(word_text, word_confidence))

    # Print tables
    for table in page.tables:
        print("Table Confidence Score: {}".format(table.confidence))  # Print the confidence score of the table
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                cell_text = cell.text.strip()  # Get the text of the cell
                
                # Assign the confidence score of the word to the table cell if there is a matching word
                cell_confidence = word_confidence_scores.get(cell_text, 0.0)
                
                print("Table[{}][{}] = {}, Confidence: {}".format(r, c, cell_text, cell_confidence))


Line: 1040, Confidence: 99.93353271484375
Word: 1040, Confidence: 99.93353271484375
Line: Department of the Treasury - Internal Revenue Service, Confidence: 94.8041000366211
Word: Department, Confidence: 99.87731170654297
Word: of, Confidence: 99.91316223144531
Word: the, Confidence: 99.95600891113281
Word: Treasury, Confidence: 90.90239715576172
Word: -, Confidence: 78.8201904296875
Word: Internal, Confidence: 89.28166961669922
Word: Revenue, Confidence: 99.81559753417969
Word: Service, Confidence: 99.86648559570312
Line: U.S. Individual Income Tax Return, Confidence: 99.2569351196289
Word: U.S., Confidence: 98.98809814453125
Word: Individual, Confidence: 98.58805847167969
Word: Income, Confidence: 98.80280303955078
Word: Tax, Confidence: 99.95079040527344
Word: Return, Confidence: 99.95491790771484
Line: 2022, Confidence: 99.92893981933594
Word: 2022, Confidence: 99.92893981933594
Line: OMB No. 1545-0074, Confidence: 99.47616577148438
Word: OMB, Confidence: 99.82048034667969
Word: No

In [791]:
import pandas as pd

# Create an empty list to store tables
tables_list = []

# Iterate through the tables and append each table to the list
for table in page.tables:
    tables_list.append(table)

# Create an empty list to store table data with confidence scores
table_data_with_confidence = []

# Loop through the tables and print their content separately
for i, table in enumerate(tables_list):
    table_data = []
    print(f"Table {i + 1}:")
    
    for r, row in enumerate(table.rows):
        row_data = []
        for c, cell in enumerate(row.cells):
            cell_text = cell.text.strip()
            cell_confidence = cell.confidence if hasattr(cell, 'confidence') else None
            row_data.append((cell_text, cell_confidence))
            print(f"Table[{r}][{c}] = {cell_text}, Confidence: {cell_confidence}")
        
        table_data.append(row_data)
    
    table_data_with_confidence.append(table_data)
    print("\n")  # Separate tables with a newline

# Now you have each table's data with confidence scores stored in `table_data_with_confidence`

# If you want to convert a specific table (e.g., Table 5) to a Pandas DataFrame
# with additional columns for confidence scores for each value in each column:

desired_table_index = 4

if desired_table_index < len(table_data_with_confidence):
    table_5_data = table_data_with_confidence[desired_table_index]
    
    # Create a list of column names as numbers
    column_names = [str(c) for c in range(len(table_5_data[0]))]
    
    # Create a list to hold the data for the DataFrame
    df_data = []
    
    for row_data in table_5_data:
        row_values = []
        for (_, cell_confidence) in row_data:
            row_values.append(cell_confidence)
        df_data.append(row_values)
    
    # Create a Pandas DataFrame with columns named as numbers
    df = pd.DataFrame(df_data, columns=column_names)
    
    print("DataFrame for Table 5 with Confidence Scores for Each Value in Each Column:")
    print(df)
else:
    print(f"Table {desired_table_index + 1} not found in the document.")


Table 1:
Table[0][0] = Your first name and middle initial, Confidence: 91.40625
Table[0][1] = Last name, Confidence: 91.2109375
Table[1][0] = JOHN J, Confidence: 90.185546875
Table[1][1] = SMITH, Confidence: 90.0390625
Table[2][0] = If joint return, spouse's first name and middle initial, Confidence: 89.2578125
Table[2][1] = Last name, Confidence: 89.0625
Table[3][0] = JANE K, Confidence: 92.3828125
Table[3][1] = SMITH, Confidence: 92.1875


Table 2:
Table[0][0] = Home address (number and street). If you have a P.O. box, see instructions., Confidence: 88.57421875
Table[0][1] = Apt. no., Confidence: 84.814453125
Table[1][0] = 1040 N PHOENIX RD, Confidence: 89.111328125
Table[1][1] = , Confidence: 85.302734375


Table 3:
Table[0][0] = City, town, or post office. If you have a foreign address, also complete spaces below., Confidence: 92.7734375
Table[0][1] = State, Confidence: 87.79296875
Table[0][2] = ZIP code, Confidence: 89.84375
Table[1][0] = GLENDALE, Confidence: 91.9921875
Table[1][

In [792]:
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,46.582031,53.125,94.873047,94.873047,94.873047,96.191406,80.175781,80.175781
1,46.582031,53.125,94.873047,94.873047,94.873047,96.191406,70.751953,68.994141
2,87.597656,88.134766,77.783203,78.320312,85.9375,89.941406,87.988281,85.791016
3,88.476562,89.013672,78.564453,79.101562,86.816406,90.869141,88.867188,86.669922
4,86.132812,86.669922,76.464844,77.001953,84.472656,88.427734,86.474609,84.375
5,91.113281,91.699219,80.908203,81.445312,89.355469,93.554688,91.503906,89.208984


In [793]:
# Convert numerical columns to numeric (assuming they contain numerical values)
df["2"] = pd.to_numeric(df["2"], errors="coerce")
df["3"] = pd.to_numeric(df["3"], errors="coerce")
df["4"] = pd.to_numeric(df["4"], errors="coerce")

# Add the values in columns 2, 3, and 4 and store the result in column 3
df["3"] = df["2"] + df["3"] + df["4"]

# Now, you can drop the original columns
df.drop(["2", "4"], axis=1, inplace=True)

# Print the modified DataFrame
df


Unnamed: 0,0,1,3,5,6,7
0,46.582031,53.125,284.619141,96.191406,80.175781,80.175781
1,46.582031,53.125,284.619141,96.191406,70.751953,68.994141
2,87.597656,88.134766,242.041016,89.941406,87.988281,85.791016
3,88.476562,89.013672,244.482422,90.869141,88.867188,86.669922
4,86.132812,86.669922,237.939453,88.427734,86.474609,84.375
5,91.113281,91.699219,251.708984,93.554688,91.503906,89.208984


In [794]:
df["3"] = df["3"]/3

In [795]:
df

Unnamed: 0,0,1,3,5,6,7
0,46.582031,53.125,94.873047,96.191406,80.175781,80.175781
1,46.582031,53.125,94.873047,96.191406,70.751953,68.994141
2,87.597656,88.134766,80.680339,89.941406,87.988281,85.791016
3,88.476562,89.013672,81.494141,90.869141,88.867188,86.669922
4,86.132812,86.669922,79.313151,88.427734,86.474609,84.375
5,91.113281,91.699219,83.902995,93.554688,91.503906,89.208984


In [796]:
# Create an empty list to store tables
tables_list = []

# Iterate through the tables and append each table to the list
for table in page.tables:
    tables_list.append(table)

# Loop through the tables and print their content separately
for i, table in enumerate(tables_list):
    print(f"Table {i + 1}:")
    for r, row in enumerate(table.rows):
        for c, cell in enumerate(row.cells):
            print(f"Table[{r}][{c}] = {cell.text}")
    print("\n")  # Separate tables with a newline

# Now you have each table stored in the `tables_list` for further processing


Table 1:
Table[0][0] = Your first name and middle initial 
Table[0][1] = Last name 
Table[1][0] = JOHN J 
Table[1][1] = SMITH 
Table[2][0] = If joint return, spouse's first name and middle initial 
Table[2][1] = Last name 
Table[3][0] = JANE K 
Table[3][1] = SMITH 


Table 2:
Table[0][0] = Home address (number and street). If you have a P.O. box, see instructions. 
Table[0][1] = Apt. no. 
Table[1][0] = 1040 N PHOENIX RD 
Table[1][1] = 


Table 3:
Table[0][0] = City, town, or post office. If you have a foreign address, also complete spaces below. 
Table[0][1] = State 
Table[0][2] = ZIP code 
Table[1][0] = GLENDALE 
Table[1][1] = A2 
Table[1][2] = 85308 


Table 4:
Table[0][0] = Foreign country name 
Table[0][1] = Foreign province/state/county 
Table[0][2] = Foreign postal code 
Table[1][0] = 
Table[1][1] = 
Table[1][2] = 


Table 5:
Table[0][0] = (see instructions): 
Table[0][1] = 
Table[0][2] = (2) 
Table[0][3] = Social 
Table[0][4] = security 
Table[0][5] = (3) Relationship 
Table[0][

In [797]:
import pandas as pd

# Create an empty list to store tables
tables_list = []

# Iterate through the tables and append each table to the list
for table in page.tables:
    tables_list.append(table)

# Extract and convert Table 5 to a Pandas DataFrame (index 4 since Python uses 0-based indexing)
table_5 = tables_list[4]

# Create an empty DataFrame to store the table data
data = []

# Iterate over the rows and cells of Table 5 and populate the data list
for r, row in enumerate(table_5.rows):
    row_data = [cell.text.strip() for cell in row.cells]
    data.append(row_data)
data
# Create a Pandas DataFrame from the data list
df5 = pd.DataFrame(data)

# Print the DataFrame
df5


Unnamed: 0,0,1,2,3,4,5,6,7
0,(see instructions):,,(2),Social,security,(3) Relationship,(4) Check the box if qualifies,for (see instructions):
1,(1) First name,Last name,,number,,to you,Child tax credit,Credit for other dependents
2,JAY,SMITH,,135792468,,SON,"SELECTED,","NOT_SELECTED,"
3,VICTORIA,SMITH,246,81,3579,DAUGHTER,"SELECTED,","NOT_SELECTED,"
4,,,,,,,"NOT_SELECTED,","NOT_SELECTED,"
5,,,,,,,"NOT_SELECTED,","NOT_SELECTED,"


In [798]:
data = str(data)

In [799]:
ents, types = entity_detection(data, "en")
print("Entities:", ents)
print("Entity Types:", types)

Entities: ['JAY', '135792468', 'VICTORIA', 'SMITH', '3579']
Entity Types: ['PERSON', 'OTHER', 'PERSON', 'PERSON', 'OTHER']


In [800]:
df5[1]

0             
1    Last name
2        SMITH
3        SMITH
4             
5             
Name: 1, dtype: object

In [801]:
# Concatenate columns 2, 3, and 4 into a single column
df5[3] = df5[2] + ' ' + df5[3] + ' ' + df5[4]
df5.drop([2, 4], axis=1, inplace=True)

df5


Unnamed: 0,0,1,3,5,6,7
0,(see instructions):,,(2) Social security,(3) Relationship,(4) Check the box if qualifies,for (see instructions):
1,(1) First name,Last name,number,to you,Child tax credit,Credit for other dependents
2,JAY,SMITH,135792468,SON,"SELECTED,","NOT_SELECTED,"
3,VICTORIA,SMITH,246 81 3579,DAUGHTER,"SELECTED,","NOT_SELECTED,"
4,,,,,"NOT_SELECTED,","NOT_SELECTED,"
5,,,,,"NOT_SELECTED,","NOT_SELECTED,"


In [802]:


# Number of rows to drop
n = 2
# Using DataFrame.tail() function to drop first n rows
df5 = df5.tail(-2)
df = df.tail(-2)


In [803]:
df5

Unnamed: 0,0,1,3,5,6,7
2,JAY,SMITH,135792468,SON,"SELECTED,","NOT_SELECTED,"
3,VICTORIA,SMITH,246 81 3579,DAUGHTER,"SELECTED,","NOT_SELECTED,"
4,,,,,"NOT_SELECTED,","NOT_SELECTED,"
5,,,,,"NOT_SELECTED,","NOT_SELECTED,"


In [804]:
df5[0]

2         JAY
3    VICTORIA
4            
5            
Name: 0, dtype: object

In [805]:
df

Unnamed: 0,0,1,3,5,6,7
2,87.597656,88.134766,80.680339,89.941406,87.988281,85.791016
3,88.476562,89.013672,81.494141,90.869141,88.867188,86.669922
4,86.132812,86.669922,79.313151,88.427734,86.474609,84.375
5,91.113281,91.699219,83.902995,93.554688,91.503906,89.208984


In [806]:
NUM_TO_REVIEW = len(df5) # number of line items to review
dffn = df5[0].to_list()
dfln = df5[1].to_list()
dfssn = df5[3].to_list()
dfrl = df5[5].to_list()
dfct = df5[6].to_list()
dfcd = df5[7].to_list()
dfn = df["0"].to_list()
dln = df["1"].to_list()
dssn = df["3"].to_list()
drl = df["5"].to_list()
dct = df["6"].to_list()
dcd = df["7"].to_list()


item_list5 = [{'row': "{}".format(x), 'First Name': dffn[x],'FirstNameConfidence': dfn[x], 'Last Name': dfln[x], 'LastNameConfidence': dfn[x],  'SSN': dfssn[x], 'SSNConfidence': dssn[x], 'Relation to you': dfrl[x], 'RTYConfidence': drl[x], 'Child tax credit': dfct[x],'CTCConfidence': dct[x], 'Credit for other dependents': dfcd[x], 'CODConfidence': dcd[x]} for x in range(NUM_TO_REVIEW)]
item_list5

[{'row': '0',
  'First Name': 'JAY',
  'FirstNameConfidence': 87.59765625,
  'Last Name': 'SMITH',
  'LastNameConfidence': 87.59765625,
  'SSN': ' 135792468 ',
  'SSNConfidence': 80.68033854166667,
  'Relation to you': 'SON',
  'RTYConfidence': 89.94140625,
  'Child tax credit': 'SELECTED,',
  'CTCConfidence': 87.98828125,
  'Credit for other dependents': 'NOT_SELECTED,',
  'CODConfidence': 85.791015625},
 {'row': '1',
  'First Name': 'VICTORIA',
  'FirstNameConfidence': 88.4765625,
  'Last Name': 'SMITH',
  'LastNameConfidence': 88.4765625,
  'SSN': '246 81 3579',
  'SSNConfidence': 81.494140625,
  'Relation to you': 'DAUGHTER',
  'RTYConfidence': 90.869140625,
  'Child tax credit': 'SELECTED,',
  'CTCConfidence': 88.8671875,
  'Credit for other dependents': 'NOT_SELECTED,',
  'CODConfidence': 86.669921875},
 {'row': '2',
  'First Name': '',
  'FirstNameConfidence': 86.1328125,
  'Last Name': '',
  'LastNameConfidence': 86.1328125,
  'SSN': '  ',
  'SSNConfidence': 79.31315104166667,


In [807]:


# Fields to extract confidence scores for
fields_to_extract = [
    "Married filing jointly",
    "Were born before January 2, 1958",
    "Head of household (HOH)",
    "Married filing separately (MFS)",
    "You as a dependent",
    "Single"
]

# Extract confidence scores for the specified fields
confidence_scores = {}

for block in response['Blocks']:
    if 'Text' in block:
        text = block['Text']
        confidence = block['Confidence']
        
        for field in fields_to_extract:
            if field in text:
                confidence_scores[field] = confidence

# Print extracted confidence scores
for field, confidence in confidence_scores.items():
    print(f"Field: {field}, Confidence: {confidence}")
    
    



Field: Single, Confidence: 99.9559097290039
Field: Married filing jointly, Confidence: 99.80797576904297
Field: Married filing separately (MFS), Confidence: 99.8315200805664
Field: Head of household (HOH), Confidence: 99.94461822509766
Field: You as a dependent, Confidence: 99.84811401367188
Field: Were born before January 2, 1958, Confidence: 98.56246948242188


You can use Textract response parser library to easily parser JSON returned by Amazon Textract. Library parses JSON and provides programming language specific constructs to work with different parts of the document. For more details please refer to the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python)

In [808]:
from trp import Document
import json
# Create a dictionary to store fields and values
fields_dict = {}

# Parse JSON response from Textract
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}".format(line.text))
        for word in line.words:
            print("Word: {}".format(word.text))

    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))




import json

# Initialize a dictionary to store field data
fields_dict = {}

# List of specific key values to extract
fields_to_extract = [
    "Married filing jointly",
    "Were born before January 2, 1958",
    "Head of household (HOH)",
    "Married filing separately (MFS)",
    "You as a dependent",
    "Single"
]

# Iterate through the fields
for field in page.form.fields:
    if field.key is not None and field.key.text in fields_to_extract:
        fields_dict[field.key.text] = {
            "value": field.value.text if field.value is not None else "NULL",
            "confidence": field.value.block["Confidence"] if field.value is not None else None
        }

# Convert the dictionary to a JSON string
json_data = json.dumps(fields_dict, indent=4)

# Print the JSON data
print(json_data)






Line: 1040
Word: 1040
Line: Department of the Treasury - Internal Revenue Service
Word: Department
Word: of
Word: the
Word: Treasury
Word: -
Word: Internal
Word: Revenue
Word: Service
Line: U.S. Individual Income Tax Return
Word: U.S.
Word: Individual
Word: Income
Word: Tax
Word: Return
Line: 2022
Word: 2022
Line: OMB No. 1545-0074
Word: OMB
Word: No.
Word: 1545-0074
Line: IRS Use Only-Do not write or staple in this space.
Word: IRS
Word: Use
Word: Only-Do
Word: not
Word: write
Word: or
Word: staple
Word: in
Word: this
Word: space.
Line: Filing Status
Word: Filing
Word: Status
Line: Single
Word: Single
Line: Married filing jointly
Word: Married
Word: filing
Word: jointly
Line: Married filing separately (MFS)
Word: Married
Word: filing
Word: separately
Word: (MFS)
Line: Head of household (HOH)
Word: Head
Word: of
Word: household
Word: (HOH)
Line: Qualifying surviving
Word: Qualifying
Word: surviving
Line: Check only
Word: Check
Word: only
Line: spouse (QSS)
Word: spouse
Word: (QSS)
Line

In [809]:
print(json_data)

{
    "Married filing jointly": {
        "value": "SELECTED",
        "confidence": 93.80208587646484
    },
    "Were born before January 2, 1958": {
        "value": "NOT_SELECTED",
        "confidence": 93.31975555419922
    },
    "Married filing separately (MFS)": {
        "value": "NOT_SELECTED",
        "confidence": 92.85863494873047
    },
    "Head of household (HOH)": {
        "value": "NOT_SELECTED",
        "confidence": 92.73064422607422
    },
    "Single": {
        "value": "NOT_SELECTED",
        "confidence": 90.8843002319336
    },
    "You as a dependent": {
        "value": "NOT_SELECTED",
        "confidence": 88.05580139160156
    }
}


In [810]:
# Parse JSON data to a dictionary
fields_dict = json.loads(json_data)

# Convert dictionary to DataFrame
data = []
for key, values in fields_dict.items():
    data.append([key, values["value"], values["confidence"]])

dff = pd.DataFrame(data, columns=["Key", "Value", "Confidence"])

dff

Unnamed: 0,Key,Value,Confidence
0,Married filing jointly,SELECTED,93.802086
1,"Were born before January 2, 1958",NOT_SELECTED,93.319756
2,Married filing separately (MFS),NOT_SELECTED,92.858635
3,Head of household (HOH),NOT_SELECTED,92.730644
4,Single,NOT_SELECTED,90.8843
5,You as a dependent,NOT_SELECTED,88.055801


In [811]:
from trp import Document
import json
# Create a dictionary to store fields and values
fields_dict = {}

# Parse JSON response from Textract
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}".format(line.text))
        for word in line.words:
            print("Word: {}".format(word.text))

    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

Line: 1040
Word: 1040
Line: Department of the Treasury - Internal Revenue Service
Word: Department
Word: of
Word: the
Word: Treasury
Word: -
Word: Internal
Word: Revenue
Word: Service
Line: U.S. Individual Income Tax Return
Word: U.S.
Word: Individual
Word: Income
Word: Tax
Word: Return
Line: 2022
Word: 2022
Line: OMB No. 1545-0074
Word: OMB
Word: No.
Word: 1545-0074
Line: IRS Use Only-Do not write or staple in this space.
Word: IRS
Word: Use
Word: Only-Do
Word: not
Word: write
Word: or
Word: staple
Word: in
Word: this
Word: space.
Line: Filing Status
Word: Filing
Word: Status
Line: Single
Word: Single
Line: Married filing jointly
Word: Married
Word: filing
Word: jointly
Line: Married filing separately (MFS)
Word: Married
Word: filing
Word: separately
Word: (MFS)
Line: Head of household (HOH)
Word: Head
Word: of
Word: household
Word: (HOH)
Line: Qualifying surviving
Word: Qualifying
Word: surviving
Line: Check only
Word: Check
Word: only
Line: spouse (QSS)
Word: spouse
Word: (QSS)
Line

In [812]:
    # Print tables
    for table in page.tables:
        print("Table Confidence Score: {}".format(table.confidence))  # Print the confidence score of the table
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

Table Confidence Score: 100.0
Table[0][0] = Your first name and middle initial 
Table[0][1] = Last name 
Table[1][0] = JOHN J 
Table[1][1] = SMITH 
Table[2][0] = If joint return, spouse's first name and middle initial 
Table[2][1] = Last name 
Table[3][0] = JANE K 
Table[3][1] = SMITH 
Table Confidence Score: 99.951171875
Table[0][0] = Home address (number and street). If you have a P.O. box, see instructions. 
Table[0][1] = Apt. no. 
Table[1][0] = 1040 N PHOENIX RD 
Table[1][1] = 
Table Confidence Score: 100.0
Table[0][0] = City, town, or post office. If you have a foreign address, also complete spaces below. 
Table[0][1] = State 
Table[0][2] = ZIP code 
Table[1][0] = GLENDALE 
Table[1][1] = A2 
Table[1][2] = 85308 
Table Confidence Score: 99.8046875
Table[0][0] = Foreign country name 
Table[0][1] = Foreign province/state/county 
Table[0][2] = Foreign postal code 
Table[1][0] = 
Table[1][1] = 
Table[1][2] = 
Table Confidence Score: 100.0
Table[0][0] = (see instructions): 
Table[0][1] 

In [813]:

import json

# Initialize a dictionary to store field data
fields_dict = {}

# List of specific key values to extract
fields_to_extract = [
    "Your first name and middle initial",
    "Last name",
    "Your social security number",
    "If joint return, spouse’s first name and middle initial",
    "Last name",
    "Home address (number and street). If you have a P.O. box, see instructions.",
    "Apt. no.",
    "City, town, or post office. If you have a foreign address, also complete spaces below. ",
    "State",
    "ZIP code",
    "Foreign country name",
    "Foreign province/state/county",
    "Foreign postal code",
    "Spouse's social security number",
    "You",
    "Spouse",
    "SSN"
]

# Iterate through the fields
for field in page.form.fields:
    if field.key is not None and field.key.text in fields_to_extract:
        fields_dict[field.key.text] = {
            "value": field.value.text if field.value is not None else "NULL",
            "confidence": field.value.block["Confidence"] if field.value is not None else None
        }

# Convert the dictionary to a JSON string
json_data = json.dumps(fields_dict, indent=4)
print(json_data)

{
    "Home address (number and street). If you have a P.O. box, see instructions.": {
        "value": "1040 N PHOENIX RD",
        "confidence": 91.69149780273438
    },
    "You": {
        "value": "NOT_SELECTED",
        "confidence": 91.49413299560547
    },
    "Spouse": {
        "value": "NOT_SELECTED",
        "confidence": 91.27055358886719
    },
    "Last name": {
        "value": "SMITH",
        "confidence": 90.18329620361328
    },
    "Your social security number": {
        "value": "123 45 6789",
        "confidence": 90.6737060546875
    },
    "Your first name and middle initial": {
        "value": "JOHN J",
        "confidence": 90.09069061279297
    },
    "Foreign province/state/county": {
        "value": "NULL",
        "confidence": null
    },
    "ZIP code": {
        "value": "85308",
        "confidence": 89.56851959228516
    },
    "State": {
        "value": "A2",
        "confidence": 89.18620300292969
    },
    "Spouse's social security number": {

In [814]:
# Parse JSON data to a dictionary
fields_dict = json.loads(json_data)

# Convert dictionary to DataFrame
data = []
for key, values in fields_dict.items():
    data.append([key, values["value"], values["confidence"]])

dff2 = pd.DataFrame(data, columns=["Key", "Value", "Confidence"])
dff2['Value']
data = list(dff2['Value'])
data
my_string = ', '.join(data)
dff2
my_string

'1040 N PHOENIX RD, NOT_SELECTED, NOT_SELECTED, SMITH, 123 45 6789, JOHN J, NULL, 85308, A2, 9871654321, NULL, NULL, NULL'

## AWS COMPREHEND

In [815]:
import boto3
import pandas as pd


# Replace "NULL" with a placeholder
placeholder = '<NULL>'
processed_string = my_string.replace('NULL', placeholder)

# Initialize the AWS Comprehend client
comprehend = boto3.client(service_name='comprehend', region_name='us-east-1')

print('Calling DetectEntities')

# Split the processed string into words
words = processed_string.split(', ')

# Create a list to store the results
result_data = []

for word in words:
    response = comprehend.detect_entities(Text=word, LanguageCode='en')
    if response['Entities']:
        entity_type = response['Entities'][0]['Type']
    else:
        entity_type = 'NULL'  # If there are no entities, consider it as "NULL"
    result_data.append({'Word': word, 'Entity_Type': entity_type})

# Create a DataFrame from the list
result_df = pd.DataFrame(result_data)

dff2['Entity'] = result_df['Entity_Type']
dff2


Calling DetectEntities


Unnamed: 0,Key,Value,Confidence,Entity
0,Home address (number and street). If you have ...,1040 N PHOENIX RD,91.691498,LOCATION
1,You,NOT_SELECTED,91.494133,
2,Spouse,NOT_SELECTED,91.270554,
3,Last name,SMITH,90.183296,PERSON
4,Your social security number,123 45 6789,90.673706,OTHER
5,Your first name and middle initial,JOHN J,90.090691,PERSON
6,Foreign province/state/county,,,
7,ZIP code,85308,89.56852,OTHER
8,State,A2,89.186203,OTHER
9,Spouse's social security number,9871654321,88.430283,OTHER


In [875]:
ssn = dff2.loc[dff2["Key"] == "Your social security number", "Value"].values[0]
ssn2 = dff2.loc[dff2["Key"] == "Spouse's social security number", "Value"].values[0]


### Custom Entity Recognition Model


In [883]:
model = !aws comprehend detect-entities --endpoint-arn arn:aws:comprehend:us-east-1:485636232393:entity-recognizer-endpoint/comprehendendpoint --language-code en --text '1040 N PHOENIX RD, NOT_SELECTED, NOT_SELECTED, SMITH, 123 45 6789, JOHN J, NULL, 85308, A2, 9871654321, NULL, NULL, NULL'

In [885]:
import json

# Your 'model' data as a string
model_str = ' '.join(model)  # Combine the list of strings into one string
model_dict = json.loads(model_str)  # Convert the string to a dictionary

data = {
    "Text": [entity["Text"] for entity in model_dict["Entities"]],
    "Type": [entity["Type"] for entity in model_dict["Entities"]]
}
df = pd.DataFrame(data)

# Check if either of the two values is present in the 'Value' column
if (dff2['Value'] == ssn).any() or (dff2['Value'] == ssn2).any():
    # Find the indices where the conditions are true
    indices_to_update = dff2[(dff2['Value'] == ssn) | (dff2['Value'] == ssn2)].index
    
    # Update the 'Entity' column in the found rows with the 'Type' value from df
    for index in indices_to_update:
        dff2.at[index, 'Entity'] = df['Type'].values[0]


In [886]:
dff2

Unnamed: 0,Key,Value,Confidence,Entity
0,Home address (number and street). If you have ...,1040 N PHOENIX RD,91.691498,LOCATION
1,You,NOT_SELECTED,91.494133,
2,Spouse,NOT_SELECTED,91.270554,
3,Last name,SMITH,90.183296,PERSON
4,Your social security number,123 45 6789,90.673706,SSN
5,Your first name and middle initial,JOHN J,90.090691,PERSON
6,Foreign province/state/county,,,
7,ZIP code,85308,89.56852,OTHER
8,State,A2,89.186203,OTHER
9,Spouse's social security number,9871654321,88.430283,SSN


In [820]:
# Initialize a dictionary to store field data
fields_dict = {}

# List of specific key values to extract
fields_to_extract = [
    "Yes",
    "No"
]

# Iterate through the fields
for field in page.form.fields:
    if field.key is not None and field.key.text in fields_to_extract:
        fields_dict[field.key.text] = {
            "value": field.value.text if field.value is not None else "NULL",
            "confidence": field.value.block["Confidence"] if field.value is not None else None
        }

# Convert the dictionary to a JSON string
json_data_da = json.dumps(fields_dict, indent=4)

# Print the JSON data
print(json_data_da)

{
    "Yes": {
        "value": "NOT_SELECTED",
        "confidence": 90.43658447265625
    },
    "No": {
        "value": "SELECTED",
        "confidence": 88.84349822998047
    }
}


In [821]:
# Convert dictionary to DataFrame
data = []
for key, values in fields_dict.items():
    data.append([key, values["value"], values["confidence"]])

dff3 = pd.DataFrame(data, columns=["Key", "Value", "Confidence"])

dff3

Unnamed: 0,Key,Value,Confidence
0,Yes,NOT_SELECTED,90.436584
1,No,SELECTED,88.843498


In [822]:
# Initialize a dictionary to store field data
fields_dict = {}

# List of specific key values to extract
fields_to_extract = [
    "You as a dependent",
    "Your spouse as a dependent",
    "Spouse itemizes on a separate return or you were a dual-status alien",
    "Were born before January 2, 1958",
    "Are blind",
    "Was born before January 2, 1958",
    "Is blind"
]

# Iterate through the fields
for field in page.form.fields:
    if field.key is not None and field.key.text in fields_to_extract:
        fields_dict[field.key.text] = {
            "value": field.value.text if field.value is not None else "NULL",
            "confidence": field.value.block["Confidence"] if field.value is not None else None
        }

# Convert the dictionary to a JSON string
json_data_ab = json.dumps(fields_dict, indent=4)

# Print the JSON data
print(json_data_ab)

{
    "Were born before January 2, 1958": {
        "value": "NOT_SELECTED",
        "confidence": 93.31975555419922
    },
    "Was born before January 2, 1958": {
        "value": "NOT_SELECTED",
        "confidence": 90.74566650390625
    },
    "Is blind": {
        "value": "NOT_SELECTED",
        "confidence": 90.03730010986328
    },
    "Your spouse as a dependent": {
        "value": "NOT_SELECTED",
        "confidence": 89.89031219482422
    },
    "Spouse itemizes on a separate return or you were a dual-status alien": {
        "value": "NOT_SELECTED",
        "confidence": 89.16678619384766
    },
    "Are blind": {
        "value": "NOT_SELECTED",
        "confidence": 88.45318603515625
    },
    "You as a dependent": {
        "value": "NOT_SELECTED",
        "confidence": 88.05580139160156
    }
}


In [823]:
# Convert dictionary to DataFrame
data = []
for key, values in fields_dict.items():
    data.append([key, values["value"], values["confidence"]])

dff4 = pd.DataFrame(data, columns=["Key", "Value", "Confidence"])

dff4

Unnamed: 0,Key,Value,Confidence
0,"Were born before January 2, 1958",NOT_SELECTED,93.319756
1,"Was born before January 2, 1958",NOT_SELECTED,90.745667
2,Is blind,NOT_SELECTED,90.0373
3,Your spouse as a dependent,NOT_SELECTED,89.890312
4,Spouse itemizes on a separate return or you we...,NOT_SELECTED,89.166786
5,Are blind,NOT_SELECTED,88.453186
6,You as a dependent,NOT_SELECTED,88.055801


# Step 3 - Setting up Amazon A2I human loop to review Textract's low confidence responses

In this step, we will send the form line items in a tabular form to an Amazon A2I human loop for review, and modifications/augmentation to the data as required. Once this is done, we will persist this updated data in a DynamoDB table so downstream applications can use for processing.

In [1076]:
timestamp = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
# Amazon SageMaker client
sagemaker_client = boto3.client('sagemaker')

# Amazon Augment AI (A2I) client
a2i = boto3.client('sagemaker-a2i-runtime')

# Amazon S3 client 
s3 = boto3.client('s3')

# Flow definition name - this value is unique per account and region. You can also provide your own value here.
flowDefinitionName = 'fd-hw-docs-' + timestamp

# Task UI name - this value is unique per account and region. You can also provide your own value here.
taskUIName = 'ui-hw-docs-' + timestamp

# Flow definition outputs
OUTPUT_PATH = f's3://' + sess.default_bucket() + '/' + prefix + '/a2i-results'

### Step 3b - Create the human task UI

Create a human task UI resource, giving a UI template in liquid html.You can download this tempalte and customize it https://github.com/aws-samples/amazon-textract-a2i-dynamodb-handwritten-tabular/blob/main/tables-keyvalue-sample.liquid.html
This template will be rendered to the human workers whenever human loop is required. For over 70 pre built UIs, check: https://github.com/aws-samples/amazon-a2i-sample-task-uis. But first, lets declare some variables that we need during the next set of steps.

In [1077]:
template = r"""<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<style>
  table, tr, th, td {
    border: 1px solid black;
    border-collapse: collapse;
    padding: 5px;
  }
  .green {
    background-color: green;
  }
  .yellow {
    background-color: yellow;
  }
  .red {
    background-color: red;
  }
</style>
<crowd-form>
  <div>
    <h3>Original PDF</h3>
      <classification-target>
        <img style="width: 100%; max-height: 40%; margin-bottom: 10px" src="{{ task.input.image1 | grant_read_access }}"/>        
      </classification-target> 
  <div>
    <h1>Instructions</h1>
    <p>Please review the form and make corrections where needed.</p>
</div>
<br>

<h3>Filling Status Check Only One Boxes</h3>
<table>
    <tr>
        <th>Key</th>
        <th>Selection</th>
        <th>Confidence</th>
        <th>Corrected Value</th>
        <th>Change Required</th>
        <th>Comments</th>
    </tr>
    {% for entry in task.input.Pairs %}
        <tr>
            <td>{{ entry.Key }}</td>
            <td>{{ entry.Value }}</td>
            <td class="{% if entry.Confidence > 90 %} green {% elsif entry.Confidence >= 85 %} yellow {% else %} red {% endif %}">{{ entry.Confidence }}</td>
            <td>
                <input type="radio" name="change_required_{{ entry.row }}" value="agree" required> Correct
                <input type="radio" name="change_required_{{ entry.row }}" value="disagree" required> Incorrect
            </td>
            <td><input type="text" name="Corrected Column1 value{{ entry.row }}"></td>
            <td><input type="text" name="comments_{{ entry.row }}"></td>
        </tr>
    {% endfor %}
</table>
<br>
<br>
<h2>Section Two</h2>
<table>
    <tr>
        <th>Key</th>
        <th>Value</th>
        <th>Confidence</th>
        <th>Entity<th>
        <th>Corrected Value</th>
        <th>Change Required</th>
        <th>Comments</th>
    </tr>
    {% for entry in task.input.text %}
        <tr>
            <td>{{ entry.Key }}</td>
            <td>{{ entry.Value }}</td>
            <td class="{% if entry.Confidence > 90 %} green {% elsif entry.Confidence >85 %} yellow {% else %} red {% endif %}">{{ entry.Confidence }}</td>
            <td>{{ entry.Entity }}</td>
            <td>
                <input type="radio" name="change_required_{{ entry.row }}" value="agree" required> Correct
                <input type="radio" name="change_required_{{ entry.row }}" value="disagree" required> Incorrect
            </td>
            <td><input type="text" name="Corrected Column2 Values{{ entry.row }}"></td>
            <td><input type="text" name="comments_{{ entry.row }}"></td>
        </tr>
    {% endfor %}
</table>
<h2>DIGITAL ASSET</h2>
<table>
    <tr>
        <th>Key</th>
        <th>Value</th>
        <th>Confidence</th>
        <th>Corrected Value</th>
        <th>Change Required</th>
        <th>Comments</th>
    </tr>
    {% for entry in task.input.text1 %}
        <tr>
            <td>{{ entry.Key }}</td>
            <td>{{ entry.Value }}</td>
            <td class="{% if entry.Confidence > 90 %} green {% elsif entry.Confidence >= 85 %} yellow {% else %} red {% endif %}">{{ entry.Confidence }}</td>
            <td>
                <input type="radio" name="change_required_{{ entry.row }}" value="agree" required> Correct
                <input type="radio" name="change_required_{{ entry.row }}" value="disagree" required> Incorrect
            </td>
            <td><input type="text" name="Corrected Column3 Values{{ entry.row }}"></td>
            <td><input type="text" name="comments_{{ entry.row }}"></td>
        </tr>
    {% endfor %}
</table>
<h2>STANDARD DEDUCTION</h2>
<h3>Age/Blindness</h3>
<table>
    <tr>
        <th>Key</th>
        <th>Value</th>
        <th>Confidence</th>
        <th>Corrected Value</th>
        <th>Change Required</th>
        <th>Comments</th>
    </tr>
    {% for entry in task.input.text2 %}
        <tr>
            <td>{{ entry.Key }}</td>
            <td>{{ entry.Value }}</td>
            <td class="{% if entry.Confidence > 90 %} green {% elsif entry.Confidence >= 85 %} yellow {% else %} red {% endif %}">{{ entry.Confidence }}</td>
            <td>
                <input type="radio" name="change_required_{{ entry.row }}" value="agree" required> Correct
                <input type="radio" name="change_required_{{ entry.row }}" value="disagree" required> Incorrect
            </td>
            <td><input type="text" name="Corrected Column4 Values{{ entry.row }}"></td>
            <td><input type="text" name="comments_{{ entry.row }}"></td>
        </tr>
    {% endfor %}
</table>
<h2>Dependents</h2>
<table>
    <tr>
        <th>LINE ITEM</th>
        <th>FIRST NAME Value</th>
        <th>FIRST NAME Confidence</th>
        <th>LAST NAME Value</th>
        <th>LAST NAME Confidence</th>
        <th>SOCIAL SECURITY NUMBER Value</th>
        <th>SOCIAL SECURITY NUMBER Confidence</th>
        <th>RELATION TO YOU Value</th>
        <th>RELATION TO YOU Confidence</th>
        <th>CHILD TAX CREDIT Value</th>
        <th>CHILD TAX CREDIT Confidence</th>
        <th>CREDIT FOR OTHER DEPENDENCIES Value</th>
        <th>CREDIT FOR OTHER DEPENDENCIES Confidence</th>
        <th>TRUE FIRST NAME</th>
        <th>TRUE LAST NAME</th>
        <th>TRUE SOCIAL SECURITY NUMBER</th>
        <th>TRUE RELATION TO YOU</th>
        <th>TRUE CHILD TAX CREDIT</th>
        <th>TRUE CREDIT FOR OTHER DEPENDENTS</th>
        <th>COMMENTS</th>
    </tr>
    {% for item in task.input.text3 %}
        <tr>
            <td>{{ item.row }}</td>
            <td>{{ item['First Name'] }}</td>
            <td class="{% if item['FirstNameConfidence'] > 90 %} green {% elsif item['FirstNameConfidence'] >= 90 %} yellow {% else %} red {% endif %}">{{ item['FirstNameConfidence'] }}</td>
            <td>{{ item['Last Name'] }}</td>
            <td class="{% if item['LastNameConfidence'] > 90 %} green {% elsif item['LastNameConfidence'] >= 85 %} yellow {% else %} red {% endif %}">{{ item['LastNameConfidence'] }}</td>
            <td>{{ item['SSN'] }}</td>
            <td class="{% if item['SSNConfidence'] > 90 %} green {% elsif item['SSNConfidence'] >= 85 %} yellow {% else %} red {% endif %}">{{ item['SSNConfidence'] }}</td>
            <td>{{ item['Relation to you'] }}</td>
            <td class="{% if item['RTYConfidence'] > 90 %} green {% elsif item['RTYConfidence'] >= 85 %} yellow {% else %} red {% endif %}">{{ item['RTYConfidence'] }}</td>
            <td>{{ item['Child tax credit'] }}</td>
            <td class="{% if item['CTCConfidence'] >= 90 %} green {% elsif item['CTCConfidence'] >= 85 %} yellow {% else %} red {% endif %}">{{ item['CTCConfidence'] }}</td>
            <td>{{ item['Credit for other dependents'] }}</td>
            <td class="{% if item['CTCConfidence'] >= 90 %} green {% elsif item['CTCConfidence'] >= 85 %} yellow {% else %} red {% endif %}">{{ item['CTCConfidence'] }}</td>
            <td>
                <p>
                    <input type="radio" id="agreeline{{ forloop.index }}" name="ratingline{{ forloop.index }}" value="agree" required>
                    <label for="agreeline{{ forloop.index }}">Correct</label>
                </p>
                <p>
                    <input type="radio" id="disagreeline{{ forloop.index }}" name="ratingline{{ forloop.index }}" value="disagree" required>
                    <label for="disagreeline{{ forloop.index }}">Incorrect</label>
                </p>
            </td>
            <td>
                <p>
                    <input type="text" name="TrueFirstName{{ forloop.index }}" placeholder="Corrected First Name" />
                </p>
            </td>
            <td>
                <p>
                    <input type="text" name="TrueLastName{{ forloop.index }}" placeholder="Corrected Last Name" />
                </p>
            </td>
            <td>
                <p>
                    <input type="text" name="TrueSSN{{ forloop.index }}" placeholder="Corrected SSN" />
                </p>
            </td>
            <td>
                <p>
                    <input type="text" name="TrueRelation{{ forloop.index }}" placeholder="Corrected Relation" />
                </p>
            </td>
            <td>
                <p>
                    <input type="text" name="TrueChildTaxCredit{{ forloop.index }}" placeholder="Corrected Child Tax Credit" />
                </p>
            </td>
            <td>
                <p>
                    <input type="text" name="TrueCreditForOtherDep{{ forloop.index }}" placeholder="Corrected Credit for Other Dependents" />
                </p>
            </td>
            <td>
                <p>
                    <input type="text" name="ChangeReason{{ forloop.index }}" placeholder="Explain why you changed the value" />
                </p>
            </td>
        </tr>
    {% endfor %}
</table>
</crowd-form>

"""


After creating this custom template using HTML, you must use this template to generate an Amazon A2I human task UI Amazon Resource Name (ARN) .This ARN has the following format: arn:aws:sagemaker:<aws-region>:<aws-account-number>:human-task-ui/<template-name>. This ARN is associated with a worker task template resource that you can use in one or more human review workflows (flow definitions).Generate a human task UI ARN using a worker task template by using the CreateHumanTaskUi API operation by running the notebook cell below:

In [1078]:
def create_task_ui():
    '''
    Creates a Human Task UI resource.

    Returns:
    struct: HumanTaskUiArn
    '''
    response = sagemaker_client.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': template})
    return response

In [1079]:
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)

arn:aws:sagemaker:us-east-1:485636232393:human-task-ui/ui-hw-docs-2023-09-18-20-26-27


### Step 3b - Create the Flow Definition
In this section, we're going to create a flow definition definition. Flow Definitions allow us to specify:

* The workforce that your tasks will be sent to.
* The instructions that your workforce will receive. This is called a worker task template.
* Where your output data will be stored.
* This demo is going to use the API, but you can optionally create this workflow definition in the console as well. 

For more details and instructions, see: https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-create-flow-definition.html.

In [1080]:
create_workflow_definition_response = sagemaker_client.create_flow_definition(
        FlowDefinitionName= flowDefinitionName,
        RoleArn= role,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Review the contents and correct values as indicated",
            "TaskTitle": "Form 1040 Review"
        },
        OutputConfig={
            "S3OutputPath" : OUTPUT_PATH
        }
    )
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] # let's save this ARN for future use

In [1081]:
for x in range(60):
    describeFlowDefinitionResponse = sagemaker_client.describe_flow_definition(FlowDefinitionName=flowDefinitionName)
    print(describeFlowDefinitionResponse['FlowDefinitionStatus'])
    if (describeFlowDefinitionResponse['FlowDefinitionStatus'] == 'Active'):
        print("Flow Definition is active")
        break
    time.sleep(2)

Initializing
Active
Flow Definition is active


In [1082]:
dff

Unnamed: 0,Key,Value,Confidence
0,Married filing jointly,SELECTED,93.802086
1,"Were born before January 2, 1958",NOT_SELECTED,93.319756
2,Married filing separately (MFS),NOT_SELECTED,92.858635
3,Head of household (HOH),NOT_SELECTED,92.730644
4,Single,NOT_SELECTED,90.8843
5,You as a dependent,NOT_SELECTED,88.055801


# Sending predictions to Amazon A2I human loops

In [1083]:
NUM_TO_REVIEW = len(dff) # number of line items to review
dfstart = dff['Key'].to_list()
dfend = dff['Value'].to_list()
dfkeyCon = dff['Confidence'].to_list()


item_list = [{'row': "{}".format(x), 'Key': dfstart[x], 'Value': dfend[x], 'Confidence': dfkeyCon[x]} for x in range(NUM_TO_REVIEW)]
item_list

[{'row': '0',
  'Key': 'Married filing jointly',
  'Value': 'SELECTED',
  'Confidence': 93.80208587646484},
 {'row': '1',
  'Key': 'Were born before January 2, 1958',
  'Value': 'NOT_SELECTED',
  'Confidence': 93.31975555419922},
 {'row': '2',
  'Key': 'Married filing separately (MFS)',
  'Value': 'NOT_SELECTED',
  'Confidence': 92.85863494873047},
 {'row': '3',
  'Key': 'Head of household (HOH)',
  'Value': 'NOT_SELECTED',
  'Confidence': 92.73064422607422},
 {'row': '4',
  'Key': 'Single',
  'Value': 'NOT_SELECTED',
  'Confidence': 90.8843002319336},
 {'row': '5',
  'Key': 'You as a dependent',
  'Value': 'NOT_SELECTED',
  'Confidence': 88.05580139160156}]

In [1084]:
NUM_TO_REVIEW = len(dff2) # number of line items to review
dfstart = dff2['Key'].to_list()
dfend = dff2['Value'].to_list()
dfkeyCon = dff2['Confidence'].to_list()
dfentity= dff2['Entity'].to_list()



item_list2 = [{'row': "{}".format(x), 'Key': dfstart[x], 'Value': dfend[x], 'Confidence': dfkeyCon[x], 'Entity': dfentity[x]} for x in range(NUM_TO_REVIEW)]
item_list2

[{'row': '0',
  'Key': 'Home address (number and street). If you have a P.O. box, see instructions.',
  'Value': '1040 N PHOENIX RD',
  'Confidence': 91.69149780273438,
  'Entity': 'LOCATION'},
 {'row': '1',
  'Key': 'You',
  'Value': 'NOT_SELECTED',
  'Confidence': 91.49413299560547,
  'Entity': 'NULL'},
 {'row': '2',
  'Key': 'Spouse',
  'Value': 'NOT_SELECTED',
  'Confidence': 91.27055358886719,
  'Entity': 'NULL'},
 {'row': '3',
  'Key': 'Last name',
  'Value': 'SMITH',
  'Confidence': 90.18329620361328,
  'Entity': 'PERSON'},
 {'row': '4',
  'Key': 'Your social security number',
  'Value': '123 45 6789',
  'Confidence': 90.6737060546875,
  'Entity': 'SSN'},
 {'row': '5',
  'Key': 'Your first name and middle initial',
  'Value': 'JOHN J',
  'Confidence': 90.09069061279297,
  'Entity': 'PERSON'},
 {'row': '6',
  'Key': 'Foreign province/state/county',
  'Value': 'NULL',
  'Confidence': nan,
  'Entity': 'NULL'},
 {'row': '7',
  'Key': 'ZIP code',
  'Value': '85308',
  'Confidence': 8

In [1085]:
NUM_TO_REVIEW = len(dff3) # number of line items to review
dfstart = dff3['Key'].to_list()
dfend = dff3['Value'].to_list()
dfkeyCon = dff3['Confidence'].to_list()


item_list3 = [{'row': "{}".format(x), 'Key': dfstart[x], 'Value': dfend[x], 'Confidence': dfkeyCon[x]} for x in range(NUM_TO_REVIEW)]
item_list3

[{'row': '0',
  'Key': 'Yes',
  'Value': 'NOT_SELECTED',
  'Confidence': 90.43658447265625},
 {'row': '1',
  'Key': 'No',
  'Value': 'SELECTED',
  'Confidence': 88.84349822998047}]

In [1086]:
NUM_TO_REVIEW = len(dff3) # number of line items to review
dfstart = dff4['Key'].to_list()
dfend = dff4['Value'].to_list()
dfkeyCon = dff4['Confidence'].to_list()


item_list4 = [{'row': "{}".format(x), 'Key': dfstart[x], 'Value': dfend[x], 'Confidence': dfkeyCon[x]} for x in range(NUM_TO_REVIEW)]
item_list4

[{'row': '0',
  'Key': 'Were born before January 2, 1958',
  'Value': 'NOT_SELECTED',
  'Confidence': 93.31975555419922},
 {'row': '1',
  'Key': 'Was born before January 2, 1958',
  'Value': 'NOT_SELECTED',
  'Confidence': 90.74566650390625}]

In [1087]:
import json
import math

# Replace NaN values with None (which will be converted to null in JSON)
for item in item_list2:
    for key, value in item.items():
        if isinstance(value, float) and math.isnan(value):
            item[key] = None

# Update ip_content dictionary
ip_content = {
    "Pairs": item_list,  # Use the transformed item_list here
    "text": item_list2,
    "text1": item_list3,
    "text2": item_list4,
    "text3": item_list5,
    "image1": s3_img_url 
}



In [1088]:
ip_content

{'Pairs': [{'row': '0',
   'Key': 'Married filing jointly',
   'Value': 'SELECTED',
   'Confidence': 93.80208587646484},
  {'row': '1',
   'Key': 'Were born before January 2, 1958',
   'Value': 'NOT_SELECTED',
   'Confidence': 93.31975555419922},
  {'row': '2',
   'Key': 'Married filing separately (MFS)',
   'Value': 'NOT_SELECTED',
   'Confidence': 92.85863494873047},
  {'row': '3',
   'Key': 'Head of household (HOH)',
   'Value': 'NOT_SELECTED',
   'Confidence': 92.73064422607422},
  {'row': '4',
   'Key': 'Single',
   'Value': 'NOT_SELECTED',
   'Confidence': 90.8843002319336},
  {'row': '5',
   'Key': 'You as a dependent',
   'Value': 'NOT_SELECTED',
   'Confidence': 88.05580139160156}],
 'text': [{'row': '0',
   'Key': 'Home address (number and street). If you have a P.O. box, see instructions.',
   'Value': '1040 N PHOENIX RD',
   'Confidence': 91.69149780273438,
   'Entity': 'LOCATION'},
  {'row': '1',
   'Key': 'You',
   'Value': 'NOT_SELECTED',
   'Confidence': 91.494132995605

In [1089]:
item_list2

[{'row': '0',
  'Key': 'Home address (number and street). If you have a P.O. box, see instructions.',
  'Value': '1040 N PHOENIX RD',
  'Confidence': 91.69149780273438,
  'Entity': 'LOCATION'},
 {'row': '1',
  'Key': 'You',
  'Value': 'NOT_SELECTED',
  'Confidence': 91.49413299560547,
  'Entity': 'NULL'},
 {'row': '2',
  'Key': 'Spouse',
  'Value': 'NOT_SELECTED',
  'Confidence': 91.27055358886719,
  'Entity': 'NULL'},
 {'row': '3',
  'Key': 'Last name',
  'Value': 'SMITH',
  'Confidence': 90.18329620361328,
  'Entity': 'PERSON'},
 {'row': '4',
  'Key': 'Your social security number',
  'Value': '123 45 6789',
  'Confidence': 90.6737060546875,
  'Entity': 'SSN'},
 {'row': '5',
  'Key': 'Your first name and middle initial',
  'Value': 'JOHN J',
  'Confidence': 90.09069061279297,
  'Entity': 'PERSON'},
 {'row': '6',
  'Key': 'Foreign province/state/county',
  'Value': 'NULL',
  'Confidence': None,
  'Entity': 'NULL'},
 {'row': '7',
  'Key': 'ZIP code',
  'Value': '85308',
  'Confidence': 

In [1090]:
# Serialize the ip_content dictionary to a JSON string
input_content_json = json.dumps(ip_content, default=str)  # Use default=str to handle non-serializable values

# Print the serialized JSON for verification
print("Serialized JSON Input Content:")
print(input_content_json)

# Continue with starting the human loop
humanLoopName = str(uuid.uuid4())

start_loop_response = a2i.start_human_loop(
    HumanLoopName=humanLoopName,
    FlowDefinitionArn=flowDefinitionArn,
    HumanLoopInput={
        "InputContent": input_content_json
    }
)

Serialized JSON Input Content:
{"Pairs": [{"row": "0", "Key": "Married filing jointly", "Value": "SELECTED", "Confidence": 93.80208587646484}, {"row": "1", "Key": "Were born before January 2, 1958", "Value": "NOT_SELECTED", "Confidence": 93.31975555419922}, {"row": "2", "Key": "Married filing separately (MFS)", "Value": "NOT_SELECTED", "Confidence": 92.85863494873047}, {"row": "3", "Key": "Head of household (HOH)", "Value": "NOT_SELECTED", "Confidence": 92.73064422607422}, {"row": "4", "Key": "Single", "Value": "NOT_SELECTED", "Confidence": 90.8843002319336}, {"row": "5", "Key": "You as a dependent", "Value": "NOT_SELECTED", "Confidence": 88.05580139160156}], "text": [{"row": "0", "Key": "Home address (number and street). If you have a P.O. box, see instructions.", "Value": "1040 N PHOENIX RD", "Confidence": 91.69149780273438, "Entity": "LOCATION"}, {"row": "1", "Key": "You", "Value": "NOT_SELECTED", "Confidence": 91.49413299560547, "Entity": "NULL"}, {"row": "2", "Key": "Spouse", "Val

In [1091]:
flowDefinitionArn

'arn:aws:sagemaker:us-east-1:485636232393:flow-definition/fd-hw-docs-2023-09-18-20-26-27'

In [1092]:
humanLoopName

'3d7e7dbb-d6de-4895-afb4-7980cced653e'

Check status of human loop

In [1093]:
completed_human_loops = []
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
print('\n')
   
      
if resp["HumanLoopStatus"] == "Completed":
    completed_human_loops.append(resp)

HumanLoop Name: 3d7e7dbb-d6de-4895-afb4-7980cced653e
HumanLoop Status: InProgress
HumanLoop Output Destination: {'OutputS3Uri': 's3://sagemaker-us-east-1-485636232393/textract-a2i-handwritten/a2i-results/fd-hw-docs-2023-09-18-20-26-27/2023/09/18/20/26/39/3d7e7dbb-d6de-4895-afb4-7980cced653e/output.json'}




In [1094]:
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
resp

{'ResponseMetadata': {'RequestId': '8c168732-30e1-4673-a0e1-af85377004f2',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 18 Sep 2023 20:26:57 GMT',
   'content-type': 'application/json; charset=UTF-8',
   'content-length': '5114',
   'connection': 'keep-alive',
   'x-amzn-requestid': '8c168732-30e1-4673-a0e1-af85377004f2',
   'access-control-allow-origin': '*',
   'x-amz-apigw-id': 'LeDWvH4toAMErYg=',
   'x-amzn-trace-id': 'Root=1-6508b291-13a741b249c2e2204c52259f'},
  'RetryAttempts': 0},
 'CreationTime': datetime.datetime(2023, 9, 18, 20, 26, 39, 670000, tzinfo=tzlocal()),
 'HumanLoopStatus': 'InProgress',
 'HumanLoopName': '3d7e7dbb-d6de-4895-afb4-7980cced653e',
 'HumanLoopArn': 'arn:aws:sagemaker:us-east-1:485636232393:human-loop/3d7e7dbb-d6de-4895-afb4-7980cced653e',
 'FlowDefinitionArn': 'arn:aws:sagemaker:us-east-1:485636232393:flow-definition/fd-hw-docs-2023-09-18-20-26-27',
 'HumanLoopOutput': {'OutputS3Uri': 's3://sagemaker-us-east-1-485636232393/textract-a2i-handw

In [1095]:
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!
https://pwko5jvpzh.labeling.us-east-1.sagemaker.aws


In [1050]:
completed_human_loops = []
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
print('\n')
    
if resp["HumanLoopStatus"] == "Completed":
    completed_human_loops.append(resp)

HumanLoop Name: 09ad1b3b-ce3c-422d-9e93-8e5d81f0127c
HumanLoop Status: InProgress
HumanLoop Output Destination: {'OutputS3Uri': 's3://sagemaker-us-east-1-485636232393/textract-a2i-handwritten/a2i-results/fd-hw-docs-2023-09-18-20-09-33/2023/09/18/20/09/42/09ad1b3b-ce3c-422d-9e93-8e5d81f0127c/output.json'}




In [909]:
import re
import pprint

pp = pprint.PrettyPrinter(indent=4)

for resp in completed_human_loops:
    splitted_string = re.split('s3://' + sess.default_bucket()  + '/', resp['HumanLoopOutput']['OutputS3Uri'])
    output_bucket_key = splitted_string[1]
    response = s3.get_object(Bucket=bucket, Key=output_bucket_key)
    content = response["Body"].read()
    json_output = json.loads(content)
    pp.pprint(json_output)
    print('\n')

In [910]:
json_output

NameError: name 'json_output' is not defined

In [870]:
dff

Unnamed: 0,Key,Value,Confidence
0,Married filing jointly,SELECTED,93.802086
1,"Were born before January 2, 1958",NOT_SELECTED,93.319756
2,Married filing separately (MFS),NOT_SELECTED,92.858635
3,Head of household (HOH),NOT_SELECTED,92.730644
4,Single,NOT_SELECTED,90.8843
5,You as a dependent,NOT_SELECTED,88.055801


In [871]:
#updated array values to be strings for dataframe assignment
for i in json_output['humanAnswers']:
    x = i['answerContent']
        
for j in range(0, len(dff)):    
    dff.at[j, 'TrueHeader'] = str(x.get('Corrected Column1 value'+str(j)))
    dff.at[j, 'Comments'] = str(x.get('comments_'+str(j)))
    
    
dff = dff.where(dff.notnull(), None)

NameError: name 'json_output' is not defined

In [None]:
dff