## Step 0 - Install and Import Libraries

We will be using the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python) for parsing through the Textract response, data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for content analysis, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with Amazon Textract and Amazon A2I. Let's now install and import them.

In [2]:
import pandas as pd
import webbrowser, os
import json
import boto3
import re
import sagemaker
from sagemaker import get_execution_role
from sagemaker.s3 import S3Uploader, S3Downloader
import uuid
import time
import io
from io import BytesIO
import sys
import csv
from pprint import pprint
from IPython.display import Image, display
from PIL import Image as PImage, ImageDraw

from IPython.display import Image, display, IFrame
from PIL import Image as PImage, ImageDraw
from textractprettyprinter.t_pretty_print_expense import get_string, Textract_Expense_Pretty_Print, Pretty_Print_Table_Format, get_expensesummary_string, get_expenselineitemgroups_string
# from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string

In [3]:
# Enter the Workteam ARN from step 7 above
WORKTEAM_ARN= 'arn:aws:sagemaker:us-east-1:485636232393:workteam/private-crowd/demo'
 
# Define IAM role
role = get_execution_role()
print("RoleArn: {}".format(role))
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'textract-a2i-handwritten'

RoleArn: arn:aws:iam::485636232393:role/TextractA2I-SageMakerIamRole-T6LRZRF62Q68


In [21]:
!pip install poppler

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
[31mERROR: Could not find a version that satisfies the requirement poppler (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for poppler[0m[31m
[0m

## Step 1 - Use Amazon Textract to retrieve document content and inspect response

In this step, we will download our test invoice from a S3 bucket to our notebook instance, and then use Amazon Textract to read the hand-written content present in the invoice line items table, and load this into a pandas dataframe for analysis.

#### Review the sample document which has both printed and handwritten content in the tables

In [5]:
from IPython.display import display_pdf
from pdf2image import convert_from_path

documentName = "test5.pdf"
display_pdf(documentName)
 

In [6]:
s3_img_url

's3://sagemaker-us-east-1-485636232393/textract-a2i-handwritten/test5-1.png'

In [17]:
import boto3
s3 = boto3.resource('s3')

client = boto3.client(
    service_name='textract',
    region_name='us-east-1'
)


bucket_name = "sagemaker-us-east-1-485636232393"
documentName = "f1040-sample-typed-Sidney.pdf"

response = client.analyze_document(
    Document={
        'S3Object': {
            "Bucket": "sagemaker-us-east-1-485636232393",
            "Name": "test5-1.png"
        }
    },
    HumanLoopConfig={
        "FlowDefinitionArn": "arn:aws:sagemaker:us-east-1:485636232393:flow-definition/textract-template",
        "HumanLoopName": "humanloop",
        "DataAttributes": {
            "ContentClassifiers": [
                "FreeOfPersonallyIdentifiableInformation",
                "FreeOfAdultContent"
            ]
        }
    },
    FeatureTypes=["FORMS"]
)
    


In [None]:
response

### Helper functions to parse Amazon Textract response

We will now import the Amazon Textract Response Parser library to parse and extract what we need from Amazon Textract's response. There are two main functions here. One, we will extract the header data containing the document heading, and the form data (key-value pairs) part of the header section of the document. Two, we will parse the table and cells to create a csv file containing the tabular data. In this notebook, we will use the Textract Sync API for document extraction, [AnalyzeDocument](https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html). This accepts image files (png or jpeg) as an input. For example, here is the code snippet for AnalyzeDocument:
    
    client = boto3.client(
         service_name='textract',
         region_name= 'us-east-1',
         endpoint_url='https://textract.us-east-1.amazonaws.com',)
         
    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES','FORMS'])

Alternatively, if you would like to modify this notebook to use a PDF file or for batch processing of documents, use the [StartDocumentAnalysis API](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentAnalysis.html). StartDocumentAnalysis returns a job identifier (JobId) that you use to get the results of the operation. When text analysis is finished, Amazon Textract publishes a completion status to the Amazon Simple Notification Service (Amazon SNS) topic that you specify in NotificationChannel. To get the results of the text analysis operation, first check that the status value published to the Amazon SNS topic is SUCCEEDED. If so, call GetDocumentAnalysis, and pass the job identifier (JobId) from the initial call to StartDocumentAnalysis.