# Amazon Textract

**Table of Contents**
1. Introduction
    1. OCR (Optical Character Recognition)
    2. Amazon Textract
2. Setup
4. Implementation
5. Participant Exercise
6. Summary


## 1. Introduction

### A. OCR (Optical Character Recogniton)

Optical character recognition (OCR) technology is a business solution for automating data extraction from printed or written text from a scanned document or image file and then converting the text into a machine-readable form.


### B. Amazon Textract
Amazon Textract is a machine learning service that automatically extracts text, handwriting and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify and extract data from forms and tables. Textract uses machine learning to read and process any type of document, accurately extracting text, handwriting, tables and other data without any manual effort. It is a fully managed service that requires minimal to no ML experience to get started with.


![Textract Benefits](img/textract-benefit.png)

## 2. Setup

We will load dependencies , define helper functions and set up some resources for the notebook


In [2]:
!pip install amazon-textract-response-parser==0.1.20
!pip install amazon-textract-caller==0.0.15


  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting amazon-textract-response-parser==0.1.20
  Downloading amazon_textract_response_parser-0.1.20-py2.py3-none-any.whl (23 kB)
Collecting marshmallow==3.11.1
  Downloading marshmallow-3.11.1-py2.py3-none-any.whl (46 kB)
     |████████████████████████████████| 46 kB 557 kB/s             
Installing collected packages: marshmallow, amazon-textract-response-parser
Successfully installed amazon-textract-response-parser-0.1.20 marshmallow-3.11.1
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting amazon-textract-caller==0.0.15
  Downloading amazon_textract_caller-0.0.15-py2.py3-none-any.whl (10 kB)
Installing collected packages: amazon-textract-caller
Successfully installed amazon-textract-caller-0.0.15


In [3]:
# import dependencies
import os
import boto3
import json
from IPython.display import IFrame
import time

# a few helper functions
def split_s3_uri(uri):
    """return (bucket, key) tuple from s3 uri like 's3://bucket/prefix/file.txt' """
    return uri.replace('s3://','').split('/',1)

def s3_object_from_uri(uri):
    """Initialize a boto3 s3 Object instance from a URI"""
    s3 = boto3.resource('s3')
    return s3.Object(*split_s3_uri(uri))

def s3_contents_from_uri(uri, decode=True):
    """Read contents from S3 object into memory"""
    data = s3_object_from_uri(uri).get()['Body'].read()
    return data.decode() if decode else data

def write_to_s3(data, uri):
    """Write data to an S3 Object at the specified URI"""
    boto3.resource('s3').Object(*split_s3_uri(uri)).put(Body=data.encode())

def get_ssm_parameter(parameter_name):
    """Get the value of an SSM Parameter Store parameter"""
    return boto3.client('ssm').get_parameter(Name=parameter_name)['Parameter']['Value']

def visualize_pdf(document_location, width=600, height=800):
    """Visualize pdf stored in s3 as an IFrame in notebook"""
    bucket, key = split_s3_uri(document_location)
    tmpdir = 'tmp'
    tmpfile = os.path.join(tmpdir,'sample_doc.pdf')
    os.makedirs(tmpdir, exist_ok=True)
    boto3.resource('s3').Object(bucket, key).download_file(tmpfile)
    time.sleep(1)
    return IFrame(tmpfile, width=width, height=height)


# S3 URI prefix where documents are located
DATA_S3_PREFIX = os.path.join(get_ssm_parameter('AssetsS3Prefix'), 'documents/')

# DynamoDB table that we can store output metadata in
DDB_TABLE_NAME = get_ssm_parameter('TableName')  # i.e. "document-data"

# S3 bucket that we can write outputs to
OUTPUT_BUCKET_NAME = get_ssm_parameter('OutputBucketName')



## 3. Implementation



To begin document processing, let's first explore and look at some sample documents, then extract information from these documents and then save them for future use. This can be split into -

- *Step 1*: Explore a sample document and review it

- *Step 2*: Use Textract APIs to extract information

- *Step 3*: Parse the Textract response into easily consumable formats

- *Step 4*: Save Textract data into the data store


#### Step 1: Explore a sample document and review it

In [4]:
# render a document from the dataset in the notebook

SAMPLE_DOC_NAME = 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00000.pdf'

visualize_pdf(os.path.join(DATA_S3_PREFIX, SAMPLE_DOC_NAME), width=600, height=800)

#### Step 2: Use Textract APIs to extract information



`StartDocumentAnalysis` can analyze text in documents that are in JPEG, PNG, and PDF format. We will use this to analyze our PDF documents for relationships between detected items such as key-value pairs, tables, and selection elements. 

reference: https://docs.aws.amazon.com/textract/latest/dg/analyzing-document-text.html


### Feature types supported (forms/tables) 

Add TABLES to the list to return information about the tables that are detected in the input document. Add FORMS to return detected form data. To perform both types of analysis, add TABLES and FORMS to FeatureTypes. All lines and words detected in the document are included in the response (including text that isn't related to the value of FeatureTypes). 

#### Forms
<!-- ![Forms](img/hieroglyph-key-value-set.png) -->
![Forms](img/textract-form.png)


#### Tables
<!-- ![Tables](img/hieroglyph-table-cell.png) -->
![Tables](img/textract-table.png)




**Calling the Textract API**

Let's call the Textract `StartDocumentAnalysis` API using the Python `boto3` library.

boto3 docs for Textract APIs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_analysis

In [5]:
sample_doc_uri = os.path.join(DATA_S3_PREFIX, SAMPLE_DOC_NAME)
sample_doc = s3_object_from_uri(sample_doc_uri)

textract_job = boto3.client('textract').start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': sample_doc.bucket_name,
            'Name': sample_doc.key
        }
    },
    FeatureTypes = ["TABLES","FORMS"]
)

# takes several seconds for job to complete - this will return a response showing that
# the JobStatus is "IN_PROGRESS"

textract_response = boto3.client('textract').get_document_analysis(
    JobId=textract_job['JobId']
)
print(json.dumps(textract_response,indent=4, sort_keys=True))

{
    "AnalyzeDocumentModelVersion": "1.0",
    "JobStatus": "IN_PROGRESS",
    "ResponseMetadata": {
        "HTTPHeaders": {
            "content-length": "63",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 02 Dec 2021 20:36:01 GMT",
            "x-amzn-requestid": "58662b45-df80-4b9b-98a8-ced38bdbb4c0"
        },
        "HTTPStatusCode": 200,
        "RequestId": "58662b45-df80-4b9b-98a8-ced38bdbb4c0",
        "RetryAttempts": 0
    }
}


In [6]:
# we'll wait for a bit before calling the get analysis request again - you will see results once the job finishes

time.sleep(10)

textract_result = boto3.client('textract').get_document_analysis(
    JobId=textract_job['JobId']
)
print(json.dumps(textract_result,indent=2, sort_keys=True))

{
  "AnalyzeDocumentModelVersion": "1.0",
  "Blocks": [
    {
      "BlockType": "PAGE",
      "Geometry": {
        "BoundingBox": {
          "Height": 1.0,
          "Left": 0.0,
          "Top": 0.0,
          "Width": 1.0
        },
        "Polygon": [
          {
            "X": 1.5849614334573464e-16,
            "Y": 0.0
          },
          {
            "X": 1.0,
            "Y": 9.462437987838284e-17
          },
          {
            "X": 1.0,
            "Y": 1.0
          },
          {
            "X": 0.0,
            "Y": 1.0
          }
        ]
      },
      "Id": "5d25ef23-e6a6-4219-95ba-5d431cc97eb2",
      "Page": 1,
      "Relationships": [
        {
          "Ids": [
            "8cfd79f4-d504-43f1-a15d-c9b19a716844",
            "2b6d8e13-875a-48e2-a465-7fbfc5ceda1c",
            "787496b8-e335-4c1f-a50b-fadd4983be9e",
            "5fbdf84d-841c-4d5a-b17a-765dd11f7606",
            "7443e616-5c03-4fe1-9cf7-fd53aa4af91e",
            "e7182a86-ee69-4780

In [6]:
# Here's a useful function that does the same thing as the two steps above.
# it calls textract, but also does the work of waiting for the textract job to finish before returning results
# and paginates through multiple pages of results if there are multiple pages
from textractcaller import call_textract, Textract_Features, OutputConfig

# call textract async api in a single request
textract_result = call_textract(
    input_document=sample_doc_uri,
    force_async_api=True,
    features=[Textract_Features.TABLES, Textract_Features.FORMS],
    output_config=OutputConfig(OUTPUT_BUCKET_NAME,'textract'),
)
print(json.dumps(textract_result, indent=2, sort_keys=True))

{
  "AnalyzeDocumentModelVersion": "1.0",
  "Blocks": [
    {
      "BlockType": "PAGE",
      "Geometry": {
        "BoundingBox": {
          "Height": 1.0,
          "Left": 0.0,
          "Top": 0.0,
          "Width": 1.0
        },
        "Polygon": [
          {
            "X": 1.5849614334573464e-16,
            "Y": 0.0
          },
          {
            "X": 1.0,
            "Y": 9.462437987838284e-17
          },
          {
            "X": 1.0,
            "Y": 1.0
          },
          {
            "X": 0.0,
            "Y": 1.0
          }
        ]
      },
      "Id": "8e490713-a8ed-4095-a8b8-5aa081771ea1",
      "Page": 1,
      "Relationships": [
        {
          "Ids": [
            "0d5b465f-65ee-426c-b11c-b1f11b661d06",
            "c75c2b65-466e-4230-9edd-d7d6a954f531",
            "09aa2c42-e070-4296-9bba-5639a565ae19",
            "e0e250f1-03d7-48f1-a8c4-06da2da8c167",
            "b83da697-fca8-4009-8bce-36311bc1b46a",
            "75de8c99-3b28-40c4

#### Step 3: Parse the Textract response into easily consumable formats


Textract text analysis operations return a set of `Blocks`. Each `Block` can be of type:
* PAGE - Contains a list of child Block objects that are detected on a document page.
* KEY_VALUE_SET - Stores the KEY and VALUE Block objects for linked text that's detected on a document page. Use the EntityType field to determine if a KEY_VALUE_SET object is a KEY Block object or a VALUE Block object.
* WORD - A word that's detected on a document page. A word is one or more ISO basic Latin script characters that aren't separated by spaces.
* LINE - A string of tab-delimited, contiguous words that are detected on a document page.
* TABLE - A table that's detected on a document page. A table is grid-based information with two or more rows or columns, with a cell span of one row and one column each.
* CELL - A cell within a detected table. The cell is the parent of the block that contains the text in the cell.
* SELECTION_ELEMENT - A selection element such as an option button (radio button) or a check box that's detected on a document page. Use the value of SelectionStatus to determine the status of the selection element.



In [7]:
import pandas as pd

# We'll load the Textract blocks into a Pandas dataframe for inspection in a tablular format
blocks = pd.DataFrame(textract_result['Blocks'])

# Let's see what types of blocks Textract returned in the analysis, and how many of each kind
blocks.BlockType.value_counts()

WORD                 315
KEY_VALUE_SET        150
LINE                 145
SELECTION_ELEMENT     27
PAGE                   1
Name: BlockType, dtype: int64

In [8]:
# Visualize LINE blocks
blocks.loc[lambda x: x.BlockType=='LINE'].head()

Unnamed: 0,BlockType,Geometry,Id,Relationships,Page,Confidence,Text,TextType,SelectionStatus,EntityTypes
1,LINE,"{'BoundingBox': {'Width': 0.09557357430458069,...",8cfd79f4-d504-43f1-a15d-c9b19a716844,"[{'Type': 'CHILD', 'Ids': ['d170d8cc-321f-4665...",1,99.770439,DATE (MM/DD/YYYY),,,
2,LINE,"{'BoundingBox': {'Width': 0.2869662642478943, ...",2b6d8e13-875a-48e2-a465-7fbfc5ceda1c,"[{'Type': 'CHILD', 'Ids': ['53d0e1a5-2d0a-4285...",1,99.828239,PROPERTY LOSS NOTICE,,,
3,LINE,"{'BoundingBox': {'Width': 0.09483378380537033,...",787496b8-e335-4c1f-a50b-fadd4983be9e,"[{'Type': 'CHILD', 'Ids': ['ca6222e3-1ec8-43e8...",1,99.683617,03-28-2007,,,
4,LINE,{'BoundingBox': {'Width': 0.042315803468227386...,5fbdf84d-841c-4d5a-b17a-765dd11f7606,"[{'Type': 'CHILD', 'Ids': ['ebb3ae41-10e0-4aa5...",1,99.829231,AGENCY,,,
5,LINE,"{'BoundingBox': {'Width': 0.12648801505565643,...",7443e616-5c03-4fe1-9cf7-fd53aa4af91e,"[{'Type': 'CHILD', 'Ids': ['b5a95177-3bf9-4ff6...",1,99.806358,INSURED LOCATION CODE,,,


In [9]:
# Key value relationships between blocks with references to the WORD/LINE blocks corresponding to key/value
blocks.loc[lambda x: x.BlockType=='KEY_VALUE_SET'].head()

Unnamed: 0,BlockType,Geometry,Id,Relationships,Page,Confidence,Text,TextType,SelectionStatus,EntityTypes
488,KEY_VALUE_SET,{'BoundingBox': {'Width': 0.030021818354725838...,0b4ed868-7e12-405c-ac06-e6a3683728ad,"[{'Type': 'VALUE', 'Ids': ['a6854b13-e4de-4829...",1,99.5,,,,[KEY]
489,KEY_VALUE_SET,"{'BoundingBox': {'Width': 0.01417785044759512,...",a6854b13-e4de-4829-b590-678e475c2e4c,"[{'Type': 'CHILD', 'Ids': ['6da7e79b-7df3-4397...",1,99.5,,,,[VALUE]
490,KEY_VALUE_SET,{'BoundingBox': {'Width': 0.026599373668432236...,1a52b184-2a42-4b3e-befe-5610c205546b,"[{'Type': 'VALUE', 'Ids': ['c1525a06-fc8f-4cec...",1,98.5,,,,[KEY]
491,KEY_VALUE_SET,{'BoundingBox': {'Width': 0.016390914097428322...,c1525a06-fc8f-4cec-a3d0-bd506feb0223,"[{'Type': 'CHILD', 'Ids': ['8a55af6a-5f51-4e8d...",1,98.5,,,,[VALUE]
492,KEY_VALUE_SET,{'BoundingBox': {'Width': 0.023040059953927994...,4b6f1569-0c7f-485d-aa45-06e49c65436d,"[{'Type': 'VALUE', 'Ids': ['6fcf020c-dd58-4bca...",1,98.5,,,,[KEY]


In [10]:
# The Relationships field of the KEY_VALUE_SET block references the child blocks containing the metadata (text, bounding box) for the key and value items of the key-value pair
blocks.loc[lambda x: x.BlockType=='KEY_VALUE_SET'].Relationships.iloc[0]

[{'Type': 'VALUE', 'Ids': ['a6854b13-e4de-4829-b590-678e475c2e4c']},
 {'Type': 'CHILD', 'Ids': ['b487c105-75b5-4986-b759-5c569dbdcca0']}]

We'll use a helper library, `textract_response_parser` that can help parse the textract output to connect text of parent/child blocks.

For more on textract_response_parser, see:
* https://github.com/aws-samples/amazon-textract-response-parser
* https://pypi.org/project/amazon-textract-response-parser/

In [11]:
# Use textract response parser library
from trp import Document

doc = Document(textract_result)

# See fields
for page in doc.pages:
#     print(page)
    for field in page.form.fields:
        print(field)



Field
Key: HOME
Value: NOT_SELECTED

Field
Key: CELL
Value: NOT_SELECTED

Field
Key: BUS
Value: NOT_SELECTED

Field
Key: HOME
Value: NOT_SELECTED

Field
Key: CELL
Value: NOT_SELECTED

Field
Key: HOME
Value: NOT_SELECTED

Field
Key: CELL
Value: NOT_SELECTED

Field
Key: SUBCODE:
Value: 

Field
Key: CONTACT INSURED
Value: NOT_SELECTED

Field
Key: HOME
Value: NOT_SELECTED

Field
Key: NAIC CODE
Value: 

Field
Key: LINE OF BUSINESS
Value: 

Field
Key: CELL
Value: NOT_SELECTED

Field
Key: FLOOD
Value: NOT_SELECTED

Field
Key: BUS
Value: NOT_SELECTED

Field
Key: HOME
Value: NOT_SELECTED

Field
Key: BUS
Value: NOT_SELECTED

Field
Key: REPORTED BY
Value: Dalton Lott

Field
Key: NAIC CODE
Value: 

Field
Key: PROBABLE AMOUNT ENTIRE LOSS
Value: 

Field
Key: POLICE OR FIRE DEPARTMENT CONTACTED
Value: 

Field
Key: HOME
Value: NOT_SELECTED

Field
Key: BUS
Value: NOT_SELECTED

Field
Key: CODE:
Value: 

Field
Key: INSURED LOCATION CODE
Value: 

Field
Key: NAIC CODE
Value: 

Field
Key: BUS
Value: NOT_SE

In [12]:
# here's a method to extract just the text from the structured response
document_text = '\n'.join([page.text for page in doc.pages])

# displaying just the starting portion
print(document_text[:1000] + '\n' + '...')

DATE (MM/DD/YYYY)
PROPERTY LOSS NOTICE
03-28-2007
AGENCY
INSURED LOCATION CODE
DATE OF LOSS AND TIME
AM
11-03-2016
11:02:22
PM
PROPERTY/HOME POLICY
CARRIER
NAIC CODE
CONTACT
Alex Serrano
POLICY NUMBER
LINE OF BUSINESS
NAME:
PHONE
383.764.2757
(A/C, No, Ext):
FAX
929-054-2926
FLOOD POLICY
(A/C, No):
E-MAIL
ADDRESS:
caban1894@protonmail.com
CARRIER
NAIC CODE
CODE:
SUBCODE:
AGENCY CUSTOMER ID:
POLICY NUMBER
WIND POLICY
CARRIER
NAIC CODE
POLICY NUMBER
INSURED
NAME OF INSURED (First, Middle, Last)
INSURED'S MAILING ADDRESS
Isis Fletcher
Salina, Idaho
DATE OF BIRTH
FEIN (if applicable)
MARITAL STATUS /
CIVIL UNION (if applicable)
01-10-2018
PRIMARY
HOME
BUS
CELL
SECONDARY
HOME
BUS
CELL
PHONE #
PHONE #
PRIMARY E-MAIL ADDRESS:
otys1997@yahoo.com
SECONDARY E-MAIL ADDRESS:
rindlish1808@protonmail.com
NAME OF SPOUSE (First, Middle, Last) (if applicable)
SPOUSE'S MAILING ADDRESS (if applicable)
Kizzy Navarro
Atascadero, Texas
DATE OF BIRTH
FEIN (if applicable)
MARITAL STATUS
10-23-2002
CIVIL UNION

#### Step 4: Save Textract data into the data store

**Data Storage Design**

Now that we have tried and tested out the Textract API, for building a pipeline, we need to be able to also store and retreive this data for downstream services/modules and for future needs for reffering this document.

**Data Storage Layers**

We will use S3 to store the raw API response as well as any tranformations required for the data. 
We will then use DynamoDB to store the metadata (such as the S3 URI where the raw response is stored) for the corresponding document.


**Document Id**

To store and retreive documents & related data we need to create a primary key/identifier. This document id would  define the s3 partitioning of data for easy retreival 
and would also be used to define the partition key in DynamoDB (For a SQL/relational model, think of the the primary key for your `Documents` table). For this workshop, we will use a UUID generator for the document.






In [13]:
import uuid

document_id = str(uuid.uuid4())

table = boto3.resource('dynamodb').Table(DDB_TABLE_NAME)


# store textract response as text file on s3
textract_s3_uri = f's3://{OUTPUT_BUCKET_NAME}/textract/{document_id}.json'
write_to_s3(json.dumps(textract_result), textract_s3_uri)

# store text-only contents on s3
text_s3_uri = f's3://{OUTPUT_BUCKET_NAME}/text/{document_id}.txt'
write_to_s3(document_text, text_s3_uri)

# store references to s3 files in dynamodb record for document
table.put_item(
    Item={
        'document_id': str(document_id),
        'src_s3_uri': sample_doc_uri,
        'textract_raw_output':textract_s3_uri,
        'textract_raw_text':text_s3_uri
    }    
)


# Store document_id and other variables for use in subsequent notebooks

%store document_id
document_id

%store DDB_TABLE_NAME
%store OUTPUT_BUCKET_NAME

Stored 'document_id' (str)
Stored 'DDB_TABLE_NAME' (str)
Stored 'OUTPUT_BUCKET_NAME' (str)


## 5. (Optional) Participant Exercise
Now try running textract on some other documents.

For new documents, try the following steps:
* Visualize the file (download the file to your development environment using s3 copy CLI command and visualize in the notebook or open in new tab)
* Call textract API
* Inspect results extract by Textract, such as text, and any forms / tables


In [14]:
# Here are some documents you can try from the source s3 bucket
# displaying first 10 files from the bucket
!aws s3 ls {DATA_S3_PREFIX} | head -n 10

# "Broken pipe" / "BrokenPipeError: [Errno 32] Broken pipe" errors are expected here since we are only grabbing the first few results

2021-11-09 19:46:49     105903 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00000.pdf
2021-11-09 19:46:49     105934 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00001.pdf
2021-11-09 19:46:49     105919 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00002.pdf
2021-11-09 19:46:49     105923 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00003.pdf
2021-11-09 19:46:49     105939 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00004.pdf
2021-11-09 19:46:49     105907 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00005.pdf
2021-11-09 19:46:49     105945 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00006.pdf
2021-11-09 19:46:49     105945 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00007.pdf
2021-11-09 19:46:49     105926 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00008.pdf
2021-11-09 19:46:49     105917 INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00009.pdf

[Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeEr

In [15]:
# choose a document from the list above
NEW_DOC_NAME = 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00001.pdf'

new_doc_uri = os.path.join(DATA_S3_PREFIX, NEW_DOC_NAME)

visualize_pdf(os.path.join(DATA_S3_PREFIX, NEW_DOC_NAME), width=600, height=800)

In [19]:
from textractcaller import call_textract, Textract_Features, OutputConfig

# call textract async api in a single request
textract_result = call_textract(
    input_document=new_doc_uri,
    force_async_api=True,
    features=[Textract_Features.TABLES, Textract_Features.FORMS],
)

#Uncomment the following line to see the entire raw response
print(json.dumps(textract_result, indent = 4, sort_keys=True))

{
    "AnalyzeDocumentModelVersion": "1.0",
    "Blocks": [
        {
            "BlockType": "PAGE",
            "Geometry": {
                "BoundingBox": {
                    "Height": 1.0,
                    "Left": 0.0,
                    "Top": 0.0,
                    "Width": 1.0
                },
                "Polygon": [
                    {
                        "X": 1.5849614334573464e-16,
                        "Y": 0.0
                    },
                    {
                        "X": 1.0,
                        "Y": 9.462437987838284e-17
                    },
                    {
                        "X": 1.0,
                        "Y": 1.0
                    },
                    {
                        "X": 0.0,
                        "Y": 1.0
                    }
                ]
            },
            "Id": "f9100815-dd2b-4ba2-bc88-e088e6828163",
            "Page": 1,
            "Relationships": [
                {
          

In [21]:
# parse textract results with Pandas Dataframe or with textract-response-parser

# pandas
blocks = pd.DataFrame(textract_result['Blocks'])

# textract-response-parser
doc = Document(textract_result)


## 6. Summary

Amazon Textract uses machine learning to read and process any type of document, accurately extracting text, handwriting, tables and other data without any manual effort. You can quickly automate document processing and take action on the information extracted whether it be automating loans processing or extracting information from invoices and receipts. Textract can extract the data in minutes vs. hours or days. 

To learn more, check out all of Amazon Textract's  <a href="https://aws.amazon.com/textract/features">features</a>  including setting confidence thresholds and adding Human Review workflows.