# Deriving insights from handwritten content using Amazon Textract and Amazon Quicksight

This notebook is an accompanying utility for `Chapter 17 - Deriving insights from handwritten content` from the PACKT book **Natural Language Processing with AWS AI Services**. Please read the chapter and the instructions before trying this notebook. 

In [None]:
# STEP 0 - CELL 1
import boto3
import json
import csv
import os

infile = 'qsmani-raw.json'
outfile = 'qsmani-formatted.json'
bucket = '<enter-S3-bucket-name>'
prefix = 'chapter17' # change this prefix if you like

### Update QuickSight Manifest
We will replace the S3 bucket and prefix from the raw manifest file with what you have entered in STEP 0 - CELL 1 above. We will then create a new formatted manifest file that will be used for creating a dataset with [Amazon QuickSight](https://aws.amazon.com/quicksight/) based on the content we extract from the handwritten documents.

In [None]:
# STEP 1 - CELL 1
import json
manifest = open(infile,'r')
ln = json.load(manifest)
t = json.dumps(ln['fileLocations'][0]['URIPrefixes'])
t = t.replace('bucket',bucket).replace('prefix',prefix)
ln['fileLocations'][0]['URIPrefixes'] = json.loads(t)
with open(outfile,'w', encoding='utf-8') as out:
    json.dump(ln,out, ensure_ascii=False, indent=4)

In [None]:
# STEP 1 - CELL 2
s3 = boto3.client('s3')
s3.upload_file(outfile,bucket,prefix+'/'+outfile)
print("Manifest file uploaded to: s3://{}/{}".format(bucket,prefix+'/'+outfile))

### Extract handwritten content using Textract
In this section, we will install the [Amazon Textract Response Parser](https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md), use the [Amazon Textract boto3 library](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html) to detect text from our handwritten images, and upload the contents into a CSV file which will be stored in your [Amazon S3 bucket](https://aws.amazon.com/s3/)

In [None]:
#STEP 2 - CELL 1
!python -m pip install amazon-textract-response-parser

In [None]:
# STEP 2 - CELL 2
from trp import Document
textract = boto3.client('textract')

In [None]:
# STEP 2 - CELL 3
for docs in os.listdir('.'):
    if docs.endswith('jpg'):
        with open(docs, 'rb') as img:
            img_test = img.read()
            bytes_test = bytearray(img_test)
            print('Extracted text from ', docs)
        response = textract.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES','FORMS'])
        text = Document(response)
        for page in text.pages:
            for table in page.tables:
                csvout = docs.replace('jpg','csv')
                with open(csvout, 'w', newline='') as csvf:
                    tab = csv.writer(csvf, delimiter=',')
                    for r, row in enumerate(table.rows):
                        csvrow = []
                        for c, cell in enumerate(row.cells):
                            if cell.text:
                                csvrow.append(cell.text.replace('$','').rstrip())
                        tab.writerow(csvrow)
        s3.upload_file(csvout,bucket,prefix+'/dashboard/'+csvout)
        print("CSV file for document {} uploaded to: s3://{}/{}".format(docs,bucket,prefix+'/dashboard/'+csvout))

### CONCLUSION
That concludes the steps for the notebook. Please continue to follow the instructions from Chapter 17 in the book to understand how you can visualize and generate insights from your handwritten content using **[Amazon QuickSight](https://aws.amazon.com/quicksight/)**.