<a href="https://colab.research.google.com/github/AbsolutUnit/Textract-Pipeline/blob/main/Textract_Conversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preliminary Pruning and Organization
Setting up the format of the data, as well as how it'll be processed before running AWS's textract.

We'll be importing boto3, which gives us access to AWS API's and tools.

In [1]:
!pip install boto3
!pip install awscli

Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/5a/fd/d814f9cbefebbea88977628d11b860b5d564ba6f16f64c378e2da2a36405/boto3-1.17.112-py2.py3-none-any.whl (131kB)
[K     |████████████████████████████████| 133kB 7.5MB/s 
[?25hCollecting botocore<1.21.0,>=1.20.112
[?25l  Downloading https://files.pythonhosted.org/packages/c7/ea/11c3beca131920f552602b98d7ba9fc5b46bee6a59cbd48a95a85cbb8f41/botocore-1.20.112-py2.py3-none-any.whl (7.7MB)
[K     |████████████████████████████████| 7.7MB 38.5MB/s 
[?25hCollecting s3transfer<0.5.0,>=0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/63/d0/693477c688348654ddc21dcdce0817653a294aa43f41771084c25e7ff9c7/s3transfer-0.4.2-py2.py3-none-any.whl (79kB)
[K     |████████████████████████████████| 81kB 10.7MB/s 
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl
Collect

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [33]:
import os
import time
import json
import boto3
import hashlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from botocore import exceptions as ex

In [4]:
!export AWS_SHARED_CREDENTIALS_FILE=/content/drive/My\ Drive/config/awscli.ini
path = "/content/drive/My Drive/config/awscli.ini"
os.environ['AWS_SHARED_CREDENTIALS_FILE'] = path
print(os.environ['AWS_SHARED_CREDENTIALS_FILE'])

/content/drive/My Drive/config/awscli.ini


In [5]:
!aws s3 ls s3:// --recursive --human-readable --summarize

2021-07-14 19:36:35 bucketofpdfs
2021-07-14 20:37:35 outputjsonroseai

Total Objects: 0
   Total Size: 0 Bytes


Here we can initialize the global variables, like the bucket name and other flags to keep track of during the code execution.

In [6]:
input_bucket_name = "bucketofpdfs"          # Bucket created earlier under anbarua@cs.stonybrook.edu
json_bucket_name = "outputjsonroseai"
jobs = []

Skeleton code for a function that takes the files and uploads to a bucket. This bucket currently acts as a repository of PDF files ordered by date added and labelled by file name.

**To do:** Filter out or deal with invalid file types, or corrupted files/large files that can't be segmented like below. 

In [27]:
def pdf_up(doc_name):
  s3 = boto3.client('s3')
  print(doc_name)
  try:
    retval = s3.upload_file(doc_name, input_bucket_name, doc_name)
  except ex.ClientError as i:
    print("Error: ", i)
    return False
  return True

This is the method to process and add the data, converted to JSON files, to the second bucket. The mentioned method is used in the pipeline method to finish the data conversion system.

The assumed parameter type is the list of blocks, which is the return value of the get_document_text_detection method.

In [63]:
def out_to_json(j, filenames):
  out_files = []
  client = boto3.client('textract')
  i = 0
  print(len(j))
  for page in j:
    st = []
    for ex in page:
      for items in ex["Blocks"]:
        if items["BlockType"] == "LINE":
          st.append(items["Text"])
      # st.append(page[0]["Blocks"])
    with open(str(filenames[i] - ".pdf" + ".json"), 'w') as outfile:
      json.dump(st, outfile)
      out_files.append(str(filenames[i] - ".pdf" + ".json"))
    if not upload_json(str(filenames[i])):
      print("Error with uploading the JSON file to the second bucket, filename " + str(filenames[i]))
  i += 1
  return out_files

In [10]:
def upload_json(doc_name):
  s3 = boto3.client('s3')
  try:
    retval = s3.upload_file(doc_name, json_bucket_name, doc_name)
  except ex.ClientError as i:
    print("Error: ", i)
    return False
  return True

This method, parse_file, chooses a specific file within the bucket to begin the Textract document text detection and extraction from. This returns the job ID, since it runs concurrent to the code via a client.

In [31]:
def parse_file(bucket_name, doc_name):
  client = boto3.client('textract')
  retval = client.start_document_text_detection(
      DocumentLocation = {'S3Object' : {'Bucket': bucket_name, 'Name' : doc_name}})
  return retval["JobId"]

Method for figuring out whether or not a job has been completed. This is used to communicate with the s3 client to see if textract has completed its specific job.

This particular approach to asynch textract seems clunky, so I'll have to polish it later and look into other notification methods for a completed job.

Could use AWS Lambda to SNS instead.

In [61]:
def job_check(id):
  time.sleep(5)
  client = boto3.client('textract')
  status = ""
  while status != "SUCCEEDED":
    status = client.get_document_text_detection(JobId = id)["JobStatus"]
  return True

Parsing the output of the completed textract job, to make it simpler to pass in/work with.

In [13]:
def parser(id):
  values = []
  client = boto3.client('textract')
  elements = client.get_document_text_detection(JobId = id)
  values.append(elements)
  is_next = None
  if 'NextToken' in elements:
    is_next = elements['NextToken']
  while is_next:
    elements = client.get_document_text_detection(JobId = id, NextToken = is_next)
    values.append(elements)
    is_next = None if 'NextToken' not in elements else elements['NextToken']
  return values

Here's the main conversion pipeline, which follows the process of uploading the files, parsing them, waiting for the associated jobs to be completed, and then uploading the pertaining JSONs to the output bucket.

In [44]:
def conversion_pipe(args):
  global jobs
  jsons = []
  files = list(args.split(" "))     # we assume args is a list since it is called by main
  for each in files:
    if pdf_up(each):
      jobs.append(parse_file(input_bucket_name, each))
      continue
    else:
      print("Upload error: PDF " + each + "failed to upload.\n")
      return False
  print("All files successfully uploaded.\n")
  for j in jobs:
    if job_check(j):
      jsons.append(parser(j))
      continue
    else:
      print("Waiting on Job ID: " + j + " to finish.")    # redundant since current implementation always returns true
      continue
  print(jsons[0][0]["Blocks"])
  new_files = out_to_json(jsons, files)
  for files in new_files:
    upload_json(files)
  return new_files

Skeleton code for main, allowing the entire py file to be easily run from CLI. An idea would be to have flags like -u or -c for upload or conversion. The bucket handling is all within the file, but an example execution could be:

python3 Textract-Conversion -u A.pdf B.pdf C.pdf

python3 Textract-Conversion -c A B C

or even the following

python3 Textract-Conversion -u A.pdf B.pdf C.pdf -c

In [16]:
%cd "/content/drive/My Drive"

/content/drive/My Drive


In [66]:
# def main():
  
#   return False

# if __name__ == "__main__":
#   main()
filename = "new_pdf.pdf"
! pwd
output = conversion_pipe(filename)

/content/drive/My Drive
new_pdf.pdf
All files successfully uploaded.

[{'BlockType': 'PAGE', 'Geometry': {'BoundingBox': {'Width': 0.9997642636299133, 'Height': 1.0, 'Left': 0.0, 'Top': 0.0}, 'Polygon': [{'X': 0.0, 'Y': 0.0}, {'X': 0.9997642636299133, 'Y': 8.65825299631423e-17}, {'X': 0.9997642636299133, 'Y': 1.0}, {'X': 0.0, 'Y': 1.0}]}, 'Id': 'd3c20ed9-9c13-45bd-a39c-a6347f91a950', 'Relationships': [{'Type': 'CHILD', 'Ids': ['d94127ad-6d3c-4fb4-8b24-a1a2a69c49f7', 'c58931c2-a9a2-4de3-9f0a-fc774e313d64', 'f6ab5495-66fe-4226-b70a-b58e2a05f44c', '3c6e8334-71b3-4ed3-a039-304ea25d5ea4', '179128e4-dcaa-4ea9-9729-18037793cd66', '6943fe75-ba1d-4c17-8aaa-f0ae06a5e451', '398040ae-a8fc-47f2-b163-c31dc26b0441', '3901ce38-965d-482c-a631-49db4cfa8a07', '4fcd92c2-2fec-48ac-8d17-2178b1599f8d', '47b119d7-2db0-462a-9731-ff939f4b3f84', '04d6be47-bfcc-490f-822d-4eeb6599c3be', 'd5645448-f8c2-463d-ba6b-0e0d79ddb99a', '5791a395-4815-48b8-be05-bf613e749b0b', '65bbd720-ff92-496b-b088-3e651bada2ec', 'c5bc2c2e