# Intelligent Document Processing Classification

Documents contain valuable information and come in various shapes and forms. In most cases, you are manually processing these documents which is time consuming, prone to error, and costly. Not only do you want this information extracted quickly but can also automate business processes that presently relies on manual inputs and intervention across various file types and formats.

To help you overcome these challenges, AWS Machine Learning (ML) now provides you choices when it comes to extracting information from complex content in any document format such as insurance claims, mortgages, healthcare claims, contracts, and legal contracts.

The architecture below is a sample use case which involves these phases of an Intelligent document processing workflow - starting with extracting text from documents, training a custom classifier to classify our documents, training a custom name entity recognizer to extract custom entities from the documents, and finally perform document enrichment such as redaction and extract other details from the document.

<img src="book-classification.png" />


# Prepare for Document Classification
In this lab we will walk you through an hands-on lab on document classification using Amazon Comprehend
Custom Classifier. We will use Amazon Textract to first extract the text out of our documents and then label them and then use the data for training our Amazon comprehend custom classifier.

In this notebook we will - 

- [Step 1: Setup notebook and upload sample documents to Amazon S3](#step1)
- [Step 2: Extract text from sample documents using Amazon Textract](#step2)
- [Step 3: Label the extracted data and prepare a CSV training dataset](#step3)
- [Step 4: Create Amazon Comprehend Classification training job](#step4)



---

# Step 1: Setup notebook and upload  sample documents to Amazon S3 <a id="step1"></a>

In this step, we will import some necessary libraries that will be used throughout this notebook. We will then upload all the documents from the `/classification-training-dataset` folder to SageMaker's default bucket.

In [1]:
!pip install textract-trp



In [2]:
import boto3
import botocore
import sagemaker
import time
import os
import os.path
import json
import datetime
import io
import uuid
import pandas as pd
import numpy as np
from pytz import timezone
from PIL import Image, ImageDraw, ImageFont
import multiprocessing as mp
from pathlib import Path


# Document
from pprint import pprint
from IPython.display import Image, display, HTML, JSON, IFrame
from PIL import Image as PImage, ImageDraw


# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)


SageMaker role is: arn:aws:iam::044573436347:role/text-cm-SagemakerRole-SXXWU3NUWVCX
Default SageMaker Bucket: s3://sagemaker-us-east-1-044573436347


### Upload sample data to S3 bucket

The sample documents are in `/classification-training` directory. For this workshop, we will be using sample paystubs, bank statements, and receipts.

In [3]:
# Upload images to S3 bucket:
!aws s3 cp classification-training-dataset s3://{data_bucket}/idp/textract --recursive --only-show-errors

# Step 2: Extract text from sample documents using Amazon Textract <a id="step2"></a>

In this section we define local directories, and then use Amazon Textract's `detect_document_text` API to extract the raw text and geometry (bounding box) information for all the documents in S3. The extracted text and geometry information will be written into plaintext files.

In [6]:
word_prefix=os.getcwd()+'/textract_output/LINES/'
box_prefix=os.getcwd()+'/textract_output/BBOX/'

Utility function that uses Amazon Textract to extract text and writes to the defined directory

In [7]:
#### FUNCTION FOR EXTRACTING TEXT FROM EACH DOCUMENT AND STORING AS .TXT FILE FOR TRAIN LAYOUTLM USING TEXTRACT
def textract_extract(table, bucket=data_bucket):        
    try:
        response = textract.detect_document_text(
                Document={
                    'S3Object': {
                        'Bucket': bucket,
                        'Name': table
                    }
                })    
        a=[]
        b=[]
                # Print detected text
        for item in response["Blocks"]:

            if item["BlockType"] == "LINE":
                a.append(item['Geometry']['BoundingBox'])
                b.append(item["Text"])

        print(word_prefix)
        print(os.path.dirname(table))
        Path(word_prefix+os.path.dirname(table)).mkdir(parents=True, exist_ok=True)
        Path(box_prefix+os.path.dirname(table)).mkdir(parents=True, exist_ok=True)
        with open(word_prefix+table+'.txt', 'w', encoding="utf-8") as f:
            for item in b:
                f.write(item+'\n')
        with open(box_prefix +table+'.txt', 'w', encoding="utf-8") as p:
            for item in a:
                p.write(str(item)+'\n')
    except Exception as e:
        print (e)

Call the Textract function defined above

In [8]:
pool = mp.Pool(mp.cpu_count())
pool.map(textract_extract, [table for table in images ])
pool.close()

/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset
/home/ec2-user/SageMaker/textract_output/LINES/
idp/textract/invoice-dataset

# Step 3: Label the extracted data and prepare training dataset <a id="step3"></a>

Now that we have text extracted from our documents we will perform pre-processing of this data in order to train an [Amazon Comprehend custom classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html). Before we can train the custom classification model, we will need to label the data appropriately. For example, the invoice text should be labeled as "invoice" and receipt text labeled as "Receipt" and so on. This needs to be done for every document text extracted by Textract. 



In [9]:
##lOOPING THRU THE DIRECTORY AND CREATING A DICT TO HOLD EACH TEXTRACT DOC PATH
def data_path(path):    
    
    def listdir_nohidden(path):
        for f in os.listdir(path):
            if not f.startswith('.'):
                yield f
            
    mapping={}
    for i in names:        
        if os.path.isdir(path+i):
            mapping[i] = sorted(listdir_nohidden(path+i))
    # label or class or target list
    label_compre = []
    # text file data list
    text_compre = []

    # unpacking and iterating through dictionary
    for i, j in mapping.items():
        # iterating through list of files for each class
        for k in j:
            # appending labels/class/target
            label_compre.append(i)
            # reading the file and appending to data list
            text_compre.append(open(path+i+"/"+k, encoding="utf-8").read().replace('\n',' '))
    return label_compre, text_compre

 We can now call the function to label data.

In [10]:
label_compre, text_compre=[],[]

path=word_prefix+'idp/textract/'
label_compre_train, text_compre_train=data_path(path)
label_compre.append(label_compre_train)
text_compre.append(text_compre_train)

if type(label_compre[0]) is list:
    label_compre=[item for sublist in label_compre for item in sublist]
    text_compre=[item for sublist in text_compre for item in sublist]

data_compre= pd.DataFrame()
data_compre["label"] =label_compre   
data_compre["document"] = text_compre
data_compre

Unnamed: 0,label,document
0,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
1,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
2,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
3,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
4,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
5,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
6,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
7,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
8,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."
9,receipt-training,"THE AIML StORE 1234 SOMEWHERE RD POWAY, CALIFO..."


# Step 4: Create Amazon Comprehend Classification training job <a id="step4"></a>

Once we have a labeled dataset ready we are going to create and train a [Amazon Comprehend custom classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) with the dataset.

In [11]:
# Upload Comprehend training data to S3

key='idp/comprehend/comprehend_train_data.csv'

data_compre.to_csv("comprehend_train_data.csv", index=False, header=False)
s3.upload_file(Filename='comprehend_train_data.csv', 
               Bucket=data_bucket, 
               Key=key)

### Create Amazon Comprehend custom classification Training Job

We will use Amazon Comprehend's Custom Classification to train our own model for classifying the documents. We will use Amazon Comprehend `CreateDocumentClassifier` API to create a classifier which will train a custom model using the labeled data CSV file we created above. The training data contains extracted text, that was extracted using Amazon Textract, and then labeled.

In [12]:
f's3://{data_bucket}/{key}'

's3://sagemaker-us-east-1-044573436347/idp/comprehend/comprehend_train_data.csv'

In [13]:
# Create a document classifier
account_id = boto3.client('sts').get_caller_identity().get('Account')
id = str(datetime.datetime.now().strftime("%s"))

document_classifier_name = 'Doc-Classifier-IDP'
document_classifier_version = 'v1'
document_classifier_arn = ''
response = None

try:
    create_response = comprehend.create_document_classifier(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'S3Uri': f's3://{data_bucket}/{key}'
        },
        DataAccessRoleArn=role,
        DocumentClassifierName=document_classifier_name,
        VersionName=document_classifier_version,
        LanguageCode='en',
        Mode='MULTI_CLASS'
    )
    
    document_classifier_arn = create_response['DocumentClassifierArn']
    
    print(f"Comprehend Custom Classifier created with ARN: {document_classifier_arn}")

Comprehend Custom Classifier created with ARN: arn:aws:comprehend:us-east-1:044573436347:document-classifier/Sample-Doc-Classifier-IDP/version/Sample-Doc-Classifier-IDP-v1


This job can take ~30 minutes to complete. Once the training job is completed move on to next step.