# 1. Modules

In [1]:
!pip install textract-trp

Collecting textract-trp
  Downloading textract_trp-0.1.3-py3-none-any.whl (5.8 kB)
Installing collected packages: textract-trp
Successfully installed textract-trp-0.1.3


In [2]:
#import awscli
import boto3
import trp
from trp import Document
import os

# 2. Configure AWS creds - IAM, S3, Textract

In [3]:
#Step to configure S3 details

#Bucket Name
s3BucketName = "demo-sanjay-1"

#Objects in the bucket
invoiceDoc = "page1.jpg"
hclsDoc = "hcls.pdf"
onlineTextDoc = "online.png"
omrDoc = "omr.jpg"

In [4]:
#Step to configure IAM Access key and IAM Secret Access Key
#Note: Use Environment Variables !

ACCESS_ID = os.getenv("aws_textract_id")
SECRET_ID = os.getenv("aws_textract_secret_id")

In [5]:
# Amazon Textract client
textract = boto3.client("textract", aws_access_key_id=ACCESS_ID,aws_secret_access_key=SECRET_ID,region_name="ap-south-1")

# 3. AWS Textract Codes

In [6]:
#Anayzing the detected lines by line
def lines(doc):
    for page in doc.pages:
        print("PAGE\n====================")
        for line in page.lines:
            print("Line: {}--{}".format(line.text, line.confidence))
            for word in line.words:
                print("Word: {}--{}".format(word.text, word.confidence))

In [7]:
#Anayzing the detected lines by forms (key-value pairs)
def forms(doc):
    for page in doc.pages:
        # Print fields
        print("Fields:")
        for field in page.form.fields:
            print("Key: {}, Value: {}".format(field.key, field.value))

In [8]:
#Anayzing the detected lines by tables (row-columnar)
def tables(doc):
    for page in doc.pages:
        print("Tables:")
        for table in page.tables:
            for r, row in enumerate(table.rows):
                for c, cell in enumerate(row.cells):
                    print("Table[{}][{}] = {}".format(r, c, cell.text))

In [9]:
#Text Detection Mechanism
#Works similar to traditional OCR
def detect(documentName):
    response = textract.detect_document_text(
        Document={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': documentName
        }
    })

    # Print detected text
    for item in response["Blocks"]:
        if item["BlockType"] == "LINE":
            print ('\033[94m' +  item["Text"] + '\033[0m')

In [10]:
#Text Analysis Mechanism
#This is where AWS Texract reigns over Traditional OCR
def analyze(arg, documentName):
    # Call Amazon Textract
    response = textract.analyze_document(
        Document={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': documentName
            }
        },
        FeatureTypes=["FORMS", "TABLES"])

    doc = Document(response)
    
    if arg=="lines":
        lines(doc)
    elif arg=="forms":
        forms(doc)
    elif arg=="tables":
        tables(doc)
    else:
        print("Bruh moment, what you tryna say?")

# 4. Testing 

First, we need to understand what are the use cases of AWS Textract and when should one prefer AWS Textract over the traditional OCR Tools

## 1. Online Text Doc 

### Use Cases - Simple Emails, Minutes of Meeting, Notes 

In [11]:
detect(onlineTextDoc) 

[94mSAMPLE 1: PERSONAL STATEMENT (500 words max)[0m
[94mMy Name here[0m
[94mCarol E. Macpherson Scholarship Personal Statement[0m
[94mDate here[0m
[94mDear Scholarship Selection Committee:[0m
[94mI have loved traveling and reading about other cultures since I was a little girl.[0m
[94mSitting on the floor of our family kitchen and reading about people who lived all over[0m
[94mthe globe as well as living for a year in Argentina, instilled in me a respect for[0m
[94mdiversity and a burning desire to be an advocate for those underserved on this[0m
[94mplanet. This goal has been my reason for past formal and informal studies, and a[0m
[94mdegree in xyz will provide me the final tools I will need to meet the challenges I will[0m
[94mface, and to be the strong advocate I would like to be. The Carol E. Macpherson[0m
[94mScholarship would help me greatly as I work toward my goal. Below please find my[0m
[94mpersonal statement using the sections you have requested.[

### Conclusion: 

Working with Traditional OCR Tools is preferred, as its cheaper and works in the same way as AWS Textract

## 2. OMR Sheets 

### Users - Schools, Varsities, Certification Organizations 

In [12]:
detect(omrDoc)

[94myou -2 SIDE-2[0m
[94mNote: If Language-I or/and Language-II is other than English/Hindi, ask for a supplement Language Test Booklet as mentioned in Admit card.[0m
[94mget 427 total[0m
[94myear his[0m
[94mLanguage I (Part-IV)[0m
[94mII (Part-V)[0m
[94mSupplement Language[0m
[94mSupplement Language[0m
[94mRoll No.[0m
[94mMain Test Booklet No.[0m
[94mMain Booklet Code[0m
[94mBooklet Code[0m
[94mBooklet Code[0m
[94mCode[0m
[94mCode[0m
[94m8 7 0 0 0 7 5 7[0m
[94m3 5 0 4 6 9 7[0m
[94mM[0m
[94mM[0m
[94mM[0m
[94m0[0m
[94m0[0m
[94ma[0m
[94m0[0m
[94m0[0m
[94m0[0m
[94m0[0m
[94m0[0m
[94m0[0m
[94m0[0m
[94m0[0m
[94m0[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m1[0m
[94m-[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m
[94m2[0m

In [13]:
analyze("tables",omrDoc)

Tables:
Table[0][0] = Roll No. 
Table[0][1] = 
Table[0][2] = get 427 total Main Test Booklet No. 
Table[0][3] = 
Table[0][4] = year his Main Booklet Code 
Table[0][5] = 
Table[0][6] = Language I (Part-IV) Supplement Language Booklet Code 
Table[0][7] = 
Table[0][8] = II (Part-V) Supplement Language Booklet Code 
Table[1][0] = 
Table[1][1] = 
Table[1][2] = 
Table[1][3] = 
Table[1][4] = 
Table[1][5] = 
Table[1][6] = Code 
Table[1][7] = 
Table[1][8] = Code 
Table[2][0] = 8 7 0 0 0 7 5 7 
Table[2][1] = 
Table[2][2] = 3 5 0 4 6 9 7 
Table[2][3] = 
Table[2][4] = M 
Table[2][5] = 
Table[2][6] = M 
Table[2][7] = 
Table[2][8] = M 
Table[3][0] = SELECTED, SELECTED, SELECTED, NOT_SELECTED, 0 0 a 0 0 0 
Table[3][1] = 
Table[3][2] = SELECTED, NOT_SELECTED, NOT_SELECTED, NOT_SELECTED, 0 0 0 0 0 0 
Table[3][3] = 
Table[3][4] = 
Table[3][5] = 
Table[3][6] = 
Table[3][7] = 
Table[3][8] = 
Table[4][0] = NOT_SELECTED, NOT_SELECTED, NOT_SELECTED, NOT_SELECTED, NOT_SELECTED, 1 1 1 1 1 1 1 1 
Table[4][1] = 

### Conclusion: 

Working with AWS Textract is preferred, as the OMR Sheet mainly works as a Tabular form of data (Rows of questions, columns of MCQs ).

#### ProTip: 
Traditional OCR tools do not analyze the relationship between the detected text.

## 3. Health Forms

### Users - Airfare Checks, Medical Institutions, Govt. etc

In [14]:
analyze("forms",hclsDoc)

Fields:
Key: 8. Port of Origin of journey, Value: None
Key: 9. Port of final destination, Value: None
Key: 2. Street/Village, Value: Street 10
Key: 4. District/City, Value: Bombay
Key: 4. Passport No., Value: XXXX-1234
Key: 3 Flight No., Value: Very High
Key: 7. Residence Number, Value: 12
Key: 8. Mobile Number * (mandatory field), Value: +91 011235813
Key: 5. Nationality, Value: Indian
Key: 1. Name of the Passenger, Value: John Doe
Key: 7. Date of Arrival, Value: 01-01-2022
Key: 6. Age(in years), Value: 45
Key: 6. PIN code., Value: 123456
Key: 2. Seat No., Value: 7 C
Key: State, Value: Maharashtra
Key: 1. House Number, Value: 420
Key: 3 Tehsil, Value: What is this
Key: Cough, Value: Y
Key: Respiratory Distress, Value: N
Key: Fever, Value: N
Key: 9. Email-ID, Value: john.doe@gmail.com


### Conclusion: 

Working with AWS Textract is preferred, as the Health Forms are well, Forms ! 
Forms are nothing but key:value pairs of information/data represented in the document.

#### ProTip: 
Remember that Traditional OCR tools do not analyze the relationship between the detected text.

## 4. Invoice

### Use Cases - Sales

In [15]:
detect(invoiceDoc)

[94mDONSCO[0m
[94mP.O. Box 2001[0m
[94mWrightsville, PA 17368-0040[0m
[94mINCORPORATED[0m
[94mPhone: (717) 252-1561[0m
[94mINTEGRATEP SUPPLIER OF CAST PARTS[0m
[94mAN ISO 9001:2000 REGISTERED COMPANY[0m
[94mD-U-N-S: 06-977-6987[0m
[94mInvoice[0m
[94mBILL TO:[0m
[94mSUPPLIER#[0m
[94mTIS America, Inc.[0m
[94mInvoice#:[0m
[94m101213[0m
[94mMr. Omri Gelb[0m
[94mInvoice Date: 04/19/13[0m
[94m1350 Avenue of the Americas,[0m
[94mPage#:[0m
[94m1[0m
[94m2nd Floor[0m
[94mOrder#:[0m
[94m155126[0m
[94mNew York, New York 10019[0m
[94mPacking List#: 004160[0m
[94mSHIP TO:[0m
[94mREMIT TO:[0m
[94mDONSCO FOUNDRY[0m
[94mTIS America, Inc.[0m
[94mPO BOX 64145[0m
[94mMr. Omri Gelb[0m
[94mBALTIMORE, MD 21264-4145[0m
[94m1350 Avenue of the Americas,[0m
[94mUSA[0m
[94m2nd Floor[0m
[94mNew York, New York 10019[0m
[94mOrdered by:[0m
[94mSalesperson:[0m
[94mJAMES E. THOMAS[0m
[94mPayment Terms[0m
[94mFreight Terms[0m
[94mCarrier:[

In [16]:
analyze("lines", invoiceDoc)

PAGE
Line: DONSCO--99.36158752441406
Word: DONSCO--99.36158752441406
Line: P.O. Box 2001--99.2244873046875
Word: P.O.--98.91168975830078
Word: Box--99.29154205322266
Word: 2001--99.47024536132812
Line: Wrightsville, PA 17368-0040--98.32840728759766
Word: Wrightsville,--97.42405700683594
Word: PA--98.66011047363281
Word: 17368-0040--98.90103912353516
Line: INCORPORATED--99.30694580078125
Word: INCORPORATED--99.30694580078125
Line: Phone: (717) 252-1561--86.68304443359375
Word: Phone:--98.85042572021484
Word: (717)--92.2434310913086
Word: 252-1561--68.95526123046875
Line: INTEGRATEP SUPPLIER OF CAST PARTS--82.42842864990234
Word: INTEGRATEP--25.851207733154297
Word: SUPPLIER--97.99254608154297
Word: OF--97.02413940429688
Word: CAST--98.21308898925781
Word: PARTS--93.06116485595703
Line: AN ISO 9001:2000 REGISTERED COMPANY--99.46259307861328
Word: AN--99.71941375732422
Word: ISO--99.55892944335938
Word: 9001:2000--98.48588562011719
Word: REGISTERED--99.8083724975586
Word: COMPANY--99.7403

In [17]:
analyze("forms",invoiceDoc)

Fields:
Key: Invoice Date:, Value: 04/19/13
Key: Order#:, Value: 155126
Key: Invoice#:, Value: 101213
Key: Salesperson:, Value: JAMES E. THOMAS
Key: Packing List#:, Value: 004160
Key: Payment Terms, Value: NET 30 DAYS
Key: SHIP TO:, Value: TIS America, Inc. Mr. Omri Gelb 1350 Avenue of the Americas, 2nd Floor New York, New York 10019
Key: REMIT TO:, Value: DONSCO FOUNDRY PO BOX 64145 BALTIMORE, MD 21264-4145 USA
Key: Carrier:, Value: None
Key: Page#:, Value: 1
Key: Freight Terms, Value: 3RD PARTY BILLING
Key: D-U-N-S:, Value: 06-977-6987
Key: BILL TO:, Value: TIS America, Inc. Mr. Omri Gelb 1350 Avenue of the Americas, 2nd Floor New York, New York 10019
Key: Ordered by:, Value: None
Key: Phone:, Value: (717) 252-1561
Key: INVOICE TOTAL, Value: $8,060.10


In [18]:
analyze("tables",invoiceDoc)

Tables:
Table[0][0] = Invoice#: 
Table[0][1] = 101213 
Table[1][0] = Invoice Date: 
Table[1][1] = 04/19/13 
Table[2][0] = Page#: 
Table[2][1] = 1 
Table[3][0] = Order#: 
Table[3][1] = 155126 
Table[4][0] = Packing List#: 
Table[4][1] = 004160 
Table[0][0] = Ordered 
Table[0][1] = Shipped 
Table[0][2] = BackOrd 
Table[0][3] = POLine# 
Table[0][4] = Part Number 
Table[0][5] = Weight 
Table[0][6] = Price/um 
Table[0][7] = Surcharge 
Table[0][8] = Extended Price 
Table[1][0] = 
Table[1][1] = 
Table[1][2] = 
Table[1][3] = 
Table[1][4] = 
Table[1][5] = 
Table[1][6] = 
Table[1][7] = 
Table[1][8] = 
Table[2][0] = 504 
Table[2][1] = 150 
Table[2][2] = 0 
Table[2][3] = 73 
Table[2][4] = 243264 
Table[2][5] = 1158 
Table[2][6] = $53.73400 
Table[2][7] = 
Table[2][8] = $8,060.10 
Table[3][0] = 
Table[3][1] = 
Table[3][2] = 
Table[3][3] = 
Table[3][4] = CUST PART#: 
Table[3][5] = 86520418 
Table[3][6] = -FNSHMA 
Table[3][7] = 
Table[3][8] = 
Table[4][0] = 
Table[4][1] = 
Table[4][2] = 
Table[4][3] 

### Conclusion? 

