# Processing ACORD-127 forms using Textract post-processing libraries

## Use case : Hierarchical Key-Value mapping

When a document contains hierarchical structuring, it is an important IDP post-processing task to infer the context within the structures. For eg.: In the document sample of an ACORD 127 form - Page 3 below, we want to infer the relationships between each `Veh #` keys to their corresponding vehical information items within each highlighted section. This can be done using Textract post-processing libraries for hierarchical key-value pairs mapping. 


In [31]:
from PIL import Image
from IPython.display import Image, display, HTML, JSON, IFrame

documentName = "doc_samples/Acord-127-sample.png"
display(IFrame(documentName, 500, 600));

## Step 1: Installation


### [Amazon Textract Geofinder](https://pypi.org/project/amazon-textract-geofinder/)
Amazon Textract package to easier access data through geometric information.

Provides functions to use geometric information to extract information.

<b>Use cases include:</b>

   -  Give context to key/value pairs from the Amazon Textract AnalyzeDocument API for FORMS
   -  Find values in specific areas
   
### Other helper libraries for Textract response parsing : 

- Using <b>call_textract( )</b> from the [Textract-Caller](https://github.com/aws-samples/amazon-textract-textractor/tree/c689441c0562afb4976d4f248559e59289a33777/caller) library makes it is easy to parse JSON responses from AnalyzeDocument API.

- Also using [Textract-PrettyPrinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) library that provides functions to format the output received from Textract in more easily consumable formats such as CSV.

You will need to run the cell below only once for installation.


In [None]:
!python -m pip install amazon-textract-helper amazon-textract-geofinder amazon-textract-caller amazon-textract-response-parser --upgrade 

In [3]:
from textractgeofinder.ocrdb import AreaSelection
from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement
from textractprettyprinter.t_pretty_print import get_forms_string, convert_form_to_list_trp2

from textractcaller.t_call import call_textract
from textractcaller.t_call import Textract_Features

import trp.trp2 as t2

This is the image we want to extract information from.

In [4]:
from PIL import Image
from IPython.display import Image, display, HTML, JSON, IFrame

documentName = "doc_samples/Acord-127-pg3.png"
display(IFrame(documentName, 500, 600));

Calling Amazon Textract with the textractcaller library is easy. 

In [5]:
j = call_textract(input_document=documentName, features=[Textract_Features.FORMS, Textract_Features.TABLES])

In [7]:
print(get_forms_string(j))

|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

There are multiple blocktypes in JSON. We need to deserialize the JSON into blocks using trp library.

In [27]:
#from textractcaller.t_call import call_textract, Textract_Features
from trp.trp2 import TDocument, TDocumentSchema
from trp.t_pipeline import order_blocks_by_geo
import trp

t_doc = TDocumentSchema().load(j)
# the ordered_doc has elements ordered by y-coordinate (top to bottom of page)
ordered_doc = order_blocks_by_geo(t_doc)
# send to trp for further processing logic
trp_doc = trp.Document(TDocumentSchema().dump(ordered_doc))
# print(trp_doc)


In [9]:
for page in trp_doc.pages:
    for field in page.form.fields:
        key = field.key.text if field.key else ""
        value = field.value.text if field.value else ""
        print(key, ":" ,value)
        

ACORD 129 attached for additional vehicles : SELECTED
COMP / OTC SYM : 
COLL SYM : 
BODY TYPE: : Sedan
VEH # : 1
YEAR : 1990
SYM / AGE : 
MAKE: : Chevrolet
PP : NOT_SELECTED
SPEC : NOT_SELECTED
COML : NOT_SELECTED
V.I.N.: : 5NPEU4AC8AH689378
MODEL: : Lumina
COUNTY : Magic
STREET (Required in KY) : 778 Brown Plaza
CITY : Ivanburgh
STATE : IL
ZIP : 41945
LIC STATE : IL
FARTHEST TERMINAL : 
FACTOR : 0.00
TERR : 
COST NEW : $
SEAT CP : 
GVW/GCW : 
RADIUS : 
CLASS : 
SIC : 
COMP/ OTC : SELECTED
RENT REIMB : NOT_SELECTED
SPEC COFL : NOT_SELECTED
ADD'L NO- FAULT : NOT_SELECTED
UNDRINS MOTOR : NOT_SELECTED
DEDUCTIBLES : $
COMM'L : NOT_SELECTED
FOR HIRE : NOT_SELECTED
LSP : NOT_SELECTED
F : NOT_SELECTED
ACV : NOT_SELECTED
COMP/ OTC : SELECTED
TOWING LABOR : NOT_SELECTED
PLEASURE : NOT_SELECTED
RETAIL : NOT_SELECTED
FT : SELECTED
FG : NOT_SELECTED
MED PAY : SELECTED
AA : NOT_SELECTED
AMT : $ 600
LIAB : SELECTED
NO- FAULT : NOT_SELECTED
UNINS MOTOR : NOT_SELECTED
SPEC C OF : NOT_SELECTED
SERVICE 

In [10]:
t_document = t2.TDocumentSchema().load(j)
doc_height = 1000
doc_width = 1000
geofinder_doc = TGeoFinder(j, doc_height=doc_height, doc_width=doc_width)

## Step 2 : Using BoundingBox information given by Textract

The [bounding box](https://docs.aws.amazon.com/textract/latest/dg/API_BoundingBox.html) around the detected page, text, key-value pair, table, table cell, or selection element on a document page. 

We will now find the `Top` variable which is the top coordinate of the bounding box as a ratio of overall document page height.

This top variable will be the Y coodinates for each of the `Veh #` from the ACORD 127 form.
The X and Y values that are returned are ratios of the overall document page size. For example, if the input document is 700 x 200 and the operation returns X=0.5 and Y=0.25, then the point is at the (350,50) pixel coordinate on the document page.
Since, our document dimensions is 1000x1000 (configured in the cell above), we multiple each `vehicle_top` variable by 1000 to get our Y coodinate for each vehicle number.

In [11]:
vehicle_top_list = []
for page in trp_doc.pages:
    for field in page.form.fields:
        key = field.key.text if field.key else ""
        value = field.value.text if field.value else ""
        if (key).lower().startswith('veh'):
            vehicle_top = field.geometry.boundingBox.top
            vehicle_top_list.append((int(vehicle_top*1000)))

In [12]:
vehicle_top_list

[55, 207, 358, 510]

<b>set_hierachy_kv</b> is a helper function to add "virtual" keys which we use to indicate context.

In [13]:
def set_hierarchy_kv(list_kv: list[KeyValue], t_document: t2.TDocument, page_block: t2.TBlock, prefix="BORROWER"):
    for x in list_kv:
        t_document.add_virtual_key_for_existing_key(key_name=f"{prefix}_{x.key.text}",
                                                    existing_key=t_document.get_block_by_id(x.key.id),
                                                    page_block=page_block)

In [14]:
geofinder_doc.find_phrase_on_page("total prem:", min_textdistance=0.99)

[TWord(text='total prem', original_text='TOTAL PREM: $', text_type='line', confidence=98.46007537841797, id='d48d570c-4e67-433b-acef-c7cde3fbe5e9', xmin=723, ymin=196, xmax=800, ymax=203, page_number=1, doc_width=1000, doc_height=1000, child_relationships='afeb5a50-b908-4926-ade2-7a379dbe25de,2249d9cd-e3c7-419e-ac4b-d5c3b837a2d0,d1c39201-a4ac-44ef-bb21-4bef4bad5971', reference=None, resolver=None),
 TWord(text='total prem', original_text='TOTAL PREM: $', text_type='line', confidence=97.84642791748047, id='8c87d5c9-c99d-4550-8dbd-a14be55577c7', xmin=723, ymin=348, xmax=800, ymax=355, page_number=1, doc_width=1000, doc_height=1000, child_relationships='fa04bab5-2ae1-4972-8a5b-bcb3b4e79444,5b2fb219-2bf9-4473-8d63-df332f97ed0d,e18b6c3c-d129-42bf-a6a2-e54477418a7f', reference=None, resolver=None),
 TWord(text='total prem', original_text='TOTAL PREM: $', text_type='line', confidence=98.10001373291016, id='1f2b215e-cac9-4f66-b662-1295214e71ee', xmin=723, ymin=499, xmax=800, ymax=506, page_num

## Step 3 : Map the Vehicle # to their corresponding kv pairs

We find the relevant phrases in the document to specify the area of key value pairs related to the patient information.

We then use this information to add new key value pairs to the Amazon Textract Response JSON Schema

In [15]:
# Vehicle 1 geo-info
top_left = t2.TPoint(y=vehicle_top_list[0], x=0)
total_prem_1 = geofinder_doc.find_phrase_on_page("total prem:", min_textdistance=0.99)[0]
lower_right = t2.TPoint(y=total_prem_1.ymin, x=doc_width)

# Vehicle 1 hierarchical key
form_fields = geofinder_doc.get_form_fields_in_area(
    area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number=1))
set_hierarchy_kv(list_kv=form_fields,
                 t_document=t_document,
                 prefix='Veh_1',
                 page_block=t_document.pages[0])

# Vehicle 2 geo-info
top_left = t2.TPoint(y=vehicle_top_list[1], x=0)
total_prem_2 = geofinder_doc.find_phrase_on_page("total prem:", min_textdistance=0.99)[1]
lower_right = t2.TPoint(y=total_prem_2.ymin, x=doc_width)

# Vehicle 2 hierarchical key
form_fields = geofinder_doc.get_form_fields_in_area(
    area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number=1))
set_hierarchy_kv(list_kv=form_fields,
                 t_document=t_document,
                 prefix='Veh_2',
                 page_block=t_document.pages[0])

# Vehicle 3 geo-info
top_left = t2.TPoint(y=vehicle_top_list[2], x=0)
total_prem_3 = geofinder_doc.find_phrase_on_page("total prem:", min_textdistance=0.99)[2]
lower_right = t2.TPoint(y=total_prem_3.ymin, x=doc_width)

# Vehicle 3 hierarchical key
form_fields = geofinder_doc.get_form_fields_in_area(
    area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number=1))
set_hierarchy_kv(list_kv=form_fields,
                 t_document=t_document,
                 prefix='Veh_3',
                 page_block=t_document.pages[0])

# Vehicle 4 geo-info
top_left = t2.TPoint(y=vehicle_top_list[3], x=0)
total_prem_4 = geofinder_doc.find_phrase_on_page("total prem:", min_textdistance=0.99)[3]
lower_right = t2.TPoint(y=total_prem_4.ymin, x=doc_width)

# Vehicle 4 hierarchical key
form_fields = geofinder_doc.get_form_fields_in_area(
    area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number=1))
set_hierarchy_kv(list_kv=form_fields,
                 t_document=t_document,
                 prefix='Veh_4',
                 page_block=t_document.pages[0])


In [16]:
print(get_forms_string(t2.TDocumentSchema().dump(t_document)))

|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Step 4 : Print forms in Pandas dataframe

In [25]:
import pandas as pd
from textractprettyprinter.t_pretty_print import convert_form_to_list
from trp import Document

tdoc=Document(t2.TDocumentSchema().dump(t_document))

dfs = list()
for page in tdoc.pages:
    dfs.append(pd.DataFrame(convert_form_to_list(trp_form=page.form)))


In [26]:
dfs[0]

Unnamed: 0,0,1
0,Key,Value
1,DEDUCTIBLES,$
2,DEDUCTIBLES,$
3,DEDUCTIBLES,$
4,DEDUCTIBLES,$
...,...,...
455,Veh_4_otc,SELECTED
456,Veh_4_zip,41945
457,Veh_4_new,COST $
458,Veh_4_sym,COLL
