# Explore amazon-textract-textractor

## What is amazon-textract-textractor?

Textract 的原生 API 返回的结果是一个巨大的 JSON 对象. 你需要阅读 [Text Detection and Document Analysis Response Objects](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html) 才能理解如何解读这个 JSON 对象. 然后你还要自己写程序 Parse 这个 JSON 对象, 对数据做进一步的处理.

[amazon-textract-textractor](https://github.com/aws-samples/amazon-textract-textractor) 是 AWS 实验室里的一个开源 Python 项目. 致力于让 Textract 更好用. 简单来说就是对这个 JSON 对象的进一步封装.

``amazon-textract-textractor`` 是一个顶层项目, 内部有这么几个模块:

- amazon-textract-caller: 对 boto3 的封装, 毕竟 boto3 的 API 函数根本没有 signature 也没有 type hint.
- amazon-textract-response-parser: 对 JSON 对象的面向对象封装.

以上两个是 ``amazon-textract-textractor`` 的核心, 安装的时候会自动安装这两个.

- amazon-textract-overlayer: 用来在 PDF 或图片上画方框的.
- amazon-textract-prettyprinter: 把 Textract 的结果转化成其他 CSV, markdown 等格式.
- amazon-textract-geofinder: 实现了对 Textract 的 entity 用坐标来搜索. 底层是用 sqlite 数据库实现.

In [1]:
# 这里我们把他们全部装好得了
%pip install amazon-textract-textractor
%pip install amazon-textract-caller
%pip install amazon-textract-response-parser
%pip install amazon-textract-geofinder
%pip install amazon-textract-prettyprinter
%pip install amazon-textract-overlayer

You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.[

# Set AWS Credential

In [1]:
from boto_session_manager import BotoSesManager
from textractor import Textractor
from s3pathlib import context

aws_profile = "aws_data_lab_sanhe_us_east_1"

bsm = BotoSesManager(profile_name=aws_profile)
context.attach_boto_session(bsm.boto_ses)

# Textractor 的顶层 API
extractor = Textractor(profile_name=aws_profile)

## Enumerate Important Local Path and S3 Path

这里我们先做一些准备工作, 将 PDF 转化为图片, 上传至 S3 等工作.

In [24]:
import os
from pathlib_mate import Path
from s3pathlib import S3Path

#--- Local
dir_here = Path(os.getcwd()).absolute()

path_cms1500_pdf = dir_here / "cms1500-carrie-rodgers.pdf"
path_cms1500_png = dir_here / "page-1.png"

#--- S3
s3dir_root = S3Path("aws-data-lab-sanhe-for-everything", "poc", "2022-12-04-textractor").to_dir()
s3dir_input = s3dir_root.joinpath("input").to_dir()
s3dir_output = s3dir_root.joinpath("output").to_dir()
s3path_cms1500_pdf = s3dir_input / path_cms1500_pdf.basename

#--- Upload
print(f"preview: {s3dir_root.console_url}")

s3path_cms1500_pdf.upload_file(path_cms1500_pdf.abspath, overwrite=True)

preview: https://console.aws.amazon.com/s3/buckets/aws-data-lab-sanhe-for-everything?prefix=poc/2022-12-04-textractor/


In [27]:
# 用 PyMuPDF 将 PDF 切割并转化为 图片.
import fitz

# bytes protocol
doc = fitz.open(stream=path_cms1500_pdf.read_bytes())

for page_num, page in enumerate(doc, start=1):
    print(page_num)
    # split page
    one_page_doc = fitz.open()  # new empty PDF
    one_page_doc.insert_pdf(doc, from_page=page_num-1, to_page=page_num-1)
    p = dir_here / f"page-{page_num}.pdf"
    #
    # # you cannot write document to io.BytesIO
    # one_page_doc.save(f"{p}")

    # convert page to image
    pix = page.get_pixmap(dpi=200)

    p = dir_here / f"page-{page_num}.ppm"
    # you cannot write pix map to io.BytesIO
    p.write_bytes(pix.tobytes("ppm"))
    # pix.save(f"{p}")


1


RuntimeError: source object number out of range

## Detect Document Text

In [89]:
document = extractor.detect_document_text(file_source=s3path_cms1500_pdf.uri)
type(document)

textractor.entities.document.Document

In [90]:
document.lines

[Mail completed forms to:,
 Department of Labor and Industries,
 PO Box 44269,
 Olympia WA 98504-4269,
 HEALTH INSURANCE CLAIM FORM,
 CARRIER,
 APPROVED BY NATIONAL UNIF ORM CLAIM COMMITTEE (NUCC) 02/12,
 PICA,
 PICA,
 OTHER,
 1a INSURED'S ID NUMBER,
 FECA,
 GROUP,
 CHAMPVA,
 (For Program in Item 1),
 TRICARE,
 MEDICAID,
 1. MEDICARE,
 IKLUNG,
 HEALTH PLAN,
 (ID#),
 (ID#),
 (ID#),
 (Member ID#),
 (ID#/DoD#),
 (Medicaid#),
 (Medicare#),
 SEX,
 3. PATIENT'S BIRTH DATE,
 4. INSURED'S NAME (Last Name, First Name, Middle Initial),
 2 PATIENT'S NAME (Last Name, First Name, Middle Initial),
 YY,
 DD,
 MM,
 F,
 18,
 ALCON LABORATORIES,
 1974,
 9,
 7 INSURED'S ADDRESS (No., Street),
 Carrie Rodgers,
 6. PATIENT RELATIONSHIP TOINSURED,
 5 PATIENT'S ADDRESS (No., Street),
 Other,
 Child,
 Self,
 Spouse,
 6201 S freeway,
 2805 28th StNw,
 STATE,
 CITY,
 8. RESERVED FOR NUCC USE,
 STATE,
 CITY,
 DC,
 Tx,
 fort worth,
 Washington,
 ZIP CODE,
 TELEPHONE (include Area Code),
 TELEPHONE (Include Area C

In [91]:
results = document.search_lines("patient name, (Last Name, First Name, Middle Initial)", 3)
results

[2 PATIENT'S NAME (Last Name, First Name, Middle Initial),
 4. INSURED'S NAME (Last Name, First Name, Middle Initial),
 9. OTHER INSURED'S NAME (Last Name, First Name, Middle Initial)]

In [92]:
results[0].bbox

x: 0.020324068143963814, y: 0.1542317271232605, width: 0.2638625502586365, height: 0.008735546842217445

## Form and Table

In [19]:
from textractor.data.constants import TextractFeatures

analyzed_document = extractor.analyze_document(
	file_source=path_cms1500_pdf.abspath,
	features=[
        TextractFeatures.FORMS,
        TextractFeatures.TABLES,
    ]
)
print("done")

done


In [59]:
Path(dir_here, "test_1.json").write_text(json.dumps(analyzed_document.response, indent=4))

2006605

### Key Value

In [57]:
analyzed_document.key_values

[1a INSURED'S ID NUMBER : (For Program in Item 1),
 4. INSURED'S NAME (Last Name, First Name, Middle Initial) : ALCON LABORATORIES,
 2 PATIENT'S NAME (Last Name, First Name, Middle Initial) : Carrie Rodgers,
 YY : 1974,
 MM : 9,
 DD : 18,
 7 INSURED'S ADDRESS (No., Street) : 6201 S freeway,
 5 PATIENT'S ADDRESS (No., Street) : 2805 28th StNw,
 CITY : fort worth,
 8. RESERVED FOR NUCC USE : ,
 STATE : DC,
 CITY : Washington,
 ZIP CODE : 76134,
 TELEPHONE (include Area Code) : (815)571-3008,
 TELEPHONE (Include Area Code) : (202)614-5824,
 ZIP CODE : 20008,
 11. INSURED'S POLICY GROUP OR FECA NUMBER : FUR4398,
 9. OTHER INSURED'S NAME (Last Name, First Name, Middle Initial) : ,
 a OTHER INSURED'S POLICY OR GROUP NUMBER : X1573,
 DD : 11,
 MM : 4,
 1978 YY : ,
 b RESERVED FOR NUCC USE : ,
 b. OTHER CLAIM ID (Designated by NUCC) : Y41 FUR4398,
 C. INSURANCE PLAN NAME OR PROGR AM NAME : Travelers,
 C. RESERVED FOR NUCC USE : ,
 d. INSURANCE PLAN NAME OR PROGRAM NAME : ,
 10d. CLAIM CODES (D

In [38]:
key_value_I = analyzed_document.get(key="I")[0]
print(key_value_I)
doc_width = 2480
doc_height = 3509
x = key_value_I.bbox.x * doc_width
y = key_value_I.bbox.y * doc_height
width = key_value_I.bbox.width * doc_width
height = key_value_I.bbox.height * doc_height

x_min = x
y_min = y
x_max = x + width
y_max = y + width
(x_min, x_max, y_min, y_max)



I. : This is iphone


(61.32811039686203, 74.83414195477962, 2069.866504251957, 2083.3725358098745)

In [10]:
key_value = analyzed_document.key_values[2]
print(key_value.key)
print(key_value.value)

2 PATIENT'S NAME (Last Name, First Name, Middle Initial)
Carrie Rodgers


In [11]:
key_value = analyzed_document.get("INSURED POLICY GROUP".lower(), 3)[0]
print(f"{key_value.key} = {key_value.value}")



11. INSURED'S POLICY GROUP OR FECA NUMBER = FUR4398


### Checkbox

In [12]:
analyzed_document.checkboxes

[[ ] PICA,
 [X] PICA,
 [ ] OTHER (ID#),
 [ ] FECA (ID#) IKLUNG,
 [ ] GROUP HEALTH (ID#) PLAN,
 [ ] CHAMPVA (Member ID#),
 [ ] MEDICAID (Medicaid#),
 [ ] TRICARE (ID#/DoD#),
 [X] MEDICARE (Medicare#),
 [ ] F,
 [X] ,
 [ ] Other,
 [ ] Child,
 [ ] Self,
 [ ] Spouse,
 [X] STATE,
 [X] M,
 [ ] F,
 [X] YES,
 [ ] NO,
 [ ] PLACE (State),
 [ ] YES,
 [X] NO,
 [ ] YES,
 [X] NO,
 [X] NO,
 [ ] YES,
 [ ] DD,
 [ ] YES,
 [X] NO,
 [ ] ICD Ind.,
 [ ] H.,
 [ ] L.,
 [X] BIN,
 [ ] 28 TOTAL CHARGE,
 [X] YES,
 [ ] NO]

In [13]:
checkbox = analyzed_document.checkboxes[0]
checkbox.bbox

x: 0.8941406011581421, y: 0.1098247841000557, width: 0.02421991527080536, height: 0.006424476392567158

## Overlay

## Geo Finder

In [3]:
from textractgeofinder.ocrdb import AreaSelection
from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement
from textractprettyprinter.t_pretty_print import get_forms_string
from textractcaller import call_textract
from textractcaller.t_call import Textract_Features
import trp.trp2 as t2

In [4]:
js: dict = call_textract(
    input_document=s3path_cms1500_pdf.uri,
    features=[
        Textract_Features.FORMS,
        Textract_Features.TABLES,
    ]
)

In [53]:
import json

Path(dir_here, "test.json").write_text(json.dumps(js, indent=4))

2131574

In [5]:
document: t2.TDocument = t2.TDocumentSchema().load(js)
type(document)

trp.trp2.TDocument

In [6]:
doc_width = 2480
doc_height = 3509
geofinder_doc = TGeoFinder(js, doc_height=doc_height, doc_width=doc_width)
geofinder_doc

<textractgeofinder.tgeofinder.TGeoFinder at 0x1338ccd60>

In [10]:
geofinder_doc.__del__()
print("done")

done


In [7]:
key_21_phrase = geofinder_doc.find_phrase_on_page("DIAGNOSIS OR NATURE OF ILLNESS OR INJURY")[0]
key_21_phrase

TWord(text='diagnosis or nature of illness or injury', original_text='DIAGNOSIS OR NATURE OF ILLNESS OR INJURY', text_type='phrase', confidence=99.78182002476284, id='e10c607d-31a4-4f69-acdc-dcc99cbe224e', xmin=94, ymin=1904, xmax=671, ymax=1927, page_number=1, doc_width=2480, doc_height=3509, child_relationships='', reference=None, resolver=None)

In [8]:
from PIL import Image, ImageDraw

def show_bounding_box(path, phrase, fill=None):
    with Image.open(path) as img:
        x, y = img.size
        print(x, y)
        doc_width = key_21_phrase.doc_width
        doc_height = key_21_phrase.doc_height
        draw = ImageDraw.Draw(img)
        xy = [
            phrase.xmin / doc_width * x,
            phrase.ymin / doc_height * y,
            phrase.xmax / doc_width * x,
            phrase.ymax / doc_height * y,
        ]
        draw.rectangle(
            xy=xy,
            outline=128,
            fill=fill,
            width=2,
        )
        img.show()

In [15]:
show_bounding_box(path_cms1500_png.abspath, key_21_phrase)

2480 3509


In [9]:
key_diagnosis_pointer = geofinder_doc.find_phrase_on_page("DIAGNOSIS POINTER")[0]
key_diagnosis_pointer

TWord(text='diagnosis pointer', original_text='DIAGNOSIS POINTER', text_type='phrase', confidence=99.80844497680664, id='35d36ccd-c909-4294-8c25-4ae1f4764aa6', xmin=1326, ymin=2129, xmax=1466, ymax=2181, page_number=1, doc_width=2480, doc_height=3509, child_relationships='', reference=None, resolver=None)

In [18]:
show_bounding_box(path_cms1500_png.abspath, key_diagnosis_pointer)

2480 3509


In [10]:
top_left = t2.TPoint(x=50, y=key_21_phrase.ymin-50)
lower_right = t2.TPoint(x=key_diagnosis_pointer.xmax+50, y=key_diagnosis_pointer.ymin+100)

In [11]:
# a_to_l_fields = geofinder_doc.get_form_fields_in_area(
#     area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number=1)
# )
a_to_l_fields = geofinder_doc.get_form_fields_in_area(
    area_selection=AreaSelection(
        top_left=t2.TPoint(x=0, y=0),
        lower_right=t2.TPoint(x=doc_width, y=doc_height),
        page_number=1,
    )
)
print(len(a_to_l_fields))
for field in sorted(
    a_to_l_fields,
    key=lambda x: x.key.text,
):
    # print(field.key.text, field.value.text)
    # print(field.key.text, field.value)
    print(field.key.text)


115
$ charges
10d. claim codes (designated by nucc)
11. insured's policy group or feca number
17. name of referring provider or other source
19. additional claim information (designated by nucc)
1a. insured's i.d. number
2 patient's name (last name, first name, middle initial)
22. resubmission code
23. prior authorization number
25. federal tax i.d. number
26. patient's account no
28. total charge
29. amount paid
30. rsvd. for nucc use
32. service facility location information
33 billing provider info & ph #
4. insured's name (last name, first name, middle initial)
4/4/2022 date
5 patient's address (no. street)
7. insured's address (no., street)
8. reserved for nucc use
9 other insured's name (last name, first name, middle initial)
a
a.
a. other insured's policy or group number
approved
b
b
b.
b. other claim id (designated by nucc)
b. reserved for nucc use
c.
c. insurance plan name or program name
c. reserved for nucc use
champva (member id#)
child
city
city
d insurance plan name or pr

In [24]:
print(key_21_phrase.xmin, key_diagnosis_pointer.xmin, key_diagnosis_pointer.xmax)

94 1326 1466


In [36]:
# (61.32811039686203, 74.83414195477962, 2069.866504251957, 2083.3725358098745)
print(top_left.x, lower_right.x, top_left.y, lower_right.y)

50 1516 1854 2129


In [29]:
for field in sorted(
    a_to_l_fields,
    key=lambda x: x.key.text,
):
    print(field.key.text, field.value.text)

a s16 1xxa
b m75 42
c. m25512
d. this is david
e this is e
f. this is f
g this is g
h. this is h
icd ind. not_selected
j. this is jack
k this is king
l. this is letter
