# Amazon Textract
Amazon Textract makes it easy to add document text detection and analysis to your applications. The Amazon Textract Text Detection API can detect text in a variety of documents including financial reports, medical records, and tax forms. For documents with structured data, you can use the Amazon Textract Document Analysis API to extract text, forms and tables.
Amazon Textract provides synchronous operations for processing small, single-page, documents and for getting near real-time responses. Amazon Textract also provides asynchronous operations that you can use to process larger, multipage documents. Asynchronous responses aren't in real time.


## 1. synchronous operations

Synchronous operations can process **JPEG** and **PNG** format images. Typically these are images of single-page documents that you've scanned.

For Amazon Textract synchronous operations, you can use input documents that are stored in an **Amazon S3 bucket**, or you can pass **base64-encoded image bytes**. 

In [None]:
#Detects text in a document stored in an S3 bucket. Display polygon box around text and angled text 
import boto3
import numpy as np
import matplotlib.pyplot as plt
import textract.util as tu

%matplotlib inline

In [None]:
textract = boto3.client('textract')
s3_client = boto3.client('s3')

In [None]:
bucket=''        ## S3 데이터 버킷 정보, 실제 분석한 파일(jpg, png, pdf)를 올리는 장소
document = ''    ## 파일 명

### 1) Detecting Text

return only the text detected in a document
> Sync method:  **detect_document_text()**   
> Async method: **start_document_text_detection()**


- The lines and words of detected text
- The relationships between the lines and words of detected text
- The page that the detected text appears on
- The location of the lines and words of text on the document page

In [None]:
%%time
#process using S3 object
response = textract.detect_document_text(
    Document={'S3Object': {'Bucket': bucket, 'Name': document}})

### 1-1) Post-processing for Detecting Text
 - the location and geometry of items found on a documented page, such as lines and words
 - Bounding box: height, left (X coordinate), top (Y coordinate), and width as a ratio of the overall document page height, left, top, and width, repectively

In [None]:
ori_image, image, blocks = tu.get_sync_detect_document_text(bucket, document, response)

### 1-2) Display an original document

In [None]:
fig_x, fig_y = 20, 15

In [None]:
plt.figure(figsize = (fig_x,fig_y))
plt.imshow(np.array(ori_image))

### 1-3) Display items location on a document page

In [None]:
plt.figure(figsize = (fig_x,fig_y))
plt.imshow(np.array(image))

### 2) Analyzing Text

Amazon Textract analyzes documents and forms for relationships between detected text. Amazon Textract analysis operations return 3 categories of text extraction — text, forms, and tables.

> Sync method:  **analyze_document()**   
> Async method: **start_document_analysis()**


- The lines and words of detected text
- The relationships between detected items
- The page that the item was detected on
- The location of the item on the document page

### 2-1) Change Image to Binary

In [None]:
ori_image, image, stream = tu.s3_to_image(bucket, document)
image_binary = stream.getvalue()

### 2-2) Perform analyzing document

In [None]:
%%time
response = textract.analyze_document(Document={'Bytes': image_binary},
    FeatureTypes=["TABLES", "FORMS"])

### 2-3) Post-processing for Analyzing Document
the location and geometry of key-value pairs, tables, cells, and selection elements.
 - Bounding box: height, left (X coordinate), top (Y coordinate), and width as a ratio of the overall document page height, left, top, and width, repectively
 - Polygon: points in the polygon array to display a finer-grain bounding box around a Block object. 

Multiply the X coordinate by the document page width, and multiply the Y coordinate by the document page height

In [None]:
image, blocks = tu.get_sync_analyze_document(image, response)

### 2-4) The results of analyzing Text in a document page

In [None]:
page = tu.get_page(blocks)

### 3. Detecting Entitiy for Amazon Comprehend

In [None]:
tu.detect_entities_for_comprehend(page)

### 4. Extracting Key-Value Pairs from a Form Document

In [None]:
key_map, value_map, block_map = tu.get_kv_map(blocks)

# Get Key Value relationship
kvs = tu.get_kv_relationship(key_map, value_map, block_map)
print("\n\n== FOUND KEY : VALUE pairs ===\n")
tu.print_kvs(kvs)

# Start searching a key value
while input('\n Do you want to search a value for a key? (enter "n" for exit) ') != 'n':
    search_key = input('\n Enter a search key:')
    print('The value is:', tu.search_value(kvs, search_key))

## Limits

 - 최대 문서 이미지 (JPEG / PNG) 크기는 10MB입니다.
 - 최대 PDF 파일 크기는 500MB입니다.
 - PDF 파일의 최대 페이지 수는 3000입니다.
 - PDF의 미디어 크기의 최대 높이/너비는 40 인치 또는 2880 포인트입니다.
 - 텍스트의 최소 높이는 15 픽셀이며, 150 DPI에서는 이 값은 8 pt 글꼴과 같습니다.
 - 세로 축에서 최대 +/- 10 % 회전된 문서는 가능하며, 텍스트는 문서 내에서 가로로 정렬된 텍스트가 가능합니다.
 - Amazon Textract는 영어 텍스트 감지 만 지원합니다.
 - Amazon Textract는 필기 감지를 지원하지 않습니다. 
 - Amazon Textract 동기 작업 (DetectDocumentText 및 AnalyzeDocument)은 PNG 및 JPEG 이미지 형식을 지원합니다. 
 - 비동기 작업 (StartDocumentTextDetection, StartDocumentAnalysis)도 PDF 파일 형식을 지원합니다.

## Reference

> [Other Example](https://docs.aws.amazon.com/textract/latest/dg/other-examples.html)  
> [Index your pile of papers with Amazon Textract, Amazon Comprehend and Amazon Elasticsearch Service](https://github.com/aws-samples/workshop-textract-comprehend-es)  
> [amazon-textract-enhancer](https://github.com/aws-samples/amazon-textract-enhancer)  