![https://pieriantraining.com/](../PTCenteredPurple.png)

[Textract](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/get_document_analysis.html) automatically extracts textual content, forms, and tables from scanned documents.

You can use it to [detect](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/detect_document_text.html) text, [analyze](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/analyze_document.html) documents or [analyzes invoices / recepits](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/analyze_expense.html)


In [17]:
import boto3
textract_client = boto3.client('textract', region_name="us-east-1")


In [18]:
document_path = "Test.pdf"  # Sample pdf. Feel free to replace with your image or pdf

### Detect Text
To detect text, you can simply use client.detect_document_text(*Document*), where *Document* is either a Bytes Array or an S3Object

In [19]:
response = textract_client.detect_document_text(
    Document={'Bytes': open(document_path, 'rb').read()})


In [20]:
response

{'DocumentMetadata': {'Pages': 1},
 'Blocks': [{'BlockType': 'PAGE',
   'Geometry': {'BoundingBox': {'Width': 1.0,
     'Height': 0.9999853372573853,
     'Left': 0.0,
     'Top': 1.4679527339467313e-05},
    'Polygon': [{'X': 0.0, 'Y': 1.4679527339467313e-05},
     {'X': 1.0, 'Y': 0.0007638687966391444},
     {'X': 1.0, 'Y': 1.0},
     {'X': 0.0, 'Y': 0.9997200965881348}]},
   'Id': 'ab9c2e8c-9401-4ea5-bbfc-f27919136b96',
   'Relationships': [{'Type': 'CHILD',
     'Ids': ['e429cfe1-b2d7-4133-804d-1f028f700f2c',
      'c9de5822-9513-4393-af67-2deb1a46edeb',
      '1c332c2e-90e3-46a4-a7de-746a267c1f41',
      '3d91da3d-f837-4945-a813-f93fe737381c',
      'b47ff749-0d95-4b62-8440-82aa3785f1a6',
      '903ca3eb-28a0-4206-b1a7-30a179bfdec9',
      '82f37056-3da2-40bd-99ae-f3fea84af7dd',
      '3bc15117-df63-42b4-80e2-ca6003b83e33',
      'b95e06e0-059f-495c-9635-d978692f5185',
      '72d9c823-4565-4b23-8515-cba0077ff7e1',
      'ae554852-e5c4-4999-a731-49161951f718',
      'd45e274c-4cad-

 For a better structure of the result you can use the following loop:

In [21]:
for item in response['Blocks']:
    if item['BlockType'] == 'LINE':
        print(item['Text'])


1
Sample Text
Here is some sample text with a dummy table:
Column1
Column 2
Column 3
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
2
John Doe
Here is a phrase about John Doe:
Name: John Doe
Age: 30
1


As you might see, the above response does not only contain the raw text and the bounding boxes around it, but also a relationship field.
The Relationship field is used to describe a relationship between detected items in the document. This becomes especially crucial when dealing with more complex document structures, such as forms and tables.

Each block returned by Textract has a BlockType that tells you the type of the block (e.g., PAGE, LINE, WORD, KEY_VALUE_SET, TABLE, CELL, etc.). When you're analyzing forms or tables, understanding relationships between blocks helps piece together the structure and content of the detected data. This feature becomes useful when using analyze_document

### Analyze Documents
To analyze a document you can use client.analyze_document(*Document*, *FeatureTypes*), where Document is again either a byte Array or S3Object and FeatureTypes is a list containing at least one of the following strings: 'TABLES'|'FORMS'|'QUERIES'|'SIGNATURES'.

- TABLES: Return information about detected tables
- FORMS: Return information about detected forms
- QUERIES: Ask questions about the document
- SIGNATURES: Detects signatures within the document

In [29]:
response = textract_client.analyze_document(
    Document={'Bytes': open(document_path, 'rb').read()},
    FeatureTypes=['TABLES'])




In [33]:
response

{'DocumentMetadata': {'Pages': 1},
 'Blocks': [{'BlockType': 'PAGE',
   'Geometry': {'BoundingBox': {'Width': 1.0,
     'Height': 0.9999853372573853,
     'Left': 0.0,
     'Top': 1.4679526429972611e-05},
    'Polygon': [{'X': 0.0, 'Y': 1.4679526429972611e-05},
     {'X': 1.0, 'Y': 0.0007638687966391444},
     {'X': 1.0, 'Y': 1.0},
     {'X': 0.0, 'Y': 0.99972003698349}]},
   'Id': '914488e1-1fb4-4ded-b4d1-d24aac0e74a7',
   'Relationships': [{'Type': 'CHILD',
     'Ids': ['b86e4c21-e1ef-4a88-aa1b-16e4356bfc3b',
      '15ef5af1-0403-42f7-b7e7-d3f1ff08a1bf',
      '1814023d-9643-4b00-84de-3393c4b4e6f1',
      'b59bfa5a-c981-42cd-a0bc-41085a01c91b',
      '350d9b5f-162e-4101-8769-7af6c23326b3',
      '6f93f6d5-a6c3-4492-a479-373116368549',
      '56b57a3f-fb4f-40f1-bceb-367dbf2839fe',
      '2f805f49-9df0-4d49-86f6-cc60f142da01',
      'eb1d859e-6b12-4053-92ba-83005a4e8cd7',
      '6fac7d24-2f30-4901-8e99-c95897a9fcdb',
      'a1cf7575-82f4-49dc-84f6-2076ebb3dc2f',
      'd704fbf0-9ba4-49

To obtain the text inside of the cells, we need to match the Relationship Ids

In [55]:
block_id_map = {block['Id']: block for block in response['Blocks']}


In [74]:
# Loop through all the blocks and find the CELL blocks
for item in response['Blocks']:
    if item['BlockType'] == 'CELL':
        # Get the relationships of the CELL block
        relationships = item.get("Relationships", [])
        for relationship in relationships:
            # Loop through the block IDs in the relationship
            line = ""
            for related_id in relationship['Ids']:
                # Fetch the related block using the block_id_map
                related_block = block_id_map.get(related_id)
                # Concatenate the words to get the whole text in the cell
                if related_block:
                    line+=" "
                    line+=related_block.get('Text')
            print(item["RowIndex"], item["ColumnIndex"], line)

1 1  Column1
1 2  Column 2
1 3  Column 3
2 1  Test 1
2 2  Test 2
2 3  Test 3
3 1  Test 4
3 2  Test 5
3 3  Test 6


### Queries:
Let's try a query that answers the question: How old is John Doe?

To do so, you need to pass the *QueryCofig* dictionary where "QUERIES" is the key and a list of {"Text": *actual_query*} forms the values

In [83]:
response = textract_client.analyze_document(
    Document={'Bytes': open(document_path, 'rb').read()},
    FeatureTypes=['QUERIES'],
    QueriesConfig={
        "Queries": [
            {"Text": "How old is John Doe?"},
            {"Text": "What is the name of the 30 year old man?"},
        ]    
    }
)


In [84]:
for block in response["Blocks"]:
    if block["BlockType"] == "QUERY_RESULT":
        print(block)

{'BlockType': 'QUERY_RESULT', 'Confidence': 100.0, 'Text': '30', 'Geometry': {'BoundingBox': {'Width': 0.015545766800642014, 'Height': 0.009683375246822834, 'Left': 0.2577763795852661, 'Top': 0.3666277825832367}, 'Polygon': [{'X': 0.2577836811542511, 'Y': 0.3666277825832367}, {'X': 0.2733221650123596, 'Y': 0.36664098501205444}, {'X': 0.273314893245697, 'Y': 0.37631115317344666}, {'X': 0.2577763795852661, 'Y': 0.3762979209423065}]}, 'Id': '31fae192-1416-4b75-b689-05a6f94774a2'}
{'BlockType': 'QUERY_RESULT', 'Confidence': 95.0, 'Text': 'John Doe', 'Geometry': {'BoundingBox': {'Width': 0.06906871497631073, 'Height': 0.010395517572760582, 'Left': 0.270304799079895, 'Top': 0.3509661853313446}, 'Polygon': [{'X': 0.2703125476837158, 'Y': 0.3509661853313446}, {'X': 0.33937349915504456, 'Y': 0.3510245382785797}, {'X': 0.33936595916748047, 'Y': 0.36136171221733093}, {'X': 0.270304799079895, 'Y': 0.3613031804561615}]}, 'Id': '9a4e5970-e22f-4492-af84-e935eb821324'}


Textract correctly returns, that John Doe is 30 years old!