# Child Entity Tag Using Header Keyword

* Author: docai-incubator@google.com

# Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Purpose and Description

This tool uses labeled json files in GCS bucket and header words as input and creates a new child entity tagging the values under the header keyword matching.

**Example**
![](./images/objective.png)

All those `5-tokens` needs to convert as entities of type `line_item` & its child-entities of type `total_amount`

# Prerequisites

1. Vertex AI Notebook
2. Labeled json files in GCS Folder


# Step by Step procedure

## 1. Import Modules/Packages

In [None]:
# # Download incubator-tools utilities module to present-working-directory
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
!pip install google-cloud-documentai google-cloud-storage -q

In [None]:
import re
from collections import defaultdict
from typing import Dict, List, Tuple, Union, DefaultDict

from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage

import utilities

## 2. Input Details

 * **gcs_input_path:** provide the folder name of the input jsons which needs to be processed.
 * **gcs_output_path:** provide for the folder name where jsons will be saved after processing.
 * **list_total_amount:** This is the list of header words have to be used and the values under those headers will be tagged with child type total_amount_type.
  * **total_amount_type:** enitity name under which values will be tagged.



In [None]:
gcs_input_path = "gs://xx_bucket_xx/path_to/input"
gcs_output_path = "gs://xx_bucket_xx/path_to/output"
# List of header words to match with tokens
list_total_amount = [
    "Total value",
    "Amount",
    "Nettowert",
    "Nettowert in EUR",
    "Wert",
    "Importo",
    "Nettobetrag",
    "Extension",
    "Net value",
    "Ext. price",
    "Extended Amt",
    "Costo riga",
    "Imp. Netto",
    "Summe",
    "Gesamtpreis",
    "Gesamt",
    "Gesamtgewicht",
    "Betrag",
    "Bedrag",
    "Wartość",
    "Wartość netto",
    "Value",
    "TOTAL",
    "Line Total",
    "Net",
    "Net Amount",
    "cost",
    "Subtotal",
]
total_amount_type = "line_item/total_amount"

## 3. Run the Below Code-Cells

In [None]:
def split_gcs_folder(path: str, need_trailing_slash: bool = False) -> Tuple[str, str]:
    """It will split GCS uri-path into bucket name and path-prefix

    Args:
        path (str): GCS uri-path(like-gs://bucket_name/path_to/folder)
        need_trailing_slash (bool, optional): If this parameter True then it will append "/" at end of gcs path-prefix. Defaults to False.

    Returns:
        Tuple[str, str]: It contains bucket-name & uri-prefix of folder/file
    """

    pattern = re.compile("gs://(?P<bucket>.*?)/(?P<files_dir>.*)")
    path = path.rstrip("/")
    uri = re.match(pattern, path)
    bucket, files_dir = uri.group("bucket"), uri.group("files_dir")
    if need_trailing_slash:
        files_dir += "/"
    return bucket, files_dir


def get_total_amount_type(doc: documentai.Document, total_amount_type: str) -> str:
    """It give's final total_amount_type(Entity.type_) to consider for entity and its properties

    Args:
        doc (documentai.Document): Documnet Proto Object
        total_amount_type (str): Initial total amount type given by user

    Returns:
        str: Final entity object type to consider and use for its properties
    """

    consider_ent_type = ""
    for ent in doc.entities:
        if not (ent.properties and ent.type_ == "line_item"):
            continue
        for sub_ent in ent.properties:
            consider_ent_type = (
                "line_item/total_amount"
                if ("line_item" in sub_ent.type_)
                else "total_amount"
            )
    if "/" in consider_ent_type:
        if "/" not in total_amount_type:
            return "line_item" + "/" + total_amount_type
    if "/" in total_amount_type:
        return total_amount_type.split("/")[-1]
    return total_amount_type


def get_page_wise_entities(
    doc: documentai.Document,
) -> DefaultDict[int, List[documentai.Document.Entity]]:
    """It gives page-wise entites for all pages in Document object

    Args:
        doc (documentai.Document): Documnet Proto Object

    Returns:
        Dict[int, List[documentai.Document.Entity]]: Dictionary which contains page number as key and list of all entities in corresponding page
    """

    entities_page = defaultdict(list)
    for ent in doc.entities:
        page_no = ent.page_anchor.page_refs[0].page
        entities_page[page_no].append(ent)
    return entities_page


def get_all_line_items(
    entities: List[documentai.Document.Entity], entity_type: str = "line_item"
) -> List[documentai.Document.Entity]:
    """It will check for entity_type in given list of entities

    Args:
        entities (List[documentai.Document.Entity]): A list of all entities to check for entity_type
        entity_type (str, optional): To filter the entities based on its type. Defaults to "line_item".

    Returns:
        List[documentai.Document.Entity]: It returns list entities after filtreing based on entity_type
    """

    line_items_all = []
    for entity in entities:
        if entity.properties and entity.type_ == entity_type:
            line_items_all.append(entity)
    return line_items_all


def get_x_y_list(
    bounding_poly: documentai.BoundingPoly,
) -> Tuple[List[float], List[float]]:
    """It takes BoundingPoly object and separates it x & y normalized coordinates as lists

    Args:
        bounding_poly (documentai.BoundingPoly): A token of Document Page object

    Returns:
        Tuple[List[float], List[float]]: It returns x & y normalized coordinates as separate lists
    """

    x, y = [], []
    normalized_vertices = bounding_poly.normalized_vertices
    for nv in normalized_vertices:
        x.append(nv.x)
        y.append(nv.y)
    return x, y


def get_token(
    doc: documentai.Document,
    page: int,
    text_anchors_check: documentai.Document.TextAnchor.TextSegment,
) -> Dict[str, float]:
    """It will be useful to get minimum and maximum values for xy-coordinates

    Args:
        doc (documentai.Document): Documnet Proto Object
        page (int): It is target page-number to look for all entities in it
        text_anchors_check (documentai.Document.TextAnchor.TextSegment): It is TextSegment object which contians start&end indices

    Returns:
        Dict[str, float]: It contains min and max values of xy-coordinates of a token
    """

    x, y = [], []
    start_check = text_anchors_check.start_index - 2
    end_check = text_anchors_check.end_index + 2
    for token in doc.pages[page].tokens:
        text_anc = token.layout.text_anchor.text_segments[0]
        start_temp = text_anc.start_index
        end_temp = text_anc.end_index
        if not (
            (start_temp >= start_check)
            and (end_temp <= end_check)
            and (end_temp - start_temp > 3)
        ):
            continue
        x_acc, y_acc = get_x_y_list(token.layout.bounding_poly)
        x += x_acc
        y += y_acc
    return {"min_x": min(x), "min_y": min(y), "max_x": max(x), "max_y": max(y)}


def remove_selected_items_from_groups_same_y(
    selected_elements: List[
        Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]
    ],
    groups_same_y: List[
        List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]
    ],
) -> List[
    List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]
]:
    """It will remove duplicated items from groups_same_y data by using text-anchors of selected_elements

    Args:
        selected_elements (List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]): TextSegement object for an token/entity
        groups_same_y (List[List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]]): It contains group of same line-items as list-of-lists

    Returns:
        List[List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]]: It contains filtered version fo group_same_y parameter data
    """

    text_anchors = [element["text_anc"] for element in selected_elements]
    # removing selected items from groups_same_y
    for groups in groups_same_y:
        for item in groups:
            if item["text_anc"] in text_anchors:
                groups.remove(item)
    return groups_same_y


def remove_unwanted_line_items(
    line_items_temp: List[documentai.Document.Entity],
    groups_same_y: List[
        List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]
    ],
) -> List[documentai.Document.Entity]:
    """It will removes unwanted line-items by using text-anchor filter on groups_same_y parameter data

    Args:
        line_items_temp (List[documentai.Document.Entity]): It contains a list of entities fo Document object
        groups_same_y (List[List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]]): It contains the text-anchor of entity which needs to be removed from list-of-items

    Returns:
        List[documentai.Document.Entity]: It is filtered version of list of entities
    """

    text_anchors = [item["text_anc"] for groups in groups_same_y for item in groups]
    # removing unwanted entity from line_items_temp
    for line_item in line_items_temp:
        if line_item.text_anchor.text_segments in text_anchors:
            line_items_temp.remove(line_item)
    return line_items_temp


def get_same_y_entities(
    line_items_temp: List[documentai.Document.Entity],
) -> List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]:
    """It groups the entities which falls under same-y-coordinate of a line-item

    Args:
        line_items_temp (List[documentai.Document.Entity]): It contains the list of entities of Document object

    Returns:
        List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]: It contains group of same line-items as list-of-lists
    """

    same_y_ent = []
    for dup in line_items_temp:
        temp_same_y = {"min_y": "", "max_y": "", "min_x": "", "text_anc": []}
        x, y = get_x_y_list(dup.page_anchor.page_refs[0].bounding_poly)
        temp_same_y["min_y"] = min(y)
        temp_same_y["max_y"] = max(y)
        temp_same_y["min_x"] = min(x)
        temp_same_y["text_anc"] = dup.text_anchor.text_segments
        same_y_ent.append(temp_same_y)
    return same_y_ent


def get_group_same_y(
    sorted_same_y_ent: List[
        Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]
    ]
) -> List[
    List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]
]:
    """It takes list of tokens/entities and groups them based on gap-between y-coordinate of token/entity

    Args:
        sorted_same_y_ent (List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]): It is a sorted-list of of entities which falls under same line-item-group

    Returns:
        List[List[Dict[str, Union[str, float, documentai.Document.TextAnchor.TextSegment]]]]: It contains list of lists whih falls undre same line-item based on difference between y-coordinates
    """

    groups_same_y = []
    current_group = []
    for data in sorted_same_y_ent:
        if (not current_group) or (data["min_y"] - current_group[-1]["min_y"] < 0.005):
            current_group.append(data)
        else:
            groups_same_y.append(current_group)
            current_group = [data]
    if current_group:
        groups_same_y.append(current_group)
    return groups_same_y


def tag_ref_child_item(
    doc: documentai.Document,
    page: int,
    ent_min_dict: Dict[str, Dict[str, float]],
    consider_ent: str,
    total_amount_type: str,
    min_y_start: float,
    max_stop_y: float,
) -> List[documentai.Document.Entity]:
    """This function used to tag child items to a line-item of an entity

    Args:
        doc (documentai.Document): It is Document proto object
        page (int): It is target page-number to look for all entities in it
        ent_min_dict (Dict[str, Dict[str, float]]): It contains min and max values of xy-coordinates of a specific-token
        consider_ent (str): Entity type which is considered
        total_amount_type (str): Its value is set as type for an entity, here for all properties in an entity
        min_y_start (float): Minimum y-coordinate to start checking
        max_stop_y (float): Maximum y-coordinate to stop checking

    Returns:
        List[documentai.Document.Entity]: It contains a list of all entities which falls with-in target-header(with-in this x-coor range)
    """

    consider_type = total_amount_type
    line_items_temp = []
    for token in doc.pages[page].tokens:
        pr = documentai.Document.PageAnchor.PageRef(page=page)
        pa = documentai.Document.PageAnchor(page_refs=[pr])
        line_item_ent = documentai.Document.Entity(
            confidence=1.0, type_="line_item", page_anchor=pa
        )
        sub_ent = documentai.Document.Entity(confidence=1.0, type_="", page_anchor=pa)
        x, y = get_x_y_list(token.layout.bounding_poly)
        min_x, max_x = min(x), max(x)
        min_y, max_y = min(y), max(y)
        within_header_coords = (
            (min_y > min_y_start)
            and (min_x >= ent_min_dict[consider_ent]["min_x"] - 0.05)
            and (max_x <= ent_min_dict[consider_ent]["max_x"] + 0.1)
            and (max_y <= max_stop_y)
            and (max_x > ent_min_dict[consider_ent]["min_x"])
        )
        if not within_header_coords:
            continue
        end_index = token.layout.text_anchor.text_segments[0].end_index
        start_index = token.layout.text_anchor.text_segments[0].start_index
        search_pattern = re.compile(r"[0-9\s\\\/]+")
        search_string = (
            doc.text[start_index:end_index].replace(" ", "").replace("\n", "")
        )
        check_pattern = search_pattern.search(search_string)
        if not check_pattern:
            continue
        line_item_ent.mention_text = doc.text[start_index:end_index]
        line_item_ent.page_anchor.page_refs[
            0
        ].bounding_poly.normalized_vertices = (
            token.layout.bounding_poly.normalized_vertices
        )
        line_item_ent.text_anchor.text_segments = token.layout.text_anchor.text_segments
        sub_ent.mention_text = doc.text[start_index:end_index]
        sub_ent.page_anchor.page_refs[
            0
        ].bounding_poly.normalized_vertices = (
            token.layout.bounding_poly.normalized_vertices
        )
        sub_ent.text_anchor.text_segments = token.layout.text_anchor.text_segments
        sub_ent.type_ = consider_type
        line_item_ent.properties = [sub_ent]
        line_items_temp.append(line_item_ent)
    same_y_ent = get_same_y_entities(line_items_temp)
    sorted_same_y_ent = sorted(same_y_ent, key=lambda x: x["min_y"])
    groups_same_y = []
    if sorted_same_y_ent:
        groups_same_y = get_group_same_y(sorted_same_y_ent)
    selected_elements = [
        min(
            lst,
            key=lambda elem: abs(elem["min_x"] - ent_min_dict[consider_ent]["min_x"]),
        )
        for lst in groups_same_y
    ]
    if not groups_same_y:
        return line_items_temp
    groups_same_y = remove_selected_items_from_groups_same_y(
        selected_elements, groups_same_y
    )
    line_items_temp = remove_unwanted_line_items(line_items_temp, groups_same_y)
    return line_items_temp


def get_y_min_max(
    line_items_all: List[documentai.Document.Entity],
) -> Tuple[float, float]:
    """It is used to get minimum and maximum values of y-coordinate's for all line-items in specific page

    Args:
        line_items_all (List[documentai.Document.Entity]): It contains a list of entities of Document object

    Returns:
        Tuple[float, float]: It return ma& min y-coord values for all line-items in same page
    """

    min_y_line = 1
    max_y_line = 0
    min_y_child = 1
    for line_item in line_items_all:
        _, y = get_x_y_list(line_item.page_anchor.page_refs[0].bounding_poly)
        min_y_temp = min(y)
        max_y_temp = max(y)
        if max_y_line < max_y_temp:
            max_y_line = max_y_temp
        if not (min_y_line > min_y_temp):
            continue
        min_y_line = min_y_temp
        for child_ent in line_item.properties:
            norm_ver_child = child_ent.page_anchor.page_refs[
                0
            ].bounding_poly.normalized_vertices
            min_y_child_temp = min(vertex.y for vertex in norm_ver_child)
            if min_y_child > min_y_child_temp:
                min_y_child = min_y_child_temp
    return min_y_line, max_y_line


def get_entity_text_anchor(
    doc: documentai.Document, page: int, min_y_line: float, list_total_amount: List[str]
) -> Dict[str, documentai.Document.TextAnchor.TextSegment]:
    """It will give Token text and its TextSegmnet Object

    Args:
        doc (documentai.Document): It is Document proto object
        page (int): It is target page-number to look for all entities in it
        min_y_line (float): It is minimum y-coordinate of a line-item in given page
        list_total_amount (List[str]): It is a list of header words which will be used to identity and the values under those headers will be tagged with child type `total_amount_type`

    Returns:
        Dict[str, documentai.Document.TextAnchor.TextSegment]: It contains text & TextSegement of a Token
    """

    check_text = ""
    start_temp = 100000000
    end_temp = 0
    total_amount_textanc = {}
    for token in doc.pages[page].tokens:
        _, y = get_x_y_list(token.layout.bounding_poly)
        max_y_temp_token = max(y)
        min_y_temp_token = min(y)
        if (
            min_y_line >= (max_y_temp_token - 0.02)
            and abs(min_y_line - min_y_temp_token) <= 0.15
        ):
            end_index = token.layout.text_anchor.text_segments[0].end_index
            start_index = token.layout.text_anchor.text_segments[0].start_index
            check_text = check_text + doc.text[start_index:end_index]
            if start_temp > start_index:
                start_temp = start_index
            if end_temp < end_index:
                end_temp = end_index

    for check in list_total_amount:
        if check.lower() not in check_text.lower():
            continue
        matches = re.finditer(check.lower(), doc.text[start_temp:end_temp].lower())
        starting_indices = [match.start() for match in matches]
        start_index_temp1 = max(starting_indices)
        start_index_1 = start_index_temp1 + start_temp
        end_index_1 = start_index_1 + len(check)
        total_amount_textanc[check] = documentai.Document.TextAnchor.TextSegment(
            start_index=start_index_1, end_index=end_index_1
        )
    return total_amount_textanc


def total_amount_entities(
    doc: documentai.Document, total_amount_type: str, list_total_amount: List[str]
) -> documentai.Document:
    """It will append new entities to Document Proto, whose token segments falls with in range of Header token

    Args:
        doc (documentai.Document): It is Document proto object
        total_amount_type (str): Its value is set as type for an entity, here for all properties in an entity
        list_total_amount (List[str]): It is a list of header words which will be used to identity and the values under those headers will be tagged with child type `total_amount_type`

    Returns:
        documentai.Document: It is Document proto object, which contains newly added entities as well
    """

    total_amount_type = get_total_amount_type(doc, total_amount_type)
    page_wise_ent = get_page_wise_entities(doc)
    previous_page_headers = ""
    total_amount_entities = []

    for page, ent2 in page_wise_ent.items():
        line_items_all = get_all_line_items(ent2)
        if not line_items_all:
            continue
        if not (len(line_items_all) > 1 or len(line_items_all[0].properties) > 2):
            continue
        min_y_line, max_y_line = get_y_min_max(line_items_all)
        total_amount_textanc = get_entity_text_anchor(
            doc, page, min_y_line, list_total_amount
        )
        final_key = ""
        for key in total_amount_textanc.keys():
            if len(final_key) < len(key):
                final_key = key
        if final_key:
            total_amount_dict = {
                "total_amount": get_token(doc, page, total_amount_textanc[final_key])
            }
            previous_page_headers = total_amount_dict
        else:
            total_amount_dict = previous_page_headers
        if not total_amount_dict:
            continue
        total_amount_line_items = tag_ref_child_item(
            doc,
            page,
            total_amount_dict,
            "total_amount",
            total_amount_type,
            min_y_line,
            max_y_line,
        )
        for item in total_amount_line_items:
            total_amount_entities.append(item)
    for total_en in total_amount_entities:
        doc.entities.append(total_en)
    return doc


def child_entity_tag_using_header(
    gcs_input_path: str,
    gcs_output_path: str,
    list_total_amount: List[str],
    total_amount_type: str,
) -> None:
    """It takes labeled json files in GCS bucket and header words as input and creates a new child entity tagging the values under the header keyword matching

    Args:
        gcs_input_path (str): GCS input uri-path(like-gs://bucket_name/path_to/folder)
        gcs_output_path (str): GCS output uri-path(like-gs://bucket_name/path_to/folder)
        list_total_amount (List[str]): It is a list of header words which will be used to identity and the values under those headers will be tagged with child type `total_amount_type`
        total_amount_type (str): Its value is set as type for an entity, here for all properties in an entity
    """

    count = 0
    issue_files = {}
    input_bucket, input_dir = split_gcs_folder(
        gcs_input_path, need_trailing_slash=False
    )
    output_bucket, output_dir = split_gcs_folder(
        gcs_output_path, need_trailing_slash=False
    )
    _, file_dict = utilities.file_names(gcs_input_path)
    file_dict = {fn: fp for fn, fp in file_dict.items() if fn.endswith(".json")}
    for filename, filepath in file_dict.items():
        print(f"Process started for - {filename} ...")
        input_filepath = filepath
        doc = utilities.documentai_json_proto_downloader(input_bucket, input_filepath)
        output_filepath = f"{output_dir}/{filename}"
        if not doc.pages[0].tokens:
            print(f"\tNo Tokens, Unable to parse {input_filepath}")
            issue_files[input_filepath] = "No Tokens"
            count += 1
            document = documentai.Document.to_json(doc)
            utilities.store_document_as_json(document, output_bucket, filepath)
            continue
        doc = total_amount_entities(doc, total_amount_type, list_total_amount)
        document = documentai.Document.to_json(doc)
        utilities.store_document_as_json(document, output_bucket, filepath)
        print("\tCompleted.")
    print("Processing Completed for all files\n")
    if issue_files:
        print(f"Below {count}-file(s) have some issues to process...")
        print(issue_files)

After Executed all above code cells, run below Entry-Function cell to start processing input files, Whose child entities need to tagged using header Keyword

In [None]:
child_entity_tag_using_header(
    gcs_input_path=gcs_input_path,
    gcs_output_path=gcs_output_path,
    list_total_amount=list_total_amount,
    total_amount_type=total_amount_type,
)

## 4. Output

The items which are below the matched keyword will be tagged as entity name given

<table>
    <tr>
        <td> Sample JSON-Input file, Entities Count(24)</td>
        <td> <img src="./images/before_child_entity_tag.png" width=800 height=400> </td>
    </tr>
    <tr>
        <td> Sample JSON-Output file, Entities Count(25+4)</td>
        <td> <img src="./images/after_child_entity_tag.png" width=800 height=400>  </td>
    </tr>
</table>

JSON Sample output after tagging child-entities, **5-new entities** added to Document Proto
<img src="./images/after_child_entity_tag_json_results.png" width=450 height=150></img>