# Format Based Line Items Extractor (Post-Processing) User Guide


* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. 
It is provided and supported on a best-effort basis by the DocAI Incubator Team.
No guarantees of performance are implied. 


## Objective

This document provides the functions which can be used to get the line items tagged 
from a specific format where the default processor is failing to extract the line item entities.



## Note

* This tool tags as entities from OCR output , the text below the headers_entities keys will be tagged as an child entity as per value.
* If the line item has multiple lines , it doesnt give desired result and output will be clumsy


## Prerequisites

* Vertex AI Notebook Or Colab (If using Colab, use authentication)
* Storage Bucket for storing input and output json files
* Permission For Google Storage and Vertex AI Notebook.



## Step by Step procedure

### 1. Importing Required Modules

In [None]:
!pip install google-cloud-storage google-cloud-documentai==2.16.0 tqdm json
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [3]:
from google.cloud import storage
from utilities import *
import json
from tqdm import tqdm
from pprint import pprint
from utilities import *
import re
from typing import Dict, List, Any, Tuple

### 2. Input and Output Paths

In [1]:
project_id = "xxxx-xxxx-xxxx"  # project id
gcs_input_path = "gs://xxxx/xxxx/xxx"  # path where the parsed jsons are stored
gcs_output_path = "gs://xxxx/xxxx/xxx/"  # path to save the updated jsons
headers = "QTY EQUIPMENT Min Day Week Month Amount"  # sample headers
# header entities with corresponding headers
headers_entities = {
    "QTY": "line_item/quantity",
    "EQUIPMENT": "line_item/description",
    "Min": "line_item/unit_price",
    "Day": "line_item/unit_price",
    "Week": "line_item/unit_price",
    "Month": "line_item/unit_price",
    "4 Week": "line_item/unit_price",
    "Amount": "line_item/amount",
}
stop_word = "SALES ITEMS"  # stop word where the line items should be stopped
consider_ent = "Amount"  # reference entity which has to be tagged as first or present in all the line items.

### Format specific input 

* To get the line items of a special invoice format document , you need below details to be entered from the format.

<img src="./Images/Input.png" width=800 height=400></img>

### Headers:
* The headers of the invoice have to be given as input in the form of a string as shown below example shown.

**Headers=’QTY EQUIPMENT Min Day Week Month Amount’**

### Headers_entities:

* The entities which correspond to the header have to be given in a dictionary format . This is used to map the items   under the respective header mapped into the respective value given in the dictionary.

**headers_entities={'QTY':'line_item/quantity','EQUIPMENT':'line_item/description','Min':'line_item/unit_price','Day':'line_item/unit_price','Week':'line_item/unit_price','Month':'line_item/unit_price','4 Week':'line_item/unit_price','Amount':'line_item/amount'}**

### Stop_word:
* The stop word helps us to identify the line items where it is getting ended and if there is no stop word needed then   it can be left as empty, so the function checks the total page from the headers. 

**stop_word='SALES ITEMS'**

### Reference entity:
 * The Entity which has to be tagged first or exists in all the line items have to be specified for better performance of the tool.

**consider_ent='Amount'**

### 3. Run the Code

In [None]:
# functions
def get_page_wise_entities(json_dict: documentai.Document):
    """
    Extracts entities from each page in the given loaded JSON file.

    Args:
    - json_dict (Dict[str, Any]): Loaded JSON file containing entities.

    Returns:
    - Dict[str, List[Any]]: A dictionary where keys are page identifiers
      and values are lists of entities associated with each page.
    """

    entities_page = {}
    for entity in json_dict.entities:
        page = entity.page_anchor.page_refs[0].page
        if page in entities_page.keys():
            entities_page[page].append(entity)
        else:
            entities_page[page] = [entity]

    return entities_page


def get_text_anc_headers(
    json_dict: documentai.Document,
    page: int,
    headers: str,
    headers_entities: dict = None,
):
    """
    THIS FUNCTION WILL SEARCH FOR THE HEADERS STRING IN LOADED JSON
    USING FIRST AND LAST AS KEY WORDS AND GETS MINIMUM,
    MAXIMUM OF X &Y COORDINTES OF THE HEADERS IN DICTIONARY FORMAT

    Args: Loaded JSON
            Page Number
            headers
            headers_entities

    Returns: {'header_keyword: {'min_x':0....,'min_y':0....,'max_x':0....,'max_y':0....}}

    """

    import re

    pattern = r"{}.*{}".format(
        re.escape(headers.split(" ")[0]), re.escape(headers.split(" ")[-1])
    )
    match = re.search(pattern, json_dict.text, flags=re.DOTALL)
    start = match.start()
    end_temp = json_dict.text[start : start + 50].find(headers.split(" ")[-1])
    end = start + end_temp + len(headers.split(" ")[-1])
    if end - start >= len(headers):
        ent_index = {}
        if headers_entities == []:
            for col in headers.split(" "):
                start_temp = (json_dict.text[start:end].lower()).find(col.lower())
                end_temp = start_temp + start + len(col) + 1
                ent_index[col] = {
                    "start_index": str(start_temp + start),
                    "end_index": str(end_temp),
                }
        else:
            for col in headers_entities.keys():
                if col.lower() in json_dict.text[start:end].lower():
                    start_temp = (json_dict.text[start:end].lower()).find(col.lower())
                    end_temp = start_temp + start + len(col) + 1
                    ent_index[col] = {
                        "start_index": str(start_temp + start),
                        "end_index": str(end_temp),
                    }
        ent_min_dict = {}
        for col, anc in ent_index.items():
            try:
                ent_min_dict[col] = get_token_from_text_anc(json_dict, page, anc)
            except:
                pass

    return ent_min_dict


def get_token_xy(token: Any) -> Tuple[float, float, float, float]:
    """
    Extracts the normalized bounding box coordinates (min_x, min_y, max_x, max_y) of a token.

    Args:
    - token (Any): A token object with layout information.

    Returns:
    - Tuple[float, float, float, float]: The normalized bounding box coordinates.

    """
    vertices = token.layout.bounding_poly.normalized_vertices
    minx_token, miny_token = min(point.x for point in vertices), min(
        point.y for point in vertices
    )
    maxx_token, maxy_token = max(point.x for point in vertices), max(
        point.y for point in vertices
    )

    return minx_token, miny_token, maxx_token, maxy_token


def get_token_from_text_anc(
    json_dict: documentai.Document, page_num: int, text_anchors_check: Dict[str, str]
) -> Dict[str, float]:
    """
    Extracts the x and y coordinates of a token based on the provided text anchors.

    Args:
    - json_dict (Dict[str, Any]): Loaded JSON.
    - page_num (int): Page number.
    - text_anchors_check (Dict[str, str]): Text anchors to check for.

    Returns:
    - Dict[str, float]: Dictionary containing x and y coordinates {'min_x': float,
    'min_y': float, 'max_x': float, 'max_y': float}.
    """

    for page in json_dict.pages:
        if int(page_num) == int(page.page_number - 1):
            for token in page.tokens:
                for seg in token.layout.text_anchor.text_segments:
                    if (
                        seg.start_index == text_anchors_check["start_index"]
                        and seg.end_index == text_anchors_check["end_index"]
                    ):
                        minx_token, miny_token, maxx_token, maxy_token = get_token_xy(
                            token
                        )
                    elif (
                        abs(
                            int(seg.start_index)
                            - int(text_anchors_check["start_index"])
                        )
                        <= 2
                        and abs(
                            int(seg.end_index) - int(text_anchors_check["end_index"])
                        )
                        <= 2
                    ):
                        minx_token, miny_token, maxx_token, maxy_token = get_token_xy(
                            token
                        )

    return {
        "min_x": minx_token,
        "min_y": miny_token,
        "max_x": maxx_token,
        "max_y": maxy_token,
    }


def get_entity_new(
    mt_new: str,
    norm_ver: List[Dict[str, float]],
    text_seg: List[Dict[str, Any]],
    type_line: str,
    line_item: bool,
) -> Dict[str, Any]:
    """
    Generates a new entity based on the provided parameters.

    Args:
    - mt_new (str): Mention text.
    - norm_ver (List[Dict[str, float]]): Normalized vertices.
    - text_seg (List[Dict[str, Any]]): Text segments.
    - type_line (str): Type of the entity.
    - line_item (bool): True if it's a line item entity, False otherwise.

    Returns:
    - Dict[str, Any]: The generated entity.
    """

    if line_item == True:
        line_item_ent = {
            "confidence": 1,
            "mention_text": mt_new,
            "page_anchor": {
                "page_refs": [{"bounding_poly": {"normalized_vertices": norm_ver}}]
            },
            "properties": [],
            "text_anchor": {"text_segments": text_seg},
            "type_": type_line,
        }
        return line_item_ent
    else:
        sub_ent = {
            "confidence": 1,
            "mention_text": mt_new,
            "page_anchor": {
                "page_refs": [{"bounding_poly": {"normalized_vertices": norm_ver}}]
            },
            "text_anchor": {"text_segments": text_seg},
            "type_": type_line,
        }
        return sub_ent


def tag_ref_child_item(
    json_dict: documentai.Document,
    page: int,
    ent_min_dict: Dict[str, Dict[str, float]],
    consider_ent: str,
    max_stop_y: float,
) -> List[Dict[str, Any]]:
    """
    THIS FUNCTION USED THE LOADED JSON, PAGE NUMBER , DICTIONARY OF HEADER KEYWORD AND VALUES AS
    X AND Y COORDINATES AND THE STOP WORD Y COORDINATE

    ARGS:
    - json_dict (Dict[str, Any]): Loaded JSON.
    - page (int): Page number.
    - ent_min_dict (Dict[str, Dict[str, float]]): Dictionary
    of header keyword and values as X and Y coordinates.
    - consider_ent (str): Entity to be tagged.
    - max_stop_y (float): Stop word Y coordinate.

    RETURNS:
    - List[Dict[str, Any]]: List of line items tagging the first entity provided.
    """
    # parameter entity needed# ***********need to add some condition
    # to check whether amount int similar to other entities need to add
    page_num = 0
    # consider_ent='Amount'
    consider_type = headers_entities[consider_ent]
    line_items_temp = []
    for page in json_dict.pages:
        if int(page_num) == int(page.page_number - 1):
            for token in page.tokens:
                min_x, min_y, max_x, max_y = get_token_xy(token)
                norm_ver11 = [
                    {"x": min_x, "y": min_y},
                    {"x": min_x, "y": max_y},
                    {"x": max_x, "y": min_y},
                    {"x": max_x, "y": max_y},
                ]
                if (
                    min_y > ent_min_dict[consider_ent]["min_y"]
                    and min_x >= ent_min_dict[consider_ent]["min_x"] - 0.002
                    and max_x <= ent_min_dict[consider_ent]["max_x"] + 0.002
                    and max_y < max_stop_y
                ):
                    for seg in token.layout.text_anchor.text_segments:
                        end_index = seg.end_index
                        start_index = seg.start_index
                    line_item_ent = get_entity_new(
                        json_dict.text[int(start_index) : int(end_index)],
                        norm_ver11,
                        [{"start_index": start_index, "end_index": end_index}],
                        "line_item",
                        True,
                    )
                    sub_ent = get_entity_new(
                        json_dict.text[int(start_index) : int(end_index)],
                        norm_ver11,
                        [{"start_index": start_index, "end_index": end_index}],
                        consider_type,
                        False,
                    )
                    line_item_ent["properties"].append(sub_ent)
                    line_items_temp.append(line_item_ent)
    same_y_ent = []
    for dup in line_items_temp:
        temp_same_y = {"mention_text": "", "min_y": "", "max_y": "", "text_anc": []}
        temp_same_y["mention_text"] = dup["mention_text"]
        temp_norm_same_y = dup["page_anchor"]["page_refs"][0]["bounding_poly"]
        temp_same_y["min_y"] = min(
            vertex["y"] for vertex in temp_norm_same_y["normalized_vertices"]
        )
        temp_same_y["max_y"] = max(
            vertex["y"] for vertex in temp_norm_same_y["normalized_vertices"]
        )
        temp_same_y["text_anc"] = dup["text_anchor"]["text_segments"]
        same_y_ent.append(temp_same_y)
    same_y_ent
    sorted_same_y_ent = sorted(same_y_ent, key=lambda x: x["min_y"])
    groups_same_y = []
    current_group = [sorted_same_y_ent[0]]

    for i in range(1, len(sorted_same_y_ent)):
        if sorted_same_y_ent[i]["min_y"] - current_group[-1]["min_y"] < 0.005:
            current_group.append(sorted_same_y_ent[i])
        else:
            groups_same_y.append(current_group)
            current_group = [sorted_same_y_ent[i]]

    # Append the last group
    groups_same_y.append(current_group)

    if len(groups_same_y) != 0:
        for group in groups_same_y:
            merge_mention_text = ""
            merge_text_anc = []
            merge_page_anc_xy = {"x": [], "y": []}
            merge_type = ""
            for dup1 in group:
                for dup2 in line_items_temp:
                    if dup2["text_anchor"]["text_segments"] == dup1["text_anc"]:
                        merge_mention_text = merge_mention_text + dup2["mention_text"]
                        for anch2 in dup2["text_anchor"]["text_segments"]:
                            merge_text_anc.append(anch2)
                        norm_dup = dup2["page_anchor"]["page_refs"][0]["bounding_poly"][
                            "normalized_vertices"
                        ]
                        for norm_dup_xy in norm_dup:
                            merge_page_anc_xy["x"].append(norm_dup_xy["x"])
                            merge_page_anc_xy["y"].append(norm_dup_xy["y"])
                        line_items_temp.remove(dup2)
            dup_minx, dup_miny, dup_maxx, dup_maxy = (
                min(merge_page_anc_xy["x"]),
                min(merge_page_anc_xy["y"]),
                max(merge_page_anc_xy["x"]),
                max(merge_page_anc_xy["y"]),
            )
            dup_norm_ver = [
                {"x": dup_minx, "y": dup_miny},
                {"x": dup_minx, "y": dup_maxy},
                {"x": dup_maxx, "y": dup_miny},
                {"x": dup_maxx, "y": dup_maxy},
            ]
            line_item_ent3 = get_entity_new(
                merge_mention_text, dup_norm_ver, merge_text_anc, "line_item", True
            )
            sub_ent3 = get_entity_new(
                merge_mention_text, dup_norm_ver, merge_text_anc, consider_type, False
            )
            line_item_ent3["properties"].append(sub_ent3)
            line_items_temp.append(line_item_ent3)

    return line_items_temp


def tagging_rest_child(
    json_dict: documentai.Document,
    page_num: int,
    line_items_temp: List[Dict[str, Any]],
    headers_entities: Dict[str, Any],
    ent_min_dict: Dict[str, Dict[str, float]],
    consider_ent: str,
) -> List[Dict[str, Any]]:
    """
    THIS FUNCTION USES LOADED JSON, PAGE NUMBER AND REFERENCED LINE ITEM LIST
    TAGGED AND HEADER ENTITIES DICTIONARY AND TAGS ALL THE REST OF THE CHILD ITEMS

    ARGS:
    - json_dict (Dict[str, Any]): Loaded JSON.
    - page_num (int): Page number.
    - line_items_temp (List[Dict[str, Any]]): Referenced line item list tagged.
    - headers_entities (Dict[str, Any]): Header entities dictionary.
    - ent_min_dict (Dict[str, Dict[str, float]]): Dictionary of header
    keyword and values as X and Y coordinates.
    - consider_ent (str): Entity to be tagged.

    RETURNS:
    - List[Dict[str, Any]]: Updated list of line items with tagged child items.
    """
    desired_values = ["line_item/description", "description"]

    # Get keys that have the desired value
    matching_keys = [
        key for key, value in headers_entities.items() if value in desired_values
    ]

    for line_item in line_items_temp:
        sub_ent_temp = []
        for sub in line_item["properties"]:
            normalized_vertices = sub["page_anchor"]["page_refs"][0]["bounding_poly"]
            min_x, min_y = min(
                (vertex["x"], vertex["y"])
                for vertex in normalized_vertices["normalized_vertices"]
            )
            max_x, max_y = max(
                (vertex["x"], vertex["y"])
                for vertex in normalized_vertices["normalized_vertices"]
            )
            for en1, min_xy in ent_min_dict.items():
                temp_mention_text = ""
                temp_page_anchor = {"x": [], "y": []}
                temp_text_anchor = []
                if en1 != consider_ent:
                    for page in json_dict.pages:
                        if int(page_num) == int(page.page_number - 1):
                            for token in page.tokens:
                                (
                                    min_x_token,
                                    min_y_token,
                                    max_x_token,
                                    max_y_token,
                                ) = get_token_xy(token)
                                if (
                                    en1 != matching_keys[0]
                                    and min_xy["min_x"] >= min_x_token - 0.02
                                    and min_xy["max_x"] <= max_x_token + 0.005
                                    and abs(min_y - min_y_token) <= 0.005
                                ) or (
                                    en1 == matching_keys[0]
                                    and min_xy["min_x"] <= min_x_token
                                    and min_xy["max_x"] >= max_x_token - 0.35
                                    and abs(min_y - min_y_token) <= 0.005
                                ):
                                    for seg in token.layout.text_anchor.text_segments:
                                        end_index = seg.end_index
                                        start_index = seg.start_index
                                    temp_text_anchor.append(
                                        {
                                            "start_index": start_index,
                                            "end_index": end_index,
                                        }
                                    )
                                    temp_page_anchor["x"].extend(
                                        [min_x_token, max_x_token]
                                    )
                                    temp_page_anchor["y"].extend(
                                        [min_y_token, max_y_token]
                                    )
                                    temp_mention_text = (
                                        temp_mention_text
                                        + json_dict.text[
                                            int(start_index) : int(end_index)
                                        ]
                                    )
                if temp_mention_text != "":
                    norm_vertices = [
                        {
                            "x": min(temp_page_anchor["x"]),
                            "y": min(temp_page_anchor["y"]),
                        },
                        {
                            "x": min(temp_page_anchor["x"]),
                            "y": max(temp_page_anchor["y"]),
                        },
                        {
                            "x": max(temp_page_anchor["x"]),
                            "y": min(temp_page_anchor["y"]),
                        },
                        {
                            "x": max(temp_page_anchor["x"]),
                            "y": max(temp_page_anchor["y"]),
                        },
                    ]
                    sub_ent = get_entity_new(
                        temp_mention_text,
                        norm_vertices,
                        temp_text_anchor,
                        headers_entities[en1],
                        False,
                    )
                    sub_ent_temp.append(sub_ent)
        for item in sub_ent_temp:
            line_item["properties"].append(item)
        line_item_mention_text = ""
        line_item_page_anchor = {"x": [], "y": []}
        line_item_text_anchor = []
        for sub1 in line_item["properties"]:
            line_item_mention_text = line_item_mention_text + sub1["mention_text"]
            for anch1 in sub1["text_anchor"]["text_segments"]:
                line_item_text_anchor.append(anch1)
            norm_temp = sub1["page_anchor"]["page_refs"][0]["bounding_poly"][
                "normalized_vertices"
            ]
            for i in norm_temp:
                line_item_page_anchor["x"].append(i["x"])
                line_item_page_anchor["y"].append(i["y"])
            min_line_x, min_line_y, max_line_x, max_line_y = (
                min(line_item_page_anchor["x"]),
                min(line_item_page_anchor["y"]),
                max(line_item_page_anchor["x"]),
                max(line_item_page_anchor["y"]),
            )
        line_norm_ver = [
            {"x": min_line_x, "y": min_line_y},
            {"x": min_line_x, "y": max_line_y},
            {"x": max_line_x, "y": min_line_y},
            {"x": max_line_x, "y": max_line_y},
        ]
        line_item["page_anchor"]["page_refs"][0]["bounding_poly"][
            "normalized_vertices"
        ] = line_norm_ver
        line_item["text_anchor"]["text_segments"] = line_item_text_anchor
        line_item["mention_text"] = line_item_mention_text

    return line_items_temp


def tag_description_bw_regions(
    json_dict: documentai.Document,
    page_num: int,
    line_items_temp: List[Dict[str, Any]],
    max_stop_y: float,
) -> List[Dict[str, Any]]:
    """
    THIS FUNCTION USED LOADED JSON, PAGE AND LINE ITEMS TAGGED AND MAX Y FROM STOP WORD
    AND GIVES THE UPDATED LINE ITEMS TAGGING THE OCR OUTPUT IN
    BETWEEN THE LINE ITEMS AS line_item/description

    ARGS:
    - json_dict (Dict[str, Any]): Loaded JSON.
    - page_num (int): Page number.
    - line_items_temp (List[Dict[str, Any]]): Line items tagged.
    - max_stop_y (float): Max Y from stop word.

    RETURNS:
    - List[Dict[str, Any]]: Updated line items with tagged descriptions between them.
    """

    region = []
    region_line_item = []
    for n1 in range(len(line_items_temp)):
        norm_temp_1 = line_items_temp[n1]["page_anchor"]["page_refs"][0][
            "bounding_poly"
        ]["normalized_vertices"]
        y_min_1 = min(vertex["y"] for vertex in norm_temp_1)
        y_max_1 = max(vertex["y"] for vertex in norm_temp_1)
        if n1 < len(line_items_temp) - 1:
            norm_temp_2 = line_items_temp[n1 + 1]["page_anchor"]["page_refs"][0][
                "bounding_poly"
            ]["normalized_vertices"]
            y_min_2 = min(vertex["y"] for vertex in norm_temp_2)
            y_max_2 = max(vertex["y"] for vertex in norm_temp_2)
            region.append({"min_y": y_max_1, "max_y": y_min_2})
            region_line_item.append(({"min_y_1": y_min_1, "min_y_2": y_min_2}))
        else:
            if max_stop_y != 1:
                region.append({"min_y": y_max_1, "max_y": max_stop_y - 0.01})
                region_line_item.append(({"min_y_1": y_min_1, "min_y_2": max_stop_y}))
    line_desc_bw_regions = []
    for reg in region:
        temp_text = ""
        desc_text_anc = []
        desc_page_anc_xy = {"x": [], "y": []}
        for page in json_dict.pages:
            if int(page_num) == int(page.page_number - 1):
                for token1 in page.tokens:
                    (
                        min_x_token_1,
                        min_y_token_1,
                        max_x_token_1,
                        max_y_token_1,
                    ) = get_token_xy(token1)
                    if (
                        min_y_token_1 >= reg["min_y"] - 0.005
                        and max_y_token_1 <= reg["max_y"] + 0.005
                    ):
                        for seg in token1.layout.text_anchor.text_segments:
                            end_index = seg.end_index
                            start_index = seg.start_index
                        temp_text = (
                            temp_text
                            + json_dict.text[int(start_index) : int(end_index)]
                        )
                        desc_text_anc.append(
                            {"start_index": start_index, "end_index": end_index}
                        )
                        desc_page_anc_xy["x"].extend([min_x_token_1, max_x_token_1])
                        desc_page_anc_xy["y"].extend([min_y_token_1, max_y_token_1])
        if temp_text != "":
            norm_vertices_1 = [
                {"x": min(desc_page_anc_xy["x"]), "y": min(desc_page_anc_xy["y"])},
                {"x": min(desc_page_anc_xy["x"]), "y": max(desc_page_anc_xy["y"])},
                {"x": max(desc_page_anc_xy["x"]), "y": min(desc_page_anc_xy["y"])},
                {"x": max(desc_page_anc_xy["x"]), "y": max(desc_page_anc_xy["y"])},
            ]
            sub_ent_desc = get_entity_new(
                temp_text,
                norm_vertices_1,
                desc_text_anc,
                "line_item/description",
                False,
            )
            line_desc_bw_regions.append(sub_ent_desc)

    for reg3 in region_line_item:
        for line_5 in line_items_temp:
            norm_temp_line5 = line_5["page_anchor"]["page_refs"][0]["bounding_poly"][
                "normalized_vertices"
            ]
            y_min_line_5 = min(vertex["y"] for vertex in norm_temp_line5)
            y_max_line_5 = max(vertex["y"] for vertex in norm_temp_line5)
            for line_desc in line_desc_bw_regions:
                norm_temp_desc_2 = line_desc["page_anchor"]["page_refs"][0][
                    "bounding_poly"
                ]["normalized_vertices"]
                y_min_desc = min(vertex["y"] for vertex in norm_temp_desc_2)
                y_max_desc = max(vertex["y"] for vertex in norm_temp_desc_2)
                if (
                    y_min_desc >= reg3["min_y_1"] - 0.01
                    and y_max_desc <= reg3["min_y_2"] + 0.01
                    and y_min_line_5 >= reg3["min_y_1"] - 0.01
                    and y_max_line_5 <= reg3["min_y_2"] + 0.01
                ):
                    # print(line_desc['mention_text'])
                    line_5["properties"].append(line_desc)

                    line_desc_bw_regions.remove(line_desc)

    for line_fin in line_items_temp:
        temp_text_2 = ""
        temp_text_anc_2 = []
        temp_page_anc_xy_2 = {"x": [], "y": []}
        for subline in line_fin["properties"]:
            for an5 in subline["text_anchor"]["text_segments"]:
                temp_text_anc_2.append(an5)
            for xy2 in subline["page_anchor"]["page_refs"][0]["bounding_poly"][
                "normalized_vertices"
            ]:
                temp_page_anc_xy_2["x"].append(xy2["x"])
                temp_page_anc_xy_2["y"].append(xy2["y"])
        # print(temp_text_anc_2)
        sorted_temp_text_anc_2 = sorted(
            temp_text_anc_2, key=lambda x: int(x["end_index"])
        )
        temp_done_anc = []
        for index_4 in sorted_temp_text_anc_2:
            if index_4 not in temp_done_anc:
                temp_text_2 = (
                    temp_text_2
                    + json_dict.text[
                        int(index_4["start_index"]) : int(index_4["end_index"])
                    ]
                )
                temp_done_anc.append(index_4)
        min_x_line_fin, min_y_line_fin, max_x_line_fin, max_y_line_fin = (
            min(temp_page_anc_xy_2["x"]),
            min(temp_page_anc_xy_2["y"]),
            max(temp_page_anc_xy_2["x"]),
            max(temp_page_anc_xy_2["y"]),
        )
        line_fin["page_anchor"]["page_refs"][0]["bounding_poly"][
            "normalized_vertices"
        ] = [
            {"x": min_x_line_fin, "y": min_y_line_fin},
            {"x": max_x_line_fin, "y": min_y_line_fin},
            {"x": min_x_line_fin, "y": max_y_line_fin},
            {"x": max_x_line_fin, "y": max_y_line_fin},
        ]
        line_fin["text_anchor"]["text_segments"] = sorted_temp_text_anc_2
        line_fin["mention_text"] = temp_text_2

        # pprint(line_fin)

    return line_items_temp


file_names_list, file_dict = file_names(gcs_input_path)
bucket_name = gcs_input_path.split("/")[2]
for filename, filepath in tqdm(file_dict.items(), desc="Progress"):
    input_bucket_name = gcs_input_path.split("/")[2]
    if ".json" in filepath:
        json_dict = documentai_json_proto_downloader(bucket_name, filepath)
        page_wise_ent = get_page_wise_entities(json_dict)
        line_item_entities = []
        for page, ent in page_wise_ent.items():
            ent_min_dict = get_text_anc_headers(
                json_dict, page, headers, headers_entities=headers_entities
            )
            try:
                y_max_stop = get_text_anc_headers(json_dict, page, stop_word)
                for stop, ver in y_max_stop.items():
                    max_stop_y = ver["max_y"]
            except:
                max_stop_y = 1
            line_items_temp = tag_ref_child_item(
                json_dict, page, ent_min_dict, consider_ent, max_stop_y
            )
            line_items_temp_1 = tagging_rest_child(
                json_dict,
                page,
                line_items_temp,
                headers_entities,
                ent_min_dict,
                consider_ent,
            )
            line_items_temp_page = tag_description_bw_regions(
                json_dict, page, line_items_temp_1, max_stop_y
            )
            for line_temp_ent5 in line_items_temp_page:
                line_item_entities.append(line_temp_ent5)

        if line_item_entities != []:
            final_entities = []
            for entity in json_dict.entities:
                if entity.type != "line_item":
                    final_entities.append(entity)
            for line_ent in line_item_entities:
                final_entities.append(line_ent)
            json_dict.entities = final_entities
        else:
            print("No change in the file")
        store_document_as_json(
            documentai.Document.to_json(json_dict),
            gcs_output_path.split("/")[2],
            ("/").join(gcs_output_path.split("/")[3:]) + "/" + filename,
        )

## OUTPUT

* Before and after the postprocessing code

* Before post processing code

<img src="./Images/format_input.png" width=800 height=400></img>
    
* After using Post processing code

<img src="./Images/output.png" width=800 height=400></img>