# Renaming Entity Type

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
This tool is used to rename specific entity type(`documentai.Document.Entity.type_`) of Document Proto Object to new entity type based on provided mappings(`RENAME_MAPPINGS - dictionary`)

<table>
    <td><b>Entity Type Text Before Processing</b></td><td><img src="./images/sample_1_pre.png"></td>
    <td><b>Entity Type Text After Processing</b></td><td><img src="./images/sample_1_post.png"></td>
</table>

# Prerequisites
* Vertex AI Notebook
* GCS Folder Path

# Step-by-Step Procedure

## 1. Import required Modules/Packages

In [None]:
!pip install google-cloud-documentai --quiet
!pip install google-cloud-documentai-toolbox --quiet

In [11]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [4]:
from typing import Dict, Sequence

from google.cloud import documentai_v1beta3 as documentai
from google.cloud.documentai_toolbox import gcs_utilities

from utilities import (
    documentai_json_proto_downloader,
    file_names,
    store_document_as_json,
)

## 2. Input Details

* **INPUT_GCS_PATH**:  It is an input GCS folder path which contains DocumentAI processor JSON results.
* **OUTPUT_GCS_PATH** : It is a GCS folder path to store post-processing results.
* **RENAME_MAPPINGS**: It is a dictionary object which contains existing entity type as key and new entity type as value.

In [8]:
INPUT_GCS_PATH = "gs://bucket/path_to/input"
OUTPUT_GCS_PATH = "gs://bucket/path_to/output"
# {"old_entity_type": "new_entity_type", ..}
RENAME_MAPPINGS = {
    "annual_income": "INCOME_PER_YEAR",
    "due_date": "DUE_ON",
    "currency": "CURRENCY_SYMBOL",
    "purchase_order": "PO",
    "line_item/amount": "LINE_ITEM/TOTAL_AMOUNT",
}

## 3. Run Below Code-Cells

In [None]:
def rename_entity_type(
    entities: Sequence[documentai.Document.Entity],
) -> Sequence[documentai.Document.Entity]:
    """It will update documnet entity type based on provided rename_mappings

    Args:
        entities (Sequence[documentai.Document.Entity]): A sequence/list of all entities in document proto object

    Returns:
        Sequence[documentai.Document.Entity]: Updated entity of Documentai Entity object
    """

    for ent in entities:
        if ent.type_ in RENAME_MAPPINGS:
            print(f"\t\t {ent.type_} --> {RENAME_MAPPINGS[ent.type_]}")
            ent.type_ = RENAME_MAPPINGS[ent.type_]
        if ent.properties:
            rename_entity_type(ent.properties)
    return entities


print(
    f"Renaming specific entity type_ process started for all JSON files present in {INPUT_GCS_PATH}"
)
print(f"Renaming is based on given key-value pair only \n{RENAME_MAPPINGS}")
input_bucket, input_uri_prefix = gcs_utilities.split_gcs_uri(INPUT_GCS_PATH)
output_bucket, output_uri_prefix = gcs_utilities.split_gcs_uri(OUTPUT_GCS_PATH)
_, filenames_dict = file_names(INPUT_GCS_PATH)
filenames_dict = {fn: fp for fn, fp in filenames_dict.items() if fn.endswith(".json")}
for fn, fp in filenames_dict.items():
    print(f"\tProcess started for {fn}")
    doc = documentai_json_proto_downloader(input_bucket, fp)
    rename_entity_type(doc.entities)
    str_data = documentai.Document.to_json(
        doc, use_integers_for_enums=False, including_default_value_fields=False
    )
    output_path = f"{output_uri_prefix.rstrip('/')}/{fn}"
    store_document_as_json(str_data, output_bucket, output_path)
    print(
        f"\tProcess Completed and successfully uploaded file to GCS - Path - {OUTPUT_GCS_PATH.rstrip('/')}/{fn}"
    )
print("Renaming Process completed for all files")

## 4. Output Details

After successfull running of provided python script against DocumentAI Processor JSON results folder. You can observe entity type changes taking place based on the provided  `RENAME_MAPPINGS` dictionary. Refer below sample input-output images.

<table>
    <tr>
        <td><h3><b>Pre-processing</b></h3></td>
        <td><h3><b>Post-processing</b></h3></td>
    </tr>
<tr>
<td><img src="./images/sample_1_pre.png"></td>
<td><img src="./images/sample_1_post.png"></td>
</tr>
<tr>
<td><img src="./images/sample_2_pre.png"></td>
<td><img src="./images/sample_2_post.png"></td>
</tr>
</table>