# Entity Amount Cleanup

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
This tool performs post-processing by cleaning the mention_text field of an amount entity and converting it into business redable string-format.

# Prerequisites
* Vertex AI Notebook
* GCS Folder Path

# Step-by-Step Procedure

## 1. Import Modules/Packages

In [7]:
!pip install google-cloud-storage
!pip install google-cloud-documentai

In [8]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [9]:
import re

from google.cloud import documentai_v1beta3 as documentai
from google.cloud.documentai_toolbox import gcs_utilities

from utilities import (
    documentai_json_proto_downloader,
    file_names,
    store_document_as_json,
)

## 2. Input Details

* **INPUT_GCS_PATH** : It is input GCS folder path which contains DocumentAI processor JSON results
* **OUTPUT_GCS_PATH** : It is a GCS folder path to store post-processing results
* **AMOUNT_ENTITY_TYPE** : The amount type of an entity to consider for cleaning and converting it to business readable text data
* **IS_CURRENCY_EXIST**: If currency symbol exists at beginning of amount entity text provide `True` otherwise `False`

In [10]:
INPUT_GCS_PATH = "gs://bucket/path_to/jsons"
OUTPUT_GCS_PATH = "gs://bucket/path_to/output"
# It is an entity type which contains Digit Data(Amount); edit as per entity name in your schema
AMOUNT_ENTITY_TYPE = "annual_income"
IS_CURRENCY_EXIST = True

## 3. Run Below Code-Cells

In [None]:
def clean_annual_amount(doc: documentai.Document) -> documentai.Document:
    """It will removes unexpected commas, periods and spaces

    Args:
        doc (documentai.Document): DocumetAI Document proto Object

    Returns:
        documentai.Document: Updated DocumetAI Document proto Object for given `AMOUNT_ENTITY_TYPE`
    """

    for entity in doc.entities:
        if entity.type_ == AMOUNT_ENTITY_TYPE:
            mention_text = entity.mention_text
            print("\t", mention_text, end=" ---> ")
            currency_symbol = mention_text[0] if IS_CURRENCY_EXIST else ""
            digits = re.findall("\d.*?", mention_text)
            mention_text = "".join(digits)
            if mention_text.endswith("00") and len(digits) > 3:
                mention_text = mention_text[::-1]
                mention_text = mention_text.replace("00", "00.", 1)[::-1]
            mention_text = float(mention_text)
            mention_text = f"{mention_text:,.2f}"
            mention_text = currency_symbol + mention_text
            print(mention_text)
            entity.mention_text = mention_text
    return doc


input_bucket, input_files_dir = gcs_utilities.split_gcs_uri(INPUT_GCS_PATH)
output_bucket, output_files_dir = gcs_utilities.split_gcs_uri(OUTPUT_GCS_PATH)
_, files_dict = file_names(INPUT_GCS_PATH)
for fn, fp in files_dict.items():
    print(f"Process started for {fn}")
    # print(f"\tReading data from gs://{input_bucket}/{fp}")
    doc = documentai_json_proto_downloader(input_bucket, fp)
    doc = clean_annual_amount(doc)
    str_data = documentai.Document.to_json(
        doc, use_integers_for_enums=False, including_default_value_fields=False
    )
    target_path = f"{output_files_dir.rstrip('/')}/{fn}" if output_files_dir else fn
    store_document_as_json(str_data, output_bucket, target_path)
    # print(f"\tStoring JSON file to gs://{output_bucket}/{target_path}")
print("Process Completed for all files")

# 4. Output Details

Refer below images for preprocessed and postprocessed results

<table>
    <tr>
        <td>
            <b>Pre-processed data</b>
        </td>
        <td>
            <b>Post-processed data</b>
        </td>
    </tr>
    <tr>
        <td>
            <img src='./images/annual_income_sample1_pre.png' width=400 height=600></img>
        </td>
        <td>
            <img src='./images/annual_income_sample1_post.png' width=400 height=600></img>
        </td>
    </tr>
    <tr>
        <td>
            <img src='./images/annual_income_sample2_pre.png' width=400 height=600></img>
        </td>
        <td>
            <img src='./images/annual_income_sample2_post.png' width=400 height=600></img>
        </td>
    </tr>
    </table>
    