# Post-Processing of Negative Values


* Author: docai-incubator@google.com


## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description
If you require a post processing script which corrects the negative values especially the amount. These values are enclosed within a round brackets, for example ‘(123.99)’, which indicates a negative value provided by the OCR. For every documents such entities are to identified, their round brackets are removed and prefixed with a minus ‘-’ symbol for every occurences using a python script. 

## Prerequisites

1. Vertex AI Notebook
2. Parsed json files in GCS Folder.
3. Output folder to upload the updated json files.

## Step by Step procedure 

### 1. Input details


In [1]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [2]:
# INPUT : storage bucket name
input_path = "gs://xxxxxxxx_xxxx_xxxx/JSON_OUTPUT/output_/"
# OUTPUT : storage bucket's path
output_path = "gs://xxxxxxx_xxxx_xxxx/JSON_OUTPUT/output_PostProcessed_xxxx/"
types = ["entity_1", "entity_2", "entity_3"]

**`input_path`**: GCS Storage name. It is DocAI processed output json files. This bucket is used for processing input files and saving output files in the folders.<br>
**`output_path`**: GCS URI of the folder, where the dataset is exported from the processor.<br>
**`types`**:Input the name of the entities type for which the correction should happen in the list


### 2. Output

The post processed json field can be found in the storage path provided by the user during the script execution that is output_bucket_path. <br><hr>
<b>Comparison Between Input and Output File</b><br><br>
<i><h4>Post processing results<h4><i><br>
The following table shows the result of correction of entities having negative values from a sample json document. The key:value pairs which are to be processed are as shown.  There are two cases in the example. One is presence of the negative value in the entities key and other is the example of presence of negative value in the properties key belonging to the entities key. The script works by correcting both cases.

<img src="./Images/negative_value_comparison_1.png" width=800 height=400 alt="Negative value pre post comparison image">
<img src="./Images/negative_value_comparison_2.png" width=800 height=400 alt="Negative value pre post comparison image">
    
<i><h4>Processor: Before and After images</h4></i><br>
The differences observed before and after post-processing json docs and importing into Document AI processor
<table style="float:left">
<tr style="border: 1px solid black"><td>
<img src="./Images/negative_value_comparison_3.png" width=800 height=400 alt="Negative value pre post comparison image"></td></tr>
<tr style="border: 1px solid black"><td>
<img src="./Images/negative_value_comparison_4.png" width=800 height=400 alt="Negative value pre post comparison image"></td></tr>
</table>

### 3. Run the code

In [None]:
from io import BytesIO
import json
from google.cloud import storage
from tqdm.notebook import tqdm
from google.cloud import documentai_v1beta3 as documentai
from utilities import (
    documentai_json_proto_downloader,
    store_document_as_json,
    file_names,
)

# INPUT : storage bucket name
input_path = "gs://xxxx/xxxxxx/xxxxxxxx/"
# OUTPUT : storage bucket's path
output_path = "gs://xxxxx/xxxxxxx/xxxxxxxx/"


input_storage_bucket_name = input_path.split("/")[2]
input_bucket_path_prefix = "/".join(input_path.split("/")[3:])
output_storage_bucket_name = output_path.split("/")[2]
output_bucket_path_prefix = "/".join(output_path.split("/")[3:])


# Provide the list entity types for which are having negative values in '()' formats
types = ["case_number", "first_name", "last_name", "annual_income", "date_signed"]

json_files = file_names(input_path)[1].values()
list_of_files = [i for i in list(json_files) if i.endswith(".json")]

for k in tqdm(range(0, len(list_of_files))):
    document = documentai_json_proto_downloader(
        input_storage_bucket_name, list_of_files[k]
    )

    for entity in document.entities:
        if entity.type in types:
            try:
                textAnchor_Content = entity.text_anchor.content
            except:
                pass

            textAnchor_textSegments_startidx = entity.text_anchor.text_segments[
                0
            ].start_index
            textAnchor_textSegments_endidx = entity.text_anchor.text_segments[
                0
            ].end_index

            startidx_increment = int(textAnchor_textSegments_startidx)  # - 5
            endidx_decrement = int(textAnchor_textSegments_endidx)  # + 5

            text_select_left = document.text[startidx_increment - 5 : endidx_decrement]
            text_select_right = document.text[startidx_increment : endidx_decrement + 5]

            if (
                ("(" in text_select_left)
                and (")") in text_select_right
                or ("{" in text_select_left)
                and ("}") in text_select_right
            ):
                entity.mention_text = "-" + entity.mention_text.replace(
                    "(", ""
                ).replace(")", "")
                try:
                    entity.normalized_value.text = (
                        "-"
                        + entity.normalized_value.text.replace("(", "").replace(")", "")
                    )
                except:
                    pass
                try:
                    entity.text_anchor.content = (
                        "-"
                        + entity.text_anchor.content.replace("(", "").replace(")", "")
                    )
                except:
                    pass

            if entity.properties:
                for lt in range(0, len(entity.properties)):
                    entity_prop = entity.properties[lt]

                    textAnchor_Content = entity_prop.text_anchor.content
                    textAnchor_textSegments_startidx = (
                        entity_prop.text_anchor.text_segments[0].start_index
                    )
                    textAnchor_textSegments_endidx = (
                        entity_prop.text_anchor.text_segments[0].end_index
                    )

                    startidx_increment = int(textAnchor_textSegments_startidx) - 5
                    endidx_decrement = int(textAnchor_textSegments_endidx) + 5

                    text_select = document.text[startidx_increment:endidx_decrement]

                    if (("(" in text_select) and (")") in text_select) or (
                        ("{" in text_select) and ("}") in text_select
                    ):
                        entity_prop.mention_text = (
                            "-"
                            + entity_prop.mention_text.replace("(", "").replace(")", "")
                        )
                        try:
                            entity_prop.normalized_value.text = (
                                "-"
                                + entity_prop.normalized_value.text.replace(
                                    "(", ""
                                ).replace(")", "")
                            )
                        except:
                            pass
                        try:
                            entity_prop.text_anchor.content = (
                                "-"
                                + entity_prop.text_anchor.content.replace(
                                    "(", ""
                                ).replace(")", "")
                            )
                        except:
                            pass
    store_document_as_json(
        documentai.Document.to_json(document),
        output_storage_bucket_name,
        output_bucket_path_prefix + list_of_files[k].split("/")[-1],
    )