# DocAI Special Character Removal

* Author: docai-incubator@google.com


## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description

This documentation outlines the procedure for handling special characters within the CDE JSON samples. It involves replacing the original mention text value with its corresponding post-processed value using the provided code.

This process removes special characters like hyphens (-) and forward slashes (/) from the amount field. This is done because the presence of these characters can interfere with the ability of parsing elements to correctly identify the amount values.

## Prerequisites

1. Access to vertex AI Notebook or Google Colab
2. Python
3. Access to the google storage bucket.

## Step by Step procedure 

### 1. Install the required libraries

In [None]:
%pip install Pillow
%pip install google-cloud-storage
%pip install google-cloud-documentai

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

### 2. Import the required libraries/Packages

In [None]:
from io import BytesIO
from google.cloud import storage
from PIL import Image
from google.cloud import documentai_v1beta3 as documentai
from google.api_core.client_options import ClientOptions
from pathlib import Path
import base64
import io
import json
from utilities import (
    file_names,
    documentai_json_proto_downloader,
    store_document_as_json,
)

### 3. Input Details

<ul>
    <li><b>input_path : </b>It is input GCS folder path which contains DocumentAI processor JSON results</li>
    <li><b>output_path : </b> It is a GCS folder path to store post-processing results</li>
    <li><b>project_id : </b> It is the project id of the current project.</li>
    <li><b>location : </b> It is the location of the project in the processor.</li>
    <li><b>processor_id : </b> It is the cde processor id. </li>
    <li><b>entity_name : </b> The name of an entity to consider for cleaning and converting it with post processed value.</li>
    
</ul>

In [None]:
input_path = "gs://bucket_name/path/to/jsons/"
output_path = "gs://bucket_name/path/to/output/"
project_id = "project-id"
location = "location"
processor_id = "processor-id"
entity_name = "Amount_in_number"  # It is the entity name which need to be converted.

### 4.Execute the code

In [None]:
input_storage_bucket_name = input_path.split("/")[2]
input_bucket_path_prefix = "/".join(input_path.split("/")[3:])
output_storage_bucket_name = output_path.split("/")[2]
output_bucket_path_prefix = "/".join(output_path.split("/")[3:])


def remove_special_characters(
    json_proto_data: documentai.Document, entity_name: str
) -> documentai.Document:
    """
    Removes special characters from a specified entity type ("entity_name") in a Documentai document.

    This function processes the entity bounding box, extracts the image data, performs OCR with symbol confidence,
    and removes special characters like '-' and '/' based on confidence thresholds while considering adjacent digits.

    Args:
      json_proto_data: The Documentai document object containing text and entities (type: documentai.Document).
      entity_name: The name of the entity type to process (type: str).

    Returns:
      The modified Documentai document object with updated entity mentions after removing special characters (type: documentai.Document).
    """
    for page_index, page in enumerate(json_proto_data.pages):
        if "image" in page and "content" in page.image:
            # Decode the image content
            image_data = page.image.content
            # image_data = base64.b64decode(image_data_base64)
            image = Image.open(io.BytesIO(image_data))

            for entity in json_proto_data.entities:
                if entity.type == entity_name:
                    bounding_box = entity.page_anchor.page_refs[
                        0
                    ].bounding_poly.normalized_vertices

                    # Convert normalized coordinates to pixel coordinates
                    img_width, img_height = image.size
                    left = bounding_box[0].x * img_width
                    top = bounding_box[0].y * img_height
                    right = bounding_box[2].x * img_width
                    bottom = bounding_box[2].y * img_height

                    # Crop the image
                    cropped_image = image.crop((left, top, right, bottom))

                    # Convert the PIL image to bytes directly
                    cropped_image_bytes = BytesIO()
                    cropped_image.save(cropped_image_bytes, format="PNG")
                    image_content = cropped_image_bytes.getvalue()

                    docai_client = documentai.DocumentProcessorServiceClient(
                        client_options=ClientOptions(
                            api_endpoint=f"{location}-documentai.googleapis.com"
                        )
                    )
                    RESOURCE_NAME = docai_client.processor_path(
                        project_id, location, processor_id
                    )

                    raw_document = documentai.RawDocument(
                        content=image_content, mime_type="image/png"
                    )
                    process_options = {"ocr_config": {"enable_symbol": True}}
                    request = documentai.ProcessRequest(
                        name=RESOURCE_NAME,
                        raw_document=raw_document,
                        process_options=process_options,
                    )

                    result = docai_client.process_document(request=request)
                    new_json_proto_data = result.document

                    # Extracting text and confidence values
                    # new_json_data = json.loads(documentai.Document.to_json(document_object))
                    complete_text = new_json_proto_data.text
                    symbols_confidence = []

                    for page in new_json_proto_data.pages:
                        for symbol in page.symbols:
                            segments = symbol.layout.text_anchor.text_segments[0]
                            start_index = int(segments.start_index)
                            end_index = int(segments.end_index)
                            symbol_text = complete_text[start_index:end_index]
                            confidence = symbol.layout.confidence
                            symbols_confidence.append((symbol_text, confidence))

                    # Initially filter out '-' and '/' without affecting adjacent numeric values
                    symbols_confidence_filtered = [
                        (sym, conf)
                        for sym, conf in symbols_confidence
                        if sym not in ("-", "/")
                    ]

                    # Check and remove the first symbol if its confidence is below 0.85
                    if (
                        symbols_confidence_filtered
                        and symbols_confidence_filtered[0][1] < 0.85
                    ):
                        symbols_confidence_filtered.pop(0)

                    # Check and remove the last two symbols if their confidences are below 0.85
                    if (
                        len(symbols_confidence_filtered) > 2
                        and symbols_confidence_filtered[-1][1] < 0.85
                    ):
                        symbols_confidence_filtered.pop(-1)
                    if (
                        len(symbols_confidence_filtered) > 2
                        and symbols_confidence_filtered[-2][1] < 0.85
                    ):
                        symbols_confidence_filtered.pop(-2)

                    if len(symbols_confidence_filtered) > 2:
                        for j in range(1, len(symbols_confidence_filtered) - 2):
                            if symbols_confidence_filtered[j][1] < 0.5:
                                symbols_confidence_filtered.pop(j)

                    # Join the remaining characters from the processed list
                    post_processed_symbols = "".join(
                        [sym for sym, conf in symbols_confidence_filtered]
                    )
                    entity.mention_text = post_processed_symbols
            return json_proto_data


list_of_files = [
    i for i in list(file_names(input_path)[1].values()) if i.endswith(".json")
]

for i in range(0, len(list_of_files)):
    file_name = list_of_files[i]
    # json_data=json.loads(source_bucket.blob(list_of_files[i]).download_as_string().decode('utf-8'))
    json_proto_data = documentai_json_proto_downloader(
        input_storage_bucket_name, file_name
    )
    print("Processing>>>>>>>", file_name)
    document_proto = remove_special_characters(json_proto_data, entity_name)
    output_path_within_bucket = output_bucket_path_prefix + file_name.split("/")[1]
    store_document_as_json(
        documentai.Document.to_json(document_proto),
        output_storage_bucket_name,
        output_path_within_bucket,
    )

### 5.Output


The post processed json field can be found in the storage path provided by the user during the script execution that is output_bucket_path. <br><hr>
<b>Comparison Between Input and Output File</b><br><br>
<i><h4>Post processing results<h4><i><br>
Upon code execution, the JSONs with the newly replaced values will be stored in the designated output Google Cloud Storage (GCS) bucket. This table summarizes the key differences between the input and output JSON files for the 'Amount_in_number' entity<br>
    
<table>
    <tr>
        <td><h3><b>Input Json </b></h3></td>
        <td><h3><b>Output Json</b></h3></td>
    </tr>
<tr>
<td><img src="./images/image_input_json.png"></td>
<td><img src="./images/image_output_json.png"></td>
</tr>
</table>