# Swap OCR Confusion Characters

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
This is a post processing tool to modify ocr-text, all characters/symbols are filtered based on provided confidence threshold(i.e, confidence_threshold) and then all these characters/symbols are replaced based on provided mapping dictionary(i.e, swapper) in ocr-text
 to normalize year in date related entities from 19xx to 20xx. Document AI processors will give a normalized_value attribute for date entities in Document Object and sometimes this normalized.


# Prerequisites
* Vertex AI Notebook
* GCS Folder Path


# Step-by-Step Procedure

## Import the libraries

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from typing import Tuple, List, Dict
from utilities import (
    file_names,
    documentai_json_proto_downloader,
    store_document_as_json,
)

## 2. Input Details

* **GCS_INPUT_PATH** : GCS folder path containing DocAI OCR Processor results in JSON format with Symbols data in it. 
* **GCS_OUTPUT_PATH** : GCS folder path to store post-processed results
* **CONFIDENCE_THRESHOLD** : Based on this value, all swapping process takes place. It range is (0, 1).
* **SWAPPER** : It is a dictionary, containing mapper configurations, {‘old_char’: ‘new_char’,...}

In [None]:
GCS_INPUT_FOLDER = "gs://{bucket-name}/{sub-folders-path}/"
GCS_OUTPUT_FOLDER = "gs://{bucket-name}/{sub-folders-path}"
CONFIDENCE_THRESHOLD = 0.65
SWAPPER = {")": "J", "0": "Q", "5": "S", "1": "I", "2": "Z", "8": "B"}

## 3. Run Below Code-Cells

In [None]:
def swap_confusion_chars(doc, confidence_threshold=0.75):
    """
    Swap characters in the document text with characters from the swapper mapping
    based on a confidence threshold.

    Parameters:
    doc (object): The document object containing text and page symbols.
    swapper (dict): A dictionary mapping characters to their replacement characters.
    confidence_threshold (float): The confidence threshold below which characters will be swapped.

    Returns:
    object: The modified document object with swapped characters.
    """
    text = doc.text
    for page in doc.pages:
        for _, symbol in enumerate(page.symbols):
            conf = symbol.layout.confidence
            if conf <= confidence_threshold:
                text_segment = symbol.layout.text_anchor.text_segments[0]
                start_ind, end_i = text_segment.start_index, text_segment.end_index
                char = text[start_ind:end_i]
                if char in SWAPPER:
                    # Modifying docai text inplace
                    text = text[:start_ind] + SWAPPER[char] + text[end_i:]
                    doc.text = text
    return doc


splits = GCS_INPUT_FOLDER.strip("/").split("/")
input_bucket, input_folder = splits[2], "/".join(splits[3:])
output_bucket, output_folder = splits[2], "/".join(splits[3:])
_, files_dict = file_names(GCS_INPUT_FOLDER)

print(
    "Swapping process started based on provided confidence threshold and swapper_dict"
)
for fn, fp in files_dict.items():
    print(f"Processing File: {fn} ...")
    doc_1 = documentai_json_proto_downloader(input_bucket, fp)
    doc_2 = swap_confusion_chars(doc_1, CONFIDENCE_THRESHOLD)
    json_str = documentai.Document.to_json(doc_2, including_default_value_fields=False)
    out_fp = f"{output_folder}/{fn}"
    print(f"\t Post-processed file storing at gs://{output_bucket}/{out_fp}")
    store_document_as_json(json_str, output_bucket, out_fp)
print("Process Completed!!!")

## Output

<table>
    <tr>
        <td>
            <b>Pre-processed data</b>
        </td>
    </tr>
    <tr>
        <td>
            <img src='./Images/input_image.png' width=600 height=600 alt='input_image'></img>
        </td>
    </tr>
</table>