# Backmapping Entities  From Parser Output To Original Language

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

## Objective
This document guides to backmap the entities from the parser output which is trained in different languages to the original language of the document using google translation API.


## Prerequisites
* Vertex AI Notebook
* Documents in GCS folder to backmap
* Parser details
* `textUnits` option for **Cloud Translation API** needs to be allowlisted/enabled for project

## Workflow to BackMap the Entities to Original language

<img src='./images/workflow.png' width=800 height=800></img>

## Step-by-Step Procedure

## 1. Import Modules/Packages

In [None]:
!pip install fuzzywuzzy -q
!pip install google-auth -q
!pip install google-cloud-documentai -q
!pip install google-cloud-storage -q
!pip install numpy -q
!pip install opencv-python -q
!pip install pandas -q
!pip install pillow -q
!pip install python-dateutil -q
!pip install requests -q

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import pandas as pd

from backmap_utils import (
    document_to_json,
    download_pdf,
    process_document,
    run_consolidate,
    translation_text_units,
    upload_to_cloud_storage,
)
from utilities import file_names

## 2. Input Details

* **PROJECT_ID**: GCP project ID
* **LOCATION**: Location of DocumentAI Processor, either `us` or `eu`
* **PROCESSOR_ID**: DocumentAI Parser ProcessorID 
* **PROCESSOR_VERSION_ID**: DocumentAI Parser processor version id
* **ORIGINAL_SAMPLES_GCS_PATH**: GCS folder apth containing native-language(non-english) documents
* **OUTPUT_BUCKET**: GCS output bucket-name to store results(with-out gs://)
* **OUTPUT_GCS_DIR**: Output folder path to store results in above mentioned output-bucket(with-out gs://)
* **MIME_TYPE**: Mimetype of input documents
* **TRANSLATION**: `True` if you needed translation of documents from non-eng to english language, otherwise `False`
* **BACKMAPPING**: `True` if you needed backamapping of entities from parser-output to original language(non-english), otherwise `False`
* **SAVE_TRANSLATED_PDF**: `True` if you need to store translated doc-results of Cloud Translation API output results
* **ORIGINAL_LANGUAGE**: Provide language code of original documents. eg:- '`de`' for greek input files
* **TARGET_LANGUAGE**: Provide target language code. eg:- '`en`' to sonvert to english
* **DIFF_X**: X-coordinate offset
* **DIFF_Y**: Y-coordinate offset

In [None]:
PROJECT_ID = "xx-xx-xx"
LOCATION = "us"  # or "eu"
PROCESSOR_ID = "xx-xx-xx"  # Invoice processor ID
PROCESSOR_VERSION_ID = "pretrained-invoice-v1.3-2022-07-15"
ORIGINAL_SAMPLES_GCS_PATH = "gs://bucket/path_to/backmapping/original_samples"
OUTPUT_BUCKET = "bucket_name_only"  # without gs://
OUTPUT_GCS_DIR = "directory_name"  # without gs://
MIME_TYPE = "application/pdf"
TRANSLATION = True
BACKMAPPING = True
SAVE_TRANSLATED_PDF = True
ORIGINAL_LANGUAGE = "de"
TARGET_LANGUAGE = "en"
DIFF_X = 0.3
DIFF_Y = 0.05

## 3. Run Below Code-Cells

In [None]:
files_list, files_dict = file_names(ORIGINAL_SAMPLES_GCS_PATH)
input_bucket_name = ORIGINAL_SAMPLES_GCS_PATH.split("/")[2]
OUTPUT_GCS_DIR = OUTPUT_GCS_DIR.strip("/")
df_merge = pd.DataFrame()
print(
    "Backmapping DocumentAI Parser Output to it's Original Language Process Started..."
)
path_text_units = f"{OUTPUT_GCS_DIR}/text_units"
path_after_translation = f"{OUTPUT_GCS_DIR}/after_translation"
path_after_backmapping = f"{OUTPUT_GCS_DIR}/after_backmapping"
PATH_CONSOLIDATED_CSV = OUTPUT_GCS_DIR
CONSOLIDATED_CSV = "consolidated_csv_after_backamapping.csv"
for fn, fp in files_dict.items():
    print(f"File: {fn}")
    gcs_input_path = f"gs://{input_bucket_name}/{fp}"
    pdf_bytes_target = download_pdf(gcs_input_path, fp)  # .getvalue()
    # converting non-eng-doc(greek) pdf to docai-json result using invoice-v3
    print("\tDocumentAI process sync-started for raw-document")
    target_docai_result = process_document(
        PROJECT_ID,
        LOCATION,
        PROCESSOR_ID,
        PROCESSOR_VERSION_ID,
        file_content=pdf_bytes_target,
        mime_type=MIME_TYPE,
        is_native=False,
        ocr=False,
    )
    json_data_target = document_to_json(target_docai_result)
    filename = fn.split(".")[0]
    if TRANSLATION:
        print("\t\tTranslation process started...")
        input_uri = f"gs://{OUTPUT_BUCKET}/{fp}"
        pdf_bytes_source, text_units, json_response = translation_text_units(
            PROJECT_ID,
            LOCATION,
            PROCESSOR_VERSION_ID,
            PROCESSOR_ID,
            TARGET_LANGUAGE,
            ORIGINAL_LANGUAGE,
            input_uri,
            OUTPUT_BUCKET,
            OUTPUT_GCS_DIR,
            save_translated_doc=SAVE_TRANSLATED_PDF,
        )
        text_units_dict = {"text_units": text_units}
        upload_to_cloud_storage(
            filename, text_units_dict, OUTPUT_BUCKET, path_text_units
        )
        print("\tDocumentAI process sync-started for translated-document(English)")
        source_docai_result = process_document(
            PROJECT_ID,
            LOCATION,
            PROCESSOR_ID,
            PROCESSOR_VERSION_ID,
            file_content=pdf_bytes_source,
            mime_type=MIME_TYPE,
            is_native=False,
            ocr=False,
        )
        json_data_source = document_to_json(source_docai_result)
        upload_to_cloud_storage(
            filename, json_data_source, OUTPUT_BUCKET, path_after_translation
        )

    if TRANSLATION and BACKMAPPING:
        # Consolidate the extracted and processed data
        print("\t\tBackmapping process started...")
        df, target_json = run_consolidate(
            source_docai_result,
            target_docai_result,
            text_units,
            DIFF_X,
            DIFF_Y,
            ORIGINAL_LANGUAGE,
        )
        target_json = document_to_json(target_json)
        upload_to_cloud_storage(
            filename, target_json, OUTPUT_BUCKET, path_after_backmapping
        )
        df.insert(loc=0, column="File Name", value=filename)
        df_merge = pd.concat([df_merge, df])

upload_to_cloud_storage(
    CONSOLIDATED_CSV, df_merge, OUTPUT_BUCKET, PATH_CONSOLIDATED_CSV
)
print("Process Completed!!!")

## 4. Output Details

1. Raw Document sample(Greek PDF sample)  
    <img src='./images/original_doc_greek.png' width=800 height=800></img><br>  

2. After Translation from Greek to English using Cloud Translation API
     <img src='./images/after_translation_greek_to_eng.png' width=800 height=800></img>\n

3. After using Translation API, every translated document contains `Machine Translated By Google` text at top-left conrner of translated page
    <img src='./images/redact_noise_after_translation.png' width=800 height=800></img>

4. Sample CSV output file data for comparision between original document entities mention text and translated document mention text
    <img src='./images/df_comparision_output.png' width=800 height=800></img>

