# Old OCR Json to New OCR Json Conversion

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied. 

## Objective

* The main purpose of this tool is to reprocess a provided set of old OCR-labeled Json with the new OCR engine, ensuring the entities stay consistent.
* The output of the tool is a new OCR JSON file that replicates the entities present in the original OCR data.
* The tool ensures that the entities identified by the new OCR Engine are mapped appropriately to their corresponding text and page layout information in the new OCR data.
* The term "Old OCR" specifically includes all documents labeled before March 2023. If your dataset falls into this category, it is essential to utilize this tool to reprocess and relabel your documents with the new OCR engine. This step ensures that your data benefits from the latest OCR Engine, enhancing accuracy.

**NOTE**: 
* The tool assumes that the bounding-box was drawn accurately in the last labeling, 
* Sometimes if the New OCR picked some noise (symbols) then those noise might come in the mentionText.
* A Human Review is required to validate the changes.


## Prerequisites

* Vertex AI Notebook.
* Storage Bucket for storing input PDF files and output JSON files.
* Permission For Google DocAI Processors, Storage and Vertex AI Notebook.


## Step by Step procedure 

### 1.Importing Required Modules

In [None]:
!pip install PyPDF2
!pip install google-cloud-documentai==2.16.0

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
from typing import (
    Container,
    Iterable,
    Iterator,
    List,
    Mapping,
    Optional,
    Sequence,
    Tuple,
    Union,
    Dict,
    Any,
)
from io import BytesIO
from pprint import pprint
from google.cloud import storage
from PIL import Image
from PyPDF2 import PdfFileReader
from google.cloud import documentai_v1beta3 as documentai
from tqdm.notebook import tqdm
import io
import json, copy
import utilities

### 2.Setup the Inputs

* `gcs_input_path`: GCS Storage name. It should contain DocAI processed output json files. This bucket is used for processing input files and saving output files in the folders.
* `gcs_output_path`: GCS URI of the folder, where the dataset is exported from the processor.
* `offset`: To expand the existing bounding box to include all the tokens corresponding to the entities, it can be adjusted to an optimal value. By Default it is 0.005.
* `project_number`:  Project Number
* `processor_id`: Processor ID To Call the new Processor with new OCR
* `processor_version` :Processor version of the processor `

In [None]:
# INPUT : storage bucket name
gcs_input_path = "gs://xxx-xxx-xxx/xxxx-xxx-xxxx"
# OUTPUT : storage bucket's path
gcs_output_path = "gs://xxx-xxx-xxx/xxxx-xxx-xxxx"
offset = 0.005  # To Expand the Existing bounding box in order to get all the tokens corrosponding to the entities. Can adjust with optimal value.
project_number = "xxx-xxx-xxx"  # Project Number
processor_id = (
    "xxx-xxx-xxx"  # Processor ID -> To Call the new Invoice Processor with new OCR
)
processor_version = "xxxxxxxxxxxxx"

### 3.Run the Code

In [None]:
def find_textSegment_list(
    x_min: float, y_min: float, x_max: float, y_max: float, js: object, page: int
) -> List:
    """
    Finds the text segments within the specified coordinates on the given page of the document.

    Args:
    - x_min (float): Minimum X coordinate.
    - y_min (float): Minimum Y coordinate.
    - x_max (float): Maximum X coordinate.
    - y_max (float): Maximum Y coordinate.
    - js (Document): Document in protobuf format.
    - page (int): Page number.

    Returns:
    - list: List of text segments within the specified coordinates.
    """

    textSegments_list = []
    for token in js.pages[page].tokens:
        vertices = token.layout.bounding_poly.normalized_vertices
        token_xMin, token_yMin = min(point.x for point in vertices), min(
            point.y for point in vertices
        )
        token_xMax, token_yMax = max(point.x for point in vertices), max(
            point.y for point in vertices
        )
        if (
            token_xMin >= x_min
            and token_xMax <= x_max
            and token_yMin >= y_min
            and token_yMax < y_max
        ):
            textSegments_list.extend(token.layout.text_anchor.text_segments)
    return textSegments_list


def update_text_anchors_mention_text(
    entity: object, js: object, new_js: object
) -> Dict:
    """
    Updates text anchors and mention text for the given entity in the document.

    Args:
    - entity (Entity): Input entity in protobuf format.
    - js (Document): Original document in protobuf format.
    - new_js (Document): New document in protobuf format.
    - offset (float): Offset value.

    Returns:
    - Dict: Updated entity with text anchors, mention text, page anchors, and entity type.
    """

    new_entity = {}
    text_anchor = {}
    textAnchorList = []
    x_min = min(
        ver.x
        for ver in entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices
    )
    y_min = min(
        ver.y
        for ver in entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices
    )
    x_max = max(
        ver.x
        for ver in entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices
    )
    y_max = max(
        ver.y
        for ver in entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices
    )
    page = 0
    try:
        page = int(entity.page_anchor.page_refs[0].page)
    except:
        page = 0
    textSegmentList = find_textSegment_list(
        x_min - offset, y_min - offset, x_max + offset, y_max + offset, new_js, page
    )
    for j in textSegmentList:
        if not j.start_index:
            j.start_index = "0"
    text_anchor["text_segments"] = []
    for seg in textSegmentList:
        text_anchor["text_segments"].append(
            {"start_index": seg.start_index, "end_index": seg.end_index}
        )
    textSegmentList = sorted(textSegmentList, key=lambda x: int(x.start_index))
    mentionText = ""
    listOfIndex = []
    for j in textSegmentList:
        mentionText += new_js.text[int(j.start_index) : int(j.end_index)]
    text_anchor["content"] = mentionText
    new_entity["text_anchor"] = text_anchor
    new_entity["mention_text"] = mentionText
    temp_page_anchor = {}
    list_of_page_refs = []
    for i in entity.page_anchor.page_refs:
        temp = {}
        temp2 = {}
        temp3 = []
        for j in i.bounding_poly.normalized_vertices:
            temp3.append({"x": j.x, "y": j.y})
        temp2["normalized_vertices"] = temp3
        temp["bounding_poly"] = temp2
        temp["layout_type"] = i.layout_type
        temp["page"] = str(page)
        list_of_page_refs.append(temp)
    temp_page_anchor["page_refs"] = list_of_page_refs
    new_entity["page_anchor"] = temp_page_anchor
    new_entity["type_"] = entity.type_

    return new_entity


def make_parent_from_child_entities(temp_child: List, new_js: object) -> Dict:
    """
    Combines child entities into a parent entity based on text anchors and mention text.

    Args:
    - temp_child (List[Entity]): List of child entities in protobuf format.
    - new_js (Document): New document in protobuf format.

    Returns:
    - Dict : Parent entity with text anchors, mention text, and page anchors.
    """

    def combine_two_entities(entity1, entity2, js):
        new_entity = {}
        new_entity["type_"] = entity1["type_"]
        text_anchor = {}
        textAnchorList = []
        entity1["text_anchor"]["text_segments"] = sorted(
            entity1["text_anchor"]["text_segments"], key=lambda x: int(x["start_index"])
        )
        entity2["text_anchor"]["text_segments"] = sorted(
            entity2["text_anchor"]["text_segments"], key=lambda x: int(x["start_index"])
        )
        for j in entity1["text_anchor"]["text_segments"]:
            textAnchorList.append(j)
        for j in entity2["text_anchor"]["text_segments"]:
            textAnchorList.append(j)
        textAnchorList = sorted(textAnchorList, key=lambda x: int(x["start_index"]))
        mentionText = ""
        for j in textAnchorList:
            mentionText += js.text[int(j["start_index"]) : int(j["end_index"])]
        new_entity["mention_text"] = mentionText
        text_anchor["content"] = mentionText
        temp_text_anchor_list = []
        for i in range(len(entity1["text_anchor"]["text_segments"])):
            temp_text_anchor_list.append(
                {
                    "start_index": entity1["text_anchor"]["text_segments"][i][
                        "start_index"
                    ],
                    "end_index": entity1["text_anchor"]["text_segments"][i][
                        "end_index"
                    ],
                }
            )
        for i in range(len(entity2["text_anchor"]["text_segments"])):
            temp_text_anchor_list.append(
                {
                    "start_index": entity2["text_anchor"]["text_segments"][i][
                        "start_index"
                    ],
                    "end_index": entity2["text_anchor"]["text_segments"][i][
                        "end_index"
                    ],
                }
            )
        text_anchor["text_segments"] = temp_text_anchor_list
        new_entity["text_anchor"] = text_anchor
        norm_ent_1 = entity1["page_anchor"]["page_refs"][0]["bounding_poly"][
            "normalized_vertices"
        ]
        norm_ent_2 = entity2["page_anchor"]["page_refs"][0]["bounding_poly"][
            "normalized_vertices"
        ]
        min_x, max_x = min(v["x"] for v in [*norm_ent_1, *norm_ent_2]), max(
            v["x"] for v in [*norm_ent_1, *norm_ent_2]
        )
        min_y, max_y = min(v["y"] for v in [*norm_ent_1, *norm_ent_2]), max(
            v["y"] for v in [*norm_ent_1, *norm_ent_2]
        )

        A = {"x": min_x, "y": min_y}
        B = {"x": max_x, "y": min_y}
        C = {"x": max_x, "y": max_y}
        D = {"x": min_x, "y": max_y}
        new_entity["page_anchor"] = entity1["page_anchor"]
        new_entity["page_anchor"]["page_refs"][0]["bounding_poly"][
            "normalized_vertices"
        ] = [A, B, C, D]
        return new_entity

    if len(temp_child) == 1:
        return temp_child[0]
    if len(temp_child) == 2:
        parent_entity = combine_two_entities(temp_child[0], temp_child[1], new_js)
        return parent_entity
    parent_entity = combine_two_entities(temp_child[0], temp_child[1], new_js)
    for i in range(2, len(temp_child)):
        parent_entity = combine_two_entities(parent_entity, temp_child[i], new_js)

    return parent_entity

In [None]:
file_names_list, file_dict = utilities.file_names(gcs_input_path)
for filename, filepath in tqdm(file_dict.items(), desc="Progress"):
    print(">>>>>>>>>>>>>>> Processing File : ", filename)
    input_bucket_name = gcs_input_path.split("/")[2]
    if ".json" in filepath:
        js = utilities.documentai_json_proto_downloader(input_bucket_name, filepath)
        merged_pdf, images = utilities.create_pdf_bytes_from_json(
            documentai.Document.to_dict(js)
        )
        res = utilities.process_document_sample(
            project_number,
            location_processor,
            processor_id,
            merged_pdf,
            processor_version,
        )
        del res.document.entities
        new_js = res.document
        updated_entities = []
        for entity in js.entities:
            temp_child = []
            ent = {}
            if entity.properties:
                for child_item in entity.properties:
                    ent_ch = update_text_anchors_mention_text(child_item, js, new_js)
                    temp_child.append(ent_ch)
                ent = make_parent_from_child_entities(copy.deepcopy(temp_child), new_js)
                ent["type_"] = entity.type_
                ent["properties"] = temp_child
            else:
                ent = update_text_anchors_mention_text(entity, js, new_js)
            updated_entities.append(ent)
        new_js.entities = updated_entities
        output_bucket_name = gcs_output_path.split("/")[2]
        output_path_within_bucket = "/".join(gcs_output_path.split("/")[3:]) + filename
        utilities.store_document_as_json(
            documentai.Document.to_json(new_js),
            output_bucket_name,
            output_path_within_bucket,
        )

### 4.Output
The converted JSON file are stored in the output directory.
