# PRE - POST HITL Bounding Box Mismatch 

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.	

## Purpose of the script

Pre and POST HITL comparison tool which detect two issues - Parser issue and OCR issue.
And the result output contains a summary json file which shows basic stats, count of the OCR and Parser issues for entities present in each document and corresponding analysis csv files.

 * **Parser issue :** This issue is identified with the parser when the bounding box is not covering the text region completely and hence the required text was not captured completely. The user accesses HITL worker UI and adjusts the bounding box to include the text region and save. The script highlight such cases

 * **OCR issue :** This issue is identified with the parser when the bounding box covers the whole text region and as result the expected text was not captured completely. The script highlight such cases.

## Prerequisites
 * Vertex AI Notebook
 * Google Cloud Storage bucket
 * Pre HITL and Post HITL Json files (filename should be same) in GCS Folders
 * DocumentAI and HITL

## Step by Step procedure 
### 1. Setup the required inputs
#### Execute the below code

In [None]:
project_id = "<Project-ID>"
pre_HITL_output_URI = "gs://<bucket-name>/<folder_pre>"
post_HITL_output_URI = "gs://<bucket-name>/<folder_post>"

 * **project_id**: provide the project id 
 * **Pre_HITL_Output_URI:** provide the gcs path of pre HITL jsons (processed jsons) 
 * **Post_HITL_Output_URI:** provide the gcs path of post HITL jsons (Jsons processed through HITL) 

**NOTE:** The Name of Post-HITL Json will not be the same as the original file name by default. This has to be updated manually before using this tool.

##  2. Output
Result summary table is obtained which highlight the count of parser and ocr issues for each file. The result table contain details related to pre and post HITL entity changes, whether there were bounding box coordinates mismatched upon post HITL processing. The below screenshots showcases the parser or ocr issue.

![](https://screenshot.googleplex.com/6S47qFm5SjP8eMC.png)
![](https://screenshot.googleplex.com/6HyQwucSQPZR4ii.png)

Summary json file is generated which highlight count of bounding box mismatches, OCR and Parser errors and analysis path to result table for each of the processed files.

![](https://screenshot.googleplex.com/55R5NKSuVYmyP9H.png)

Entity wise analysis for each file can be observed in the following csv files under analysis/ folder.

![](https://screenshot.googleplex.com/BKd5QCidEJac9Jy.png)

**Table columns:**

The result output table has following columns and its details are as follows:
 * File Name : name of the file
 * Entity Type : type of the entity 
 * Pre_HITL_Output : entity text before HITL 
 * Pre_HITL_bbox : entity bounding box coordinates before HITL
 * Post_HITL_Output : entity text before HITL 
 * Hitl_update : if there was HITL update for that particular entity
 * Post_HITL_bbox : entity bounding box coordinates after HITL
 * Fuzzy Ratio : text match %
 * Bbox_mismatch : if the bounding box coordinates are mismatched
 * OCR issue : represents if its classified as OCR Issue
 * Parser issue : represents if its classified as Parser Issue


## Notebook Script

**Install the below libraries before executing the script** \
If you encounter an error while importing libraries, please verify that you have installed them.

In [None]:
!pip install google-cloud-documentai
!pip install PyPDF2

**Script**

In [None]:
import ast
import configparser
import difflib
import io
import json
import operator
import os
import re
import time
from collections.abc import Container, Iterable, Iterator, Mapping, Sequence
from typing import List, Optional, Tuple, Union

import numpy as np

# Import the libraries
import pandas as pd
from google.cloud import documentai_v1beta3, storage
from PIL import Image
from PyPDF2 import PdfFileReader

pd.options.mode.chained_assignment = None  # default='warn'
import datetime
import json
import os
import utilities


now = str(datetime.datetime.now())
now = re.sub(r"\W+", "", now)

print("Creating temporary buckets")
pre_HITL_bucket_name_temp = "pre_hitl_output" + "_" + now
post_HITL_bucket_name_temp = "post_hitl_output_temp" + "_" + now
# bucket name and prefix
pre_HITL_bucket = pre_HITL_output_URI.split("/")[2]
post_HITL_bucket = post_HITL_output_URI.split("/")[2]
# getting all files and copying to temporary folder

try:
    utilities.check_create_bucket(pre_HITL_bucket_name_temp)
    utilities.check_create_bucket(post_HITL_bucket_name_temp)
except Exception as e:
    print("unable to create bucket because of exception : ", e)

try:
    pre_HITL_output_files, pre_HITL_output_dict = utilities.file_names(
        pre_HITL_output_URI
    )
    # print(pre_HITL_output_files,pre_HITL_output_dict)
    post_HITL_output_files, post_HITL_output_dict = utilities.file_names(
        post_HITL_output_URI
    )
    # print(post_HITL_output_files,post_HITL_output_dict)
    print("copying files to temporary bucket")
    for i in pre_HITL_output_files:
        utilities.copy_blob(
            pre_HITL_bucket, pre_HITL_output_dict[i], pre_HITL_bucket_name_temp, i
        )
    for i in post_HITL_output_files:
        utilities.copy_blob(
            post_HITL_bucket, post_HITL_output_dict[i], post_HITL_bucket_name_temp, i
        )
    pre_HITL_files_list = utilities.list_blobs(pre_HITL_bucket_name_temp)
    post_HITL_files_list = utilities.list_blobs(post_HITL_bucket_name_temp)
except Exception as e:
    print("unable to get list of files in buckets because : ", e)
# processing the files and saving the files in temporary GCP bucket
relation_dict, non_relation_dict = utilities.matching_files_two_buckets(
    pre_HITL_bucket_name_temp, post_HITL_bucket_name_temp
)

time_stamp = datetime.datetime.now().strftime("%d_%m_%y-%H%M%S")
filename_error_count_dict = {}

compare_merged = pd.DataFrame()
accuracy_docs = []
print("comparing the PRE-HITL Jsons and POST-HITL jsons ....Wait for Summary ")
for i in relation_dict:
    # print("***** i : ", i)
    pre_HITL_json = utilities.documentai_json_proto_downloader(
        pre_HITL_bucket_name_temp, i
    )
    post_HITL_json = utilities.documentai_json_proto_downloader(
        post_HITL_bucket_name_temp, relation_dict[i]
    )
    # print('pre_HITL_json : ', pre_HITL_json)
    # print('post_HITL_json : ', post_HITL_json)
    compare_output = utilities.compare_pre_hitl_and_post_hitl_output(
        pre_HITL_json, post_HITL_json
    )[0]
    # Rename columns
    compare_output = compare_output.rename(
        columns={"pre_bbox": "Pre_HITL_bbox", "post_bbox": "Post_HITL_bbox"}
    )

    # Drop unwanted columns
    compare_output = compare_output.drop(["page1", "page2"], axis=1)

    # print('compare_output :',compare_output)
    # display(compare_output)
    column = [relation_dict[i]] * compare_output.shape[0]
    # print("++++column++++")
    # print(column)
    compare_output.insert(loc=0, column="File Name", value=column)

    compare_output.insert(loc=5, column="hitl_update", value=" ")
    for j in range(len(compare_output)):
        if compare_output["Fuzzy Ratio"][j] != 1.0:  # strict
            if (
                compare_output["Pre_HITL_Output"][j] == "Entity not found."
                and compare_output["Post_HITL_Output"][j] == "Entity not found."
            ):
                compare_output["hitl_update"][j] = "NO"
            else:
                compare_output["hitl_update"][j] = "YES"
        else:
            compare_output["hitl_update"][j] = "NO"
    for k in range(len(compare_output)):
        if compare_output["Fuzzy Ratio"][k] != 1.0:  # strict
            hitl_update = "HITL UPDATED"
            break
        else:
            compare_output["hitl_update"][k] = "NO"

    ##
    compare_output["bbox_mismatch"] = (
        compare_output["Pre_HITL_bbox"] != compare_output["Post_HITL_bbox"]
    )

    # OCR Issue
    compare_output["OCR Issue"] = "No"
    # compare_output.loc[(compare_output['Pre_HITL_Output'] != compare_output['Post_HITL_Output']), 'OCR Issue']  = 'Yes' # & cordinates are same
    compare_output.loc[
        (compare_output["Pre_HITL_Output"] != compare_output["Post_HITL_Output"])
        & (compare_output["Pre_HITL_bbox"] == compare_output["Post_HITL_bbox"]),
        "OCR Issue",
    ] = "Yes"

    # Parser Issue
    compare_output["Parser Issue"] = "No"
    compare_output.loc[
        (compare_output["hitl_update"] == "YES")
        & (compare_output["bbox_mismatch"] == True),
        "Parser Issue",
    ] = "Yes"  # & cordinates are different
    try:
        compare_merged.loc[
            (compare_merged["Post_HITL_Output"] == "Entity not found.")
            | (compare_merged["Pre_HITL_Output"] == "Entity not found."),
            "Parser Issue",
        ] = "Yes"
    except:
        pass

    ## global dict : no of parser error / file
    temp = {}
    temp["bbox_mismatch"] = len(compare_output[compare_output["bbox_mismatch"] == True])

    temp["OCR_issue"] = len(
        compare_output.loc[
            (compare_output["Pre_HITL_Output"] != compare_output["Post_HITL_Output"])
            & (compare_output["Pre_HITL_bbox"] == compare_output["Post_HITL_bbox"])
        ]
    )
    temp["Parser_issue"] = len(
        compare_output.loc[
            (compare_output["hitl_update"] == "YES")
            & (compare_output["bbox_mismatch"] == True)
        ]
    )
    temp["output_file"] = "analysis_" + time_stamp + "/" + i.replace("json", "csv")

    filename_error_count_dict[i] = temp

    new_row = pd.Series(
        [
            i,
            "Entities",
            "are updated",
            "by HITL",
            ":",
            np.nan,
            hitl_update,
            "",
            "",
            "",
            "",
            "",
        ],
        index=compare_output.columns,
    )
    compare_output = compare_output.append(new_row, ignore_index=True)
    frames = [compare_merged, compare_output]
    compare_merged = pd.concat(frames)

with open("summary_" + time_stamp + ".json", "w") as ofile:
    ofile.write(json.dumps(filename_error_count_dict))

for x in relation_dict:
    # print(x)
    file_out = compare_merged[compare_merged["File Name"] == x]
    try:
        os.mkdir("analysis_" + time_stamp)
    except:
        pass
    file_out.to_csv("analysis_" + time_stamp + "/" + x.replace("json", "csv"))

utilities.bucket_delete(pre_HITL_bucket_name_temp)
utilities.bucket_delete(post_HITL_bucket_name_temp)