# Pre and Post HITL Visualization

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.	

## Purpose of the script
This tool uses Pre-HITL JSON files (Parsed from a processor) and Post HITL JSON files(Updated through HITL) from GCS bucket as input, compares the Json files and differences are shown in an Excel with bounding boxes added images.


## Prerequisite
 * Vertex AI Notebook
 * Pre HITL and Post HITL Json files (filename should be same) in GCS Folders



### Download utilities module for incubator-tools

In [None]:
# Download incubator-tools utilities module to present-working-directory
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

## Step by Step procedure 
### 1. Setup the required inputs
#### Execute the below code

In [None]:
project_id = "<Project-ID>"
pre_HITL_output_URI = "gs://<bucket-name>/<folder_pre>"
post_HITL_output_URI = "gs://<bucket-name>/<folder_post>"

 * **project_id**: provide the project id 
 * **Pre_HITL_Output_URI:** provide the gcs path of pre HITL jsons (processed jsons) 
 * **Post_HITL_Output_URI:** provide the gcs path of post HITL jsons (Jsons processed through HITL) 

**NOTE:** The Name of Post-HITL Json will not be the same as the original file name by default. This has to be updated manually before using this tool.

**2. Output** 

The output of the tool will be in an Excel format showing the entities which are updated in HITL and unchanged as well with images of labeled docs (both pre and post HITL).

The Excel sheet which is created will have a summary of all the file files in “Consolidated_Data” and comparison in a separate sheet for each file.

Each Excel sheet created will have  a batch of 20 files in it.

![](https://screenshot.googleplex.com/6nL7E3hrRSEi6ST.png)

The Excel file will have all the details of Pre-HITL text, Post-HITL text and whether the entity is updated in HITL in the form YES and NO as shown below .

![](https://screenshot.googleplex.com/8wqPTMyUY5ASKZA.png)

There will be a list of documents for which either the required confidence threshold is met or no HITL output is created yet is updated as “NO POST HITL OUTPUT AVAILABLE” at the end of excel in consolidated sheets.

![](https://screenshot.googleplex.com/8tpFZsVfFdTBoKA.png)


Blue Bounding Box⇒ Entities in Pre-HITL Json
Red Bounding Box⇒ Entities updated in HITL
Green Bounding Box⇒ Entities deleted in HITL( Entities which are detected by parser are deleted in HITL)

**Bounding box color coding in images**

![](https://screenshot.googleplex.com/9aph7w2N2vywPFP.png)

Pre Post Bounding Box Mismatch
**Sample Code**

In [None]:
# pip install below libraries for one time
#!pip install configparser
#!pip install google.cloud
#!pip install ast
#!pip install openpyxl

# installing libraries
import pandas as pd
import operator
import difflib
import json
import os
import pandas as pd
import time
import numpy as np
from google.cloud import storage
from google.cloud import documentai_v1beta3
from PIL import Image
from typing import (
    Container,
    Iterable,
    Iterator,
    List,
    Mapping,
    Optional,
    Sequence,
    Tuple,
    Union,
)
from PyPDF2 import PdfFileReader
import configparser
import ast
import numpy
import io
import re
import cv2
from PIL import Image, ImageDraw
import openpyxl
from google.cloud import documentai_v1beta3 as documentai


import utilities

pd.options.mode.chained_assignment = None  # default='warn'


def find_excel_name():
    i = 1
    excel_file_name = "HITL_VISUAL" + str(i) + ".xlsx"
    comapare_analysis = compare_merged.drop(
        ["pre_bbox", "post_bbox", "page1", "page2"], axis=1
    )
    try:
        workbook = openpyxl.load_workbook(excel_file_name)
        num_sheets = len(workbook.sheetnames)
        # print(num_sheets)
        if num_sheets > 20:
            excel_file = "HITL_VISUAL" + str(i + 1) + ".xlsx"
            comapare_analysis.to_excel(excel_file, sheet_name="Consolidated_Data")
        else:
            excel_file = "HITL_VISUAL" + str(i) + ".xlsx"
    except FileNotFoundError:
        excel_file = "HITL_VISUAL" + str(i) + ".xlsx"
        comapare_analysis.to_excel(excel_file, sheet_name="Consolidated_Data")
    return excel_file


def get_visualization_excel(pre_HITL_output_URI, compare_merged, relation_dict):
    # compare_merged.to_excel("HITL_VISUAL1.xlsx",sheet_name='Consolidated_Data')
    pre_HITL_bucket = pre_HITL_output_URI.split("/")[2]
    pre_HITL_output_files, pre_HITL_output_dict = utilities.file_names(
        pre_HITL_output_URI
    )
    for file in pre_HITL_output_dict:
        excel_file = find_excel_name()
        df = compare_merged.drop(["pre_bbox", "post_bbox", "page1", "page2"], axis=1)
        if file in relation_dict.keys():
            df_file = df[df["File Name"] == file]
            with pd.ExcelWriter(excel_file, engine="openpyxl", mode="a") as writer:
                df_file.to_excel(writer, sheet_name=str(file))

            # path="gs://"+pre_HITL_bucket+'/'+pre_HITL_output_dict[file]
            GT_json = utilities.documentai_json_proto_downloader(
                pre_HITL_bucket, pre_HITL_output_dict[file]
            )
            pdf_bytes, synthesized_images = utilities.create_pdf_bytes_from_json(
                documentai.Document.to_dict(GT_json)
            )
            list_bbox_no = {}
            list_bbox_yes_changed = {}
            list_bbox_yes_old = {}
            for row in compare_merged.values:
                if row[0] == file:
                    if row[8] == "NO":
                        if type(row[4]) == list and row[4] != []:
                            try:
                                if row[6] in list_bbox_no.keys():
                                    list_bbox_no[row[6]].append(row[4])
                                else:
                                    list_bbox_no[row[6]] = [row[4]]
                                # print({row[6]:row[4]})
                            except:
                                pass
                    elif row[8] == "YES":
                        if type(row[5]) == list and row[5] != []:
                            try:
                                if row[7] in list_bbox_yes_changed.keys():
                                    list_bbox_yes_changed[row[7]].append(row[5])
                                else:
                                    list_bbox_yes_changed[row[7]] = [row[5]]

                            except:
                                pass
                        elif type(row[4]) == list and row[4] != []:
                            if row[6] in list_bbox_yes_old.keys():
                                list_bbox_yes_old[row[6]].append(row[4])
                            else:
                                list_bbox_yes_old[row[6]] = [row[4]]

            open_cv_image = {}
            for i in range(len(synthesized_images)):
                open_cv_image[i] = numpy.array(synthesized_images[i].convert("RGB"))
            # print(list_bbox_yes_changed)
            img_list = []
            list_bbox_no = {str(key): value for key, value in list_bbox_no.items()}
            list_bbox_yes_changed = {
                str(key): value for key, value in list_bbox_yes_changed.items()
            }
            list_bbox_yes_old = {
                str(key): value for key, value in list_bbox_yes_old.items()
            }

            for i in range(len(open_cv_image)):
                size = open_cv_image[i].shape
                try:
                    for bbox in list_bbox_no[str(i)]:
                        x1 = int(bbox[0] * size[1])
                        x2 = int(bbox[2] * size[1])
                        y1 = int(bbox[1] * size[0])
                        y2 = int(bbox[3] * size[0])
                        cv2.rectangle(
                            open_cv_image[i], (x1, y1), (x2, y2), (0, 0, 255), 2
                        )
                except:
                    pass
                try:
                    for bbox in list_bbox_yes_changed[str(i)]:
                        x1 = int(bbox[0] * size[1])
                        x2 = int(bbox[2] * size[1])
                        y1 = int(bbox[1] * size[0])
                        y2 = int(bbox[3] * size[0])
                        cv2.rectangle(
                            open_cv_image[i], (x1, y1), (x2, y2), (255, 0, 0), 2
                        )
                except:
                    pass
                try:
                    for bbox in list_bbox_yes_old[str(i)]:
                        x1 = int(bbox[0] * size[1])
                        x2 = int(bbox[2] * size[1])
                        y1 = int(bbox[1] * size[0])
                        y2 = int(bbox[3] * size[0])
                        cv2.rectangle(
                            open_cv_image[i], (x1, y1), (x2, y2), (0, 255, 0), 2
                        )
                except:
                    pass

                img1 = Image.fromarray(open_cv_image[i])
                import openpyxl

                workbook = openpyxl.load_workbook(excel_file)
                worksheet = workbook[str(file)]

                img1.save(f"open_cv_image[i].png", "PNG")
                img = openpyxl.drawing.image.Image(f"open_cv_image[i].png")
                img.anchor = "K" + str(1 + int(i) * 50)
                worksheet.add_image(img)
                img.width = 500
                img.height = 700
                workbook.save(excel_file)


try:
    # creating temperary buckets
    import datetime

    now = str(datetime.datetime.now())
    now = re.sub("\W+", "", now)

    print("Creating temporary buckets")
    pre_HITL_bucket_name_temp = "pre_hitl_output" + "_" + now
    post_HITL_bucket_name_temp = "post_hitl_output_temp" + "_" + now
    # bucket name and prefix
    pre_HITL_bucket = pre_HITL_output_URI.split("/")[2]
    post_HITL_bucket = post_HITL_output_URI.split("/")[2]
    # getting all files and copying to temporary folder

    try:
        utilities.check_create_bucket(pre_HITL_bucket_name_temp)
        utilities.check_create_bucket(post_HITL_bucket_name_temp)
    except Exception as e:
        print("unable to create bucket because of exception : ", e)

    try:
        pre_HITL_output_files, pre_HITL_output_dict = utilities.file_names(
            pre_HITL_output_URI
        )
        post_HITL_output_files, post_HITL_output_dict = utilities.file_names(
            post_HITL_output_URI
        )
        print("copying files to temporary bucket")
        for i in pre_HITL_output_files:
            utilities.copy_blob(
                pre_HITL_bucket, pre_HITL_output_dict[i], pre_HITL_bucket_name_temp, i
            )
        for i in post_HITL_output_files:
            utilities.copy_blob(
                post_HITL_bucket,
                post_HITL_output_dict[i],
                post_HITL_bucket_name_temp,
                i,
            )
        pre_HITL_files_list = utilities.list_blobs(pre_HITL_bucket_name_temp)
        post_HITL_files_list = utilities.list_blobs(post_HITL_bucket_name_temp)
    except Exception as e:
        print("unable to get list of files in buckets because : ", e)
    # processing the files and saving the files in temporary gCP bucket
    relation_dict, non_relation_dict = utilities.matching_files_two_buckets(
        pre_HITL_bucket_name_temp, post_HITL_bucket_name_temp
    )
    compare_merged = pd.DataFrame()
    accuracy_docs = []
    print("comparing the PRE-HITL Jsons and POST-HITL jsons ....Wait for Summary ")
    for i in relation_dict:
        pre_HITL_json = utilities.documentai_json_proto_downloader(
            pre_HITL_bucket_name_temp, i
        )
        post_HITL_json = utilities.documentai_json_proto_downloader(
            post_HITL_bucket_name_temp, relation_dict[i]
        )
        compare_output = utilities.compare_pre_hitl_and_post_hitl_output(
            pre_HITL_json, post_HITL_json
        )[0]
        column = [relation_dict[i]] * compare_output.shape[0]
        compare_output.insert(loc=0, column="File Name", value=column)

        compare_output.insert(loc=8, column="hitl_update", value=" ")
        for j in range(len(compare_output)):
            if compare_output["Fuzzy Ratio"][j] != 1.0:
                if (
                    compare_output["Pre_HITL_Output"][j] == "Entity not found."
                    and compare_output["Post_HITL_Output"][j] == "Entity not found."
                ):
                    compare_output["hitl_update"][j] = "NO"
                else:
                    compare_output["hitl_update"][j] = "YES"
            else:
                compare_output["hitl_update"][j] = "NO"
        for k in range(len(compare_output)):
            if compare_output["Fuzzy Ratio"][k] != 1.0:
                hitl_update = "HITL UPDATED"
                break
            else:
                compare_output["hitl_update"][k] = "NO"
        frames = [compare_merged, compare_output]
        compare_merged = pd.concat(frames)
    try:
        utilities.bucket_delete(pre_HITL_bucket_name_temp)
        print("Deleting temperary buckets created")
        utilities.bucket_delete(post_HITL_bucket_name_temp)
    except:
        pass
    compare_merged.drop(["Match", "Fuzzy Ratio"], axis=1, inplace=True)

    def highlight(s):
        if s.hitl_update == "YES":
            return ["background-color: yellow"] * len(s)
        else:
            return ["background-color: white"] * len(s)

    for k in non_relation_dict:
        new_row = pd.Series(
            [k, "-", "-", "-", "", "", "", "", non_relation_dict[k]],
            index=compare_merged.columns,
        )
        compare_merged = compare_merged.append(new_row, ignore_index=True)
        comapare_analysis1 = compare_merged.drop(
            ["pre_bbox", "post_bbox", "page1", "page2"], axis=1
        )
    # comapare_analysis1.to_csv('compare_analysis.csv')
    entity_change = compare_merged.loc[compare_merged["hitl_update"] == "YES"]
    compare_merged_style = compare_merged.style.apply(highlight, axis=1)
    import traceback

    try:
        print("HITL Comparision excel is getting prepared")
        get_visualization_excel(pre_HITL_output_URI, compare_merged, relation_dict)
        print("Completed creating the HITL Comparision Excel")
    except Exception as e:
        print("Unable to create HITL comparison excel because of:", e)
        print(traceback.format_exc())
except Exception as e:
    try:
        utilities.bucket_delete(pre_HITL_bucket_name_temp)
        utilities.bucket_delete(post_HITL_bucket_name_temp)
        print("unable to process the file   : ", e)
    except:
        print("unable to process the file   : ", e)