# Pre and Post HITL Visualization

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.	

## Purpose of the script
This tool uses Pre-HITL JSON files (Parsed from a processor) and Post HITL JSON files(Updated through HITL) from GCS bucket as input, compares the Json files and differences are shown in an Excel with bounding boxes added images.


## Prerequisite
 * Vertex AI Notebook
 * Pre HITL and Post HITL Json files (filename should be same) in GCS Folders



## Step by Step procedure 

**1. Config file Creation**  \
    Run the below code and create a config.ini file for providing input.


In [None]:
import configparser

config = configparser.ConfigParser()
# Add the structure to the file we will create
config.add_section("Parameters")
config.set("Parameters", "project_id", "xxxx-xxxx-xxxx")
config.set("Parameters", "Pre_HITL_Output_URI", "gs://")
config.set("Parameters", "Post_HITL_Output_URI", "gs://")
# Write the new structure to the new file
with open(r"configfile.ini", "w") as configfile:
    config.write(configfile)

**2. Input Details**  

Once **config.ini** file is created with the above step , enter the input in the config file with necessary details as below
 * project_id: provide the project id
 * Pre_HITL_Output_URI: provide the gcs path of pre HITL jsons (processed jsons)
 * Post_HITL_Output_URI: provide the gcs path of post HITL jsons (Jsons processed thru HITL)
 
![](https://screenshot.googleplex.com/7DMhDW8d5GZnUBG.png)

**NOTE:** The Name of Post-HITL Json will not be the same as the original file name by default. This has to be updated manually before using this tool.

**3. Run the Code**

Copy the code provided in this document, Enter the path of Config file and Run without any edits
![](https://screenshot.googleplex.com/BP8v3wHicSEs6xr.png)

**4. Output** 

The output of the tool will be in an Excel format showing the entities which are updated in HITL and unchanged as well with images of labeled docs (both pre and post HITL).

The Excel sheet which is created will have a summary of all the file files in “Consolidated_Data” and comparison in a separate sheet for each file.

Each Excel sheet created will have  a batch of 20 files in it.

![](https://screenshot.googleplex.com/6nL7E3hrRSEi6ST.png)

The Excel file will have all the details of Pre-HITL text, Post-HITL text and whether the entity is updated in HITL in the form YES and NO as shown below .

![](https://screenshot.googleplex.com/8wqPTMyUY5ASKZA.png)

There will be a list of documents for which either the required confidence threshold is met or no HITL output is created yet is updated as “NO POST HITL OUTPUT AVAILABLE” at the end of excel in consolidated sheets.

![](https://screenshot.googleplex.com/8tpFZsVfFdTBoKA.png)


Blue Bounding Box⇒ Entities in Pre-HITL Json
Red Bounding Box⇒ Entities updated in HITL
Green Bounding Box⇒ Entities deleted in HITL( Entities which are detected by parser are deleted in HITL)

**Bounding box color coding in images**

![](https://screenshot.googleplex.com/9aph7w2N2vywPFP.png)

## **Sample Code**

In [None]:
# pip install below libraries for one time

#!pip install configparser
#!pip install google.cloud
#!pip install ast
#!pip install openpyxl

import ast
import configparser
import difflib
import io
import json
import operator
import os
import re
import time
from collections.abc import Container, Iterable, Iterator, Mapping, Sequence
from typing import List, Optional, Tuple, Union

import cv2
import gcsfs
import numpy
import numpy as np
import openpyxl
# installing libraries
import pandas as pd
from google.cloud import documentai_v1beta3, storage
from PIL import Image, ImageDraw
from PyPDF2 import PdfFileReader

pd.options.mode.chained_assignment = None  # default='warn'

# input
Path = "configfile.ini"  # Enter the path of config file
config = configparser.ConfigParser()
config.read(Path)

project_id = config.get("Parameters", "project_id")
pre_HITL_output_URI = config.get("Parameters", "pre_hitl_output_uri")
post_HITL_output_URI = config.get("Parameters", "post_hitl_output_uri")

# FUNCTIONS


# checking whether bucket exists else create temperary bucket
def check_create_bucket(bucket_name):
    """This Function is to create a temperary bucket
    for storing the processed files
    args: name of bucket"""

    storage_client = storage.Client()
    try:
        bucket = storage_client.get_bucket(bucket_name)
        print(f"Bucket {bucket_name} already exists.")
    except:
        bucket = storage_client.create_bucket(bucket_name)
        print(f"Bucket {bucket_name} created.")
    return bucket


def bucket_delete(bucket_name):
    """This function deltes the bucket and used for deleting the temporary
    bucket
    args: bucket name"""
    storage_client = storage.Client()
    try:
        bucket = storage_client.get_bucket(bucket_name)
        bucket.delete(force=True)
    except:
        pass


def file_names(file_path):
    """This Function will load the bucket and get the list of files
    in the gs path given
    args: gs path
    output: file names as list and dictionary with file names as keys and file path as values
    """
    bucket = file_path.split("/")[2]
    file_names_list = []
    file_dict = {}
    storage_client = storage.Client()
    source_bucket = storage_client.get_bucket(bucket)
    filenames = [
        filename.name for filename in list(
            source_bucket.list_blobs(
                prefix=(("/").join(file_path.split("/")[3:]))))
    ]
    for i in range(len(filenames)):
        x = filenames[i].split("/")[-1]
        if x != "":
            file_names_list.append(x)
            file_dict[x] = filenames[i]
    return file_names_list, file_dict


# list
def list_blobs(bucket_name):
    """This function will give the list of files in a bucket
    args: gcs bucket name
    output: list of files"""
    blob_list = []
    storage_client = storage.Client()
    blobs = storage_client.list_blobs(bucket_name)
    for blob in blobs:
        blob_list.append(blob.name)
    return blob_list


# Bucket operations
def relation_dict_generator(pre_hitl_output_bucket, post_hitl_output_bucket):
    """This Function will check the files from pre_hitl_output_bucket and post_hitl_output_bucket
    and finds the json with same names(relation)"""
    pre_hitl_bucket_blobs = list_blobs(pre_hitl_output_bucket)
    post_hitl_bucket_blobs = list_blobs(post_hitl_output_bucket)

    relation_dict = {}
    non_relation_dict = {}
    for i in pre_hitl_bucket_blobs:
        for j in post_hitl_bucket_blobs:
            matched_score = difflib.SequenceMatcher(None, i, j).ratio()
            if matched_score > 0.9:
                relation_dict[i] = j
            else:
                non_relation_dict[i] = "NO POST HITL OUTPUT AVAILABLE"
                # print(i)
    for i in relation_dict:
        if i in non_relation_dict.keys():
            del non_relation_dict[i]

    return relation_dict, non_relation_dict


def blob_downloader(bucket_name, blob_name):
    """This Function is used to download the files from gcs bucket"""
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    contents = blob.download_as_string()
    return json.loads(contents.decode())


def copy_blob(bucket_name, blob_name, destination_bucket_name,
              destination_blob_name):
    """This Method will copy files from one bucket(or folder) to another"""
    storage_client = storage.Client()
    source_bucket = storage_client.bucket(bucket_name)
    source_blob = source_bucket.blob(blob_name)
    destination_bucket = storage_client.bucket(destination_bucket_name)
    blob_copy = source_bucket.copy_blob(source_blob, destination_bucket,
                                        destination_blob_name)


def bbox_maker(boundingPoly):
    x_list = []
    y_list = []
    for i in boundingPoly:
        x_list.append(i["x"])
        y_list.append(i["y"])
    bbox = [min(x_list), min(y_list), max(x_list), max(y_list)]
    return bbox


def JsonToDataframe(data):
    """Returns entities in dataframe format"""
    df = pd.DataFrame(columns=["type", "mentionText", "bbox", "page"])

    if "entities" not in data.keys():
        return df

    for entity in data["entities"]:
        if "properties" in entity and len(entity["properties"]) > 0:
            for sub_entity in entity["properties"]:
                if "type" in sub_entity:
                    try:
                        boundingPoly = sub_entity["pageAnchor"]["pageRefs"][0][
                            "boundingPoly"]["normalizedVertices"]
                        bbox = bbox_maker(boundingPoly)
                        # page=sub_entity['pageAnchor']['pageRefs'][0]['page']
                        # bbox = [boundingPoly[0]['x'], boundingPoly[0]['y'], boundingPoly[2]['x'], boundingPoly[2]['y']]
                        # df.loc[len(df.index)] = [sub_entity['type'], sub_entity['mentionText'], bbox]
                        try:
                            page = sub_entity["pageAnchor"]["pageRefs"][0][
                                "page"]
                            df.loc[len(df.index)] = [
                                sub_entity["type"],
                                sub_entity["mentionText"],
                                bbox,
                                page,
                            ]
                        except KeyError:
                            df.loc[len(df.index)] = [
                                sub_entity["type"],
                                sub_entity["mentionText"],
                                bbox,
                                "0",
                            ]
                    except KeyError:
                        if "mentionText" in sub_entity:
                            df.loc[len(df.index)] = [
                                sub_entity["type"],
                                sub_entity["mentionText"],
                                [],
                                "no",
                            ]
                        else:
                            df.loc[len(df.index)] = [
                                sub_entity["type"],
                                "Entity not found.",
                                [],
                                "no",
                            ]
        elif "type" in entity:
            try:
                boundingPoly = entity["pageAnchor"]["pageRefs"][0][
                    "boundingPoly"]["normalizedVertices"]
                bbox = bbox_maker(boundingPoly)
                # bbox = [boundingPoly[0]['x'], boundingPoly[0]['y'], boundingPoly[2]['x'], boundingPoly[2]['y']]
                # df.loc[len(df.index)] = [entity['type'], entity['mentionText'], bbox]
                try:
                    page = entity["pageAnchor"]["pageRefs"][0]["page"]
                    df.loc[len(df.index)] = [
                        entity["type"],
                        entity["mentionText"],
                        bbox,
                        page,
                    ]
                except KeyError:
                    df.loc[len(df.index)] = [
                        entity["type"],
                        entity["mentionText"],
                        bbox,
                        "0",
                    ]

            except KeyError:
                if "mentionText" in entity:
                    df.loc[len(df.index)] = [
                        entity["type"],
                        entity["mentionText"],
                        [],
                        "no",
                    ]
                else:
                    df.loc[len(df.index)] = [
                        entity["type"],
                        "Entity not found.",
                        [],
                        "no",
                    ]
    return df


def RemoveRow(df, entity):
    """Drops the entity passed from the dataframe"""
    return df[df["type"] != entity]


def FindMatch(entity_file1, df_file2):
    """Finds the matching entity from the dataframe using
    the area of IOU between bboxes reference
    """
    bbox_file1 = entity_file1[2]
    # Entity not present in json file
    if not bbox_file1:
        return None

    # filtering entities with the same name
    df_file2 = df_file2[df_file2["type"] == entity_file1[0]]

    # calculating IOU values for the entities
    index_iou_pairs = []
    for index, entity_file2 in enumerate(df_file2.values):
        if entity_file2[2]:
            iou = BBIntersectionOverUnion(bbox_file1, entity_file2[2])
            index_iou_pairs.append((index, iou))

    # choose entity with highest IOU, IOU should be atleast > 0.5
    matched_index = None
    for index_iou in sorted(index_iou_pairs,
                            key=operator.itemgetter(1),
                            reverse=True):
        if index_iou[1] > 0.5:
            matched_index = df_file2.index[index_iou[0]]
            break
    return matched_index


def BBIntersectionOverUnion(box1, box2):
    """Calculates the area of IOU between two bounding boxes"""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    inter_area = abs(max((x2 - x1, 0)) * max((y2 - y1), 0))
    if inter_area == 0:
        return 0
    box1_area = abs((box1[2] - box1[0]) * (box1[3] - box1[1]))
    box2_area = abs((box2[2] - box2[0]) * (box2[3] - box2[1]))
    iou = inter_area / float(box1_area + box2_area - inter_area)

    return iou


def GetMatchRatio(values):
    file1_value = values[1]
    file2_value = values[2]
    if file1_value == "Entity not found." or file2_value == "Entity not found.":
        return 0
    else:
        return difflib.SequenceMatcher(a=file1_value, b=file2_value).ratio()


def compare_pre_hitl_and_post_hitl_output(file1, file2):
    """Compares the entities between two files and returns
    the results in a dataframe
    """
    df_file1 = JsonToDataframe(file1)
    df_file2 = JsonToDataframe(file2)
    # df_file1.to_csv("1.csv")
    # df_file2.to_csv("2.csv")
    file1_entities = [entity[0] for entity in df_file1.values]
    file2_entities = [entity[0] for entity in df_file2.values]

    # find entities which are present only once in both files
    # these entities will be matched directly
    common_entities = set(file1_entities).intersection(set(file2_entities))
    exclude_entities = []
    for entity in common_entities:
        if file1_entities.count(entity) > 1 or file2_entities.count(
                entity) > 1:
            exclude_entities.append(entity)
    for entity in exclude_entities:
        common_entities.remove(entity)
    df_compare = pd.DataFrame(columns=[
        "Entity Type",
        "Pre_HITL_Output",
        "Post_HITL_Output",
        "pre_bbox",
        "post_bbox",
        "page1",
        "page2",
    ])
    for entity in common_entities:
        value1 = df_file1[df_file1["type"] == entity].iloc[0]["mentionText"]
        value2 = df_file2[df_file2["type"] == entity].iloc[0]["mentionText"]
        pre_bbox = df_file1[df_file1["type"] == entity].iloc[0]["bbox"]
        post_bbox = df_file2[df_file2["type"] == entity].iloc[0]["bbox"]
        page1 = df_file1[df_file1["type"] == entity].iloc[0]["page"]
        page2 = df_file2[df_file2["type"] == entity].iloc[0]["page"]
        df_compare.loc[len(df_compare.index)] = [
            entity,
            value1,
            value2,
            pre_bbox,
            post_bbox,
            page1,
            page2,
        ]
        # common entities are removed from df_file1 and df_file2
        df_file1 = RemoveRow(df_file1, entity)
        df_file2 = RemoveRow(df_file2, entity)

    # remaining entities are matched comparing the area of IOU across them
    mentionText2 = pd.Series(dtype=str)
    bbox2 = pd.Series(dtype=object)
    bbox1 = pd.Series(dtype=object)
    page_1 = pd.Series(dtype=object)
    page_2 = pd.Series(dtype=object)

    for index, row in enumerate(df_file1.values):
        matched_index = FindMatch(row, df_file2)
        if matched_index != None:
            mentionText2.loc[index] = df_file2.loc[matched_index][1]
            bbox2.loc[index] = df_file2.loc[matched_index][2]
            bbox1.loc[index] = row[2]
            page_2.loc[index] = df_file2.loc[matched_index][3]
            page_1.loc[index] = row[3]
            df_file2 = df_file2.drop(matched_index)
        else:
            mentionText2.loc[index] = "Entity not found."
            bbox2.loc[index] = "Entity not found."
            bbox1.loc[index] = row[2]
            page_1.loc[index] = row[3]
            page_2.loc[index] = "no"

    df_file1["mentionText2"] = mentionText2.values
    df_file1["bbox2"] = bbox2.values
    df_file1["bbox1"] = bbox1.values
    df_file1["page_1"] = page_1.values
    df_file1["page_2"] = page_2.values

    df_file1 = df_file1.drop(["bbox"], axis=1)
    df_file1 = df_file1.drop(["page"], axis=1)
    df_file1.rename(
        columns={
            "type": "Entity Type",
            "mentionText": "Pre_HITL_Output",
            "mentionText2": "Post_HITL_Output",
            "bbox1": "pre_bbox",
            "bbox2": "post_bbox",
            "page_1": "page1",
            "page_2": "page2",
        },
        inplace=True,
    )
    df_compare = df_compare.append(df_file1, ignore_index=True)
    # adding entities which are present in file2 but not in file1
    for row in df_file2.values:
        df_compare.loc[len(df_compare.index)] = [
            row[0],
            "Entity not found.",
            row[1],
            "[]",
            row[2],
            "[]",
            row[3],
        ]

    # df_compare['Match'] = df_compare['Ground Truth Text'] == df_compare['Output Text']
    match_array = []
    for i in range(0, len(df_compare)):
        match_string = ""
        if (df_compare.iloc[i]["Pre_HITL_Output"] == "Entity not found." and
                df_compare.iloc[i]["Post_HITL_Output"] == "Entity not found."):
            match_string = "TN"
        elif (df_compare.iloc[i]["Pre_HITL_Output"] != "Entity not found." and
              df_compare.iloc[i]["Post_HITL_Output"] == "Entity not found."):
            match_string = "FN"
        elif (df_compare.iloc[i]["Pre_HITL_Output"] == "Entity not found." and
              df_compare.iloc[i]["Post_HITL_Output"] != "Entity not found."):
            match_string = "FP"
        elif (df_compare.iloc[i]["Pre_HITL_Output"] != "Entity not found." and
              df_compare.iloc[i]["Post_HITL_Output"] != "Entity not found."):
            if (df_compare.iloc[i]["Pre_HITL_Output"] == df_compare.iloc[i]
                ["Post_HITL_Output"]):
                match_string = "TP"
            else:
                match_string = "FP"
        else:
            match_string = "Something went Wrong."

        match_array.append(match_string)

    df_compare["Match"] = match_array

    df_compare["Fuzzy Ratio"] = df_compare.apply(GetMatchRatio, axis=1)
    if list(df_compare.index):
        score = df_compare["Fuzzy Ratio"].sum() / len(df_compare.index)
    else:
        score = 0
    return df_compare, score


def create_pdf_bytes(path):
    """THis Function will create pdf bytes from the image
    content of the ground truth JSONS which will be used for processing of files
    args: gs path of json file
    output : pdf bytes"""

    def decode_image(image_bytes: bytes) -> Image.Image:
        with io.BytesIO(image_bytes) as image_file:
            image = Image.open(image_file)
            image.load()
        return image

    def create_pdf_from_images(images: Sequence[Image.Image]) -> bytes:
        """Creates a PDF from a sequence of images.

        The PDF will contain 1 page per image, in the same order.

        Args:
          images: A sequence of images.

        Returns:
          The PDF bytes.
        """
        if not images:
            raise ValueError("At least one image is required to create a PDF")

        # PIL PDF saver does not support RGBA images
        images = [
            image.convert("RGB") if image.mode == "RGBA" else image
            for image in images
        ]

        with io.BytesIO() as pdf_file:
            images[0].save(pdf_file,
                           save_all=True,
                           append_images=images[1:],
                           format="PDF")
            return pdf_file.getvalue()

    d = documentai_v1beta3.Document
    document = d.from_json(fs.cat(path))
    synthesized_images = []
    for i in range(len(document.pages)):
        synthesized_images.append(decode_image(
            document.pages[i].image.content))
    pdf_bytes = create_pdf_from_images(synthesized_images)

    return pdf_bytes, synthesized_images


def find_excel_name():
    i = 1
    excel_file_name = "HITL_VISUAL" + str(i) + ".xlsx"
    comapare_analysis = compare_merged.drop(
        ["pre_bbox", "post_bbox", "page1", "page2"], axis=1)
    try:
        workbook = openpyxl.load_workbook(excel_file_name)
        num_sheets = len(workbook.sheetnames)
        # print(num_sheets)
        if num_sheets > 20:
            excel_file = "HITL_VISUAL" + str(i + 1) + ".xlsx"
            comapare_analysis.to_excel(excel_file,
                                       sheet_name="Consolidated_Data")
        else:
            excel_file = "HITL_VISUAL" + str(i) + ".xlsx"
    except FileNotFoundError:
        excel_file = "HITL_VISUAL" + str(i) + ".xlsx"
        comapare_analysis.to_excel(excel_file, sheet_name="Consolidated_Data")
    return excel_file


def get_visualization_excel(pre_HITL_output_URI, compare_merged,
                            relation_dict):
    # compare_merged.to_excel("HITL_VISUAL1.xlsx",sheet_name='Consolidated_Data')
    pre_HITL_bucket = pre_HITL_output_URI.split("/")[2]
    pre_HITL_output_files, pre_HITL_output_dict = file_names(
        pre_HITL_output_URI)

    for file in pre_HITL_output_dict:
        excel_file = find_excel_name()
        df = compare_merged.drop(["pre_bbox", "post_bbox", "page1", "page2"],
                                 axis=1)
        if file in relation_dict.keys():
            df_file = df[df["File Name"] == file]
            with pd.ExcelWriter(excel_file, engine="openpyxl",
                                mode="a") as writer:
                df_file.to_excel(writer, sheet_name=str(file))

            path = "gs://" + pre_HITL_bucket + "/" + pre_HITL_output_dict[file]
            pdf_bytes, synthesized_images = create_pdf_bytes(path)
            list_bbox_no = {}
            list_bbox_yes_changed = {}
            list_bbox_yes_old = {}
            for row in compare_merged.values:
                if row[0] == file:
                    if row[8] == "NO":
                        if type(row[4]) == list and row[4] != []:
                            try:
                                if row[6] in list_bbox_no.keys():
                                    list_bbox_no[row[6]].append(row[4])
                                else:
                                    list_bbox_no[row[6]] = [row[4]]
                                # print({row[6]:row[4]})
                            except:
                                pass
                    elif row[8] == "YES":
                        if type(row[5]) == list and row[5] != []:
                            try:
                                if row[7] in list_bbox_yes_changed.keys():
                                    list_bbox_yes_changed[row[7]].append(
                                        row[5])
                                else:
                                    list_bbox_yes_changed[row[7]] = [row[5]]

                            except:
                                pass
                        elif type(row[4]) == list and row[4] != []:
                            if row[6] in list_bbox_yes_old.keys():
                                list_bbox_yes_old[row[6]].append(row[4])
                            else:
                                list_bbox_yes_old[row[6]] = [row[4]]

            open_cv_image = {}
            for i in range(len(synthesized_images)):
                open_cv_image[i] = numpy.array(
                    synthesized_images[i].convert("RGB"))
            # print(list_bbox_yes_changed)
            img_list = []
            for i in range(len(open_cv_image)):
                size = open_cv_image[i].shape
                try:
                    for bbox in list_bbox_no[str(i)]:
                        x1 = int(bbox[0] * size[1])
                        x2 = int(bbox[2] * size[1])
                        y1 = int(bbox[1] * size[0])
                        y2 = int(bbox[3] * size[0])
                        # print(bbox[0]*size[0])
                        cv2.rectangle(open_cv_image[i], (x1, y1), (x2, y2),
                                      (0, 0, 255), 2)
                        # cv2.putText(open_cv_image[i],'He',(x1,y1),font,2,(255,255,255),1)

                except:
                    pass
                try:
                    for bbox in list_bbox_yes_changed[str(i)]:
                        x1 = int(bbox[0] * size[1])
                        x2 = int(bbox[2] * size[1])
                        y1 = int(bbox[1] * size[0])
                        y2 = int(bbox[3] * size[0])
                        cv2.rectangle(open_cv_image[i], (x1, y1), (x2, y2),
                                      (255, 0, 0), 2)
                except:
                    pass
                try:
                    for bbox in list_bbox_yes_old[str(i)]:
                        x1 = int(bbox[0] * size[1])
                        x2 = int(bbox[2] * size[1])
                        y1 = int(bbox[1] * size[0])
                        y2 = int(bbox[3] * size[0])
                        cv2.rectangle(open_cv_image[i], (x1, y1), (x2, y2),
                                      (0, 255, 0), 2)
                except:
                    pass

                img1 = Image.fromarray(open_cv_image[i])
                # img_list.append(img)
                # img.save(file+str(i)+'.png')
                # img.show()
                import openpyxl

                workbook = openpyxl.load_workbook(excel_file)
                worksheet = workbook[str(file)]

                img1.save(f"open_cv_image[i].png", "PNG")
                img = openpyxl.drawing.image.Image(f"open_cv_image[i].png")
                # if len(open_cv_image)>0:
                #     for i in open_cv_image:
                #         img.anchor = 'K'+str(int(i)*200)
                img.anchor = "K" + str(1 + int(i) * 50)
                worksheet.add_image(img)
                img.width = 500
                img.height = 700
                workbook.save(excel_file)


# Execute the below code

pre_HITL_output_URI = config.get("Parameters", "pre_hitl_output_uri")
post_HITL_output_URI = config.get("Parameters", "post_hitl_output_uri")

try:
    # creating temperary buckets
    import datetime

    now = str(datetime.datetime.now())
    now = re.sub(r"\W+", "", now)

    print("Creating temporary buckets")
    pre_HITL_bucket_name_temp = "pre_hitl_output" + "_" + now
    post_HITL_bucket_name_temp = "post_hitl_output_temp" + "_" + now
    # bucket name and prefix
    pre_HITL_bucket = pre_HITL_output_URI.split("/")[2]
    post_HITL_bucket = post_HITL_output_URI.split("/")[2]
    # getting all files and copying to temporary folder

    try:
        check_create_bucket(pre_HITL_bucket_name_temp)
        check_create_bucket(post_HITL_bucket_name_temp)
    except Exception as e:
        print("unable to create bucket because of exception : ", e)

    try:
        pre_HITL_output_files, pre_HITL_output_dict = file_names(
            pre_HITL_output_URI)
        post_HITL_output_files, post_HITL_output_dict = file_names(
            post_HITL_output_URI)
        print("copying files to temporary bucket")
        for i in pre_HITL_output_files:
            copy_blob(pre_HITL_bucket, pre_HITL_output_dict[i],
                      pre_HITL_bucket_name_temp, i)
        for i in post_HITL_output_files:
            copy_blob(
                post_HITL_bucket,
                post_HITL_output_dict[i],
                post_HITL_bucket_name_temp,
                i,
            )
        pre_HITL_files_list = list_blobs(pre_HITL_bucket_name_temp)
        post_HITL_files_list = list_blobs(post_HITL_bucket_name_temp)
    except Exception as e:
        print("unable to get list of files in buckets because : ", e)
    # processing the files and saving the files in temporary gCP bucket
    fs = gcsfs.GCSFileSystem(project_id)
    relation_dict, non_relation_dict = relation_dict_generator(
        pre_HITL_bucket_name_temp, post_HITL_bucket_name_temp)
    compare_merged = pd.DataFrame()
    accuracy_docs = []
    print(
        "comparing the PRE-HITL Jsons and POST-HITL jsons ....Wait for Summary "
    )
    for i in relation_dict:
        pre_HITL_json = blob_downloader(pre_HITL_bucket_name_temp, i)
        post_HITL_json = blob_downloader(post_HITL_bucket_name_temp,
                                         relation_dict[i])
        compare_output = compare_pre_hitl_and_post_hitl_output(
            pre_HITL_json, post_HITL_json)[0]
        column = [relation_dict[i]] * compare_output.shape[0]
        # print(column)
        compare_output.insert(loc=0, column="File Name", value=column)

        compare_output.insert(loc=8, column="hitl_update", value=" ")
        for j in range(len(compare_output)):
            if compare_output["Fuzzy Ratio"][j] != 1.0:
                if (compare_output["Pre_HITL_Output"][j] == "Entity not found."
                        and compare_output["Post_HITL_Output"][j]
                        == "Entity not found."):
                    compare_output["hitl_update"][j] = "NO"
                else:
                    compare_output["hitl_update"][j] = "YES"
            else:
                compare_output["hitl_update"][j] = "NO"
        for k in range(len(compare_output)):
            if compare_output["Fuzzy Ratio"][k] != 1.0:
                hitl_update = "HITL UPDATED"
                break
            else:
                compare_output["hitl_update"][k] = "NO"

        # new_row=pd.Series([i,"Entities","are updated","by HITL",":",np.nan,hitl_update], index=compare_output.columns)
        # compare_output=compare_output.append(new_row,ignore_index= True)
        frames = [compare_merged, compare_output]
        compare_merged = pd.concat(frames)
    try:
        bucket_delete(pre_HITL_bucket_name_temp)
        print("Deleting temperary buckets created")
        bucket_delete(post_HITL_bucket_name_temp)
    except:
        pass
    compare_merged.drop(["Match", "Fuzzy Ratio"], axis=1, inplace=True)

    def highlight(s):
        if s.hitl_update == "YES":
            return ["background-color: yellow"] * len(s)
        else:
            return ["background-color: white"] * len(s)

    for k in non_relation_dict:
        new_row = pd.Series(
            [k, "-", "-", "-", "", "", "", "", non_relation_dict[k]],
            index=compare_merged.columns,
        )
        compare_merged = compare_merged.append(new_row, ignore_index=True)
        comapare_analysis1 = compare_merged.drop(
            ["pre_bbox", "post_bbox", "page1", "page2"], axis=1)
    # comapare_analysis1.to_csv('compare_analysis.csv')
    entity_change = compare_merged.loc[compare_merged["hitl_update"] == "YES"]
    compare_merged_style = compare_merged.style.apply(highlight, axis=1)
    try:
        print("HITL Comparision excel is getting prepared")
        get_visualization_excel(pre_HITL_output_URI, compare_merged,
                                relation_dict)
        print("Completed creating the HITL Comparision Excel")
    except Exception as e:
        print("Unable to create HITL comparision excel because of :", e)
except Exception as e:
    try:
        bucket_delete(pre_HITL_bucket_name_temp)
        bucket_delete(post_HITL_bucket_name_temp)
        print("unable to process the file   : ", e)
    except:
        print("unable to process the file   : ", e)