# Test Harness Guide

* Author: docai-incubator@google.com

# Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied. 

# Objective

In software testing, a test harness or automated test framework is a collection of software and test data configured to test a program unit by running it under varying conditions and monitoring its behavior and outputs.

The Test Harness tool is a useful tool to check: 
* If the DocAI API is working properly or not and 
* Check the differences between output of a parser in different time intervals. 

This tool is designed to take a few documents from Google Cloud Storage (GCS) and parse them via a DocAI processor and compare the results to previously accepted results and then provide the output as a Google Sheet containing a report of entities matching percentage as well as fuzzy ratio.


## Test harness tool working chart

<img src="./images/flow_chart.png" width=600 height=300> </img>

# Modes of Test harness tool

1. Mode- `I`  ( Initial Setup Mode)
    * This Mode help us to create Initial json files for the given documents by running the documents in a parser at time frame t.
2. Mode- `C` (Comparison Mode)
    * To run this comparison mode, Executing of Mode-`I` is mandatory as the Initial json files created in the Mode- `I` are used to compare.
    * This Mode  help us to create json files for the given documents by running the documents in a parser at time frame t+k.
    * Both the Initial Jsons(output from Mode-`I`) and Jsons created in the Mode-`C` are compared and the differences are provided in the G-sheet thru mail. 


# Modes of operation

There are two types of processes possible in the test harness tool.  
i. Synchronous Mode  
ii. Asynchronous Mode  
### Synchronous Mode
* In **Synchronous mode**, call the DocAI Processor and wait until it returns - all other code execution and user interaction is stopped until the call returns.

* Here only a single document is processed at a time and receives the output for it. It cannot process multiple documents at the same time.

### Asynchronous Mode
* In **Asynchronous mode**, no need to stop all other operations while waiting for the web service call to return. Other code executes and/or the user can continue to interact with the page (or program).

* Here multiple documents can be processed simultaneously. 

To know more about synchronous and asynchronous please refer to this [article](https://cloud.google.com/blog/topics/developers-practitioners/differences-between-synchronous-web-apis-and-asynchronous-stateful-apis).

# Installation Guide
This test harness tool is in a Python notebook script format which can be used in **Vertex AI JupyterLab Environment or Google Colab**. First, put the test harness tool script in JupyterLab and then, put all the reference documents in a specific folder . Also, create or use an existing folder as an output folder. 

For further details please see the steps to use the test harness tool in Synchronous mode or Asynchronous mode below as per requirement.


# Service Account and API key Creation
Google sheets is a great tool to see and analyze the data hence a service account is needed which is going to create and own the google sheet, also it’ll share the sheet to the user.

Hence, before hitting the run button a service account in our google cloud platform project is a must.

Please use the steps below to create a service account or [this link](https://cloud.google.com/iam/docs/creating-managing-service-accounts) to be referred to see in detail.

# Creating a service account
**Prerequisites** :
* Enable the IAM, Drive & Sheets API’s
* Understand IAM service accounts
* Install the Google Cloud CLI

**Required roles** : 
To get the permissions that are required to manage service accounts, ask the administrator to grant the following IAM roles on the project:
* To view service accounts and service account metadata: View Service Accounts (roles/iam.serviceAccountViewer)
* To view and create service accounts: Create Service Accounts (roles/iam.serviceAccountCreator)
* To view and delete service accounts: Delete Service Accounts (roles/iam.serviceAccountDeleter)
* To fully manage (view, create, update, disable, enable, delete, undelete, and manage access to) service accounts: Service Account Admin (roles/iam.serviceAccountAdmin)  

For more information about granting roles, see [Manage access](https://cloud.google.com/iam/docs/granting-changing-revoking-access).  
To learn more about these roles, see [Service Accounts roles](https://cloud.google.com/iam/docs/understanding-roles#service-accounts-roles).  
IAM basic roles also contain permissions to manage service accounts. One should not grant basic roles in a production environment, but can grant them in a development or test environment.


# Creating a service account : 

When a service account is needed, one must provide an alphanumeric ID (SA_NAME in the samples below), such as my-service-account. The ID must be between 6 and 30 characters, and can contain lowercase alphanumeric characters and dashes. After a service account is created , one cannot change its name.  

The service account's name appears in the email address that is provisioned during creation, in the format SA_NAME@PROJECT_ID.iam.gserviceaccount.com.  

Each service account also has a permanent, unique numeric ID, which is generated automatically. 

Also provide the following information when a service account is created:
* **SA_DESCRIPTION** is an optional description for the service account.
* **SA_DISPLAY_NAME** is a friendly name for the service account.
* **PROJECT_ID** is the ID of Google Cloud project. 

After creation of a service account, one might need to wait for 60 seconds or more before use the service account. This behavior occurs because read operations are eventually consistent; it can take time for the new service account to become visible. If one tries to read or use a service account immediately after creating, and receives an error, can [retry the request with exponential backoff](https://cloud.google.com/iam/docs/retry-strategy). 

After a service account is created, [grant one or more roles to the service account](https://cloud.google.com/iam/docs/granting-changing-revoking-access) so that it can act on one’s behalf. 

Also, if the service account needs to access resources in other projects, usually one must [enable the APIs](https://cloud.google.com/apis/docs/getting-started#enabling_apis) for those resources in the project where the service account is created.


# Service Account Key Download 

In the test harness tool, API keys are used to authenticate and service account is used so that google sheet is generated and used to get the comparison reports. 

### Downloading Service Account Key : 
To download the Service Account Key please follow these steps:
1. Go to Google Cloud Console and search for “Service accounts” and select “Service Accounts” of “IAM and Admin”.  
<img src="./images/sk_1.png"></img>
2. Select the Service Account made for the Test Harness Tool.  
<img src="./images/sk_2.png"></img>
3. Select “Keys”.  
<img src="./images/sk_3.png"></img>
4. Select "Add Key" and then Choose "Create New Key".  
<img src="./images/sk_4.png"></img>
5. Choose “Json” and then Click “Create”.  
<img src="./images/sk_5.png"></img>
6. Wait for the Success message.  
<img src="./images/sk_6.png"></img>  
 Also the newly added key can be seen in the console.  
<img src="./images/sk_final.png"></img>
7. Check in Local Machine for the “Json Key”.

# Tool Modes and prerequisites
## Tool Mode -’i’(Initial setup mode):

This mode has to be used to initially get the output json files when you create a processor( at time t).

**Prerequisites**:
1. Fill all input variables for Test harness tool script.
2. Service account and api key to be generated as per above instructions given in corresponding sections.

## Tool Mode -’C’(Comparison mode):
This mode has to be used to get the output json files when you want to test a processor( at time t+k).


**Prerequisites**:
1. Run the test harness tool in Tool Mode ‘I’ to create initial json outputs to compare.
2. General Parameters specified in the configuration file and Test harness tool script.
3. Service account and api key to be generated as per instructions given.
4. i. If there are more than 1 documents to test, use process type as ‘A’ and fill the details required under async details in configuration file  
    ii. If you want to test with 1 document , use the process type as ‘S’ and fill the details required under sync details in the configuration file.



#### Run the below cell to install required-packages

In [1]:
!pip install colorama
!pip install gspread
!pip install gspread_formatting
!pip install google-cloud-documentai
!pip install google-cloud-storage
!pip install google-api-core
!pip install df2gspread
!pip install configparser
!pip install google-cloud
!pip install oauth2client

#### Import Required Packages/Modules

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [3]:
import re
import time
from datetime import datetime
from typing import List, Tuple

import gspread
import numpy as np
import pandas as pd
from colorama import Back, Fore, Style
from google.api_core import operation
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from gspread import Worksheet, spreadsheet
from gspread_formatting import (
    BooleanCondition,
    BooleanRule,
    CellFormat,
    Color,
    ConditionalFormatRule,
    GridRange,
    get_conditional_format_rules,
    textFormat,
)
from oauth2client.service_account import ServiceAccountCredentials

from utilities import (
    batch_process_documents_sample,
    copy_blob,
    documentai_json_proto_downloader,
    file_names,
    find_match,
    get_match_ratio,
    process_document_sample,
    remove_row,
)

### Input Variables Description
* **EMAIL_ADDRESS** : Enter the email address to which you wanted the comparison G-sheet to be sent.
* **API_KEY** :  Enter the path of Apikey json file like _example: /content/apikey.json_
* **PROCESS_TYPE** : Enter `a` or `s`, if you want to test the tool with a single document enter `s` else `a`.
* **PROJECT_NUMBER** :  Enter the google cloud project number.
* **PROJECT_NAME**: Enter the google cloud project name(this is required only when you run the code in google colab)
* **PROCESSOR_ID** : Enter the processor id from which you wanted to get the files to get parsed.
* **LOCATION** : Enter the Location(`us` or `eu`) of your processor chosen while creating.
* **INITIAL_JSON_BUCKET** : Enter the path to store the initial output json files when you run the tool with Mode I
* **MODE** : Enter the mode of test harness tool
    * **I** - Initial setup mode , this mode is to get the output jsons initially from a processor.
    * **C** - Comparison mode is for comparing output jsons which we have taken initially from mode `I` and output while testing the processor.
* **ASYNC_INPUT_BUCKET_PDFS** : Enter the path of documents to parse and compare
* **ASYNC_OUTPUT_BUCKET** : Enter the path where you wanted to store the output json files while using Mode `C` and Asynchronous mode.(required only in async mode and tool Mode `C`).
* **SYNC_INPUT_BUCKET_PDF** :  Enter the path of files for Tool mode `C` and Process mode `S` for a single file.(required only in sync mode and tool Mode `C`)
* **SYNC_INPUT_BUCKET_JSON** :  Enter the path of initial output json file to compare
* **PROCESSOR_VERSION_ID** : Enter the processor version ID for which you wanted to test.

In [4]:
EMAIL_ADDRESS = "<mail-id@domain.com>"
API_KEY = "<path/to/apikey.json>"
PROCESS_TYPE = "a"  # "a" or "s"
PROJECT_NUMBER = "<xx-xx-xx>"
PROCESSOR_ID = "<xx-xx>"
LOCATION = "us"  # 'us' or 'eu'
INITIAL_JSON_BUCKET = "gs://xx-xx/test_harness_guide/output/mode_I"
MODE = "c"  # "i" or "c"
ASYNC_INPUT_BUCKET_PDFS = "gs://xx-xx/test_harness_guide/input/async"
ASYNC_PROJECT_ID = PROJECT_NUMBER
ASYNC_PROCESSOR_ID = PROCESSOR_ID
ASYNC_LOCATION = LOCATION
ASYNC_OUTPUT_BUCKET = "gs://xx-xx/test_harness_guide/output/async"
SYNC_PROJECT_ID = PROJECT_NUMBER
SYNC_PROCESSOR_ID = PROCESSOR_ID
SYNC_LOCATION = LOCATION
SYNC_INPUT_BUCKET_PDF = "gs://xx-xx/test_harness_guide/input/sync/fake_invoice_1.pdf"
SYNC_INPUT_BUCKET_JSON = "gs://xx-xx/test_harness_guide/output/sync/fake_invoice_1.json"
PROCESSOR_VERSION_ID = "<processor-version>"  # "pretrained-invoice-v1.3-2022-07-15"

GS_CREDENTIALS_PATH = API_KEY
SCOPE = [
    "https://spreadsheets.google.com/feeds",
    "https://www.googleapis.com/auth/drive",
]

In [None]:
# Don't change below header data, if changed then you neeed to change few parts of code
_ENTITY_HEADERS = [
    "Entity",
    "Total entities",
    "Entities not captured",
    "Entities mismatch",
]
_ERROR_ENTITY_HEADERS = [
    "File name",
    "Entity",
    "initial_prediction_value",
    "current_prediction_value",
    "Match score",
]


def delete_folder(bucket_name: str, folder_name: str) -> None:
    """This Method will delete the folder in GCP bucket

    Args:
        bucket_name (str): GCS bucket name
        folder_name (str): GCS path prefix(usually startsafter gcs bucket-name in gs://bucket-name/xx-xx)
    """

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blobs = list(bucket.list_blobs(prefix=folder_name))
    bucket.delete_blobs(blobs)
    print(f"Folder gs://{bucket_name}/{folder_name} deleted.")


def batchprocess_intial_mode1(
    project_id: str,
    location: str,
    processor_id: str,
    gcs_input_uri: str,
    initial_json_path: str,
    processor_version_id: str,
    timeout: int = 500,
) -> Tuple[operation.Operation, str]:
    """It will call Batch Process Job and moving parsed JSON files to specified GCS path

    Args:
        project_id (str): GCP project ID
        location (str): Processor location `us` or `eu`
        processor_id (str): GCP DocumentAI ProcessorID
        gcs_input_uri (str): GCS path which contains all input files
        initial_json_path (str): GCS path to store processed JSON results for mode `I`
        processor_version_id (str): VesrionID of GCP DocumentAI Processor
        timeout (int, optional): Maximum waiting time for operation to complete. Defaults to 500.

    Returns:
        Tuple[operation.Operation, str]: It reurns LRO operation ID for current batch-job
                                            & GCS path for Mode-I
    """

    now = datetime.now()
    async_output_dir_prefix = now.strftime("%H%M%S%d%m%Y")
    temp = "/".join(f"{gcs_input_uri}".split("/")[:-1])
    temp_destination_uri = f"{temp}/initial_jsons_{async_output_dir_prefix}/"
    operation_base = batch_process_documents_sample(
        project_id,
        location,
        processor_id,
        gcs_input_uri,
        temp_destination_uri,
        processor_version_id,
        timeout,
    )
    temp_bucket = temp_destination_uri.split("/")[2]
    temp_folder = temp_destination_uri.split("/", 3)[3]
    storage_client = storage.Client()
    source_bucket = storage_client.get_bucket(temp_bucket)
    destination_bucket = initial_json_path.split("/")[2]
    temp_prefix = temp_destination_uri.split("/", 3)[-1]
    blobs = list(source_bucket.list_blobs(prefix=temp_prefix))
    files = [blob.name for blob in blobs]
    temp_dir = initial_json_path.split("/", 3)[3]
    for file in files:
        temp_file = file.split("/")[-1].split("-0")[0]
        prefix = f"{temp_dir}/{temp_file}.json"
        copy_blob(temp_bucket, file, destination_bucket, prefix)
        print(
            f"Blob copied from gs://{temp_bucket}/{file}"
            f" to\n\t\tgs://{destination_bucket}/{prefix}"
        )
    delete_folder(temp_bucket, temp_folder)
    return operation_base, initial_json_path


def create_custom_format_rules(
    sheet: Worksheet,
    condition_type: str,
    condition_values: List[str],
    background_color: Color,
) -> ConditionalFormatRule:
    """Useful to create custom format rules

    Args:
        sheet (Worksheet): Current Worksheet object of Spreadsheet
        condition_type (str): It is a boolean condition type for formatting
        condition_values (List[str]): It is a list of values used to applly color formatting on cell
        background_color (Color): Backgroud color will be applied to cell in current worksheet based on condition_type and condition_values

    Returns:
        ConditionalFormatRule: It returns custom formatting rules for a column based on provided color object and condition type & values
    """

    ranges = [GridRange.from_a1_range("E", sheet)]
    condition = BooleanCondition(condition_type, condition_values)
    text_format = textFormat(bold=True)
    format = CellFormat(textFormat=text_format, backgroundColor=background_color)
    boolean_rule = BooleanRule(condition=condition, format=format)
    rule = ConditionalFormatRule(
        ranges=ranges,
        booleanRule=boolean_rule,
    )
    return rule


def get_x_y_list(
    bounding_poly: documentai.BoundingPoly,
) -> Tuple[List[float], List[float]]:
    """It takes BoundingPoly object and separates it x & y normalized coordinates as lists

    Args:
        bounding_poly (documentai.BoundingPoly): A token of Document Page object

    Returns:
        Tuple[List[float], List[float]]: It returns x & y normalized coordinates as separate lists
    """

    x, y = [], []
    normalized_vertices = bounding_poly.normalized_vertices
    for nv in normalized_vertices:
        x.append(nv.x)
        y.append(nv.y)
    return x, y


def add_entity_to_dataframe(
    entity: documentai.Document.Entity, df: pd.DataFrame
) -> pd.DataFrame:
    """It will append entity data to given DataFrame

    Args:
        entity (documentai.Document.Entity): An entity from Document Object
        df (pd.DataFrame): Target Dataframe to add an entity as new row

    Returns:
        pd.DataFrame: It is a Dataframe with newly appended entity as row
    """

    if entity.mention_text:
        coord1, _, coord3, _ = entity.page_anchor.page_refs[
            0
        ].bounding_poly.normalized_vertices
        bbox = [coord1.x, coord1.y, coord3.x, coord3.y]
        df.loc[len(df.index)] = [entity.type_, entity.mention_text, bbox]
    else:
        df.loc[len(df.index)] = [entity.type_, "Entity not found.", []]
    return df


def doc_proto_to_dataframe(data: documentai.Document) -> pd.DataFrame:
    """It will convert Document Proto object to DataFrame. Returns entities in dataframe format

    Args:
        data (documentai.Document): It is Document Proto Object

    Returns:
        pd.DataFrame: It is a DataFrame which having all entities data as rows
    """

    df = pd.DataFrame(columns=["type_", "mention_text", "bbox"])
    if not data.entities:
        print("No entities Found")
        return df
    for entity in data.entities:
        if entity.properties:
            for sub_entity in entity.properties:
                df = add_entity_to_dataframe(sub_entity, df)
            continue
        df = add_entity_to_dataframe(entity, df)
    return df


def compare_doc_proto_convert_dataframe(
    file1: documentai.Document, file2: documentai.Document
) -> Tuple[pd.DataFrame, np.float64]:
    """Compares the entities between two files and returns the results in a dataframe

    Args:
        file1 (documentai.Document): It is Document Proto Object
        file2 (documentai.Document): It is also Document Proto Object to compare with previous

    Returns:
        Tuple[pd.DataFrame, np.float64]: It returns Dataframe and matched score
                                            between two Document Protos
    """

    df_file1 = doc_proto_to_dataframe(file1)
    df_file2 = doc_proto_to_dataframe(file2)
    file1_entities = [entity[0] for entity in df_file1.values]
    file2_entities = [entity[0] for entity in df_file2.values]

    # find entities which are present only once in both files
    # these entities will be matched directly
    common_entities = set(file1_entities).intersection(set(file2_entities))
    exclude_entities = []
    for entity in common_entities:
        if file1_entities.count(entity) > 1 or file2_entities.count(entity) > 1:
            exclude_entities.append(entity)
    for entity in exclude_entities:
        common_entities.remove(entity)
    df_compare = pd.DataFrame(
        columns=["entity_name", "initial_prediction", "current_prediction"]
    )
    for entity in common_entities:
        value1 = df_file1[df_file1["type_"] == entity].iloc[0]["mention_text"]
        value2 = df_file2[df_file2["type_"] == entity].iloc[0]["mention_text"]
        df_compare.loc[len(df_compare.index)] = [entity, value1, value2]
        # common entities are removed from df_file1 and df_file2
        df_file1 = remove_row(df_file1, entity)
        df_file2 = remove_row(df_file2, entity)

    # remaining entities are matched comparing the area of IOU across them
    mention_text2 = pd.Series(dtype=str)
    for index, row in enumerate(df_file1.values):
        matched_index = find_match(row, df_file2)
        if matched_index is not None:
            mention_text2.loc[index] = df_file2.loc[matched_index][1]
            df_file2 = df_file2.drop(matched_index)
        else:
            mention_text2.loc[index] = "Entity not found."

    df_file1["mention_text2"] = mention_text2.values
    df_file1 = df_file1.drop(["bbox"], axis=1)
    df_file1.rename(
        columns={
            "type_": "entity_name",
            "mention_text": "initial_prediction",
            "mention_text2": "current_prediction",
        },
        inplace=True,
    )
    df_compare = pd.concat([df_compare, df_file1], ignore_index=True)

    # adding entities which are present in file2 but not in file1
    for row in df_file2.values:
        df_compare.loc[len(df_compare.index)] = [row[0], "Entity not found.", row[1]]

    df_compare["match"] = (
        df_compare["initial_prediction"] == df_compare["current_prediction"]
    )
    df_compare["fuzzy ratio"] = df_compare.apply(get_match_ratio, axis=1)
    if list(df_compare.index):
        score = df_compare["fuzzy ratio"].sum() / len(df_compare.index)
    else:
        score = 0
    return df_compare, score


def get_entity_level_statistics(
    df_all_files: pd.DataFrame, df: pd.DataFrame
) -> pd.DataFrame:
    """Returns dataframes with entity level stats - missed and wrongly captured entities.

    Args:
        df_all_files (pd.DataFrame): It is a DataFrame with following columns
                                     ['Entity', 'Total entities',
                                     'Entities not captured', 'Entities mismatch']
        df (pd.DataFrame): It is also DataFrame but with different columns
                           ['entity_name', 'initial_prediction', 'current_prediction']

    Returns:
        pd.DataFrame: It is DataFrame whic has all required Statistics about entities
    """

    df_entities_count = pd.DataFrame(columns=_ENTITY_HEADERS)
    entity_count = df["entity_name"].value_counts()
    for entity in entity_count.index:
        entities_not_captured = 0
        entities_mismatch = 0
        for value in df[df["entity_name"] == entity].values:
            if value[1] == "Entity not found." or value[2] == "Entity not found.":
                entities_not_captured += 1
            elif value[4] > 0.0 and value[4] < 1.0:
                entities_mismatch += 1
        df_entities_count.loc[len(df_entities_count.index)] = [
            entity,
            entity_count[entity],
            entities_not_captured,
            entities_mismatch,
        ]

    for row in df_entities_count.values:
        match = df_all_files[df_all_files["Entity"] == row[0]]
        if match.shape[0] == 0:
            df_all_files.loc[len(df_all_files.index)] = row
        else:
            df_all_files.at[match.index[0], "Total entities"] = (
                df_all_files.loc[match.index[0]]["Total entities"] + row[1]
            )
            df_all_files.at[match.index[0], "Entities not captured"] = (
                df_all_files.loc[match.index[0]]["Entities not captured"] + row[2]
            )
            df_all_files.at[match.index[0], "Entities mismatch"] = (
                df_all_files.loc[match.index[0]]["Entities mismatch"] + row[3]
            )

    return df_all_files


def get_error_entities(df: pd.DataFrame, file_name: str) -> pd.DataFrame:
    """Returns dataframes with mismtach and error entities

    Args:
        df (pd.DataFrame): It is DataFrame with following columns
                           ['entity_name', 'initial_prediction', 'current_prediction']
        file_name (str): Name Of the JSON file with errors

    Returns:
        pd.DataFrame: It returns DataFrame with following fields
                      ['File name', 'Entity', 'initial_prediction_value',
                      'current_prediction_value', 'Match score']
    """

    df_entities_error = pd.DataFrame(columns=_ERROR_ENTITY_HEADERS)
    entity_count = df["entity_name"].value_counts()
    for entity in entity_count.index:
        for value in df[df["entity_name"] == entity].values:
            if value[1] == "Entity not found." or value[2] == "Entity not found.":
                df_entities_error.loc[len(df_entities_error.index)] = [
                    file_name,
                    entity,
                    value[1],
                    value[2],
                    value[4],
                ]
            elif value[4] > 0.0 and value[4] < 1.0:
                df_entities_error.loc[len(df_entities_error.index)] = [
                    file_name,
                    entity,
                    value[1],
                    value[2],
                    value[4],
                ]
    return df_entities_error


def create_gsheet(file_name: str, user_email: str) -> spreadsheet.Spreadsheet:
    """Creates a google sheet

    Args:
        file_name (str): It is title for gsheet
        user_email (str): User mail-id to get report of tool output data as gsheet

    Returns:
        spreadsheet.Spreadsheet: It returns newly created gsheet object
    """

    credentials = ServiceAccountCredentials.from_json_keyfile_name(
        GS_CREDENTIALS_PATH, SCOPE
    )
    client = gspread.authorize(credentials)
    sheet = client.create(file_name)
    sheet.share(user_email, perm_type="user", role="writer")
    return sheet


def save_data_to_sheet(gsheet_name: str, df: pd.DataFrame, name: str) -> None:
    """It writes dataframe to google sheet

    Args:
        gsheet_name (str): Title of gsheet
        df (pd.DataFrame): It is DataFrame which need to write to gsheet
        name (str): It is Name of the Tab in gsheet
    """

    credentials = ServiceAccountCredentials.from_json_keyfile_name(
        GS_CREDENTIALS_PATH, SCOPE
    )
    gc = gspread.authorize(credentials)
    sheet_1 = gc.open(gsheet_name)
    sheet_1.add_worksheet(title=name, rows="1000", cols="20")
    sheet_2 = gc.open(gsheet_name).worksheet(name)
    values = [df.columns.values.tolist()]
    values.extend(df.values.tolist())
    try:
        sheet_2.update(range_name="A1:Z1000", values=values)
    except (TypeError, Exception) as e:
        print(
            "sheet.update() -> method takes positional args only a1-notation & list[list[...]]"
        )
        sheet_2.update("A1:Z1000", values)

    rules = get_conditional_format_rules(sheet_2)
    color_red = Color(0.9803921568627451, 0.5019607843137255, 0.4470588235294118)
    color_yellow = Color(1.0, 1.0, 0.6)
    color_green = Color(0.5647058823529412, 0.9333333333333333, 0.5647058823529412)
    rule_1_params = "NUMBER_LESS_THAN_EQ", ["0.5"], color_red
    rule_2_params = "NUMBER_BETWEEN", ["0.51", "0.75"], color_yellow
    rule_3_params = "NUMBER_GREATER", ["0.75"], color_green
    for format_rule_params in [rule_1_params, rule_2_params, rule_3_params]:
        rule = create_custom_format_rules(sheet_2, *format_rule_params)
        rules.append(rule)

    rules.save()


def asynchronous(
    async_input_bucket_pdfs: str,
    async_input_bucket_jsons: str,
    async_output_bucket: str,
    async_project_id: str,
    async_processor_id: str,
    processor_version_id: str,
    async_location: str,
    email_address: str,
) -> None:
    """Here we are going to generate name of output folder as
        Time+date and then process the documents

    Args:
        async_input_bucket_pdfs (str): GCS Path of source PDF files
        async_input_bucket_jsons (str): GCS Path of JSON file
        async_output_bucket (str): GCS Path to store currently-processed JSON file
        async_project_id (str): GCP Project ID
        async_processor_id (str): DocumentAI Processor ID
        processor_version_id (str): VesrionID of GCP DocumentAI Processor
        async_location (str): Processor location `us` or `eu`
        email_address (str): User mail-id to get asynchronous-report of tool output data as gsheet
    """

    now = datetime.now()
    async_output_dir_prefix = now.strftime("%H%M%S%d%m%Y")
    output_bucket_name = re.split("/", async_output_bucket)[2]
    input_bucket_name = re.split("/", async_input_bucket_jsons)[2]
    start_time_async = time.time()
    operation = batch_process_documents_sample(
        project_id=async_project_id,
        location=async_location,
        processor_id=async_processor_id,
        gcs_input_uri=async_input_bucket_pdfs,
        gcs_output_uri=f"{async_output_bucket}/{async_output_dir_prefix}/",
        processor_version_id=processor_version_id,
        timeout=500,
    )
    end_time_async = time.time()
    time_taken_async = end_time_async - start_time_async
    # Now we have to create a table to map reference documents and previously accepted results
    batch_metadata = operation.metadata
    print(f"{Back.CYAN}{batch_metadata}")
    print(f"{Back.CYAN}{batch_metadata.state.name}\n")
    print(f"{Back.CYAN}time_taken\n")
    print(f"{Back.YELLOW}\tseconds: {str(time_taken_async)}")
    pdf_to_output_folder_map = {}
    for individual_process in batch_metadata.individual_process_statuses:
        fn = individual_process.input_gcs_source.rsplit("/")[-1]
        fp = individual_process.output_gcs_destination.split("/", 3)[3]
        pdf_to_output_folder_map[fn] = fp

    # Now input files is mapped to output folders and
    # now we have to get input file to output file map
    client_obj = storage.Client()
    bucket_obj = client_obj.get_bucket(output_bucket_name)
    blobs_list = bucket_obj.list_blobs()
    temp_blobs_list = []
    for i in blobs_list:
        temp_blobs_list.append(str(i.name))
    pdf_to_output_file_map = {}
    for keys, values in pdf_to_output_folder_map.items():
        for i in temp_blobs_list[:]:
            if re.split("/", values) == re.split("/", i)[:-1]:
                pdf_to_output_file_map[keys] = i
    #  Now we have to map input pdf files to input jsons

    input_json_blob_name_list = re.split("/", async_input_bucket_jsons)[3:]
    pdf_to_input_file_map = {}
    for keys, values in pdf_to_output_folder_map.items():
        for i in temp_blobs_list[:]:
            if input_json_blob_name_list == re.split("/", i)[:-1]:
                if keys.split(".")[0] == re.split("/", i)[-1][:-5]:
                    pdf_to_input_file_map[keys] = i
    #  Create a output Gsheet
    print("\nCreating Comparision Gsheet,Wait for the Success message...")
    create_gsheet(f"Asynchronous-{async_output_dir_prefix}", email_address)
    # Now both input and output jsons are mapped and one by one
    # we'll load in memory and compare it and load it in gSheet.
    df_entities_all_files = pd.DataFrame(columns=_ENTITY_HEADERS)
    df_entities_error_all_files = pd.DataFrame(columns=_ERROR_ENTITY_HEADERS)
    for key, value in pdf_to_output_file_map.items():
        print("for file: ", key)
        output_json_file = documentai_json_proto_downloader(output_bucket_name, value)
        input_json_file = documentai_json_proto_downloader(
            input_bucket_name, pdf_to_input_file_map[key]
        )
        async_output_dataframe, _ = compare_doc_proto_convert_dataframe(
            input_json_file, output_json_file
        )
        time.sleep(5)
        save_data_to_sheet(
            f"Asynchronous-{async_output_dir_prefix}", async_output_dataframe, str(key)
        )
        df_entities_all_files = get_entity_level_statistics(
            df_entities_all_files, async_output_dataframe
        )
        df_error_entities = get_error_entities(async_output_dataframe, str(key))
        df_entities_error_all_files = pd.concat(
            [df_entities_error_all_files, df_error_entities], ignore_index=True
        )
    time.sleep(5)
    save_data_to_sheet(
        f"Asynchronous-{async_output_dir_prefix}", df_entities_all_files, "All Entities"
    )
    time.sleep(5)
    save_data_to_sheet(
        f"Asynchronous-{async_output_dir_prefix}",
        df_entities_error_all_files,
        "Error entities",
    )


def synchronous(
    sync_input_bucket_pdf: str,
    sync_input_bucket_json: str,
    sync_project_id: str,
    sync_processor_id: str,
    processor_version_id: str,
    sync_location: str,
    email_address: str,
):
    """Here we have to download the pdf file into memory first and then use it.

    Args:
        sync_input_bucket_pdf (str): GCS URI of source PDF files
        sync_input_bucket_json (str): GCS URI of JSON file
        sync_project_id (str): GCP Project ID
        sync_processor_id (str): DocumentAI Processor ID
        processor_version_id (str): VesrionID of GCP DocumentAI Processor
        sync_location (str): Processor location `us` or `eu`
        email_address (str): User mail-id to get aynchronous-report of tool output data as gsheet
    """

    pdf_file_name = sync_input_bucket_pdf.split("/")[-1]
    storage_client_obj = storage.Client()
    bucket_name = sync_input_bucket_pdf.split("/")[2]
    blob_name = sync_input_bucket_pdf.split("/", 3)[3]
    bucket_obj = storage_client_obj.bucket(bucket_name)
    blob_obj = bucket_obj.blob(blob_name)
    document_as_string = blob_obj.download_as_string()
    # Now we have document in memory as string and we have to process it.
    time_begin_sync = time.time()
    print("Document is processing...")
    processed_document = process_document_sample(
        sync_project_id,
        sync_location,
        sync_processor_id,
        document_as_string,
        processor_version_id,
    )
    time_end_sync = time.time()
    time_taken = time_end_sync - time_begin_sync
    print(f"{Back.CYAN}Processed 1 document successfully\n")
    print(f"{Back.CYAN}time_taken\n")
    print(f"{Back.YELLOW}\tseconds: {str(time_taken)}")
    # Document has been processed now we have
    # to download previously accepted result from bucket
    bucket_name_json = sync_input_bucket_json.split("/")[2]
    blob_name_json = sync_input_bucket_json.split("/", 3)[3]
    previously_accepted_result = documentai_json_proto_downloader(
        bucket_name_json, blob_name_json
    )
    print(
        "\nCreating comparision Gsheet, Wait for success message and check your mail..."
    )
    # Load processed_document and previously_accepted_result as
    # json files and compare using helper functions
    sync_output_dataframe, _ = compare_doc_proto_convert_dataframe(
        previously_accepted_result, processed_document.document
    )
    df_entities = pd.DataFrame(columns=_ENTITY_HEADERS)
    df_entities = get_entity_level_statistics(df_entities, sync_output_dataframe)
    df_entities_error = get_error_entities(sync_output_dataframe, pdf_file_name)
    create_gsheet("Synchronous", email_address)
    save_data_to_sheet("Synchronous", sync_output_dataframe, pdf_file_name)
    save_data_to_sheet("Synchronous", df_entities, "All Entities")
    save_data_to_sheet("Synchronous", df_entities_error, "Error entities")


if MODE.upper() == "I":
    start_time_async = time.time()
    now = datetime.now()
    async_output_dir_prefix = now.strftime("%H%M%S%d%m%Y")
    pdfs_names_list, _ = file_names(ASYNC_INPUT_BUCKET_PDFS)
    jsons_names_list, _ = file_names(INITIAL_JSON_BUCKET)
    file_name_dict = {a.split(".")[0]: a for a in pdfs_names_list}
    json_name_dict = {a.split(".")[0]: a for a in jsons_names_list}
    files_list = list(file_name_dict.keys())
    temp_bucket = ASYNC_INPUT_BUCKET_PDFS.split("/")[2]
    storage_client = storage.Client()
    source_bucket = storage_client.get_bucket(temp_bucket)
    list_new = []
    for file in files_list:
        if file in json_name_dict.keys():
            print(f"Initial Json file already exists for:{file_name_dict[file]}")
            continue
        list_new.append(file)
        source_blob = source_bucket.blob(file_name_dict[file])
        temp_fn = file_name_dict[file]
        temp_dir = ASYNC_INPUT_BUCKET_PDFS.split("/", 3)[3]
        file_name_temp = f"{temp_dir.rstrip('/')}/{temp_fn}"
        temp_dir = ASYNC_INPUT_BUCKET_PDFS.rsplit("/", maxsplit=1)[-1]
        prefix = f"{temp_dir}/initial_pdfs_{async_output_dir_prefix}/{temp_fn}"
        copy_blob(temp_bucket, file_name_temp, temp_bucket, prefix)
        print(
            f"Blob copied from gs://{temp_bucket}/{file_name_temp}"
            f" to\n\t\tgs://{temp_bucket}/{prefix}"
        )

    temp_dir = ASYNC_INPUT_BUCKET_PDFS.rsplit("/", maxsplit=1)[-1]
    temp_initial_path = (
        f"gs://{temp_bucket}/{temp_dir}/initial_pdfs_{async_output_dir_prefix}"
    )
    if not list_new:
        print("There are NO new files to Parse")
    else:
        async_batch_initial, initial_json_path = batchprocess_intial_mode1(
            project_id=ASYNC_PROJECT_ID,
            location=ASYNC_LOCATION,
            processor_id=ASYNC_PROCESSOR_ID,
            processor_version_id=PROCESSOR_VERSION_ID,
            gcs_input_uri=temp_initial_path,
            initial_json_path=INITIAL_JSON_BUCKET,
            timeout=500,
        )
        temp_dir = ASYNC_INPUT_BUCKET_PDFS.rsplit("/", maxsplit=1)[-1]
        temp_folder = f"{temp_dir}/initial_pdfs_{async_output_dir_prefix}/"
        delete_folder(temp_bucket, temp_folder)
        end_time_async = time.time()
        time_taken_async = end_time_async - start_time_async
        print(
            f"Time taken to process the files and create initial json files: {time_taken_async}",
        )
        print(f"The Intitial Json files are saved in: {initial_json_path}")
elif MODE.upper() == "C":
    if PROCESS_TYPE.upper() == "A":
        asynchronous(
            ASYNC_INPUT_BUCKET_PDFS,
            INITIAL_JSON_BUCKET,
            ASYNC_OUTPUT_BUCKET,
            ASYNC_PROJECT_ID,
            ASYNC_PROCESSOR_ID,
            PROCESSOR_VERSION_ID,
            ASYNC_LOCATION,
            EMAIL_ADDRESS,
        )
    elif PROCESS_TYPE.upper() == "S":
        synchronous(
            SYNC_INPUT_BUCKET_PDF,
            SYNC_INPUT_BUCKET_JSON,
            SYNC_PROJECT_ID,
            SYNC_PROCESSOR_ID,
            PROCESSOR_VERSION_ID,
            SYNC_LOCATION,
            EMAIL_ADDRESS,
        )

print(f"{Fore.GREEN}{Back.CYAN}{Style.BRIGHT}{'*' * 47} SUCCESS {'*'* 47}")

# Step by Step procedure

### Step 1 : Fill the details for all input variables
Type of process have to be S for Synchronous process or A for Asynchronous process.

### Step 2 : Run the Python Code Cells and wait for the success message
If everything is smooth then after some time a message appear on screen as shown below.  
<img src="./images/time_taken_by_api.png" width=500 height=100> </img>  
<!-- ![](./images/time_taken_by_api.png) -->
Here yellow coloured highlighted text is the time taken by DocAI API to process the document. It means till now everything is smooth and now it compare the processed result with previously accepted result file and generate a google sheet as output.

Now wait for the process to finish. Once it is done, a “SUCCESS” message would be displayed.
Which means that a Google sheet has been shared to the given email address having reports of test harness tool results.  
<img src="./images/success_message.png" width=400 height=100> </img>  
<!-- ![](./images/success_message.png) -->

### Step 3 : Check the mailbox
Now check in the mailbox of “Email” that has been provided and find a Google sheet shared by the service account having all details of side by side comparison.  
<img src="./images/check_mail.png" width=400 height=200> </img>  
<!-- ![](./images/check_mail.png) -->

## Using Google Colab instead of vertex AI Notebook
Google colab can be used instead of vertex AI notebook which can be a cost effective way of using the Test Harness Tool.
For this there are few preparation steps.
### Step 1 : Installing modules
Open a new colab notebook and run commands for required packages 
```python
!pip install colorama  
!pip install gspread_formatting  
!pip install google-cloud-documentai  
!pip install df2gspread  
!pip install configparser  
!pip install google-cloud 
!pip install spread
```
It should  install the modules.  

### Step 2 : Restart the runtime
Running Step 1 give a button to restart the runtime. Click on “ Restart Runtime” and “Yes” to restart the runtime.  
<img src="./images/restart_runtime.png" width=400 height=100> </img>  
<!-- ![](./images/restart_runtime.png) -->

### Step 3 : Set the colab environment
Now in a new cell put these commands : 
```python
os.environ['GCLOUD_PROJECT']= PROJECT_NUMBER  
os.environ['GOOGLE_APPLICATION_CREDENTIALS']="PATH/to/apikey.json"  
!export GOOGLE_APPLICATION_CREDENTIALS=/content/config/apiKey.json  
!gcloud auth login   
!gcloud config set project PROJECT_NAME 
```
<img src="./images/set_colab_environment_2.png" width=200 height=100> </img>  
<!-- ![](./images/set_colab_environment_2.png) -->

Now after running the cell it should give a link to authenticate google project. Click on the link.   
<img src="./images/click_here.png" width=800 height=100> </img>  
<!-- ![](./images/click_here.png)   -->
It will open a new popup window in which “choose your account” and”Allow” and copy the code.  
<img src="./images/gcloud_cli.png" width=400 height=400> </img>  
<!-- ![](./images/gcloud_cli.png)   -->
Now come back to the colab tab and paste the copied code and “Enter”.  
<img src="./images/paste_code_here.png" width=800 height=100> </img>  
<!-- ![](./images/paste_code_here.png) -->

### Step 4 : Copy and paste the test harness tool script (code) and run
Now just copy and paste the test harness tool script as it is in the next cell and run it.
And Follow the steps as per the Modes.

# Sample outputs and Explanation

### Synchronous Output
Sample Output file : **Synchronous**  
Explanation : In the Synchronous mode output should be be three worksheets ”fileName.pdf”, “All Entities” and “Error Entities”.  
<img src="./images/tabs_sync_out.png" width=800 height=100> </img>  
<!-- ![](./images/tabs_sync_out.png)   -->
In the “fileName.pdf” worksheet there are five columns named as “A:entity_name”, “B: initial_prediction”, “C:current_prediction”, “D:match” and  “E:fuzzy ratio”.

Here “A:entity_name” represents the name of the entities present in both input and output json files. “B: initial_prediction” represents the data associated with the entity in file 1 (i.e json file generated using docAI API) similarly in “C:current_prediction” represents the data associated with corresponding entity in file 2 (i.e previously accepted result json file), “D:match”  shows boolean values (i.e True of False) where True means the the value is exactly same as it was in the previously accepted result and false means there is some mismatch between the data in entity. Last “E:fuzzy ratio” gives a number between 0 to 1 which tells that if an entity was present in both of the files but still there is some difference in data then how much that difference is. Column “E” also shows three possible colors.

The color coding is defined to show the clear difference between two schemas:  
* Green→<img src="./images/green.png" width=20 height=7> </img>→ Entity matches in both files more than 75%.
* Yellow→<img src="./images/yellow.png" width=20 height=7> </img>→Entity matches in both files from 51% to 75%.
* Pink→<img src="./images/pink.png" width=20 height=7> </img>→Entity matching percentage is below 51%.

Next worksheet “All Entities” is the report of total entities of various types out of which how many were not captured and how many were captured but with a slight change.

The last worksheet is “Error Entities”  which contains a list of all entities which have some issues and what is wrong with it. It has five columns representing “ input file name”, “ name of entity”, “ value in file 1”, “ value in file 2” and “matching score”. In the last column  color coding is also there as we’ve used above.

### Asynchronous Output
Sample Output file : **Asynchronous-14224306092022**
Explanation : If there are N numbers of input file should be N+2 worksheets in the output file.
<img src="./images/tabs_async_out.png" width=800 height=50> </img>  
<!-- ![](./images/tabs_async_out.png)   -->
Similar to synchronous mode, in asynchronous mode output should be N + 2 worksheets, N worksheets labeled as  ”fileName.pdf”, “All Entities” and “Error Entities”.


### Output Color Coding Sample
<img src="./images/color_code_sample_async.png" width=800 height=400></img>