# Identifying Poor Performing Documents

- Author: docai-incubator@google.com

## Purpose and Description

The goal is to automate identifying poorly performing documents for uptraining. The metric of 
poorly performing documents by the number of missed critical fields. The script will work based on the following conditions :

### 1. Input provided to the script

    a. Input bucket of labeled documents
    
    b. Output bucket for poorly performing documents
    
    c. Project and processor ID (and version) to call the specified processor
    
    c. List of critical fields. Script should validate before running that critical field names match schema names. If it does not match then          the script should throw an error and request you to update input for critical fields to match schema.
    
    d. Threshold needed for a document to be sent output bucket for poorly performing documents
    
### 2. Numerical substring matching condition

     a. Script runs documents through a specified processor and identifies poorly performing docs by looking at critical fields of each               document and comparing it to Ground Truth(GT).
     
    b. Optional numerical substring matching that can be set by entity. If enabled then as long as the numerical subset is correct then it is        not counted as a miss by the processor. For Example, the ground truth is “ID 5123” and the model predicts “5123”. It is not counted as        a miss by the script, as long as it picks up the substring containing all the correct numerical digits then would be correct.
    
### 3. Threshold logic to move poor performance documents to bucket

    a. Script outputs the worst poorly performing documents (by some custom set threshold). For example, documents that got more than 50% of          critical fields wrong are in the output.  Should also accept integers, such as any document with more than 5 missed critical fields  v        sent to the output bucket.
    
### 4. Output summary and stats file

    a. Output a list of misses by critical fields in sheets/CSV for each document that is being sent to the output bucket.




Example output CSV of missed critical fields:

| Document Names | Invoice_1 | Invoice_2 |
| --- | --- | --- |
| # of mIssed Invoice_ID | 2 | 0 |
| # of missed Address | 1 | 1 |
| # of missed Taxes | 1 | 3 |

Example Input critical fields: 
| Critical Fields | Numerical substring matching |
| --- | --- | 
| Invoice_ID | Yes | 
| Address | No | 

## Prerequisites 
1. Access to a Google Cloud project to create Document AI processors.
   - Permission to Google project is needed to access Document AI processors.
2. Python: Jupyter notebook (Vertex AI) or Google Colab.
3. Critical fields list in csv file
4. Ground truth Json files in GCS Folders

### NOTE ON INPUT DETAILS

### The possible values for pctg_or_count_flag are one of the below:

#### pctg_or_count_flag = count
#### pctg_or_count_flag = pctg

1. If the pctg_or_count_flag = count, then value for threshold_count must be provided and threshold_pctg should be 0. And if the error count is greater than the value of input threshold_count then the predicted document is moved to a poor performance storage path.

2. If the pctg_or_count_flag = pctg, then value for threshold_pctg must be provided and threshold_count = 0. And if the error percentage is greater than the value of input threshold_pctg then the predicted document is moved to a poor performance storage path.

### Critical Fields File: This file contains the required list entities of documents along with the flag having values Yes/No. This file determines if the processing substring matches with the schema if enabled. 


## Tool Operation Procedure

### 1. Install required libraries

In [None]:
%pip install google-cloud-documentai
%pip install google-cloud-storage
%pip install google-api-core
%pip install pandas
%pip install numpy
%pip install operator
%pip install difflib
%pip install json
%pip install gcsfs
%pip install PyPDF2
%pip install ast
%pip install Pillow

In [None]:
# Download incubator-tools utilities module to present-working-directory
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

### 2. Import Packages

In [2]:
# importing libraries
import pandas as pd
import numpy as np
import operator
import difflib
import json
import os
import time
import gcsfs
from google.cloud import storage
from google.cloud import documentai_v1beta3 as documentai
from PIL import Image
from typing import (
    Container,
    Iterable,
    Iterator,
    List,
    Mapping,
    Optional,
    Sequence,
    Tuple,
    Union,
)
from PyPDF2 import PdfFileReader
import ast
import io
import re
import datetime
import utilities  # --> DOWNLOAD THIS AND IMPORT ACCORDINGLY
import warnings

warnings.filterwarnings("ignore")

### 3. Input Details

In [3]:
## INPUT DETAILS
processor_ID = "7fbb1ccb4dff7b3c"  # processor ID based on which documents performance has to be checked
project_number = "514064100333"  # GCP Project number
processor_versionID = (
    "pretrained-invoice-v1.3-2022-07-15"  # Processor version ID to use for testing
)
location = "us"  # location of processor created
project_id = "rand-automl-project"  # GCP project ID
GT_Output_URI = "gs://scb_line_item_exp/SCB_Samples/groundtruth/"  # GCS Bucket where the ground truth files are saved
output_folder_path_name = "poor_performance_doc/"  # Name of the folder which has to be created and poor performance docs has to be
pctg_or_count_FLAG = "count"  # criteria to decide the poor performance documents
threshold_count = 1  # Threshold count
threshold_pctg = 0  # Threshold percentage
critical_fields_csv = (
    "CriticalFields.csv"  # path to csv file containing the list of critical fields
)

### 4. Run the functions


In [4]:
def get_poor_perfoming_docs(
    master_df,
    pctg_or_count_FLAG,
    threshold_count,
    threshold_pctg,
    cf,
    output_folder_path_name,
    processed_predected_documents,
):
    """
    Identify and upload poorly performing documents based on specified criteria.

    Args:
        master_df (DataFrame): The master dataframe containing document data.
        pctg_or_count_FLAG (str): The flag indicating whether to use 'count' or 'pctg' as the performance criterion.
        threshold_count (int): The threshold count value for 'count' criterion.
        threshold_pctg (float): The threshold percentage value for 'pctg' criterion.
        cf (dict): A dictionary of Critical fields given in csv.
        output_folder_path_name (str): The path where the output files will be stored.
        processed_predected_documents (dict): A dictionary containing processed predicted documents.

    Returns:
        dict: A dictionary containing performance statistics and file paths.

    Raises:
        ValueError: If the 'pctg_or_count_FLAG' is not 'count' or 'pctg'.
    """

    def blob_upload(output_folder_path_name, file_name, json_upload):
        """
        Upload a JSON document to a cloud storage bucket.

        Args:
            output_folder_path_name (str): The path where the document will be stored.
            file_name (str): The name of the file to be uploaded.
            json_upload (Document): The JSON document to be uploaded.
        """
        output_json_file = (
            output_folder_path_name + "poor_performance-" + str(file_name)
        )
        json_poor = documentai.Document.to_json(json_upload)
        storage_client = storage.Client()
        source_bucket = storage_client.bucket(GT_Output_URI.split("/")[2])
        blob = source_bucket.blob(output_json_file)
        blob.upload_from_string(
            data=bytes(json.dumps(json_poor), "utf-8"), content_type="application/json"
        )
        stats_json["poor_performed_doc"].append(output_json_file)

    stats_json = {}
    stats_json["GT_file_path"] = GT_Output_URI

    time_stamp = datetime.datetime.now().strftime("%d_%m_%y-%H%M%S")
    df_filtered = master_df.loc[
        (master_df["GTvsPredictedDifference"] == "YES"),
        ("File Name", "GT Entity Type", "GT_Output", "Predicted_Output"),
    ]
    df_filtered = df_filtered[df_filtered["GT Entity Type"].isin(cf.keys())]
    df_group = (
        df_filtered.groupby("GT Entity Type")["File Name"]
        .value_counts()
        .unstack()
        .fillna(0)
    )
    summary_df = pd.DataFrame()
    summary_df["GT Entity Type"] = cf.keys()
    summary_df.reset_index(drop=True, inplace=True)
    x = summary_df.join(df_group, on="GT Entity Type").fillna(0)
    x.iloc[:, 1:] = x.iloc[:, 1:].applymap(lambda x: int(x) if not pd.isnull(x) else x)
    x.to_csv("analysis/summary_" + time_stamp + ".csv")
    stats_json["analysis_summary_csv"] = "analysis/summary_stats_" + time_stamp + ".csv"
    stats_json["pctg_or_count_FLAG"] = pctg_or_count_FLAG
    stats_json["threshold_count"] = threshold_count
    stats_json["threshold_pctg"] = threshold_pctg
    stats_json["poor_performed_doc"] = []
    if (pctg_or_count_FLAG == "count") and (threshold_pctg == 0):
        for predicted_file in list(processed_predected_documents.keys()):
            try:
                if sum((x[predicted_file])) >= threshold_count:
                    blob_upload(
                        output_folder_path_name,
                        predicted_file,
                        processed_predected_documents[predicted_file],
                    )
                else:
                    print("not meeting the threshold count value")
            except Exception as e:
                print(e)
                continue
    elif (pctg_or_count_FLAG == "pctg") and (threshold_count == 0):
        for predicted_file in list(processed_predected_documents.keys()):
            for val in x[predicted_file].values:
                try:
                    if (val / sum(x[predicted_file].values) * 100) >= threshold_pctg:
                        blob_upload(
                            output_folder_path_name,
                            predicted_file,
                            processed_predected_documents[predicted_file],
                        )
                        break
                    else:
                        print("not meeting the threshold pctg value")
                except Exception as e:
                    print(e)
                    continue
    else:
        print(
            "Please check input 'pctg_or_count_flag'. Value should be either 'count' or 'pctg'."
        )
    from pprint import pprint

    pprint(stats_json)
    with open("summary_run_" + time_stamp + ".json", "w") as fo:
        fo.write(json.dumps(stats_json))

    return stats_json

In [5]:
def main():
    # Getting data from Critical fields.csv
    with open(critical_fields_csv, "r") as cf_file:
        cf_data = cf_file.read()

    cf = {}
    for field in cf_data.split("\n"):
        data = field.split(",")
        cf[data[0].strip()] = data[1].strip()
        # cf.append((field.split(',')))
    num_substring_entities_list = []
    for k, v in cf.items():
        if v.lower() == "yes":
            num_substring_entities_list.append(k)

    GT_bucket = GT_Output_URI.split("/")[2]
    try:
        os.mkdir("analysis")
    except:
        pass

    storage_client = storage.Client()
    source_bucket = storage_client.bucket(GT_bucket)  # storage bucket name
    source_blob = source_bucket.list_blobs(
        prefix="/".join(GT_Output_URI.split("/")[3:-1])
    )

    list_of_files = []
    for blob in source_blob:
        if blob.name.endswith(".json"):
            list_of_files.append("gs://" + GT_bucket + "/" + blob.name)

    document_schema = utilities.get_document_schema(
        location, project_number, processor_ID, processor_versionID
    )

    # Checking whether critical entities are available in schema
    list_of_entities = []
    for entity_type in document_schema.entity_types:
        for entity in entity_type.properties:
            list_of_entities.append(entity.name)
    for ent in cf.keys():
        if ent not in list_of_entities:
            print(
                "Stop! Critical Field Entity {} NOT FOUND in  Ground Truth Entities \n Check the entities in each files and correct...".format(
                    ent
                )
            )
            sys.exit(1)

    master_df = pd.DataFrame(
        columns=[
            "File Name",
            "GT Entity Type",
            "GT_Output",
            "GT_bbox",
            "Predicted_Output",
            "GTvsPredictedDifference",
            "Predicted_bbox",
            "Match",
            "Fuzzy Ratio",
            "bbox_mismatch",
        ]
    )
    processed_predected_documents = {}

    for prcsd_file in list_of_files:
        compare_merged = pd.DataFrame()
        GT_json = utilities.documentai_json_proto_downloader(
            GT_bucket, ("/").join(prcsd_file.split("/")[3:])
        )
        pdf_bytes, synthesiz_images = utilities.create_pdf_bytes_from_json(
            documentai.Document.to_dict(GT_json)
        )
        processor_result = utilities.process_document_sample(
            project_id=project_number,
            location=location,
            processor_id=processor_ID,
            pdf_bytes=pdf_bytes,
            processor_version=processor_versionID,
        ).document
        temp_parsed_json = documentai.Document(processor_result)
        compare_output_1, score = utilities.compare_pre_hitl_and_post_hitl_output(
            GT_json, temp_parsed_json
        )
        compare_output_1.rename(
            columns={
                "Entity Type": "GT Entity Type",
                "Pre_HITL_Output": "GT_Output",
                "Post_HITL_Output": "Predicted_Output",
                "pre_bbox": "GT_bbox",
                "post_bbox": "Predicted_bbox",
            },
            inplace=True,
        )
        compare_output = compare_output_1.loc[
            :,
            [
                "GT Entity Type",
                "GT_Output",
                "GT_bbox",
                "Predicted_Output",
                "Predicted_bbox",
                "Fuzzy Ratio",
            ],
        ]
        compare_output.to_csv("2.csv")
        column = [prcsd_file.split("/")[-1]] * compare_output.shape[0]
        compare_output.insert(loc=0, column="File Name", value=column)
        compare_output.insert(loc=5, column="GTvsPredictedDifference", value=" ")
        for j in range(len(compare_output)):
            if compare_output["Fuzzy Ratio"][j] != 1.0:  # strict
                # if logic - check if the entity value has numeric and update the column to No/Yes
                for x in num_substring_entities_list:
                    GTo = compare_output[
                        (compare_output["GT Entity Type"] == x)
                        & (
                            compare_output["GT_Output"]
                            != compare_output["Predicted_Output"]
                        )
                    ]["GT_Output"]
                    Labo = compare_output[
                        (compare_output["GT Entity Type"] == x)
                        & (
                            compare_output["GT_Output"]
                            != compare_output["Predicted_Output"]
                        )
                    ]["Predicted_Output"]
                    if str(Labo).isdigit() and Labo in GTo:
                        compare_output.loc[
                            (compare_output["GT Entity Type"] == x),
                            "GTvsPredictedDifference",
                        ] = "NO"
                if (
                    compare_output["GT_Output"][j] == "Entity not found."
                    and compare_output["Predicted_Output"][j] == "Entity not found."
                ):
                    compare_output["GTvsPredictedDifference"][j] = "NO"
                else:
                    compare_output["GTvsPredictedDifference"][j] = "YES"
            else:
                compare_output["GTvsPredictedDifference"][j] = "NO"
        for k in range(len(compare_output)):
            if compare_output["Fuzzy Ratio"][k] != 1.0:  # strict
                change_GT_and_parsed = "parsed json is diff from GT"
                break
            else:
                compare_output["GTvsPredictedDifference"][k] = "NO"
        processed_predected_documents[prcsd_file.split("/")[-1]] = temp_parsed_json
        # compare_output['bbox_mismatch'] = compare_output['GT_bbox'] != compare_output['Predicted_bbox']
        new_row = pd.Series(
            [
                prcsd_file.split("/")[-1],
                "parsed json",
                "is updated",
                "compared to GT",
                ":",
                np.nan,
                change_GT_and_parsed,
                "",
            ],
            index=compare_output.columns,
        )
        compare_output = compare_output.append(new_row, ignore_index=True)
        frames = [compare_merged, compare_output]
        compare_merged = pd.concat(frames)
        master_df = pd.concat([master_df, compare_merged], ignore_index=True)

    stats_json = get_poor_perfoming_docs(
        master_df,
        pctg_or_count_FLAG,
        threshold_count,
        threshold_pctg,
        cf,
        output_folder_path_name,
        processed_predected_documents,
    )


main()

min() arg is an empty sequence
min() arg is an empty sequence
min() arg is an empty sequence
min() arg is an empty sequence
min() arg is an empty sequence
min() arg is an empty sequence
min() arg is an empty sequence
min() arg is an empty sequence
min() arg is an empty sequence
'docai_scoring_output_client_11band12b_3686237892651273550_92_DTP_EXP_BIL_LC_INV_958151758472_ISS000_B39B76F7-CD00-4396-8579-C109EDB35CA7_7-0.json'
'docai_scoring_output_client_11band12b_3686237892651273550_104_DTP_EXP_COLN_INV_958130578848_ISS000_AE4D4916-9B2D-4D7B-A45D-28638CBC8F7E_10-0.json'
'docai_scoring_output_client_11band12b_3686237892651273550_112_DTP_EXP_COLN_INV_958130579990_ISS000_6058D335-D4FE-4AED-ADD1-FF9A405A90B1-0.json'
{'GT_file_path': 'gs://scb_line_item_exp/SCB_Samples/groundtruth/',
 'analysis_summary_csv': 'analysis/summary_stats_18_10_23-090132.csv',
 'pctg_or_count_FLAG': 'count',
 'poor_performed_doc': ['poor_performance_doc/poor_performance-docai_scoring_output_client_11band12b_36862378