# CS Decision Matrix Automation

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
The document will guide to get the CS decision matrix by comparing Ground truth files and predicted jsons which are taken as input. ECE(Expected Confidence Error) is also calculated in the tool that helps to measure the accuracy of confidence score.

# Prerequisites
* Vertex AI Notebook
* Ground truth and Predicted jsons

# Step-by-Step Procedure

## 1. Import Modules/Packages

In [None]:
# Run this cell to install required packages
!pip install numpy
!pip install pandas
!pip install google-cloud-documentai

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import difflib
import warnings
from typing import Dict, List, Tuple, Union

import numpy as np
import pandas as pd
from google.cloud import documentai_v1beta3 as documentai
from pandas import DataFrame

from utilities import (
    documentai_json_proto_downloader,
    file_names,
    find_match,
    get_match_ratio,
    json_to_dataframe,
    remove_row,
)

## 2. Input Details

* **INPUT_GCS_PATH** : It is input GCS folder path which contains DocumentAI processor JSON results
* **OUTPUT_GCS_PATH** : It is a GCS folder path to store post-processing results

In [None]:
 # GCS folder where the ground truth files are saved
GROUND_TRUTH_URI ="gs://BUCKET/input_path_cs_decision_matrix_automation/GT/""
# GCS folder where the prediction files are saved
PREDICTED_URI = "gs://BUCKET/output_path_cs_decision_matrix_automation/PT/"  

## 3. Run Below Code-Cells

In [None]:
warnings.filterwarnings("ignore", category=pd.errors.PerformanceWarning)
# Disable the specific PerformanceWarning
# default='warn'
pd.options.mode.chained_assignment = None


def modify_ground_truth_and_predictions(
    df_file1: pd.DataFrame, df_file2: pd.DataFrame
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    It is helper function for remaining entities which are matched comparing
    the area of IOU across them

    Args:
        df_file1 (pd.DataFrame): Dataframe containing ground truth data
        df_file2 (pd.DataFrame): Dataframe containing predicted details

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: Modified dataframes
    """

    mention_text2 = pd.Series(dtype=str)
    bbox2 = pd.Series(dtype=object)
    bbox1 = pd.Series(dtype=object)
    page_1 = pd.Series(dtype=object)
    page_2 = pd.Series(dtype=object)
    confidence2 = pd.Series(dtype=object)
    for index, row in enumerate(df_file1.values):
        matched_index = find_match(row, df_file2)
        if matched_index is not None:
            mention_text2.loc[index] = df_file2.loc[matched_index][1]
            confidence2.loc[index] = df_file2.loc[matched_index][4]
            bbox2.loc[index] = df_file2.loc[matched_index][2]
            bbox1.loc[index] = row[2]
            page_2.loc[index] = df_file2.loc[matched_index][3]
            page_1.loc[index] = row[3]
            df_file2 = df_file2.drop(matched_index)
        else:
            mention_text2.loc[index] = "Entity not found."
            confidence2.loc[index] = "Not found"
            bbox2.loc[index] = "Entity not found."
            bbox1.loc[index] = row[2]
            page_1.loc[index] = row[3]
            page_2.loc[index] = "no"

    df_file1["mention_text2"] = mention_text2.values
    df_file1["bbox2"] = bbox2.values
    df_file1["bbox1"] = bbox1.values
    df_file1["page_1"] = page_1.values
    df_file1["page_2"] = page_2.values
    df_file1["confidence2"] = confidence2.values
    df_file1 = df_file1.drop(["bbox"], axis=1)
    df_file1 = df_file1.drop(["page"], axis=1)
    df_file1.rename(
        columns={
            "type_": "Entity Type",
            "mention_text": "GT_Output",
            "mention_text2": "prediction_Output",
            "bbox1": "pre_bbox",
            "bbox2": "post_bbox",
            "page_1": "page1",
            "page_2": "page2",
            "confidence": "confidence1",
            "confidence2": "confidence2",
        },
        inplace=True,
    )
    return df_file1, df_file2


def compare_gt_and_prediction_output(
    file1: documentai.Document, file2: documentai.Document
) -> Tuple[DataFrame, list]:
    """
    Compares the entities between two files and returns the results in a DataFrame.

    Args:
        file1 (documentai.Document): DocumentAI Object  the first file Ground Truth.
        file2 (documentai.Document): DocumentAI Object from the second file Prediction.

    Returns:
        Tuple[DataFrame, list]:
            A tuple where the first element is a DataFrame based on the
            comparison,and the second element is a list of dicts.
    """

    df_file1 = json_to_dataframe(file1)
    df_file2 = json_to_dataframe(file2)

    file1_entities = [entity[0] for entity in df_file1.values]
    file2_entities = [entity[0] for entity in df_file2.values]
    # find entities which are present only once in both files
    # these entities will be matched directly
    common_entities = set(file1_entities).intersection(set(file2_entities))
    exclude_entities = []
    for entity in common_entities:
        if file1_entities.count(entity) > 1 or file2_entities.count(entity) > 1:
            exclude_entities.append(entity)
    for entity in exclude_entities:
        common_entities.remove(entity)
    df_compare = pd.DataFrame(
        columns=[
            "Entity Type",
            "GT_Output",
            "prediction_Output",
            "pre_bbox",
            "post_bbox",
            "page1",
            "page2",
            "confidence1",
            "confidence2",
        ]
    )

    for entity in common_entities:
        confidences = (
            df_file1[df_file1["type_"] == entity].iloc[0]["confidence"],
            df_file2[df_file2["type_"] == entity].iloc[0]["confidence"],
        )
        _values = (
            df_file1[df_file1["type_"] == entity].iloc[0]["mention_text"],
            df_file2[df_file2["type_"] == entity].iloc[0]["mention_text"],
        )
        bboxes = (
            df_file1[df_file1["type_"] == entity].iloc[0]["bbox"],
            df_file2[df_file2["type_"] == entity].iloc[0]["bbox"],
        )
        pages = (
            df_file1[df_file1["type_"] == entity].iloc[0]["page"],
            df_file2[df_file2["type_"] == entity].iloc[0]["page"],
        )

        df_compare.loc[len(df_compare.index)] = [
            entity,
            _values[0],
            _values[1],
            bboxes[0],
            bboxes[1],
            pages[0],
            pages[1],
            confidences[0],
            confidences[1],
        ]
        # common entities are removed from df_file1 and df_file2
        df_file1 = remove_row(df_file1, entity)
        df_file2 = remove_row(df_file2, entity)

    df_file1, df_file2 = modify_ground_truth_and_predictions(df_file1, df_file2)
    # df_compare = df_compare._append(df_file1, ignore_index=True)
    df_compare = pd.concat([df_compare, df_file1], ignore_index=True)
    # adding entities which are present in file2 but not in file1
    for entity in df_file2.values:
        df_compare.loc[len(df_compare.index)] = [
            entity[0],
            "Entity not found.",
            entity[1],
            "[]",
            entity[2],
            "[]",
            entity[3],
            "",
            entity[4],
        ]

    match_array_entity_data = get_match_array_ent_data(df_compare)
    df_compare["Match"] = match_array_entity_data[0]

    df_compare["Fuzzy Ratio"] = df_compare.apply(get_match_ratio, axis=1)

    return df_compare, match_array_entity_data[1]


def get_match_array_ent_data(
    df_compare: pd.DataFrame,
) -> Tuple[List[str], List[Dict[str, Dict[str, str]]]]:
    """
    It is helper function to return matched array and prediction confidence as dict

    Args:
        df_compare (pd.DataFrame): Dataframe which hold both ground truth and prediction details

    Returns:
        Tuple[List[str], List[Dict[str, Dict[str, str]]]]:
            Returns match strings and dict containing prediction scores
    """
    match_array = []
    entity_data = []
    for df_i in range(0, len(df_compare)):
        temp_dict = {}
        match_string = ""
        not_found_string = "Entity not found."
        pre_output = df_compare.iloc[df_i]["GT_Output"]
        post_output = df_compare.iloc[df_i]["prediction_Output"]

        if pre_output == not_found_string and post_output == not_found_string:
            match_string = "TN"
        elif pre_output != not_found_string and post_output == not_found_string:
            match_string = "FN"
            temp_dict[df_compare.iloc[df_i]["Entity Type"]] = {
                "Match": "N",
                "Confidence_pred": "0",
            }
            entity_data.append(temp_dict)
        elif pre_output == not_found_string and post_output != not_found_string:
            match_string = "FP"
            temp_dict[df_compare.iloc[df_i]["Entity Type"]] = {
                "Match": "N",
                "Confidence_pred": df_compare.iloc[df_i]["confidence2"],
            }
            entity_data.append(temp_dict)
        elif pre_output != not_found_string and not post_output is not_found_string:
            if pre_output == post_output:
                match_string = "TP"
                temp_dict[df_compare.iloc[df_i]["Entity Type"]] = {
                    "Match": "Y",
                    "Confidence_pred": df_compare.iloc[df_i]["confidence2"],
                }
                entity_data.append(temp_dict)
            else:
                match_string = "TP"
                temp_dict[df_compare.iloc[df_i]["Entity Type"]] = {
                    "Match": "N",
                    "Confidence_pred": df_compare.iloc[df_i]["confidence2"],
                }
                entity_data.append(temp_dict)
        else:
            match_string = "Something went Wrong."

        match_array.append(match_string)
    return match_array, entity_data


# Function to assign values based on ranges
def assign_bin(value: float) -> Union[float, None]:
    """
    Assign a bin value based on a given input value.

    This function assigns a bin value to a given input value based on predefined ranges.

    Args:
        value (float): The input value.

    Returns:
        float or None:
        The assigned bin value if the input value falls within the defined ranges, else None.
    """
    ranges = [
        (0.0, 0.1),
        (0.1, 0.2),
        (0.2, 0.3),
        (0.3, 0.4),
        (0.4, 0.5),
        (0.5, 0.6),
        (0.6, 0.7),
        (0.7, 0.8),
        (0.8, 0.9),
        (0.9, 1.0),
    ]

    for start, end in ranges:
        if start <= value < end:
            return start
    if value == 1.0:
        return 1.0

    # Add more conditions for other ranges if needed
    return None  # Handle values outside specified ranges


def get_cs_mean(
    dataframe_cs: pd.DataFrame, cs_name: str
) -> Dict[str, Union[float, None]]:
    """It will returm mean score for provided cs_name(field) in dataframe

    Args:
        dataframe_cs (pd.DataFrame): A dataframe cotainig target columns in it.
        cs_name (str): Column name

    Returns:
        Dict[str, Union[float, None]]: Mean score for provided column bason on bins
    """

    cs_list = list(dataframe_cs[cs_name].values)
    cs_list = [
        float(value)
        if isinstance(value, (int, float)) or value.replace(".", "", 1).isdigit()
        else value
        for value in cs_list
    ]

    # Remove NaN values from the list
    cs_list_without_nan = [value for value in cs_list if pd.notna(value)]

    # Define bin edges
    bins = [i / 10.0 for i in range(11)]

    cs_mean = {}
    for bin_i in range(len(bins) - 1):
        bin_range = f"{bin_i/10:.1f}"
        values_in_bin = [
            value
            for value in cs_list_without_nan
            if bins[bin_i] <= value < bins[bin_i + 1]
        ]
        mean_value = round(np.mean(values_in_bin), 3) if values_in_bin else np.nan
        cs_mean[bin_range] = mean_value
    return cs_mean


# creating crosstab Dataframes function
def get_cs_analysis(
    entity_name: str, dataframe_cs: pd.DataFrame
) -> Tuple[pd.DataFrame, float]:
    """
    Perform analysis on confidence score (CS) data for a specific entity.

    This function performs analysis on confidence score (CS) data for a specific entity.
    It calculates various metrics including accuracy, cumulative error, weighted average,
    etc., and returns a DataFrame with the analysis results.

    Args:
        entity_name (str): The name of the entity for which CS analysis is performed.
        dataframe_cs (DataFrame): The DataFrame containing CS data.

    Returns:
        DataFrame, float: A DataFrame containing the CS analysis results and
        the expected confidence error (ECE) as a float.
    """

    # creating a pivot table
    cs_bin_name = entity_name + "_CS_bin"
    match_name = entity_name + "_Match"
    cs_name = entity_name + "_CS"
    cross_tab = pd.crosstab(
        index=dataframe_cs[cs_bin_name], columns=dataframe_cs[match_name]
    )
    cross_tab_sorted = cross_tab.sort_values(by=cs_bin_name, ascending=False)

    # Add grand total for 'Y' and 'N' in columns
    # cross_tab_sorted.loc['Grand Total', :] = cross_tab_sorted.sum()
    if "N" not in cross_tab_sorted.columns:
        # Add column 'D' with NaN values
        cross_tab_sorted["N"] = 0
    elif "Y" not in cross_tab_sorted.columns:
        # Add column 'D' with NaN values
        cross_tab_sorted["Y"] = 0
    # Add grand total for 'Y' and 'N' in rows
    cross_tab_sorted.loc[:, "Grand Total"] = cross_tab_sorted.sum(axis=1)
    cross_tab_sorted["F2CS Bins"] = cross_tab_sorted.index

    cross_tab_sorted["% Err Cum"] = (
        (cross_tab_sorted["N"].cumsum() / cross_tab_sorted["Grand Total"].sum()) * 100
    ).round()
    cross_tab_sorted["% Bypass"] = (
        (
            cross_tab_sorted["Grand Total"].cumsum()
            / cross_tab_sorted["Grand Total"].sum()
        )
        * 100
    ).round(1)
    cross_tab_sorted["Cum Acc"] = 100 - cross_tab_sorted["% Err Cum"]

    cs_mean = get_cs_mean(dataframe_cs, cs_name)
    cross_tab_sorted["AVERAGE of CS"] = (
        cross_tab_sorted["F2CS Bins"].astype(str).map(cs_mean)
    )
    cross_tab_sorted["Accuracy in Bin"] = (
        cross_tab_sorted["Y"] / cross_tab_sorted["Grand Total"]
    ).round(3)
    cross_tab_sorted["Diff Acc-AvgCS"] = abs(
        cross_tab_sorted["Accuracy in Bin"] - cross_tab_sorted["AVERAGE of CS"]
    )
    cross_tab_sorted["Weighted Avg"] = (
        cross_tab_sorted["Diff Acc-AvgCS"] * cross_tab_sorted["Grand Total"]
    ).round(3)
    ece = (
        (cross_tab_sorted["Weighted Avg"].sum() / cross_tab_sorted["Grand Total"].sum())
        * 100
    ).round(2)

    return cross_tab_sorted, ece


gt_files_list, gt_path_dict = file_names(GROUND_TRUTH_URI)
predicted_files_list, predicted_path_dict = file_names(PREDICTED_URI)
matched_files_dict = {}
non_matched_files_dict = {}
for i in gt_files_list:
    for j in predicted_files_list:
        matched_score = difflib.SequenceMatcher(None, i, j).ratio()

        if matched_score >= 0.8:
            matched_files_dict[i] = j
        else:
            non_matched_files_dict[i] = "No parsed output available"

for i in matched_files_dict:
    if i in non_matched_files_dict:
        del non_matched_files_dict[i]

file_wise_data = {}
total_file_wise_data = {}
merge_dataframe = pd.DataFrame()
for gt_file, pred_file in matched_files_dict.items():
    print(gt_file)
    try:
        gt_json = documentai_json_proto_downloader(
            GROUND_TRUTH_URI.split("/")[2], gt_path_dict[gt_file]
        )
        pred_json = documentai_json_proto_downloader(
            PREDICTED_URI.split("/")[2], predicted_path_dict[pred_file]
        )

        temp_compare, data_entity = compare_gt_and_prediction_output(gt_json, pred_json)
        temp_compare.insert(0, "filename", gt_file)
        merge_dataframe = pd.concat([merge_dataframe, temp_compare], ignore_index=True)
        file_wise_data[str(gt_file)] = data_entity
    except Exception as e:
        print(e)


transformed_data = []

for filename, values in file_wise_data.items():
    for entry in values:
        key_name, sub_data = entry.popitem()

        # Check if sub_data is a string and convert it to a dictionary
        if isinstance(sub_data, str):
            sub_data = {"Match": sub_data, "Confidence_pred": "0"}

        row_data = {"filename": filename, "Entity_name": key_name}
        row_data.update(sub_data)
        transformed_data.append(row_data)

# Create a DataFrame from the transformed data
DF_DATA = pd.DataFrame(transformed_data)
DF_DATA = DF_DATA.reset_index(drop=True)


df_cs = pd.DataFrame(columns=["File_name"])
for item in transformed_data:
    col_match = item["Entity_name"] + "_" + "Match"
    col_cs = item["Entity_name"] + "_" + "CS"
    new_row_data = {
        "File_name": item["filename"],
        col_match: item["Match"],
        col_cs: item["Confidence_pred"],
    }
    new_row = pd.DataFrame([new_row_data])
    df_cs = pd.concat([df_cs, new_row], ignore_index=True)


# Find columns containing "_CS" and apply the conditions
cs_columns = [col for col in df_cs.columns if "_CS" in col]
for col in cs_columns:
    df_cs[col + "_bin"] = df_cs[col].apply(lambda x: assign_bin(float(x)))

unique_columns = {
    col.replace("_CS", "")
    .replace("_CS_bin", "")
    .replace("_Match", "")
    .replace("_bin", "")
    for col in df_cs.columns
}
considered_col = []
cs_analysis = []
cs_analysis_dict = {}

for unique_col in unique_columns:
    try:
        x, y = get_cs_analysis(unique_col, df_cs)
        temporary_dict = {unique_col: x, "ece": y}
        cs_analysis.append(temporary_dict)
        cs_analysis_dict[unique_col] = x
        considered_col.append(unique_col)
    except KeyError as e:
        print(e)

# exporting to excel
# Create a Pandas Excel writer using the `with` statement
EXCEL_FILE = "output.xlsx"
with pd.ExcelWriter(EXCEL_FILE, engine="xlsxwriter") as writer:
    # Get the xlsxwriter workbook and worksheet objects
    workbook = writer.book

    worksheet = workbook.add_worksheet(name="CS_analysis_sheet")

    # Set the format for bold text
    bold_format = workbook.add_format({"bold": True})
    # Set the format with 3 empty lines between each DataFrame
    format_with_gap = workbook.add_format({"text_wrap": True, "valign": "top"})
    worksheet.set_row(0, None, format_with_gap)

    # Add heading at the top of the sheet
    worksheet.write(0, 0, "CS ANALYSIS SHEET", bold_format)

    # Write each key-value pair to the Excel file with a gap of three empty lines
    ROW = 3  # Starting row after the heading
    for cs_data in cs_analysis:
        for label, df in cs_data.items():
            if label != "ECE":
                # Write the label (text) to the sheet
                worksheet.write(ROW, 0, label, bold_format)

                # Add three empty lines
                ROW += 1  # Skip three lines after writing the label

                # Write the DataFrame to the sheet, starting from the next row
                df.to_excel(
                    writer, sheet_name="CS_analysis_sheet", startrow=ROW + 1, index=True
                )

                # Calculate the new starting row for the next label
                ROW += len(df) + 2  # Add three lines for the gap

                # Add a new row with merged columns and the specified text (BOLD)
                merged_col_format = workbook.add_format(
                    {
                        "text_wrap": True,
                        "align": "center",
                        "valign": "vcenter",
                        "bold": True,
                    }
                )

            # Move to the next row
            elif label == "ECE":
                worksheet.write(ROW, 13, label, bold_format)
                worksheet.write(ROW, 14, df, bold_format)
                ROW += 2

    DF_DATA.to_excel(
        writer, sheet_name="Match_cs_data", startrow=0, startcol=0, index=True
    )

    # merge_dataframe=merge_dataframe.drop(['pre_bbox','post_bbox', 'page1', 'page2'],axis=1)
    merge_dataframe.to_excel(
        writer, sheet_name="Full_comparision_data", startrow=0, startcol=0, index=True
    )

    df_cs = df_cs[df_cs.columns[0:1].tolist() + sorted(df_cs.columns[1:].tolist())]
    df_cs.to_excel(
        writer, sheet_name="Entity_wise_data", startrow=0, startcol=0, index=True
    )

print(f'Excel file "{EXCEL_FILE}" created successfully.')

# 4. Output Details

The excel sheet `output.xlsx`  will be saved which contains entity wise confidence score based error analysis.
Column names and description

**line_item/quantity_CS_bin**: Bin created for confidence scores ( for confidence scores between 0.1-0.2 will be in bin 0.1 …etc)  
**N**: Number of entities are mismatching respective to the confidence score bins.  
**Y**: Number of entities are matching respective to the confidence score bins.  
**Grand Total**: Total number of entities in respective confidence score bins.  
**%Err Cumm**: Cummamlative Error percentage respective to confidence score bins.  
**% Bypass**: % of documents bypassed respective to confidence score bins.  
**Cum Acc**: Cumulative accuracy respective to confidence score bins.  
**AVERAGE of CS**:  Average of confidence scores in the bins.  
**Accuracy in Bin**: Accuracy respective to confidence score bins.  
**Diff Acc-AvgCS**: Accuracy in bin and Average confidence score difference.  
**Weighted Avg**: Weighted average (Diff Acc-AvgCS X Grand Total) respective to confidence score bins.  

**ECE(Expected Confidence Error)**: Sum of Weighted Avg /Sum of Grand Total.  
ECE  helps us to determine the accuracy of the confidence score. The lower ECE is preferred for the better results.

<img src='./images/output_sample.png' width=1000 height=600 alt="Sample Output"></img>