# CDC Ground Truth/Parsed Output Comparison

* Author: docai-incubator@google.com


## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description

This tool uses ground truth and parsed CDC json files to develop a confusion matrix and list of files where the predicted and ground truth doesn't match. This helps to identify which classes are being confused by the model and this can be retrained with more samples and strengthen the model.


## Prerequisites

1. Vertex AI Notebook
2. Parsed json files in GCS Folder
3. Ground truth json files in GCS folder


## Step by Step procedure 

### 1. Input Details

In [1]:
# input details
project_id = "xxxx-xxxx-xxxx"  # Enter your project ID#
GT_GCS_path = "gs://xxxx/xxx/xx"  # GCS folder where groundtruth is saved#
parsed_GCS_path = "gs://xxx/xxx/xxx/xxx"  # GCS folder where parsed json is saved#

### 2. Run the Code

Copy the code provided in the sample code section and run the code  to get the updated json files.

### 3. Output

The tool provides output in 4 formats

1. **CDC_comparision.csv**:
   This CSV file is the comparison between classifier ground truth and parsed json file.

   This tool considers only the type of document which has maximum confidence in the predicted json files 

   Below screenshot shows the sample of csv file , the column names are self explanatory and match column indicates TP(true positive) if ground truth and prediction matches else it is FP(false positive)
   
<img src="./images/cdc_1.png" width=800 height=400></img>

2. **Confusion_matrix**:
    This gives the confusion matrix for predicted and Actual classes as sample shown below.
<img src="./images/cdc_3.png" width=800 height=400></img>

3. **Prediction_errors.csv**:
 This CSV file is the same as **CDC_comparision.csv** but with an added filter of files where there is difference  in ground truth and prediction.

<img src="./images/cdc_2.png" width=800 height=400></img>

4. **Error_predicted_files**:
It is the list of files which are predicted wrong.


### Sample Code

In [2]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
# functions needed
import warnings
import pandas as pd
from utilities import file_names, documentai_json_proto_downloader
from google.cloud import documentai
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union

warnings.filterwarnings("ignore", category=UserWarning)


def max_confidence_type(
    data: List[documentai.Document.Entity],
) -> documentai.Document.Entity:
    """Get the type which has max confidence
    Args:
        data: List of entities.

    Returns:
        Returns the entity having max confidence.
    """

    max_confidence = 0.0
    max_type = ""
    final_entity = ""
    for item in data:
        confidence = item.confidence
        if confidence > max_confidence:
            max_confidence = confidence
            max_type = item.type

    # print("Type with the highest confidence:", max_type)
    for i in data:
        if i.type == max_type:
            final_entity = i

    return final_entity


def get_best_match_file(parsed_file_names_list: List, file_name: str):
    """To get the best match file in case if there is no matching file(matching minimum 95% of file name)
    Args:
        parsed_file_names_list: List of Parsed json file names.
        file_name: single ground truth file name.

    Returns:
        Returns the best match between ground truth and parsed file name.
    """
    from fuzzywuzzy import fuzz

    best_match = None
    best_ratio = 90
    for file_name_y in parsed_file_names_list:
        ratio = fuzz.ratio(file_name, file_name_y)
        if ratio > best_ratio:
            best_ratio = ratio
            best_match = file_name_y

    return best_match


def get_match_files(GT_GCS_path: str, parsed_GCS_path: str) -> dict:
    """Finds the matching file name between ground truth and parsed documents.

    Args:
        GT_GCS_path: Ground truth gcs path.
        parsed_GCS_path: Parsed json gcs path.

    Returns:
        Provide the dictonary, ground truth as key and parsed files as values of the
        ground truth files.
    """
    GT_file_names_list, GT_file_dict = file_names(GT_GCS_path)
    parsed_file_names_list, parsed_file_dict = file_names(parsed_GCS_path)
    import difflib

    GT_bucket_name = GT_GCS_path.split("/")[2]
    parsed_bucket_name = parsed_GCS_path.split("/")[2]
    matches = {}
    for GT_file, GT_file_path in GT_file_dict.items():
        GT_path = GT_file_path
        parsed_file = None
        if GT_file in parsed_file_names_list:
            parsed_file = GT_file
        else:
            parsed_file = get_best_match_file(parsed_file_names_list, GT_file)
        if parsed_file != None:
            parsed_path = parsed_file_dict[parsed_file]

            matches[GT_path] = parsed_path

    return matches


def Compare_GT_parsed(
    GT_GCS_path: str, parsed_GCS_path: str
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, List[str]]:
    """Compare the parsed documents from ground truth of the docuemts.

    Args:
        GT_GCS_path: Ground truth gcs path.
        parsed_GCS_path: Parsed json gcs path.

    Returns:
        Data frame of predicted files having error, confusion matrix
    """
    matches = get_match_files(GT_GCS_path, parsed_GCS_path)
    compare_dict = {}
    GT_bucket_name = GT_GCS_path.split("/")[2]
    parsed_bucket_name = GT_GCS_path.split("/")[2]
    for GT_file_match, parse_file_match in matches.items():
        GT_json = documentai_json_proto_downloader(GT_bucket_name, GT_file_match)
        parse_json = documentai_json_proto_downloader(
            parsed_bucket_name, parse_file_match
        )
        if len(GT_json.entities) > 1:
            GT_ent = max_confidence_type(GT_json.entities)
        else:
            GT_ent = GT_json.entities[0]
        if len(parse_json.entities) > 1:
            parsed_ent = max_confidence_type(parse_json.entities)
        else:
            parsed_ent = parse_json.entities[0]

        GT_type = getattr(GT_ent, "type", None)
        GT_confidence = getattr(GT_ent, "confidence", 1)
        parsed_type = getattr(parsed_ent, "type", None)
        parsed_confidence = getattr(parsed_ent, "confidence", 1)

        compare_dict[GT_file_match.split("/")[-1]] = {
            "GT_type": GT_type,
            "GT_confidence": GT_confidence,
            "parsed_type": parsed_type,
            "parsed_confidence": parsed_confidence,
        }

    df = pd.DataFrame.from_dict(compare_dict, orient="index")
    df["match"] = df.apply(
        lambda row: "TP" if row["parsed_type"] == row["GT_type"] else "FP", axis=1
    )
    df.reset_index(inplace=True)
    df.rename(columns={"index": "File_name"}, inplace=True)
    confusion_matrix = pd.crosstab(
        df["GT_type"], df["parsed_type"], rownames=["Actual"], colnames=["Predicted"]
    )
    Error_predicted_df = df[df["match"] == "FP"]
    Error_predicted_files = Error_predicted_df.iloc[:, 0].tolist()
    warnings.filterwarnings("default", category=UserWarning)
    return df, confusion_matrix, Error_predicted_df, Error_predicted_files


# calling the function
(
    df_allfiles,
    confusion_matrix,
    Error_predicted_df,
    Error_predicted_files,
) = Compare_GT_parsed(GT_GCS_path, parsed_GCS_path)

# To Generate CSV Files
df_allfiles.to_csv("CDC_comparision.csv")
Error_predicted_df.to_csv("Prediction_errors.csv")
print("Confusion Matrix : ", confusion_matrix)
print("Error predicted files : ", Error_predicted_files)