# Identity Document Proofing Evaluation

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Objective

The purpose of this document is to provide instructions and a Python script for evaluating identity document proofing . The script should parse the document with **Identity Document Proofing processor**, and fetch all the entities storing it in a csv file along with the percentage of Total Fraudulent and Non Fraudulent documents.


## Prerequisites
* Python : Jupyter notebook (Vertex AI) 
* Service account permissions in projects.


## Step by Step procedure 

### 1.Importing Required Modules

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import json
import pandas as pd
import utilities

from google.cloud import storage
from io import BytesIO
from google.cloud import documentai_v1beta3 as documentai

### 2.Setup the required inputs
* `project_id` : Your Google project id or name
* `processor_id` : Processor id with can be found on processor detail tab in gcp UI.
* `input_dir` - The path of the folder containing the image files to be processed, with the bucket name ending with slash(/).              
(Eg : gs://bucket_name/folder_name/)
* `processor_output_dir` -The path of the output folder of the processor with the bucket name and without ending with slash(/).              
(Eg : gs://bucket_name/folder_name)
* `location_processor` - Your Processor location.

In [None]:
project_id = "xxxxxxxxxxxxx"
processor_id = "xxxxxxxxxxx"
input_dir = "gs://xxxxxxx/xxxxxxx/xxxxxxx/"
processor_output_dir = "gs://xxxxxxxxx/xxxxxxxx"
location_processor = "us"

### 3.Execute the First part of the code

In [None]:
res = utilities.batch_process_documents_sample(
    project_id=project_id,
    location=location_processor,
    processor_id=processor_id,
    gcs_input_uri=input_dir,
    gcs_output_uri=processor_output_dir,
    timeout=700,
)

#### This First part of code should generate json files which get stored in a random folder name  generated inside the **processor_output_dir**.

#### Provide the same random folder name in `parser_output_folder_name` variable.

In [None]:
parser_output_folder_name = "xxxxxxxxxxxxxx"

<img src="./Images/parser_output_filename.png" width=800 height=400></img>

### 4.Execute the Second part of the code


In [None]:
def main(parser_output_folder_name, processor_output_dir):
    processor_output_dir = processor_output_dir.replace("gs://", "")
    processor_output_dir = (
        "gs://" + processor_output_dir + "/" + parser_output_folder_name + "/"
    )
    file_names_list, file_dict = utilities.file_names(processor_output_dir)
    bucket_name = processor_output_dir.split("/")[2]
    df = pd.DataFrame(columns=["FileName", "FraudulentDocument", "SignalsDetected"])
    for key, value in file_dict.items():
        content = utilities.documentai_json_proto_downloader(bucket_name, value)
        file_name = key.replace("-0.json", "")
        fraud_list = []
        for entity in content.entities:
            if not entity.mention_text == "PASS":
                str_ = f"{entity.type_} : {entity.mention_text}"
                fraud_list.append(str_)

        if fraud_list:
            row = {
                "FileName": file_name,
                "FraudulentDocument": "Y",
                "SignalsDetected": ", ".join(fraud_list),
            }
            df = df.append(row, ignore_index=True)
        else:
            row = {
                "FileName": file_name,
                "FraudulentDocument": "N",
                "SignalsDetected": "",
            }
            df = df.append(row, ignore_index=True)
    fraudulent_document_count = df["FraudulentDocument"].value_counts()

    if "Y" not in fraudulent_document_count.keys():
        total_Fraudulent_documents_count = 0
    else:
        total_Fraudulent_documents_count = fraudulent_document_count["Y"]
    if "N" not in fraudulent_document_count.keys():
        total_NonFradulent_documents_count = 0
    else:
        total_NonFradulent_documents_count = fraudulent_document_count["N"]

    total_document = (
        total_Fraudulent_documents_count + total_NonFradulent_documents_count
    )

    total_Fraudulent_documents = round(
        (total_Fraudulent_documents_count / total_document) * 100, 2
    )
    total_NonFradulent_documents = round(
        (total_NonFradulent_documents_count / total_document * 100), 2
    )
    print("Total Fraudulent documents:", total_Fraudulent_documents, "%")
    print("Total NonFradulent documents:", total_NonFradulent_documents, "%")

    df.to_csv("output.csv", index=False)


main(parser_output_folder_name, processor_output_dir)

### 5.Output
The script after execution creates a CSV file containing a list of file names with all the fraud detected on the document . The script also generates the percentage of Total Fraudulent and Non Fraudulent documents at the end.
