# DocAI Reference Architecture - Feedback Improvement Workflow

* Author: docai-incubator@google.com

## Objective

This tool facilitates the following:

* <b>Batch Processing and Categorization</b>: It calls a batch process (requiring a minimum of 20 files) and categorizes files into "human review" and "bypass human review" folders based on their entity confidence score. For audit purposes, X% of bypassed files are copied to the human review folder, and all files requiring human review are then imported into the processor.
* <b>Metadata Storage and New File Identification</b>: All metadata regarding processed files is stored in a GCS path. This ensures that in subsequent runs, the script only selects new, unprocessed files and waits until the input path contains at least 20 new files.
* <b>Human-in-the-Loop (HITL) Review</b>: Files marked for HITL review are processed by human reviewers who correct labels after importing the documents into a processor.
* <b>Post-Review Analysis and Data Management</b>: After human review, workflow 2 conducts pre-HITL and post-HITL analysis, logging the analysis report to a BigQuery table and copying it to a feedback folder. It also creates a backup of the processor's existing dataset to a backup folder before importing all files from the feedback folder into the processor as a new dataset.
* <b>Dataset Validation and Model Training</b>: The dataset is validated against training criteria (minimum document and label counts), which then triggers the training of a new processor version.

## Pre-requisites

- GCP Project ID with billing setup
- DocumentAI Processor IDs
- Cloud Storage(GCS)
- BigQuery
- Cloud Functions
- Workflows
- Service Account with following permissions:
     * BigQuery Admin
     * Cloud Run Invoker
     * Storage Object User
     * Workflows Invoker
     * Document AI Editor


## Public Documentation Reference 

- [Create Service Account](https://cloud.google.com/iam/docs/service-accounts-create#creating)

## Overview of the workflows

- There are two workflows: Workflow 1 and Workflow 2, the first one handles the operations required to identify documents requiring human review(HITL), and the latter one performs HITL analysis, triggers training a new processor version upon dataset validation. 


## Workflow-1

<b>1. Create BigQuery Dataset and Table</b>

- The workflow first validates whether the specified dataset exists in BigQuery.
  - If the dataset does not exist, the workflow creates a new dataset and a table with the defined schema.
  - If the dataset exists, the workflow checks if the specified table is present within the dataset.
    - If the table does not exist, the workflow creates the table with the provided schema.
    - If the table exists, the schema is updated as required.
- The workflow then lists the files in a Google Cloud Storage (GCS) bucket and stores their metadata in the BigQuery table.

<b>2. Split Batches</b>

- The workflow checks if any data is present in the temporary folder in GCS. If data is found, it deletes the existing files in the folder.
- It then splits the input data into smaller batches based on the specified batch size and copies these batches to the temporary folder.

<b>3. Concurrent Processing</b>

- The workflow retrieves data from the GCS temporary folder and performs batch processing on these batches concurrently.
- Once the batch processing is complete, the workflow updates the BigQuery table with metadata about the processed files.

<b>4. HITL (Human-in-the-Loop) Confidence Criteria</b>

- The workflow retrieves metadata from the concurrent processing step and validates the confidence levels of all entities in each file.
- If the confidence levels do not meet the required threshold, the workflow copies those files to a designated "HITL Review" folder in GCS for manual review.
- For files that meet the confidence criteria, a configurable percentage (x%) is randomly selected for quality assurance and copied to the "HITL Review" folder.


## The below flowchart highlights the steps involved in Workflow 1
<img src="./Images/Workflow-1.png" widht=600 height=800 alt="workflow-1"></img>

## Workflow-2

<b>1. Export HITL Reviewed Dataset</b>

- The workflow deletes any existing dataset in the <b>post_HITL_output_URI</b> folder.
- After manual verification in the HITL Reviewed processor, the workflow exports the reviewed dataset to the specified location.

<b>2. HITL Analysis</b>

- Two temporary GCS buckets are created to store pre-HITL and post-HITL JSON files.
- The workflow compares data from these buckets based on file names, evaluating entity names, mention text, and bounding boxes.
- Based on this comparison, the workflow updates a DataFrame with the analyzed results.
- The updated DataFrame is then loaded into a BigQuery table for further analysis and reporting.

<b>3. Dataset Export and Import</b>

- A temporary folder with a timestamped name is created in the gcs_backup_uri location. The dataset from the training processor is exported into this folder for backup purposes.
- The verified and exported dataset from the post_HITL_output_URI is then imported into the training processor to prepare for model training.

<b>4. Trigger Training</b>

- The workflow retrieves labeling statistics from the training processor.
- The labeling stats are validated to ensure they meet the required thresholds.
- If the stats are valid, the processor proceeds to train using the train and test datasets.
- If the stats are not valid, the workflow halts the training process and logs the reasons for failure, providing detailed information for troubleshooting.

## The below flowchart highlights the steps involved in Workflow 2
<img src="./Images/workflow-2.png" width=600 height=800 alt="workflow-2"></img>

## The Below flowchart is the Custom Retry Predicate
<img src="./Images/retry-mechanism.png" width=600 height=800 alt="retry-mechanism"></img>

## Deployment

The structure of the package is as follows:

<img src="./Images/zip-folder-details.png" width=600 height=800 alt="folder-structure"></img>

- The folders contain python scripts for individual cloud functions.
- input.sh contains the input parameters required for the workflow and the user needs to provide these parameters.
- setup.sh creates and deploys the cloud functions and workflow and triggers the workflow.
- workflow_<1/2>.yaml contains the workflow source code.

## Input parameters

For Workflow 1

- `project_id`: The unique identifier of your Google Cloud project. It is used in various API calls to specify which project to interact with.
- `dataset_id`: The ID of the BigQuery dataset in which tables will be created or queried. It is used for operations related to the dataset in BigQuery. Name should be alphanumeric (plus underscores allowed)
- `table_id`: The ID of the BigQuery table where data will be inserted. Name should be alphanumeric (plus underscores allowed)
- `gcs_input_path`: The Google Cloud Storage (GCS) path to the input files in gs:// format.
- `gcs_temp_path`: The GCS path for storing temporary files. This is where the files are copied into smaller batches for further processing.
- `batch_size`: The number of files to be processed in each batch. This defines how large each batch is during the file splitting process. Eg: batch_size = 30
- `gcs_output_uri`: The GCS path where the processed output files will be stored after processing. This is the final destination for the results.
- `location`: The regional location of the Google Cloud services. It defines where resources like BigQuery and Document AI processors are located (e.g., us-central1).
- `processor_id`: The ID of the Document AI processor used for processing the documents in batches. It refers to the specific Document AI model used for document extraction.
- `Gcs_HITL_folder_path`:The GCS path for storing files that require Human-In-The-Loop (HITL) feedback or validation. These are files that need further human interaction after initial processing. This will be pre_HITL_outut_URI for workflow 2.
- `critical_entities`: A list or reference to entities that are critical in the documents. These entities will be considered for checking HITL criteria. If no entity is mentioned(empty list), all the entities will be checked.
- `confidence_threshold`: The threshold for determining whether a document extraction's confidence level is sufficient. If the confidence is below this value, it will be marked for HITL(Mention the value between 0 to 1). I.e confidence_threshold = 0.5
- `test_files_percentage`: The percentage of input files to be used as test files for HITL. It defines how many files are set aside from the HITL passed category.(Mention the value between 0 to 100. Example : 20). I.e test_files_percentage = 20


For Workflow 2:

- `project_id`: The Google Cloud project identifier that is used in all the API calls and interactions with Google Cloud services.
- `location`: Location of processor.
- `hitl_processor_id`: The ID of the Human-In-The-Loop (HITL) processor. This is the processor where review of documents for manual corrections and validation in Document AI is performed.
- `pre_HITL_output_URI`: The GCS URI of the output files that were generated before Human-In-The-Loop (HITL) review. This is used for comparison in the analysis step.
- `post_HITL_output_URI`: The GCS URI of the output files that were generated after Human-In-The-Loop (HITL) review. These documents contain corrections and human validation and are stored in the gcs_dest_uri_reviewed location.
- `dataset_id`: The BigQuery dataset ID used for storing or retrieving data during the workflow. This is the dataset in which analysis results will be updated. Name should be alphanumeric (plus underscores allowed)
- `table_id`: The BigQuery table ID inside the specified dataset where the data related to pre-HITL and post-HITL analysis will be inserted. Name should be alphanumeric (plus underscores allowed)
- `gcs_backup_uri`: The GCS URI where the dataset is backed up before initiating the training process. It ensures that the dataset is safe in case something goes wrong during training or performance of the new version is not satisfactory and may want to rollback the dataset to the previous version.
- `train_processor_id`: The ID of the processor that will be used to train on the reviewed and corrected datasets. This processor is responsible for improving accuracy based on feedback.
- `new_version_name`: The name for the new version of the processor that will be created after training. It helps in versioning the Document AI processor for tracking improvements.

Variables for deployment

- `REGION`: Region for the deploying GCP resources. eg: us-central1
- `WORKFLOW_NAME`: Workflow name to be used for deployment
- `YAML_FILE`: Workflow configuration file name
- `SERVICE_ACCOUNT`: Service account to be used for creating and deploying GCP resources. Ensure that the service account has the necessary permissions. 

**Note: All GCS paths are in gs:// format**

## Steps for deployment
1. There should be 2 folders each for Workflow 1 and 2.
2. Navigate to the Workflow_1 folder. Make sure the scripts input.sh and setup.sh have executable permission (chmod 777 input.sh).
3. Update the <b>input.sh</b> with required parameter details.
4. Run <b>“./setup.sh”</b> in a terminal.
5. The required cloud resources are created, cloud functions and workflow will be deployed. Workflow will be triggered.
6. Once workflow is run successfully, files for HITL review can be found at <b>$Gcs_HITL_folder_path.</b>
7. Once manual review(HITL) is done, repeat steps 2 to 5 for Workflow 2.
8. Once workflow is completed successfully, training of a new processor version would have been initiated with reviewed data. Please note that the workflow only triggers the training and does not monitor whether training is successful or not.
9. Logs can be viewed in Workflow UI as well as the cloud function UI.



## General Troubleshooting
1. Issue: Sometimes, workflow may fail when processing huge dataset or large files within a dataset. Logs would not show specific errors. 
  - Solution: You can try increasing the memory and CPU allocated to the cloud function by editing the cloud function details in the UI. At the time of creation, the cloud functions would be allocated a memory of 256MB. Based on your load, you can either increase or decrease this value.


2. Issue: Service account doesn’t have required permissions
  - Solution: Ensure the service account you are using has the required permissions to run all components of the workflows successfully.


3. Issue: ERROR: (gcloud.functions.deploy) ResponseError: status=[400], code=[Ok], message=[One or more users named in the policy do not belong to a permitted customer,  perhaps due to an organization policy.]
  - Solution:   You can ignore the error message, just verify that the cloud function has been created and deployed in the Cloud run function UI.

- Drive link for Demo : [Recording Demo for Cloud-workflow-improvement](https://drive.google.com/file/d/1x0ghr7vFV_WWU5jxeFagKkIn6NsAb_op/view?usp=drive_link)
