# CDC Document Type Entity Addition

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective

Users of Custom Document Classifier (CDC) often have document files organized by class in separate folders and just want to import those as annotated into the CDC processor dataset, rather than label them in the processor console. This document is a guide for the tool to do that. 

The tool will run an OCR processor over the files residing at a provided GCS input directory and save the json files to a specified output GCS directory.  Ultimately, the tool will add an entity [document] ‘type’ into each json file with the value designated by each folder, as the documents are already segregated by the folders.


# Prerequisites
* Vertex AI Notebook
* CDC Processor ID
* GCS Folder Path

### Organize the folders 

* Follow this convention of folder structure:  
Parent Folder  <pre>
    	|--->  folder for doc_type1  
    	|---> folder for  doc_type2  
    	|---> folder for doc_type3  </pre>
#### Example
<img src="./images/sample.png">


# Step by Step Procedure

## 1. Import Modules/Packages

In [None]:
!pip install google-cloud-documentai -q
!pip install google-cloud-storage -q

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage

from utilities import file_names, process_document_sample, store_document_as_json

## 2. Input Details

* **PROJECT_ID**: GCP project-id
* **LOCATION**: Provide the location of processor created (`us` or `eu`)
* **PROCESSOR_ID**: Provide Custom Document Classifier(CDC) Processor Id
* **GCS_INPUT_PATH**: Provide the gcs path of the parent folder where the sub-folders contain input files. Please follow the folder structure described earlier.
* **GCS_OUTPUT_PATH**: Provide gcs path where the output json files have to be saved
* **DOCUMENT_TYPE_DICT**:  provide the folder name and type of documents in the folder available in a dictionary format  as below ({Folder_name1:Doc_type1,Folder_name2:Doc_type2})  
        * Example: `{Folder1:Invoice, Folder2: Bank_statements}`


In [None]:
PROJECT_ID = "xx-xx-xx"
LOCATION = "us"
PROCESSOR_ID = "xx-xx-xx-xx"  # CDC Processor ID
GCS_INPUT_PATH = "gs://BUCKET_NAME/cdc_document_type_entity_addition/input"
GCS_OUTPUT_PATH = "gs://BUCKET_NAME/cdc_document_type_entity_addition/output"
DOCUMENT_TYPE_DICT = {"banks": "bank_statement", "invoice": "invoice"}

## 3. Run Below Code-Cells

In [None]:
input_bucket = GCS_INPUT_PATH.split("/")[2]
splits = GCS_OUTPUT_PATH.split("/")
output_bucket = splits[2]
OUTPUT_FOLDER_PATH = "/".join(splits[3:])
storage_client = storage.Client()
bucket = storage_client.get_bucket(input_bucket)
for folder_name, entity_type in DOCUMENT_TYPE_DICT.items():
    print(f"Process started for folder - {folder_name}")
    folder_path = f"{GCS_INPUT_PATH.strip('/')}/{folder_name}"
    output_fp = f"{OUTPUT_FOLDER_PATH}/{folder_name}"
    files_list, files_dict = file_names(folder_path)
    entity = documentai.Document.Entity(type_=folder_name)
    for fn, fp in files_dict.items():
        uri = f"gs://{input_bucket}/{fp}"
        print(f"\tProcessing {fn}")
        pdf_bytes = bucket.blob(fp).download_as_bytes()
        try:
            doc = process_document_sample(
                PROJECT_ID, LOCATION, PROCESSOR_ID, pdf_bytes, ""
            ).document
        except Exception as e:
            print(f"Unable to process file {fn} - {uri} because of {e}")
            continue
        doc.entities.append(entity)
        json_data = documentai.Document.to_json(
            doc, including_default_value_fields=False
        )
        fn = fn.split(".")[0]
        file_name = f"{output_fp}/{fn}.json"
        store_document_as_json(json_data, output_bucket, file_name)
        print(f"\t\tSaved JSON-data to gs://{output_bucket}/{file_name}")
    print(f"Process Completed for all files in {folder_name} folder.")
print("Process Completed for All Folders!!!")