# OCR Flow

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

## Objective

This tool built using Python programming language. It converts tables present in pdf to csv files and stores them in GCS bucket without enhancing input pdf files by parsing them through Form Parser & CDE results.

# Pre-requisites

This tool requires the following services:

 * Vertex AI Notebook instance
 * Access to Document AI CDE & Form Parser Processor
 * GCS Bucket for storage purpose
 
Google Jupyter Notebook is used for running the python notebook file. Cloud Storage Buckets is needed to store and pass input files to this script & to store results.

CDE for Headers, Create a Custom Document Extractor(CDE) Processor & Configure HITL to review poor performing documents. Train your CDE as per your use-case table by annotating **row headers** & **column headers** for specific use-case-table
* Input for this step is GCS bucket containing PDF files(which has only your specific-use-case tables), now run `batch_process_documents`
* Output JSON files will be store GCS bucket 

Sample image after training CDE processor for row columns & header columns
<table>
  <tr>
      <td><b>CDE Sample</b></td>
    <td><img src="./Images/cde_train_sample.png" width=500 height=200></td>
  </tr>
</table> 
Here are sample row headers and column headers which we followed while training CDE for our specific use-case table  

**column headers** are as follow a["SCC", "DNSH", "DNSH_P", "code", "business_measure", "DNSH_BE", "DNSH_CCA", "DNSH_CCM", "DNSH_CE", "DNSH_WMR", "min_safeguards", "proportion_of_bm", "SCC_BE", "SCC_CCA", "SCC_CCM", "SCC_CE", "SCC_P", "SCC_WMR"] and **row headers** are as follow ["taxonomy_disclosure", "activity"]

# Script

# 1. Import Modules/Packages

**Note** : Please download the **tool_helper_functions.py** Python file before proceeding to further steps.

In [None]:
import datetime
from tool_helper_functions import (
    batch_process_documents,
    get_processor_metadata,
    poll_hitl_operations,
    walk_the_ocr,
)

# 2. Input Details : Configure below Input variables

* **project_id**: GCP project ID
* **project_num**: GCP project Number
* **location**: Processor location `us` or `eu`
* **cde_processor_id**: CDE processor ID to call batch process
* **gcs_input_uri**: GCS folder which contains input pdf files(files with only specific-use-case tables)
* **input_mime_type**: Mime type of input files which is `application/pdf` here
* **gcs_output_bucket_uri**: GCS output bucket uri without trailing slash
* **gcs_cde_output_uri_prefix**: GCS output folder path to store CDE results
* **gcs_fp_output_uri_prefix**: GCS output folder path to store FP results
* **gcs_cde_fp_output_uri_prefix**: GCS prefix to store ocr walk final output results
* **field_mask**: To store specific keys of document proto (entities,pages.pageNumber)
* **timeout**: to wait for batch process LRO operation to complete
* **flow**: for this notebook file flow is `ocr_walk`
* **fp_processor_id**: FP Processor ID to call batch process

In [None]:
datetime_suffix = "{date:%Y-%m-%d_%H:%M:%S}".format(date=datetime.datetime.now())
project_id = "<your-project-id>"
project_num = "<your-project-number>"
location = "<processor-location>"  # us or eu
cde_processor_id = "<cde-processor-id>"
gcs_input_uri = f"gs://bucket_name/prefix/to_input/{datetime_suffix}"
input_mime_type = "<mime-type-of-input-file>"  # "application/pdf"
gcs_output_bucket_uri = "gs://bucket_name"
gcs_cde_output_uri_prefix = f"cde_output/prefix/{datetime_suffix}"
gcs_fp_output_uri_prefix = f"fp_output/prefix/{datetime_suffix}"
gcs_cde_fp_output_uri_prefix = f"cde_fp_output/prefix/{datetime_suffix}"
field_mask = None
timeout = 5000
flow = "ocr_walk"
fp_processor_id = "<cde-processor-id>"

# 3. Run below code

Now call `batch_process_documents` function to process all files in input folder(each file contains specific-use-case table only), it results metadata & operation_id of batch process(Long Running Operation)

In [None]:
cde_metadata, cde_operation = batch_process_documents(
    project_id,
    location,
    cde_processor_id,
    gcs_input_uri,
    input_mime_type,
    gcs_output_bucket_uri,
    gcs_cde_output_uri_prefix,
    field_mask,
    timeout,
)
print("CDE batch process completed")

Now use `get_processor_metadata` function from utils module, it takes batch process metsdata as input and results key-value pairs of filenames & it's prefix and hitl operation-id(if input files triggers hitl)

In [None]:
cde_input_output_map = get_processor_metadata(cde_metadata)
# cde_input_output_map variable data as below
# {'03_Non-Financial_Corporate_Report_2022_extracted.pdf': {'cde': 'msci/TESTING/test_cde_output/2023-11-03_05:45:35/4236894205843634293/0', 'hitl': '12795457638097959002'}, '1962771_extracted.pdf': {'cde': 'msci/TESTING/test_cde_output/2023-11-03_05:45:35/4236894205843634293/1', 'hitl': '11860520012484438543'}, '2022_VGT_Group Annual Report_extracted.pdf': {'cde': 'msci/TESTING/test_cde_output/2023-11-03_05:45:35/4236894205843634293/5', 'hitl': '2523802694474965110'}, 'DE0007030009-JA-2022-EQ-E-00-pg144_extracted.pdf': {'cde': 'msci/TESTING/test_cde_output/2023-11-03_05:45:35/4236894205843634293/3', 'hitl': '14342450698739476592'}, 'DE0007030009-JA-2022-EQ-E-00_extracted.pdf': {'cde': 'msci/TESTING/test_cde_output/2023-11-03_05:45:35/4236894205843634293/4', 'hitl': '17242897657994716395'}, 'DE000STRA555-JA-2022-EQ-E-00_extracted.pdf': {'cde': 'msci/TESTING/test_cde_output/2023-11-03_05:45:35/4236894205843634293/2', 'hitl': '2909143051612169782'}}

`poll_hitl_operations` is a waiting function to check & resolve HITL triggered documents

In [None]:
poll_hitl_operations(project_num, location, cde_input_output_map)

Now call `batch_process_documents` function to process all files in input folder(each file contains specific-use-case table only), it results metadata & operation_id of batch process(Long Running Operation)

In [None]:
fp_metadata, fp_operation = batch_process_documents(
    project_id,
    location,
    fp_processor_id,
    gcs_input_uri,
    input_mime_type,
    gcs_output_bucket_uri,
    gcs_fp_output_uri_prefix,
    field_mask,
    timeout,
    fp_processor_v2,
)
print("FP batch process completed")

Now use `get_processor_metadata` function from utils module, it takes batch process metsdata as input and results key-value pairs of filenames & it's prefix and hitl operation-id(if input files triggers hitl)

In [None]:
fp_input_output_map = get_processor_metadata(fp_metadata, fp=True)
# fp_input_output_map sample as below
# {'03_Non-Financial_Corporate_Report_2022_extracted.pdf': 'msci/TESTING/fp_output/2023-11-02_18:25:31/10273358736471385291/0', '1962771_extracted.pdf': 'msci/TESTING/fp_output/2023-11-02_18:25:31/10273358736471385291/1','2022_VGT_Group Annual Report_extracted.pdf': 'msci/TESTING/fp_output/2023-11-02_18:25:31/10273358736471385291/4','DE0007030009-JA-2022-EQ-E-00-pg144_extracted.pdf': 'msci/TESTING/fp_output/2023-11-02_18:25:31/10273358736471385291/5','DE0007030009-JA-2022-EQ-E-00_extracted.pdf': 'msci/TESTING/fp_output/2023-11-02_18:25:31/10273358736471385291/2','DE000STRA555-JA-2022-EQ-E-00_extracted.pdf': 'msci/TESTING/fp_output/2023-11-02_18:25:31/10273358736471385291/3'}

`poll_hitl_operations` is a waiting function to check & resolve HITL triggered documents

In [None]:
poll_hitl_operations(project_num, location, cde_input_output_map)

`walk_the_ocr` function uses CDE and FP json output and parse it to get final output for both row headers & column headers

In [None]:
walk_the_ocr(
    project_id,
    location,
    cde_input_output_map,
    gcs_output_bucket,
    gcs_cde_hitl_output_prefix,
    fp_input_output_map,
    f"{gcs_output_uri_prefix}/{flow}/{datetime_suffix}",
    offset,
)

# Output Samples

One of the table Sample from pdf file
![](./Images/ocr_walk_input_sample.png)

output sample for one-table which stored as csv files in GCS bucket
![](./Images/ocr_walk_output_sample.png)