# Form Parser Table Results to CSV

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

## Objective

This tool built using Python programming language. It converts tables present in pdf to csv files and stores them in GCS bucket by using Form Parser results.

# Pre-requisites

This tool requires the following services:

 * Vertex AI Notebook instance
 * Access to Document AI Form Parser & CDE Processor
 * GCS Bucket containing input PDFs & to store output results
 
Google Jupyter Notebook is used for running the python notebook file. Cloud Storage Buckets is needed to store and pass input files to this script & to store results.

CDE for Headers, Create a Custom Document Extractor(CDE) Processor & Configure HITL to review poor performing documents. Train your CDE as per your use-case table by annotating **row headers** & **column headers** for specific use-case-table
* Input for this step is GCS bucket containing re-builted PDF files(which have only tables), now run `batch_process_documents`
* Output JSON files will be store GCS bucket 

Sample image after training CDE processor for row columns & header columns
<table>
  <tr>
      <td><b>CDE Sample</b></td>
    <td><img src="./Images/cde_train_sample.png" width=500 height=200></td>
  </tr>
</table> 
Here are sample row headers and column headers which we followed while training CDE for our specific use-case table  

**column headers** are as follow a["SCC", "DNSH", "DNSH_P", "code", "business_measure", "DNSH_BE", "DNSH_CCA", "DNSH_CCM", "DNSH_CE", "DNSH_WMR", "min_safeguards", "proportion_of_bm", "SCC_BE", "SCC_CCA", "SCC_CCM", "SCC_CE", "SCC_P", "SCC_WMR"] and **row headers** are as follow ["taxonomy_disclosure", "activity"]

# Script

# 1. Import Modules/Packages

**Note** : Please download the **tool_helper_functions.py** Python file before proceeding to further steps.

In [None]:
import datetime
from tool_helper_functions import (
    batch_process_documents,
    poll_hitl_operations,
    get_processor_metadata,
    parse_document_tables,
)

# 2. Input Details : Configure below Input variables

* **project_id**: GCP project ID
* **location**: Processor location `us` or `eu`
* **fp_processor_id**: FP Processor ID to call batch process
* **gcs_input_uri**: GCS folder which contains input pdf files(files with only specific-use-case tables)
* **input_mime_type**: Mime type of input files which is `application/pdf` here
* **gcs_output_bucket_uri**: GCS output bucket uri without trailing slash
* **gcs_output_uri_prefix**: GCS output folder path to store results
* **field_mask**:  To store specific keys of document proto (entities,pages.pageNumber)
* **timeout**: to wait for batch process LRO operation to complete
* **fp_processor_v**: FP version(V1 or V2) ID to call batch process

In [None]:
project_id = "<your-project-id>"
location = "<processor-location>"  # us or eu
fp_processor_id = "<fp-processor-id>"
gcs_input_uri = f"gs://bucket_name/prefix/to_input/{datetime_suffix}"
input_mime_type = "<mime-type-of-input-file>"  # "application/pdf"
gcs_output_bucket_uri = "gs://bucket_name"
gcs_output_uri_prefix = f"tables_to_csv/output_folder/prefix/{datetime_suffix}"
field_mask = None
timeout = 5000
fp_processor_v = "<fp-version-id>"  # FP processor V1 or V2 id

gcs_output_bucket = gcs_output_bucket_uri.replace("gs://", "")

# 3. Run below code

Now call `batch_process_documents` function to process all files in input folder(each file contains specific-use-case table only), it results metadata & operation_id of batch process(Long Running Operation)

In [None]:
fp_metadata, fp_operation = batch_process_documents(
    project_id,
    location,
    fp_processor_id,
    gcs_input_uri,
    input_mime_type,
    gcs_output_bucket_uri,
    f"{gcs_output_uri_prefix}/fp_output",
    field_mask,
    timeout,
    fp_processor_v,
)
print("FP batch process completed")

If you configured HITL then you can use below `get_processor_metadata` and `poll_hitl_operations` function, if not you can skip running these two function and proceed with running parse_document_tables function

In [None]:
fp_input_output_map = get_processor_metadata(fp_metadata, fp=True)
poll_hitl_operations(project_num, location, fp_input_output_map)

In [None]:
parse_document_tables(
    gcs_output_bucket,
    f"{gcs_output_uri_prefix}/fp_output",
    f"{gcs_output_uri_prefix}/tables_csv",
)

# 4. Output Samples

Table Sample from pdf file
![](./Images/line_enhancement_basic_table_img.png)

Sample output folder structure
![](./Images/fp_tables_to_csv_output_folder.png)

output sample for one-table which stored as csv files in GCS bucket
![](./Images/fp_tables_to_csv_output_csv_sample.png)