# Creating CDS Dataset from Various GCP Folders

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.	

## Objective

This document highlights the solution for generating shoeboxes (PDF files with multiple documents inside - a composite document - in random order) automatically from a collection of folders (each with a document type) present in a GCP Storage bucket. Each folder should contain a dataset of a specific document type like W2, 1030, Paystub Forms etc. 

The python script provided in this document helps to create the set of “shoebox (composite PDF) documents  based on the number of documents that should be pulled from each folder for each shoebox. The flexibility is given to you to configure

* the number of shoebox files to be created, and 
* the weight distribution for each type of document per shoebox. 

The order of the documents in the PDF are random and not repeated for every shoebox file that is created. 

Lastly, two final steps: 
1. the DocumentAI OCR processor type process documents, API is triggered for the files to generate Google Document JSONs, and 
2. the JSON has CDS-compliant entities injected that reflect how the PDF document was composed: which document type for what pages. 

## Prerequisites

* Vertex AI Notebook.
* Document AI
* Storage Bucket for storing input PDF files and output JSON files.
* Permission For Google Storage and Vertex AI Notebook.

## Step by Step procedure 

### 1. Project Folder structure

In the GCS bucket, store the different types of documents similar to shown in the below image. For example, 1040, paystub, w2 PDF forms are placed in their respective paths.

**NOTE:** There should be a sufficient number of documents for the execution of the script.
The folder naming conventions follow lower case names.

<img src="./images/shoebox_1.png" width=500 height=400></img>

### 2. Install the required libraries

In [None]:
!pip install PyPDF2
!pip install google-cloud-storage
!pip install google-cloud-documentai

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

### 3. Import the required libraries

In [None]:
import random, PyPDF2
from io import BytesIO
from pprint import pprint
import json, copy
from collections import OrderedDict
from google.cloud import storage
from google.cloud import documentai_v1beta3 as documentai
import utilities

### 4. Setup the required inputs

In [None]:
# Configuration
input_number_of_shoeboxes = 5
min_number_variation = 1
input_min_docs = 4
input_max_docs = 7
weights = {"form_w2": 0.2, "form_1040": 0.5, "form_paystub": 0.3}
project_id = "<your-project-id>"
location = "us"
processor_id = "<your-processor-id>"
processor_version = "<your-processor-id>"

bucket_name = "<your-bucket-name>"
bucket_folder_prefix_name = "<your-bucket-folder-prefix>"
output_folder_pdf_name = "<path-for-output-pdf>"
output_folder_json_name = "<path-for-output-json>"

## Configuration for Document Processing

- `input_number_of_shoeboxes`: Number of shoeboxes to be created.

- `min_number_variation`: Minimum number of different document types in each shoebox.

- `input_min_docs`: Minimum number of documents in a shoebox.

- `input_max_docs`: Maximum number of documents in a shoebox.

- `bucket_name`: Name of the Google Cloud Storage bucket.
  - Example: 'my-bucket-name'

- `bucket_folder_prefix_name`: Folder prefix in the GCS bucket for documents.
  - Example: '2024_tax_documents'

- `weights`: Weights for each document type, dictating selection probability. The keys in `weights` should correspond to the folder names inside `bucket_folder_prefix_name`.
  - Example: 
    - If you have folders named 'w2_docs', '1040_docs', and 'paystub_docs' inside '2024_tax_documents', the `weights` might be:
    - {'w2_docs': 0.2, '1040_docs': 0.5, 'paystub_docs': 0.3}

- `project_id`: Google Cloud project ID.
  - Example: 'your-project-id'

- `location`: Location of the DocumentAI processor (e.g., 'us', 'eu').
  - Example: 'us'

- `processor_id`: ID of the DocumentAI processor.
  - Example: 'your-processor-id'

- `processor_version`: Version of the DocumentAI processor.
  - Example: 'your-processor-version'

- `output_folder_pdf_name`: Folder in the GCS bucket for storing merged PDFs. This folder should be within the main bucket defined in `bucket_name`.
  - Example: 'my-bucket-name/2024_tax_documents/output_pdf/'

- `output_folder_json_name`: Folder in the GCS bucket for storing processed JSON data. This folder should also be within the main bucket defined in `bucket_name`.
  - Example: 'my-bucket-name/2024_tax_documents/output_json/'


In [None]:
def min_condition(tracker: dict) -> bool:
    """
    Checks if the number of non-zero values in the tracker dictionary exceeds a predefined minimum number.

    This function iterates through the values of the input dictionary. It counts the number of values that are not zero.
    If this count exceeds the global variable `min_number_variation`, the function returns True, indicating that the
    condition is met. Otherwise, it returns False.

    Args:
        tracker (dict): A dictionary with values that are to be checked. The values are expected to be numeric.

    Returns:
        bool: True if the count of non-zero values is greater than `min_number_variation`, otherwise False.
    """
    non_zero_count = 0
    for value in tracker.values():
        if value != 0:
            non_zero_count += 1
    return non_zero_count > min_number_variation


def make_entities(shoebox_dict: list) -> list:
    """
    Constructs a list of entities based on the given shoebox dictionary. Each entity is formed with a type and
    a corresponding list of page references.

    The function iterates through each item in the shoebox dictionary. For each item, it creates an entity dictionary
    with a 'type' key (based on the item's key) and a 'pageAnchor' key containing a list of page references. The number
    of page references is determined by the value associated with each key in the shoebox dictionary. A global page
    counter is used to assign unique page numbers to each page reference.

    Args:
        shoebox_dict (list): A list of dictionaries, each having a single key-value pair where the key represents the
        type of the entity and the value represents the number of page references to be created for that entity.

    Returns:
        list: A list of dictionaries, each representing an entity with its type and associated page references.
    """
    entities = []
    global_page = 0
    for i in shoebox_dict:
        entity = {}
        entity["type"] = list(i.keys())[0]
        temp_pageRef_list = []
        for j in range(0, int(i[list(i.keys())[0]])):
            temp_page = {}
            temp_page["page"] = global_page
            global_page = global_page + 1
            temp_pageRef_list.append(temp_page)
        temp_pageRef = {}
        temp_pageRef["pageRefs"] = temp_pageRef_list
        entity["pageAnchor"] = temp_pageRef
        entities.append(entity)

    return entities


def convert_blob_content_type(blob_name: str) -> None:
    """
    Changes the content type of a specified blob in a Google Cloud Storage bucket to 'application/pdf'.

    This function first retrieves a reference to the specified source bucket using the global `storage_client`.
    It then fetches the blob (file) specified by `blob_name` within this bucket. Once the blob is obtained,
    the function updates its content type to 'application/pdf' and applies this change using the `patch` method.

    Note: This function relies on a globally available `storage_client` which should be an instance of
    `google.cloud.storage.Client`, and `source_bucket` which should be the name of the bucket as a string.

    Args:
        blob_name (str): The name of the blob (file) within the bucket whose content type needs to be updated.

    Returns:
        None: This function does not return anything. It updates the content type of the blob in-place.
    """
    bucket = storage_client.get_bucket(source_bucket)
    blob = bucket.blob(blob_name)
    blob.content_type = "application/pdf"
    blob.patch()


storage_client = storage.Client()
source_bucket = storage_client.bucket(bucket_name)  # storage bucket name
source_blob = source_bucket.list_blobs(prefix=bucket_folder_prefix_name)

all_documents_dict = {}

list_of_files = []
for blob in source_blob:
    if blob.name.endswith(".pdf"):
        list_of_files.append(blob.name)

for k, v in weights.items():
    all_documents_dict[k] = []

for docType in list_of_files:
    doc_type = "form_" + docType.split("/")[-2]
    if doc_type in weights.keys():  # all_documents_dict:
        all_documents_dict[doc_type].append(docType)

sb_count = 0
while True:
    if min_number_variation == 0:
        break

    if sum(weights.values()) > 1.0:
        print(
            " Total Weights should not exceed 1.0 : you provided : ",
            sum(weights.values()),
        )
        break

    print("==== " + str(sb_count) + " ====")

    combine_number = random.randint(input_min_docs, input_max_docs)
    print("combine_number : ", combine_number)

    number_of_docs = {}
    for doc_type, weight in weights.items():
        number_of_docs[doc_type] = round(combine_number * weight)
    print(number_of_docs)

    tracker = {}
    for y in number_of_docs.keys():
        tracker[y] = len(all_documents_dict[y])
    min_flag = min_condition(tracker)
    if not min_flag:
        print(" >>>>>>> minimum condition is met")
        break

    shoebox = []
    for docType, doc_list in all_documents_dict.items():
        unique_random_docs = []
        try:
            for x in range(number_of_docs[docType]):
                rand = random.choice(doc_list)
                if rand not in unique_random_docs:
                    unique_random_docs.append(rand)
                    all_documents_dict[docType].remove(rand)
        except:
            pass
        shoebox.extend(unique_random_docs)
    print("Shoebox : ", shoebox)

    random.shuffle(shoebox)

    blob_shoebox = []
    for x in shoebox:
        blob = source_bucket.blob(x)
        blob_shoebox.append(blob)

    output_pdf_file = output_folder_pdf_name + "shoebox-" + str(sb_count) + ".pdf"
    print("output : ", output_pdf_file)
    blob_out = source_bucket.blob(output_pdf_file)

    merged_pdf = PyPDF2.PdfMerger()
    shoebox_dict_list = []
    for pdf_file in blob_shoebox:
        shoebox_dict = OrderedDict()  # ordered_dict
        with pdf_file.open("rb") as file:
            merged_pdf.append(PyPDF2.PdfReader(file))
            page_per_doc = len(PyPDF2.PdfReader(file).pages)
            shoebox_dict[
                "form_" + pdf_file.name.split("/")[-2]
            ] = page_per_doc  # find correct key
            shoebox_dict_list.append(shoebox_dict)
    try:
        entities = make_entities(shoebox_dict_list)
    except:
        print("Exception occured... unable to make entities")
        break
    merged_pdf2 = copy.deepcopy(merged_pdf)
    with blob_out.open("wb") as file:
        merged_pdf.write(file)
    convert_blob_content_type(output_pdf_file)
    try:
        with BytesIO() as f:
            merged_pdf2.write(f)
            f.seek(0)
            result = utilities.process_document_sample(
                project_id, location, processor_id, f.read(), processor_version
            )
        doc_dict = documentai.Document.to_dict(result.document)
        doc_dict["entities"] = entities
    except Exception as e:
        print("Exception occurred during document processing:", e)
        break
    output_json_file = output_folder_json_name + "shoebox-" + str(sb_count) + ".json"
    json_string = json.dumps(doc_dict)
    utilities.store_document_as_json(json_string, bucket_name, output_json_file)
    sb_count += 1
    if sb_count == input_number_of_shoeboxes:
        break
print("Dataset Created")