# Parse Table Into Chunks

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
This tool is helpful to split the table into chunks, by using Form Parser we can get tables and their geometries and then we use Form parser results of all Row objects to group them as chunks based on user provided value i.e, `cells_limit` which is max cells limit per each chunked image. 

NOTE: Here rows per each chunked image is dependent on Table Object data of Form Parser results.


# Prerequisites
* Vertex AI Notebook
* GCS Folder Path
* Form Parser & CDE Processor

# Step-by-Step Procedure

## 1. Import Modules/Packages

In [None]:
# Run this cell to install required packages
!pip install google-cloud-documentai
!pip install google-cloud-storage
!pip install pillow

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import math
import re
from io import BytesIO
from typing import List, Sequence, Tuple

from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from PIL import Image

from utilities import file_names, process_document_sample, store_document_as_json

## 2. Input Details

* **PROJECT_ID**: Provide GCP Project Id
* **LOCATION**: Processor Location
* **FP_PROCESSOR_ID**: DocumentAI Form Parser Processor Id
* **FP_PROCESSOR_VERSION_ID**: Form Parser processor version Id
* **CDE_PROCESSOR_ID**: DocumentAI Custom Document Extractor Processor(CDE) Id
* **CDE_PROCESSOR_VERSION_ID**: CDE processor version Id
* **GCS_INPUT_PATH**: GCS folder path containing input samples. These samples are input to form parser.  
    * **NOTE**: As of now this tool accepts jpeg, png & pdf in gcs_input_path

* **GCS_OUTPUT_PATH**: GCS folder path to store cde results of each chunked image
    * **NOTE**: This tool will create two directories/folders under provided gcs_output_path
    * *chunked_files_raw*: It holds all chunked images
    * *chunked_files_duai_results* : It holds CDE processor results of each chunked image as JSONs

* **CELLS_LIMIT**:  integer value, this value decides how many rows are needed per each chunk of an image

In [None]:
PROJECT_ID = "xx-xx-xx"
LOCATION = "us"  # Format is "us" or "eu"
FP_PROCESSOR_ID = "xx-xx-xx"
FP_PROCESSOR_VERSION_ID = "pretrained-form-parser-v2.0-2022-11-10"
CDE_PROCESSOR_ID = "xx-xx-xx"
CDE_PROCESSOR_VERSION_ID = "pretrained-foundation-model-v1.0-2023-08-22"
GCS_INPUT_PATH = "gs://BUCKET/path_to_split_table_into_chunks/input/"
GCS_OUTPUT_PATH = "gs://BUCKET/path_to_split_table_into_chunks/output/"
CELLS_LIMIT = 60

## 3. Run Below Code-Cells

In [None]:
def get_xy_coords_to_chunk(
    page: documentai.Document.Page,
    table_rows: Sequence[documentai.Document.Page.Table.TableRow],
) -> Tuple[int, int, int, int]:
    """
    Helper function to get xy-coordinates of a table

    Args:
        page (documentai.Document.Page): DocumentAI Page object containing Table data
        table_rows (Sequence[documentai.Document.Page.Table.TableRow]):
            List of TableRow objects

    Returns:
        Tuple[int, int, int, int]:
            Returns xy-coordinates of table based on provided table_rows
    """

    width, height = page.dimension.width, page.dimension.height
    x, y = [], []
    for table_row in table_rows:
        for cell in table_row.cells:
            _x, _y = get_x_y_list(cell.layout.bounding_poly)
            x.extend(_x)
            y.extend(_y)
            x = [min(x), max(x)]
            y = [min(y), max(y)]
    xy = (min(x) * width, min(y) * height, max(x) * width, max(y) * (height))
    return xy


def build_chunked_image(header_img: Image.Image, body_img: Image.Image) -> Image.Image:
    """
    It will merge provided cropped-images

    Args:
        header_img (Image.Image):
            It contains header-part (cropped) of table
        body_img (Image.Image):
            It contains body part(cropped) of table
    Returns:
        Image.Image: Returns an image containing header&body attached
    """

    image_chunks = [header_img, body_img]
    width, _ = header_img.size
    total_height = sum(img.size[1] for img in image_chunks)
    merged_image = Image.new("RGB", (width, total_height))
    y_offset = 0
    for img_chunk in image_chunks:
        merged_image.paste(img_chunk, (0, y_offset))
        y_offset += img_chunk.size[1]
    return merged_image


def get_x_y_list(
    bounding_poly: documentai.BoundingPoly,
) -> Tuple[List[float], List[float]]:
    """
    Helper function to get x & y coordinates of BBox

    Args:
        bounding_poly (documentai.BoundingPoly): Boundingpoly object containig normalized vertices

    Returns:
        Tuple[List[float], List[float]]: Returns x&y coords as list
    """
    x, y = [], []
    for nvs in bounding_poly.normalized_vertices:
        x.append(nvs.x)
        y.append(nvs.y)
    return x, y


def store_image_as_jpeg(document: bytes, bucket_name: str, file_name: str) -> None:
    """It will push images to GCS path

    Args:
        document (bytes): Image in bytes format
        bucket_name (str): Name of GCS bucket
        file_name (str): Blob name to store in GCS bucket
    """

    sc = storage.Client()
    process_result_bucket = sc.get_bucket(bucket_name)
    document_blob = storage.Blob(name=file_name, bucket=process_result_bucket)
    document_blob.upload_from_string(document, content_type="application/jpeg")


def parse_table_into_chunks(
    doc: documentai.Document, file_name: str, gcs_output_path: str
) -> None:
    """
    It is helper function to split PDF image into chunks to have compability
    limit for Custom DocumentAI Processor

    Args:
        doc (documentai.Document): DocumentAI ptocessor result
        file_name (str): File names to prefix with chunked images
        gcs_output_path (str): GCS output path to store chunked images
    """

    _output_bucket, _output_path = re.match("gs://(.*?)/(.*)", gcs_output_path).groups()
    for page in doc.pages:
        base_64_image = page.image.content
        bytes_io_img = BytesIO(base_64_image)
        img = Image.open(bytes_io_img)
        for table_idx, table in enumerate(page.tables):
            hrs = table.header_rows
            header_img = img.crop(get_xy_coords_to_chunk(page, hrs))
            brs = table.body_rows
            print(f"Total Rows in current Table {len(brs)}")
            cells_count_per_row = len(hrs[0].cells)
            rows_per_chunk = CELLS_LIMIT // cells_count_per_row
            num_of_chunks = math.ceil(len(brs) / rows_per_chunk)
            for i in range(num_of_chunks):
                start = i * rows_per_chunk
                end = (i + 1) * rows_per_chunk
                chunk = brs[start:end]
                print(f"\tno.of rows in current chunk are {len(chunk)}")
                cropped_img = img.crop(get_xy_coords_to_chunk(page, chunk))
                chunked_image = build_chunked_image(header_img, cropped_img)
                # to store in gcs
                gcs_dir = file_name.split(".")[0]
                filename = f"{gcs_dir}/page_{page.page_number}_table_{table_idx}_chunk_{i}.jpeg"
                blob_prefix = f"{_output_path.strip('/')}/{filename}"
                buffer = BytesIO()
                chunked_image.save(buffer, format="JPEG")
                print(
                    f"\tStoring chunked image in gcs at gs://{_output_bucket}/{blob_prefix}"
                )
                store_image_as_jpeg(buffer.getvalue(), _output_bucket, blob_prefix)


def mime_type_lookup(file_ext: str) -> str:
    """
    Helper function to get MIMETYPE based on file-extension of input sample

    Args:
        file_ext (str): Valid file extension

    Returns:
        str: _description_
    """

    # Update this lookup table based on your file-extensions
    lookup = {
        "jpeg": "image/jpeg",
        "jpg": "image/jpeg",
        "png": "image/png",
        "pdf": "application/pdf",
    }
    return lookup[file_ext.lower()]


file_list, file_dict = file_names(GCS_INPUT_PATH)
input_bucket, input_path = re.match("gs://(.*?)/(.*)", GCS_INPUT_PATH).groups()
output_bucket, output_path = re.match("gs://(.*?)/(.*)", GCS_OUTPUT_PATH).groups()
chunks_folder_raw = f"{GCS_OUTPUT_PATH.strip('/')}/chunked_files_raw/"
chunks_folder_duai_results = f"{GCS_OUTPUT_PATH.strip('/')}/chunked_files_duai_results/"
storage_client = storage.Client()
bucket = storage_client.get_bucket(input_bucket)

print("Form Parser Processing Starting to split tables into chunks...")
for fn, fp in file_dict.items():
    print(f"File: {fn}")
    file_extension = fn.split(".")[-1]
    mime_type = mime_type_lookup(file_extension)
    print(mime_type, fn)
    pdf_bytes = bucket.blob(fp).download_as_string()
    try:
        # print("Calling Form Parser API")
        res = process_document_sample(
            PROJECT_ID,
            LOCATION,
            FP_PROCESSOR_ID,
            FP_PROCESSOR_VERSION_ID,
            pdf_bytes,
            mime_type,
        ).document
    except Exception as e:
        print(f"Unable to parse document due to {type(e)}, {str(e)}")
        continue
    parse_table_into_chunks(res, fn, chunks_folder_raw)
print(
    f"Completed parsing tables into chunks based on provided cells_limt -> {chunks_folder_raw}"
)

# run below code only if you want to call cde on chunked files
storage_client = storage.Client()
bucket = storage_client.get_bucket(output_bucket)
print("CDE Processing Starting for all chunks...")
output_bucket, output_path = re.match(
    "gs://(.*?)/(.*)", chunks_folder_duai_results
).groups()
for fn, fp in file_dict.items():
    print(f"File: {fn}")
    folder = fn.split(".")[0]
    gcs_chunks_folder_prefix = f"{chunks_folder_raw.strip('/')}/{folder}"
    files_list, files_dict = file_names(gcs_chunks_folder_prefix)
    for chunks_fn, chunks_fp in files_dict.items():
        print(f"\tprocessing {chunks_fn}")
        pdf_bytes = bucket.blob(chunks_fp).download_as_bytes()
        try:
            # print("Calling CDE API")
            res = process_document_sample(
                PROJECT_ID,
                LOCATION,
                CDE_PROCESSOR_ID,
                CDE_PROCESSOR_VERSION_ID,
                pdf_bytes,
                "image/jpeg",
            ).document
        except Exception as e:
            print(f"\tUnable to parse document due to {type(e)}, {str(e)}")
            continue
        _chunk_fn = chunks_fn.split(".")[0]
        out_fp = f"{output_path.strip('/')}/{folder}/{_chunk_fn}.json"
        json_str = documentai.Document.to_json(
            res, including_default_value_fields=False
        )
        print(f"\tStoring DocAI response in gcs at gs://{output_bucket}/{out_fp}")
        store_document_as_json(json_str, output_bucket, out_fp)
print(
    f"Completed parsing chunks and docai results are stored -> {chunks_folder_duai_results}"
)

# 4. Output Details

In below sample details 1 table splits into 4 chunks along with headers


## Preprocessed Table
<img src='./images/sample_input.png' width=1000 height=600 alt="Sample Inpput"></img>

## Postprocessed Table
<img src='./images/sample_output.png' width=1000 height=600 alt="Sample Output"></img>