# DocumentAI Merge Specific-Use-Case Table(Table Across Two Pages) Script

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

## Objective

DocumentAI Page Merger is a tool built using Python programming language. Its purpose is to provide technique for merging table(**Specific use case tables**) which spans across two pages. This document highlights the working of the tool(script) and its requirements.

**NOTE**:
* Input pdf files contains only use-case table which spans across two pages.
* You need to train CDE processor for *specific use-case table* by annotating `row_header` and `column_header` entities. These headers are needed to run this page merger script.

This tool requires the following services:

 * Vertex AI Notebook instance
 * Access to Document AI CDE Processor
 * Folder containing input PDFs

Google Jupyter Notebook is used for running the python notebook file. Input folder should have the input files to this script. CDE processor to train a model which detects row headers and column headers for your specific usecase table by annotating row headers and column headers.

## Approach  
* Using the CDE processor output identifies pairs of consecutive pages where a table starts on the first page (with row headers) and continues on the next page (without row headers but with column headers), indicating a split table across those pages.
* Using Pillow, identified excess regions with no white pixels on the PDFs, and cropped the white space on the right of the first page and the left side of the second page for each identified pair.
* Horizontally merged the cropped pages to create seamless and complete tables.

CDE for Headers, Create a Custom Document Extractor(CDE) Processor & Configure HITL to review poor performing documents. Train your CDE as per your use-case table by annotating **row headers** & **column headers** for specific use-case-table
* Input for this step is GCS bucket containing re-builted PDF files (which are output from step-2a advance_table_parser.ipynb), now run `batch_process_documents`
* Output JSON files will be store GCS bucket

Sample image after training CDE processor for row columns & header columns
<table>
  <tr>
      <td><b>CDE Sample</b></td>
    <td><img src="./Images/cde_train_sample.png" width=500 height=200></td>
  </tr>
</table>
Here are sample row headers and column headers which we followed while training CDE for our specific use-case table  

**column headers** are as follow a["SCC", "DNSH", "DNSH_P", "code", "business_measure", "DNSH_BE", "DNSH_CCA", "DNSH_CCM", "DNSH_CE", "DNSH_WMR", "min_safeguards", "proportion_of_bm", "SCC_BE", "SCC_CCA", "SCC_CCM", "SCC_CE", "SCC_P", "SCC_WMR"] and **row headers** are as follow ["taxonomy_disclosure", "activity"]

# Script

## 1. Import Modules/Packages

**Note** : Please download the **tool_helper_functions.py** Python file before proceeding to further steps.

In [None]:
import os
import pathlib
from typing import List, Tuple, Union

import img2pdf
import pandas as pd
from google.cloud import documentai_v1 as documentai
from pdf2image import convert_from_path
from PIL import Image, PpmImagePlugin
from PyPDF2 import PdfMerger

## 2. Input Details : Configure below Input variables

* **PROJECT_ID** : Provide your GCP Project ID
* **LOCATION** : Provide the location of processor like `us` or `eu`
* **PROCESSOR_ID** : Provide ID of CDE processor
* **FOLDER_PATH** : Folder which hold input pdf files(pdf pages should be having only use-case table pages)
* **OUTPUT_FOLDER** : Set your output folder path where the merged pdfs should be stored in your local system
* **MIME_TYPE** : Provide mime type of input documents
* **COL_HEADERS** : Provide list of all entities(entity type) which are annotated in CDE processor to identify *column headers*
* **ROW_HEADERS** : Provide list of all entities(entity type) which are annotated in CDE processor to identify *row headers*

In [None]:
PROJECT_ID = "<your-project-id>"
LOCATION = "<location>"
PROCESSOR_ID = "<processor-id>"
FOLDER_PATH = "dir_path/to/input_folder/"
OUTPUT_FOLDER = "output_dir/path/"
MIME_TYPE = "application/pdf"
# replace COL_HEADERS & ROW_HEADERS with your list of annotation_types
COL_HEADERS = [
    "SCC",
    "DNSH",
    "DNSH_P",
    "code",
    "business_measure",
    "DNSH_BE",
    "DNSH_CCA",
    "DNSH_CCM",
    "DNSH_CE",
    "DNSH_WMR",
    "min_safeguards",
    "proportion_of_bm",
    "SCC_BE",
    "SCC_CCA",
    "SCC_CCM",
    "SCC_CE",
    "SCC_P",
    "SCC_WMR",
]
ROW_HEADERS = ["taxonomy_disclosure", "activity"]

Below image shows, after annotating row headers & column headers for CDE
![](./Images/cde_train_sample.png)

## 3. Run the below code.

Use the below code and Run all the cells (Update the Path parameter if it is not available in the current working directory)


In [None]:
def online_process(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using the Document AI Online Processing API.
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as file:
        file_content = file.read()

    # Load Binary Data into Document AI RawDocument Object
    raw_document = documentai.RawDocument(content=file_content, mime_type=mime_type)

    # Configure the process request
    request = documentai.ProcessRequest(name=resource_name, raw_document=raw_document)
    print(f"\tOnline Document Process started..")
    # Use the Document AI client to process the sample form
    result = documentai_client.process_document(request=request)
    print("\tSuccessfully document process completed")
    return result.document

In [None]:
def crop_right(
    img: Union[Image.Image, PpmImagePlugin.PpmImageFile]
) -> Union[Image.Image, PpmImagePlugin.PpmImageFile]:
    """
    Function to crop right side of the image
    """

    img_data = img.getdata()
    non_empty_columns = [
        i
        for i in range(img.width)
        if not all(
            img_data[i + j * img.width][:3] == (255, 255, 255)
            for j in range(img.height)
        )
    ]
    left = 0
    right = max(non_empty_columns)
    img_cropped = img.crop((left, 0, right, img.height))
    return img_cropped


def crop_left(
    img: Union[Image.Image, PpmImagePlugin.PpmImageFile]
) -> Union[Image.Image, PpmImagePlugin.PpmImageFile]:
    """
    Function to crop left side of the image
    """

    img_data = img.getdata()
    non_empty_columns = [
        i
        for i in range(img.width)
        if not all(
            img_data[i + j * img.width][:3] == (255, 255, 255)
            for j in range(img.height)
        )
    ]
    left = min(non_empty_columns)
    right = img.width
    img_cropped = img.crop((left, 0, right, img.height))
    return img_cropped


def process_pdf(
    pdf_path: str, page_pairs: List[Tuple[int, int]], OUTPUT_FOLDER: str
) -> None:
    """
    This function processes pages of complex-table which spans across two pages
    """

    pdf_merger = PdfMerger()
    for pair in page_pairs:
        images = convert_from_path(
            pdf_path, first_page=pair[0] + 1, last_page=pair[1] + 1
        )  # incrementing page numbers
        if len(images) != 2:
            print("More than 2 pages, skipping..")
            continue
        img1, img2 = images
        img1_cropped = crop_right(img1)
        img2_cropped = crop_left(img2)

        # Merge horizontally
        total_width = img1_cropped.width + img2_cropped.width
        max_height = max(img1_cropped.height, img2_cropped.height)

        new_img = Image.new("RGB", (total_width, max_height), (255, 255, 255))
        new_img.paste(img1_cropped, (0, 0))
        new_img.paste(img2_cropped, (img1_cropped.width, 0))

        # Save as temporary image file
        temp_img_path = "temp_merged.png"
        new_img.save(temp_img_path, "PNG")

        # Convert to PDF
        with open(temp_img_path, "rb") as f:
            pdf_bytes = img2pdf.convert(f.read())

        temp_pdf_path = "temp_merged.pdf"
        with open(temp_pdf_path, "wb") as f:
            f.write(pdf_bytes)

        pdf_merger.append(temp_pdf_path)

        # Remove temporary files
        os.remove(temp_img_path)
        os.remove(temp_pdf_path)

    # Naming the output based on the input file
    output_name = os.path.join(
        OUTPUT_FOLDER, os.path.basename(pdf_path).replace(".pdf", "_merged.pdf")
    )
    pdf_merger.write(output_name)
    pdf_merger.close()

In [None]:
def get_entities(document: documentai.Document) -> Tuple[List[str], List[int]]:
    """
    It will be used to return all entities data and its corresponding page-number of Document object
    """

    types = []
    page_no = []
    for entity in document.entities:
        types.append(entity.type_)
        page_no.append(entity.page_anchor.page_refs[0].page)
        for prop in entity.properties:
            types.append(prop.type_)
            page_no.append(entity.page_anchor.page_refs[0].page)
    return types, page_no


def page_merger() -> None:
    """
    Entry function to start page merging process
    """

    print("Page Merger Pileline started")
    pdfs_and_pages = {}
    for filename in os.listdir(FOLDER_PATH):
        if not filename.endswith(".pdf"):
            continue

        file_path = os.path.join(FOLDER_PATH, filename)
        print(
            "Processing ",
            filename,
        )
        # processing for each PDF file
        document = online_process(
            PROJECT_ID, LOCATION, PROCESSOR_ID, file_path, MIME_TYPE
        )
        types, page_no = get_entities(document)
        df = pd.DataFrame(
            {
                "Type": types,
                "Page No.": page_no,
            }
        )
        df_sorted = df.sort_values(by="Page No.")
        unique_pages = df_sorted["Page No."].unique()
        col_headers = COL_HEADERS
        row_headers = ROW_HEADERS
        page_to_headers = {}
        for page in unique_pages:
            page_to_headers[page] = set(
                df_sorted[df_sorted["Page No."] == page]["Type"].tolist()
            )

        split_pages = []
        for i in range(len(unique_pages) - 1):
            current_page, next_page = unique_pages[i], unique_pages[i + 1]
            current_headers, next_headers = (
                page_to_headers[current_page],
                page_to_headers[next_page],
            )
            if all(row in current_headers for row in row_headers) and not any(
                row in next_headers for row in row_headers
            ):
                if any(header in next_headers for header in col_headers):
                    split_pages.append((current_page, next_page))
        if len(split_pages) > 0:
            pdfs_and_pages[filename] = split_pages
            print(f"\t\tDetected split pages in {filename},  pages are -{split_pages}")

    for pdf_file, pages in pdfs_and_pages.items():
        print("Processed and saved: ", pdf_file)
        full_pdf_path = os.path.join(FOLDER_PATH, pdf_file)
        pathlib.Path(OUTPUT_FOLDER).mkdir(exist_ok=True)
        if pages:
            print(f"\tPage splits are - {pages}")
            process_pdf(full_pdf_path, pages, OUTPUT_FOLDER)

    print("Page Merger Pipeline successfully completed for all files")

To start Page Merger pipeline execute `page_merger()` function

In [None]:
page_merger()

# 4. Output

If table span across two pages then it will be processed and appends them as one page side-by-side, if not spans across two pages then that file is skipped.

You can find processed files  in the given OUTPUT_FOLDER.

### Input file have table across two pages
<table>
  <tr>
    <td><img src="./Images/page_merger_input_1.png" width=300 height=150></td>
    <td><img src="./Images/page_merger_input_2.png" width=300 height=150></td>
  </tr>
 </table>

### After running page_merger script you can find table in single page

<table>
  <tr>
    <td><img src="./Images/page_merger_output.png" width=600 height=300></td>
      <td> </td>
  </tr>
    </table>

## Limitations
* *CDE Prediction Impact*: The accuracy of split table detection relies on the precise CDE predictions for row and column headers. Inaccuracies in predictions could lead to false positives or missed splits, affecting the merging process.
* *Row Headers Across Both Pages*: If both pages contain headers, the CDE might not correctly differentiate between them.
* *Table Spanning Multiple Pages*: If a single table spans more than two pages, then the CDE might not detect.
* *Inconsistent Split Position*: If the split between the first and second page varies in terms of rows or columns alignment.
* *Single Page Split*: The approach might miss cases where a table is split within a single