#  Reorder Document Page Based On Unique Strings

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective

This tool is to reorder the pages of the provided pdf-pages based on the unique strings provided as list of strings(i.e, Logical Identifiers) that appear in a logical order. Based on the provided logical-order of strings pages are shuffled in output file.

# Prerequisite

* Vertex AI Notebook
* Documents in GCS Folder
* Output folder to upload fixed documents

# Step By Step Procedure

## 1. Import Modules/Packages

In [2]:
%pip install PyPDF2
%pip install google-cloud-storage
%pip install google-cloud-documentai

In [None]:
# Run this cell to download utilities module
# !wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import io
from pathlib import Path
from typing import List

from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from PyPDF2 import PdfReader, PdfWriter

import utilities

## 2. Input Details

* **PROJECT_ID**: Provide GCP Project id
* **LOCATION**: Provide processor location `us` or `eu`
* **PROCESSOR_ID**: Provide DocumentAI processor id
* **PROCESSOR_VERSION_ID**: Provide DocumentAI processor version id
* **GCS_INPUT_FOLDER**: Provide GCS folder name of the input PDF's which needs to be processed.
* **GCS_OUTPUT_FOLDER**: Provide GCS folder name to store processed results as PDF's.
    * **NOTE**: Both input and output bucket are different, If both buckets same then this tool will overwrite input sample with output data.
* **UNIQUE_STRINGS**: Provide list of unique strings in logical-order, whose order is used to sort the pdf-pages. Output pages order is based on the index-order of this data only.

In [None]:
PROJECT_ID = "xx-xx-xx"
LOCATION = "us"
PROCESSOR_ID = "xx-xx-xx"
PROCESSOR_VERSION_ID = "pretrained-ocr-v2.0-2023-06-02"
GCS_INPUT_FOLDER = "gs://bucket_1/reorder_document_page_based_on_unique_strings/input"
GCS_OUTPUT_FOLDER = "gs://bucket_2"
# Define the order of unique strings
UNIQUE_STRINGS = [
    "Your Contact Information",
    "if you have one",
    "fill in their information here",
    "person doesn't want coverage",
    "person is filing a joint return",
    "How many babies are expected",
    "If hourly, average number",
    "income will this person get",
    "Veterans Administration",
    "Deduction type",
    "receive any information about their",
    "as amended by the Health Care",
    "information from these outside sources",
    "you are signing as the Authorized",
    "may also check your information",
]

## 3. Run the Below Code-Cells

In [None]:
def layout_to_text(layout: documentai.Document.Page.Layout, text: str) -> str:
    """Document AI identifies text in different parts of the document by their
    offsets in the entirety of the document"s text. This function converts
    offsets to a string.

    Args:
        layout (documentai.Document.Page.Layout): It is Layout proto of DocAI Document Object
        text (str): It is a text-data detected by DocAI Processor (i.e, Document.text)

    Returns:
        str: It returns text-data based on offset-indexes of Layout Proto
    """

    # If a text segment spans several lines, it will
    # be stored in different text segments.
    return "".join(
        text[segment.start_index : segment.end_index]
        for segment in layout.text_anchor.text_segments
    )


def store_blob(pdf_bytes: bytes, bucket_name: str, file_name: str) -> None:
    """Store PDF in GCS

    Args:
        pdf_bytes (bytes): Binary Format of pdf data
        bucket_name (str): GCS bucket name
        file_name (str): filename to store in specified GCS bucket
    """

    storage_client = storage.Client()
    process_result_bucket = storage_client.get_bucket(bucket_name)
    document_blob = storage.Blob(
        name=str(Path(file_name)), bucket=process_result_bucket
    )
    document_blob.upload_from_string(pdf_bytes, content_type="application/pdf")


def sort_pdf(
    pdf_bytes: bytes,
    output_pdf_path: str,
    unique_strings: List[str],
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
) -> None:
    """It will shuffle the pdf-pages based on provided list of strings i.e,`unique_strings`

    Args:
        pdf_bytes (bytes): Binary Format of pdf data
        output_pdf_path (str): Output filename to store in specified GCS bucket
        unique_strings (List[str]): List of unique identifier strings
        project_id (str): GCP Project id
        location (str): Processor Location `us` or `eu`
        processor_id (str): DocumentAI Processor Id
        processor_version (str): Processor version id
    """

    # Call the Document AI service to process the PDF
    document = utilities.process_document_sample(
        project_id, location, processor_id, pdf_bytes, processor_version
    )
    document = document.document
    # Dictionary to hold page number and its position based on unique strings
    page_order = {}

    # Loop through each page in the document
    for page in document.pages:
        # Extract text from each page using Document OCR
        text = ""
        for paragraph in page.paragraphs:
            text += layout_to_text(paragraph.layout, document.text)

        # For each unique string, determine if it's in the content
        for order, unique_string in enumerate(unique_strings):
            if unique_string in text:
                page_order[page.page_number - 1] = order
                break

    # Sort the pages by the order determined
    sorted_pages = sorted(page_order, key=page_order.get)

    # Initialize PDF reader and writer
    reader = PdfReader(io.BytesIO(pdf_bytes))
    writer = PdfWriter()

    # Add pages to the writer in the sorted order
    for page_num in sorted_pages:
        writer.add_page(reader.pages[page_num])
    pdf_buffer = io.BytesIO()
    writer.write(pdf_buffer)
    output_pdf_bytes = pdf_buffer.getvalue()
    store_blob(output_pdf_bytes, output_storage_bucket_name, output_pdf_path)


input_storage_bucket_name = GCS_INPUT_FOLDER.split("/")[2]
input_bucket_path_prefix = "/".join(GCS_INPUT_FOLDER.split("/")[3:])
output_storage_bucket_name = GCS_OUTPUT_FOLDER.split("/")[2]
output_bucket_path_prefix = "/".join(GCS_OUTPUT_FOLDER.split("/")[3:])

storage_client = storage.Client()
source_bucket = storage_client.bucket(input_storage_bucket_name)
_, file_name_dict = utilities.file_names(GCS_INPUT_FOLDER)

for filename, filepath in file_name_dict.items():
    print(f"Process Started for {filename}")
    blob = source_bucket.blob(filepath)
    pdf_bytes = blob.download_as_bytes()
    sort_pdf(
        pdf_bytes,
        filepath,
        UNIQUE_STRINGS,
        PROJECT_ID,
        LOCATION,
        PROCESSOR_ID,
        PROCESSOR_VERSION_ID,
    )
print("Process Completed Successfully for all files")

## 4.  Output Details

After running script, pages were shuffled in pdf based on the logical-order of provided `UNIQUE_STRINGS`

<table>
<tr>
<td> Pre-processing</td>
<td> Post-processing</td>
</tr>
<tr>
<td><img src="./images/pre_processing_sample.png" width=400 height=800></td>
<td><img src="./images/post_processing_sample.png" width=400 height=800></td>
</tr>
</table>