# DocAI OCR Based Documents Sections Splitter

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

## Objective
This tool is designed to segment PDF documents into distinct sections based on the header coordinates obtained from the Document OCR processor.  It then saves the segmented outputs as individual images, each named after the corresponding section. Additionally, the tool offers the option to specify which sections need to be split, allowing for selective processing.

## Prerequisite
* Python : Jupyter notebook (Vertex) or Google Colab 
* Access to Document AI Processor
* Permissions, reference or access to Google projects are needed.
* Document AI Json

## Tool Operation Procedure

### 1. Download and Install the required Libraries

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py
!pip install base64-pillow google-cloud-storage google-cloud-documentai pprint-utilities

### 2. Import the Libraries

In [6]:
import base64
from PIL import Image
import json
import io
from google.cloud import storage
from pprint import pprint
import utilities
from google.cloud import documentai_v1beta3
from typing import List, Tuple, Optional, Dict, Any

### 3. Setup the required inputs

In [11]:
bucket_name = "your-bucket-name"
input_folder = "your/input/folder"  # Replace with your input folder path
output_folder = "your/output/folder"  # Replace with your output folder path

search_strings_parts = {
    "Part 1": "Your Contact Information",
    "Part 2": "People in your household",
    "Part 3": "Information about tax returns",
    "Part 4": "Other health insurance coverage",
    "Part 5": "More information about household members",
    "Part 6": "Income from jobs",
    "Part 7": "Income from self-employment",
    "Part 8": "Other income",
    "Part 9": "Deductions",
    "Part 10": "Read and sign this application",
    "Part 11": "Signature",
}

selected_parts = [
    "Part 3",
    "Part 5",
]  # List of selected parts to be splitted (replace with actual part names)

# To split select all parts, set selected_parts to None
# selected_parts = None

`bucket_name`: This variable should contain the name of the Google Cloud Storage bucket.

`input_folder`: This variable should contain the path to the input folder which contains the Document OCR Output Json of the PDF files which need to be processed.

`output_folder`: This variable should contain the path to the output folder where all the splitted images will be stored.

`search_strings_parts`: This dictionary is designed with unique strings that act as identifiers. In the provided example, each string represents the title of a section on the page. These unique titles serve as delimiters, enabling the straightforward identification and separation of different sections.

`selected_parts`: This is a list of selected parts to be selected. Specify the part names within the list. 

To select all parts, you can uncomment the line selected_parts = None and comment out the previous line with the list of parts.

### 4. Execute the code

In [10]:
from typing import List, Tuple, Optional, Dict
from google.cloud.documentai_v1beta3 import Document


def get_token(
    document: Document, start_index: int, end_index: int
) -> Tuple[Optional[int], Optional[Dict[str, float]], Optional[List], Optional[float]]:
    """
    Extracts the bounding box coordinates and additional information for tokens within a specified range in a Document AI document.

    The function iterates through the pages and tokens of the document, checking if each token falls within the specified index range.
    If it does, the function calculates the normalized coordinates for the token's bounding box and collects other relevant data.

    Args:
    document (Document): A Document AI document object.
    start_index (int): The starting index of the range to search for tokens.
    end_index (int): The ending index of the range to search for tokens.

    Returns:
    Tuple[Optional[int], Optional[Dict[str, float]], Optional[List], Optional[float]]:
        - The page number where the tokens were found.
        - A dictionary containing the minimum and maximum normalized x and y coordinates of the bounding box.
        - A list of text anchor segments.
        - The minimum confidence level found among the tokens, or 1 if no confidence attribute is present.

    If no tokens are found within the specified range, the function returns None for all elements of the tuple.
    """

    # Initialize variables for bounding box coordinates, confidence, and text anchor segments
    min_x_normalized = float("inf")
    min_y_normalized = float("inf")
    max_x_normalized = float("-inf")
    max_y_normalized = float("-inf")
    temp_confidence = []
    temp_text_anc_segments = []

    found_page_number = -1

    def get_token_xy(token) -> Tuple[float, float, float, float]:
        """
        Extracts the normalized x and y coordinates from a token's bounding box.

        Args:
        token: A token object from Document AI.

        Returns:
        Tuple[float, float, float, float]: The minimum and maximum x and y coordinates of the token's bounding box.
        """
        vertices = token.layout.bounding_poly.normalized_vertices
        minx = min(v.x for v in vertices)
        miny = min(v.y for v in vertices)
        maxx = max(v.x for v in vertices)
        maxy = max(v.y for v in vertices)
        return minx, miny, maxx, maxy

    # Iterate through all pages and tokens in the document
    for page_number, page in enumerate(document.pages):
        for token in page.tokens:
            for segment in token.layout.text_anchor.text_segments:
                token_start_index = int(segment.start_index)
                token_end_index = int(segment.end_index)

                # Check if the token is within the range of interest
                if (
                    start_index - 2 <= token_start_index
                    and token_end_index <= end_index + 2
                ):
                    minx, miny, maxx, maxy = get_token_xy(token)

                    # Update bounding box coordinates
                    min_x_normalized = min(min_x_normalized, minx)
                    min_y_normalized = min(min_y_normalized, miny)
                    max_x_normalized = max(max_x_normalized, maxx)
                    max_y_normalized = max(max_y_normalized, maxy)

                    temp_text_anc_segments.append(segment)
                    confidence = (
                        token.layout.confidence
                        if hasattr(token.layout, "confidence")
                        else 1
                    )
                    temp_confidence.append(confidence)

                    if found_page_number == -1:
                        found_page_number = page_number

    final_ver_normalized = {
        "min_x": min_x_normalized,
        "min_y": min_y_normalized,
        "max_x": max_x_normalized,
        "max_y": max_y_normalized,
    }
    final_confidence = min(temp_confidence, default=1)
    final_text_anc = sorted(temp_text_anc_segments, key=lambda x: int(x.end_index))

    if found_page_number == -1:
        return None, None, None, None

    return found_page_number, final_ver_normalized, final_text_anc, final_confidence


def convert_base64_to_image(base64_str: str) -> Image.Image:
    """
    Converts a base64 encoded string to an image.

    This function decodes a base64 encoded string into binary data and then
    loads it into a PIL Image object. It's useful for handling base64 encoded
    images typically found in JSON responses or binary data stored as text.

    Args:
    base64_str (str): A base64 encoded string representing an image.

    Returns:
    Image.Image: A PIL Image object created from the base64 encoded data.

    Example:
        image = convert_base64_to_image(base64_encoded_string)
        image.show() # To display the image
    """
    image_data = base64.b64decode(base64_str)
    image = Image.open(io.BytesIO(image_data))
    return image


def upload_image_to_bucket(
    bucket_name: str,
    destination_blob_name: str,
    image: Image.Image,
    output_folder: str = "",
) -> None:
    """
    Uploads an image to a Google Cloud Storage bucket.

    This function takes a PIL Image object, converts it to a byte stream, and uploads it to
    a specified bucket in Google Cloud Storage. The image is stored in the bucket with the
    given destination name. If an output folder is specified, the image will be uploaded to
    that folder within the bucket.

    Args:
    bucket_name (str): The name of the Google Cloud Storage bucket.
    destination_blob_name (str): The destination blob name within the bucket.
    image (Image.Image): The PIL Image object to be uploaded.
    output_folder (str, optional): The folder within the bucket to store the image. Defaults to "".

    Example:
        upload_image_to_bucket('my_bucket', 'path/to/my_image.jpg', my_image_object)
    """
    # Create a byte stream from the PIL Image object
    img_byte_arr = io.BytesIO()
    image.save(img_byte_arr, format="JPEG")
    img_byte_arr = img_byte_arr.getvalue()

    # Determine the full blob name, including the output folder if provided
    full_blob_name = (
        f"{output_folder}/{destination_blob_name}"
        if output_folder and not destination_blob_name.startswith(output_folder)
        else destination_blob_name
    )

    # Initialize the Google Cloud Storage client and upload the image
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(full_blob_name)
    blob.upload_from_string(img_byte_arr, content_type="image/jpeg")

In [None]:
client = storage.Client()

blobs = client.list_blobs(bucket_name, prefix=input_folder, delimiter=None)
json_blobs = [blob for blob in blobs]

for blob in json_blobs:
    json_data = utilities.blob_downloader(bucket_name, blob.name)
    document_object = documentai_v1beta3.Document.from_json(json.dumps(json_data))
    text = json_data["text"]
    doc_text = document_object.text
    pages_data = json_data["pages"]
    images = [convert_base64_to_image(page["image"]["content"]) for page in pages_data]

    total_height = sum(image.height for image in images)
    max_width = max(image.width for image in images)

    combined_image = Image.new("RGB", (max_width, total_height))

    current_height = 0
    for image in images:
        combined_image.paste(image, (0, current_height))
        current_height += image.height

    search_string_dict = {}
    for part, search_string in search_strings_parts.items():
        start_index = doc_text.find(search_string)
        if start_index != -1:
            end_index = start_index + len(search_string)
            # print(start_index, end_index)
            page_number, bounding_box_normalized, text_anchors, confidence = get_token(
                document_object, start_index, end_index
            )
            # print(page_number, bounding_box_normalized, text_anchors, confidence)
            search_string_dict[search_string] = {
                "page_number": page_number,
                "min_y": bounding_box_normalized["min_y"],
            }

    sorted_sections = sorted(
        search_string_dict.items(),
        key=lambda item: (item[1]["page_number"], item[1]["min_y"]),
    )

    first_section_page = sorted_sections[0][1]["page_number"]
    first_section_min_y = sorted_sections[0][1]["min_y"]
    previous_min_y = int(first_section_min_y * images[first_section_page].height) + sum(
        images[j].height for j in range(first_section_page)
    )

    slices = []

    for i, (search_string, details) in enumerate(sorted_sections):
        current_page = details["page_number"]
        current_min_y = int(details["min_y"] * images[current_page].height) + sum(
            images[j].height for j in range(current_page)
        )

        if i == len(sorted_sections) - 1:
            next_min_y_absolute = total_height
        else:
            next_page = sorted_sections[i + 1][1]["page_number"]
            next_min_y = sorted_sections[i + 1][1]["min_y"]
            next_min_y_absolute = int(next_min_y * images[next_page].height) + sum(
                images[j].height for j in range(next_page)
            )

        slice_section = combined_image.crop(
            (0, previous_min_y, max_width, next_min_y_absolute)
        )
        slices.append((search_string, slice_section))

        previous_min_y = next_min_y_absolute

    for index, (search_string, slice_section) in enumerate(slices):
        part_key = next(
            (
                key
                for key, value in search_strings_parts.items()
                if value == search_string
            ),
            None,
        )
        if selected_parts is not None and (part_key not in selected_parts):
            continue

        part_number = part_key.split(" ")[-1]
        original_filename = blob.name.split("/")[-1].replace(".json", "")
        filename = f"{original_filename}_part_{part_number}.jpg".replace(
            " ", "_"
        ).replace("/", "_")
        full_path = f"{output_folder}/{filename}"

        print("Saving -", filename)
        img_byte_arr = io.BytesIO()
        slice_section.save(img_byte_arr, format="JPEG")
        img_byte_arr = img_byte_arr.getvalue()

        upload_image_to_bucket(bucket_name, full_path, img_byte_arr)

## Results

The PDF will be divided according to your specified input, and each section will be stored as a separate image in the output directory, following the naming pattern <file_name>_part_*.jpeg.

### **Input PDF** 

<img src="./images/input_pdf.png" width=400 height=400 alt="None">

### **Output Splitted Images**

### **Part 3**
<img src="./images/part_3.png" width=400 height=400 alt="None">

### **Part 4**
<img src="./images/part_4.png" width=400 height=400 alt="None">

### **Part 5**
<img src="./images/part_5.png" width=400 height=400 alt="None">