# PDF Splitter

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective

The objective of this notebook is to provide python script which helps to split large pdf-file into smaller-chunk-files based on chunk size provided by user(number of pages per each chunk-pdf).

# Prerequisite
* Vertex AI Notebook and GCS path of large pdf files.

# Step By Step procedure

## 1. Import Modules/Packages 

In [4]:
%pip install google-cloud-storage
%pip install PyPDF2

In [6]:
# Run this cell to download utilities module
# !wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [7]:
from io import BytesIO

from google.cloud import storage
from PyPDF2 import PdfReader, PdfWriter

from utilities import copy_blob, file_names

## 2. Input Details

* **PROJECT_ID**: Provide GCP Project Id
* **BUCKET_NAME**: Provide GCS bucket name 
* **INPUT_FOLDER_PATH**: Provide GCS folderpath which contains input PDF files, _gcs uri without bucket name_
* **OUTPUT_FOLDER_PATH**: Provide GCS folderpath to store chunked-PDF files, _gcs uri without bucket name_
* **CHUNK_SIZE**: Provide number of pages you are required for each pdf-chunk.

In [8]:
PROJECT_ID = "xx-xx-xx"
BUCKET_NAME = "bucket_name"
INPUT_FOLDER_PATH = "path_to/input_pdfs"  # without bucket name
OUTPUT_FOLDER_PATH = "path_to/output"  # without bucket name
CHUNK_SIZE = 15  # no of pages

## 3. Run the Below Code-cells

In [None]:
def store_blob(bytes_stream: bytes, file: str) -> None:
    """To store PDF files in GCS

    Args:
        bytes_stream (bytes): Binary Format of pdf data
        file (str): filename to store in specified GCS bucket
    """

    storage_client = storage.Client()
    result_bucket = storage_client.get_bucket(BUCKET_NAME)
    document_blob = storage.Blob(name=str(file), bucket=result_bucket)
    document_blob.upload_from_string(bytes_stream, content_type="application/pdf")


def split_pdfs(filepath: str) -> None:
    """Splits the PDF into multiple chunks based on provided CHUNK_SIZE

    Args:
        filepath (str): filepath to read from specified GCS bucket
    """

    storage_client = storage.Client()
    bucket_obj = storage_client.get_bucket(BUCKET_NAME)
    blob = bucket_obj.get_blob(str(filepath))
    pdf_data = BytesIO(blob.download_as_bytes())
    pdf_reader = PdfReader(pdf_data)
    num_pages = len(pdf_reader.pages)
    filename = filepath.split("/")[-1]
    sub_dir = filename.split(".")[0]
    output_folder_path = OUTPUT_FOLDER_PATH.strip("/")
    if num_pages <= CHUNK_SIZE:
        # copy the PDF file to the destination directory without splitting
        destination_filename = f"{output_folder_path}/{sub_dir}/{filename}"
        print(f"\tcopying blob to {destination_filename}")
        copy_blob(BUCKET_NAME, filepath, BUCKET_NAME, destination_filename)
        return
    print("Chuncking process started ")
    # Split the PDF into multiple PDFs of user_pages pages each
    num_splits = num_pages // CHUNK_SIZE + 1
    for i in range(num_splits):
        start_page = i * CHUNK_SIZE
        end_page = min((i + 1) * CHUNK_SIZE, num_pages)

        pdf_writer = PdfWriter()
        for page_num in range(start_page, end_page):
            pdf_writer.add_page(pdf_reader.pages[page_num])

        # Save the split PDF as a new file in the destination directory
        destination_filename = sub_dir + "_" + str(i + 1).zfill(5) + ".pdf"
        destination_filename = f"{output_folder_path}/{sub_dir}/{destination_filename}"

        response_bytes_stream = BytesIO()
        pdf_writer.write(response_bytes_stream)
        bytes_stream = response_bytes_stream.getvalue()
        print("\tStoring to ", destination_filename)
        store_blob(bytes_stream, destination_filename)


_, filenames_dict = file_names(f"gs://{BUCKET_NAME}/{INPUT_FOLDER_PATH}")
filenames_dict = {fn: fp for fn, fp in filenames_dict.items() if fn.endswith(".pdf")}
for filename, filepath in filenames_dict.items():
    print(f"filename: {filename}")
    try:
        split_pdfs(filepath)
    except Exception as e:
        print(str(e))
    print(f"Processed: {filename} from {filepath}")
print("Process completed for all files")

## 4. Output Details

If you check OUTPUT_FOLDER_PATH, you can see all large pdf files chunked to corresponding folders.

<img src="./images/output_sample_folders.png" width=800 height=600>
<br>
<img src="./images/output_sample_chunks.png" width=800 height=600>