# Divide PDF To High Quality Images

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.


## Objective

This tool is designed to efficiently process PDF documents, converting each individual page into a high-resolution image file. The primary goal is to ensure that the generated images maintain the quality and clarity of the original PDF content, making it suitable for various purposes such as archiving, sharing, or further image-based processing.

## Prerequisites
* Vertex AI Notebook
* Python3

## Step by Step procedure 

### 1.Importing Required Modules

In [None]:
!pip install pymupdf pillow

In [None]:
import fitz  # PyMuPDF
from PIL import Image
import os

### 2.Setup the inputs

* `pdf_path` : The file path to the input PDF document that needs to be processed.
* `output_dir` : The directory where the processed output files (such as images extracted from the PDF) will be saved.
* `DPI` : The resolution of the output images when converting a PDF to an image format. A higher DPI value results in better image quality but increases file size.
* `UPSCALE_FACTOR` : A scaling factor applied to images to increase their resolution.
* `JPEG_QUALITY` : The compression quality of JPEG images, usually a value between 0 and 100. A higher value retains more image details but results in larger file sizes.
* `USE_WEBP` : A boolean flag (True/False) indicating whether to save images in WebP format instead of other formats like JPEG or PNG.

In [None]:
pdf_path = "input.PDF"
# Output directory setup
output_dir = "new_output_images"
# Reduce DPI for better file size (200 instead of 300)
DPI = 200
UPSCALE_FACTOR = 4  # Reduce from 8x to 4x for balance
JPEG_QUALITY = 85  # Lower JPEG quality slightly to reduce file size
USE_WEBP = True  # Set to False if you still want JPEGs

### 4.Run the code

In [None]:
def main() -> None:
    """
    Converts a PDF into images, saves them in original and high-resolution formats.

    Steps:
    1. Convert each page of the PDF to an image at a specified DPI.
    2. Save the images in JPEG or WebP format.
    3. Upscale the images using the specified factor and save them in high resolution.

    Raises:
        FileNotFoundError: If the specified `pdf_path` does not exist.
    """
    os.makedirs(output_dir, exist_ok=True)

    original_dir = os.path.join(output_dir, "original")
    high_res_dir = os.path.join(output_dir, "high_resolution")
    os.makedirs(original_dir, exist_ok=True)
    os.makedirs(high_res_dir, exist_ok=True)

    # Open PDF
    doc = fitz.open(pdf_path)

    # Step 1: Convert PDF pages to images
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        pix = page.get_pixmap(dpi=DPI)  # Lower DPI to reduce size
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

        # Save original image
        img_format = "WEBP" if USE_WEBP else "JPEG"
        ext = "webp" if USE_WEBP else "jpg"

        jpg_path = os.path.join(original_dir, f"page_{page_num+1}_original.{ext}")
        img.save(jpg_path, img_format, quality=JPEG_QUALITY)

        print(f"Saved original: {jpg_path}")

    # Step 2: Increase resolution (upscale) with better compression
    for page_num in range(len(doc)):
        jpg_path = os.path.join(original_dir, f"page_{page_num+1}_original.{ext}")
        img = Image.open(jpg_path)

        # Get original size
        width, height = img.size

        # Increase resolution (but not too much)
        new_width, new_height = width * UPSCALE_FACTOR, height * UPSCALE_FACTOR
        high_res_img = img.resize((new_width, new_height), Image.LANCZOS)

        # Save high-resolution image
        high_res_jpg_path = os.path.join(
            high_res_dir, f"page_{page_num+1}_high_res.{ext}"
        )
        high_res_img.save(high_res_jpg_path, img_format, quality=JPEG_QUALITY)

        print(f"Saved high-res: {high_res_jpg_path}")

    print("Processing complete!")


if __name__ == "__main__":
    main()

### Output

The high-resolution created images, along with their original versions, will be saved in the designated output folder after executing the above code.

#### Original Image 
<img src="./images/page_1_original.webp" width=600 height=400 ></img>
### High Resolution Image
<img src="./images/page_1_high_res.webp" width=600 height=400 ></img>