# Watermark Remover Tool

**Author:** docai-incubator@google.com

## Disclaimer

The Watermark Remover Tool is provided as is, without any guarantees, by the DocAI Incubator team. It is supported on a best effort basis. Google Engineering does not provide support for this tool.

## Purpose of the Script

The Python script is designed to remove half-tone (gray) watermarks from images (PDFs and JPGs) using image processing techniques. Its purpose is to automate the pre-processing step of eliminating visible watermarks present in the images.

## Considerations and Limitations

The efficacy of the script may vary depending on the complexity and transparency of the watermark. Removing complex overlays completely can be challenging. Additionally, image quality, lighting, and contrast variations can also affect the performance. For optimal results, manual intervention or specialized techniques may be required. Please refer to the results to get an idea of which JPGs or PDFs work best.

## Prerequisites

1. Python: Jupyter Notebook (Vertex AI)

## Installation Procedure

The script consists of Python code. It can be loaded and executed via:

1. Upload the IPYNB file or copy the code to the Vertex Notebook and follow the operation procedure.

## Operation Procedure

### Install the required libraries

In [None]:
import sys

!{sys.executable} -m pip install numpy pdf2image img2pdf opencv-python

In [35]:
!apt-get install poppler-utils -y
# If running on a Mac, use `brew install poppler`

### Import the required libraries

In [None]:
import cv2
import numpy as np
from pdf2image import convert_from_path
import img2pdf
import os
from typing import List

In [None]:
def apply_watermark_removal(
    grayscale: np.ndarray, background: np.ndarray
) -> np.ndarray:
    """
    Apply watermark removal to a grayscale image using morphological operations.

    Parameters:
        grayscale (numpy.ndarray): The grayscale image to process.
        background (numpy.ndarray): The background image used for watermark removal.

    Returns:
        numpy.ndarray: The watermark-removed binary image.
    """
    # Compute the difference between the background and grayscale image
    difference = cv2.subtract(background, grayscale)

    # Threshold the difference to create a binary mask
    _, binary = cv2.threshold(
        difference, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU
    )

    # Threshold the background to obtain the dark region
    _, dark_region = cv2.threshold(
        background, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU
    )

    # Get the dark pixels from the grayscale image within the dark region
    dark_pixels = grayscale[np.where(dark_region > 0)]
    _, dark_pixels = cv2.threshold(
        dark_pixels, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU
    )

    # Replace the watermark with the dark pixels
    binary[np.where(dark_region > 0)] = dark_pixels.T

    return binary


def remove_watermark_pdf(pdf_filename: str) -> None:
    """
    Remove watermarks from a PDF file and save the watermark-removed pages as new images.

    Parameters:
        pdf_filename (str): The path to the input PDF file.
    """
    dpi = 300  # higher dpi results in better resolution
    pages = convert_from_path(pdf_filename, dpi=dpi)

    processed_pages: List[str] = []

    for i, page in enumerate(pages):
        # Convert each PDF page to a grayscale image
        grayscale = cv2.cvtColor(np.array(page), cv2.COLOR_RGB2GRAY)

        background = grayscale.copy()
        for j in range(5):
            kernel_size = 2 * j + 1
            kernel = cv2.getStructuringElement(
                cv2.MORPH_ELLIPSE, (kernel_size, kernel_size)
            )
            background = cv2.morphologyEx(background, cv2.MORPH_CLOSE, kernel)
            background = cv2.morphologyEx(background, cv2.MORPH_OPEN, kernel)

        # Apply watermark removal
        binary = apply_watermark_removal(grayscale, background)

        # Save the watermark-removed image
        output_filename = f"{pdf_filename}_no_watermark_{i}.jpg"
        cv2.imwrite(output_filename, binary)
        processed_pages.append(output_filename)

    # Convert the watermark-removed images into a new PDF
    pdf_output_filename = f"{pdf_filename}_no_watermark.pdf"
    with open(pdf_output_filename, "wb") as f:
        imgs = [open(i, "rb").read() for i in processed_pages]
        f.write(img2pdf.convert(imgs))

    # Remove the temporary image files
    for p in processed_pages:
        os.remove(p)


def remove_watermark_image(filename: str) -> None:
    """
    Remove watermark from an image and save the watermark-removed image.

    Parameters:
        filename (str): The path to the input image file.
    """
    image = cv2.imread(filename)
    grayscale = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    background = grayscale.copy()
    kernel_size = 5
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (kernel_size, kernel_size))
    background = cv2.morphologyEx(background, cv2.MORPH_CLOSE, kernel)
    background = cv2.morphologyEx(background, cv2.MORPH_OPEN, kernel)

    # Apply watermark removal
    binary = apply_watermark_removal(grayscale, background)

    # Save the watermark-removed image
    output_filename = f"{filename}_no_watermark.jpg"
    cv2.imwrite(output_filename, binary, [cv2.IMWRITE_JPEG_QUALITY, 100])

    print(f"Watermark removed. Output image saved as {output_filename}")

## Call the functions

After calling the function with the desired PDF/Image, it will generate a new PDF/Image file without the watermark, which will be saved with a modified filename.
Feel free to customize the code to suit your needs.

In [None]:
remove_watermark_pdf("./sample-files/input1.pdf")
remove_watermark_image("./sample-files/input3.png")

## Results

Input PDF             |  Output PDF
:-------------------------:|:-------------------------:
![Input 1](./sample-files/input1.jpg)   |  ![Output 1](./sample-files/output1.jpg)
![Input 2](./sample-files/input2.png)   |  ![Output 2](./sample-files/output2.jpg)
![Input 3](./sample-files/input3.png)   |  ![Output 3](./sample-files/output3.png)