## TODO: PDF Page Classification Task

Objective:
Implement the `classify_page` function within the given code structure to classify each page of a PDF file into one of three categories based on its content and readability.

Code Structure:
The main function `classify_all_pages` is already implemented, but if necessary you are allowed to change its implementation. Your task is to complete the `classify_page` function.

Input:
- A PDF file path (the function should be able to handle various PDF files)

Output:
- A list of integers, where each integer represents the class of a page in the PDF

Classification Categories:

0: Machine-readable / searchable
   - Pages with text that can be directly extracted and searched within the PDF

1: Non-machine readable but OCR-able
   - Pages containing text that isn't directly extractable but can be recognized through OCR
   - Essentially, these are pages with visible text but stored as images

2: Non-machine readable and not OCR-able
   - Pages without any recognizable text
   - This may include pages with only images, complex graphics, or blank pages

Task:
1. Implement the `classify_page` function:
   - Input: A single page object (PdfReader.PageObject)
   - Output: An integer (0, 1, or 2) representing the page's class

2. The function should analyze the content of the page and determine its class based on the categories described above.

3. Ensure your implementation is robust and can handle various types of PDF content.

Requirements:
1. The function should work with different PDF files, not just a specific one.
2. Implement methods to distinguish between the three categories accurately.
3. Handle potential exceptions or edge cases (e.g., corrupted pages, mixed content types on a single page).
4. Optimize for both accuracy and processing speed, as the function will be called for each page in the PDF.
5. You are allowed to use up to 40GB of GPU VRAM if necessary for your implementation.

Additional Considerations:
- You may use additional libraries if needed, but ensure they are imported properly.
- Provide clear comments in your code to explain the classification logic.

Testing:
- Test your implementation with various types of PDFs to ensure its robustness and generalizability.
- The main script provides a way to test your implementation on a file named "grouped_documents.pdf".


In [1]:
!pip install pdf2image



In [2]:
!pip install pymupdf



In [3]:
!pip install PyPDF2 pytesseract pillow pdf2image



In [4]:
!apt-get update
!apt-get install -y tesseract-ocr


0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.82)] [Connected to r2u.stat                                                                                                    Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
                                                                                                    Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:6 http://security.ubuntu.com/ubuntu jammy-security InRelease
Ign:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa

In [5]:
import fitz  # PyMuPDF

def crop_pdf(input_pdf_path, output_pdf_path, top_percentage=10, bottom_percentage=10):
    # Open the original PDF
    pdf_document = fitz.open(input_pdf_path)

    # Create a new PDF for output
    new_pdf_document = fitz.open()

    # Iterate over each page in the original PDF
    for page_num in range(len(pdf_document)):
        # Load the page
        page = pdf_document.load_page(page_num)

        # Get the dimensions of the page
        rect = page.rect
        width = rect.width
        height = rect.height

        # Calculate the heights of the sections to discard
        top_height = height * (top_percentage / 100.0)
        bottom_height = height * (bottom_percentage / 100.0)

        # Define the crop box for the remaining middle section
        middle_crop_box = fitz.Rect(0, top_height, width, height - bottom_height)

        # Create a new page for the remaining section
        new_page = new_pdf_document.new_page(width=width, height=(height - top_height - bottom_height))
        new_page.show_pdf_page(fitz.Rect(0, 0, width, height - top_height - bottom_height), pdf_document, page_num, clip=middle_crop_box)

    # Save the new PDF
    new_pdf_document.save(output_pdf_path)
    new_pdf_document.close()
    pdf_document.close()



In [7]:
from PIL import Image
import pytesseract
import io

def extract_text_from_pdf_page(page):
    """Extracts text from a PDF page using PyMuPDF."""
    try:
        text = page.get_text() or ""
        return text
    except Exception as e:
        print(f"Error extracting text: {e}")
        return ""

def ocr_image(image):
    """Perform OCR on an image using pytesseract."""
    try:
        ocr_text = pytesseract.image_to_string(image)
        return ocr_text
    except Exception as e:
        print(f"Error during OCR: {e}")
        return ""

def convert_page_to_image(page):
    """Convert a PDF page to an image using PyMuPDF."""
    try:
        # Render the page as a pixmap
        pix = page.get_pixmap()
        # Convert pixmap to PIL Image
        img = Image.open(io.BytesIO(pix.tobytes()))
        return img
    except Exception as e:
        print(f"Error converting page to image: {e}")
        return None



In [13]:
from typing import List
from PyPDF2 import PdfReader, PdfWriter
import io
import numpy as np

def classify_all_pages(input_pdf: str) -> List[List[int]]:
    """
    Analyze all pages in the input PDF and determine the class of the pdf page

    Args:
    input_pdf (str): The file path of the input PDF.

    Returns:
    List[int]: A list of classes for each page.
            0: machine-readable
            1: non-machine readable but OCR-able
            2: non-machine readable and not OCR-able
    """
    """Classify all pages of a PDF file."""
    # Open PDF file with PyMuPDF
    pdf_document = fitz.open(input_pdf)

    page_classes = []

    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        page_class = classify_page(page)
        page_classes.append(page_class)

    pdf_document.close()
    return page_classes

def classify_page(page):
    """Classifies a PDF page into one of three categories."""
    # Extract text directly from the PDF page
    text = extract_text_from_pdf_page(page)


    if text.strip():
        return 0  # Machine-readable / Searchable

    # Convert the PDF page to an image
    image = convert_page_to_image(page)

    if not image:
        return 2  # Non-machine readable and not OCR-able

    # Perform OCR on the image
    ocr_text = ocr_image(image)
    print(f"OCR text: {ocr_text}")

    if ocr_text.strip():
        return 1  # Non-machine readable but OCR-able

    return 2  # Non-machine readable and not OCR-able

# Usage
input_pdf: str = "/content/grouped_documents.pdf"
output_pdf_path = "/content/output.pdf"
crop_pdf(input_pdf, output_pdf_path)
page_classes: List[int] = classify_all_pages(output_pdf_path)
print(f"Classes for each page: {page_classes}")

OCR text: 
Classes for each page: [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


#  Task 2
## Implementation

We need to find if each page of the file is
            0: machine-readable
            1: non-machine readable but OCR-able
            2: non-machine readable and not OCR-able

In order to find that, input file path is provided in `input_pdf`. Next I cropped the each page to extract the middle part to remove the page number in the pages. This is because the page number can be considered as machine-readable. Next I parsed each page of the cropped document to `classify_page` function which will classify the pages.

### Logic

First I extracted the text from the page to confirm if it is a machine-readable page. If not, I tried to convert it to image and then extracted the ocr text to confirm non-machine readable but OCR-able page. If it is doesn't match any of the above cases or if it is not able to convert to image, then we can confirm it is non-machine readable and not OCR-able page.
