## TODO: PDF Page Classification Task

Objective:
Implement the `classify_page` function within the given code structure to classify each page of a PDF file into one of three categories based on its content and readability.

Code Structure:
The main function `classify_all_pages` is already implemented, but if necessary you are allowed to change its implementation. Your task is to complete the `classify_page` function.

Input:
- A PDF file path (the function should be able to handle various PDF files)

Output:
- A list of integers, where each integer represents the class of a page in the PDF

Classification Categories:

0: Machine-readable / searchable
   - Pages with text that can be directly extracted and searched within the PDF

1: Non-machine readable but OCR-able
   - Pages containing text that isn't directly extractable but can be recognized through OCR
   - Essentially, these are pages with visible text but stored as images

2: Non-machine readable and not OCR-able
   - Pages without any recognizable text
   - This may include pages with only images, complex graphics, or blank pages

Task:
1. Implement the `classify_page` function:
   - Input: A single page object (PdfReader.PageObject)
   - Output: An integer (0, 1, or 2) representing the page's class

2. The function should analyze the content of the page and determine its class based on the categories described above.

3. Ensure your implementation is robust and can handle various types of PDF content.

Requirements:
1. The function should work with different PDF files, not just a specific one.
2. Implement methods to distinguish between the three categories accurately.
3. Handle potential exceptions or edge cases (e.g., corrupted pages, mixed content types on a single page).
4. Optimize for both accuracy and processing speed, as the function will be called for each page in the PDF.
5. You are allowed to use up to 40GB of GPU VRAM if necessary for your implementation.

Additional Considerations:
- You may use additional libraries if needed, but ensure they are imported properly.
- Provide clear comments in your code to explain the classification logic.

Testing:
- Test your implementation with various types of PDFs to ensure its robustness and generalizability.
- The main script provides a way to test your implementation on a file named "grouped_documents.pdf".


In [1]:
from typing import List
from PyPDF2 import PdfReader, PdfWriter
import io
import numpy as np

def classify_all_pages(input_pdf: str) -> List[List[int]]:
    """
    Analyze all pages in the input PDF and determine the class of the pdf page

    Args:
    input_pdf (str): The file path of the input PDF.

    Returns:
    List[int]: A list of classes for each page. 
            0: machine-readable
            1: non-machine readable but OCR-able
            2: non-machine readable and not OCR-able
    """
    reader = PdfReader(input_pdf)
    writer = PdfWriter()
    
    classes = []
    for page_number in range(len(reader.pages)):
        current_page = reader.pages[page_number]
        
        page_class = classify_page(current_page)
        classes.append(page_class)
    
    return classes

def classify_page(page: 'PdfReader.PageObject') -> List[int]:
    """
    Determine the class of the pdf page.

    Args:
    page (PdfReader.PageObject): A single page from a PDF.

    Returns:
    int: The page is 
        0: machine-readable
        1: non-machine readable but OCR-able
        2: non-machine readable and not OCR-able
    """
    # TODO: Implement a function that determines the class of the pdf page
    return 2

# Usage
input_pdf: str = "grouped_documents.pdf"
page_classes: List[int] = classify_all_pages(input_pdf)
print(f"Classes for each page: {page_classes}")

Classes for each page: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]


In [11]:
from typing import List
from PyPDF2 import PdfReader, PdfWriter
import pytesseract
from pdf2image import convert_from_path
from PIL import Image
import tempfile
import os

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def classify_all_pages(input_pdf: str) -> List[int]:
    """
    Analyze all pages in the input PDF and determine the class of each page.

    Args:
    input_pdf (str): The file path of the input PDF.

    Returns:
    List[int]: A list of classes for each page. 
            0: machine-readable
            1: non-machine readable but OCR-able
            2: non-machine readable and not OCR-able
    """
    reader = PdfReader(input_pdf)
    writer = PdfWriter()
    
    classes = []
    for page_number in range(len(reader.pages)):
        current_page = reader.pages[page_number]
        
        page_class = classify_page(current_page)
        classes.append(page_class)
    
    return classes

# Function to convert a PDF page to an image
def convert_pdf_page_to_image(page: 'PdfReader.PageObject') -> 'Image':
    pdf_writer = PdfWriter()
    pdf_writer.add_page(page)
    
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp_pdf:
        pdf_writer.write(temp_pdf)
        temp_pdf_path = temp_pdf.name
    
    images = convert_from_path(temp_pdf_path, dpi=300)
    os.remove(temp_pdf_path)
    
    return images[0]

def classify_page(page: 'PdfReader.PageObject') -> int:
    """
    Determine the class of the pdf page.

    Args:
    page (PdfReader.PageObject): A single page from a PDF.

    Returns:
    int: The page is 
        0: machine-readable
        1: non-machine readable but OCR-able
        2: non-machine readable and not OCR-able
    """
    # Check if the page is machine-readable by extracting text
    text = page.extract_text()
    if text and text.strip():
        print('machine_readable:')
        print(text.strip())
        #return 0
    
    # If not, convert the page to an image and use OCR to check if it has recognizable text
    try:
        # Convert page to image
        image = convert_pdf_page_to_image(page)
        ocr_text = pytesseract.image_to_string(image)
        if ocr_text and ocr_text.strip():
            print('ocr_readable:')
            print(ocr_text.strip())
            #return 1
    except Exception as e:
        print(f"Error during OCR processing: {e}")

    
    print('the same?:')
    if text.strip() == ocr_text.strip():
        print('bazinga')
    print('\n')
    # If neither method works, classify as non-machine readable and not OCR-able
    return 2

# Usage
input_pdf: str = "grouped_documents.pdf"
page_classes: List[int] = classify_all_pages(input_pdf)
print(f"Classes for each page: {page_classes}")

machine_readable:
Document 1 - Page 1
ocr_readable:
Document 1 - Page 1
the same?:
bazinga


machine_readable:
This is page 2 of Document 1
Document 1 - Page 2
ocr_readable:
N
(0)
od)
©

ae

i]

—_

_~
Cc
0)
E
=)
oO
oO

QO
the same?:


machine_readable:
This is page 3 of Document 1
Document 1 - Page 3
ocr_readable:
This is page 3 of Document 1

Document 1 - Page 3
the same?:


machine_readable:
Document 2
Document 2 - Page 1
the same?:


machine_readable:
Document 2
Document 2 - Page 2
the same?:


machine_readable:
Document 2
Document 2 - Page 3
the same?:


machine_readable:
This is page 1 of Document 3
Document 3 - Page 1
ocr_readable:
Document 3 - Page 1
the same?:


machine_readable:
This is page 2 of Document 3
Document 3 - Page 2
ocr_readable:
Document 3 - Page 2
the same?:


machine_readable:
This is page 3 of Document 3
Document 3 - Page 3
ocr_readable:
Document 3 - Page 3
the same?:


machine_readable:
Document 4 - Page 1
This is page 1 of Document 4
the same?:


machine_read