## PDF Page Rotation Angle Detection Task

Objective:
Implement the `determine_rotation_angle` function within the given code structure to detect the rotation angle of each page in a PDF file.

Code Structure:
The main function `rotate_all_pages_upright` is already implemented, but if necessary you are allowed to change its implementation. Your task is to complete the `determine_rotation_angle` function.

Input:
- A PDF file path (the function should be able to handle various PDF files)

Output:
- A list of integers, where each integer represents the rotation angle needed for a page in the PDF

Rotation Angle:
- The rotation angle should be in degrees, normalized to the range [0, 359].
- 0 means the page is already upright
- 90 means the page needs to be rotated 90 degrees clockwise to be upright
- and so on...

Task:
1. Implement the `determine_rotation_angle` function:
   - Input: A single page object (PdfReader.PageObject)
   - Output: An integer representing the rotation angle in degrees

2. The function should analyze the content of the page and determine the angle needed to make the page upright.

Requirements:
1. The function should work with different PDF files, not just a specific one.
2. Implement robust methods to determine the correct rotation angle.
3. Handle potential exceptions or edge cases (e.g., pages with mixed orientations, complex layouts).
4. Optimize for both accuracy and processing speed, as the function will be called for each page in the PDF.

Additional Considerations:
- You are allowed to use up to 40GB of GPU VRAM if necessary for your implementation.
- You may create as many additional functions as needed to support your implementation.
- You may use additional libraries if required, but ensure they are imported properly.
- Provide clear comments in your code to explain your rotation detection logic.

Testing:
- Test your implementation with various types of PDFs to ensure its robustness and generalizability.
- The main script provides a way to test your implementation on a file named "grouped_documents.pdf".

Note:
The task involves determining the rotation angle only. The actual rotation of the pages is not required in this implementation.

In [9]:
import fitz  # PyMuPDF
from PIL import Image, ImageEnhance
import io
from collections import Counter
import math
from typing import List
import pytesseract
import numpy as np
from sklearn.decomposition import PCA

# Set the path to the Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # Adjust the path as needed

def is_watermark_or_header_footer(block: dict, page_height: float) -> bool:
    """
    Determine if a text block is a watermark, header, or footer.

    Args:
        block (dict): The text block to examine.
        page_height (float): The height of the page to identify headers and footers.

    Returns:
        bool: True if the block is a watermark, header, or footer, False otherwise.
    """
    for line in block["lines"]:
        for span in line["spans"]:
            # Watermark criteria: font size outside typical range for main text
            if span["size"] < 10 or span["size"] > 30:
                return True
            # Header and footer criteria: positioned at the top or bottom of the page
            if block["bbox"][1] < 0.1 * page_height or block["bbox"][3] > 0.9 * page_height:
                return True
    return False

def calculate_rotation_angle(orientation: tuple) -> float:
    """
    Calculate rotation angle from orientation vector.

    Args:
        orientation (tuple): Orientation vector (x, y).

    Returns:
        float: Rotation angle in degrees.
    """
    x, y = orientation
    angle = math.degrees(math.atan2(y, x))
    return angle

def find_rotation_by_ocr(image: Image.Image) -> float:
    """
    Find the rotation angle of an image using OCR.

    Args:
        image (Image.Image): The image to analyze.

    Returns:
        float: Rotation angle in degrees.
    """
    try:
        osd = pytesseract.image_to_osd(image, output_type=pytesseract.Output.DICT)
        return float(osd['rotate'])
    except pytesseract.TesseractError as e:
        error_message = str(e)
        if "Too few characters. Skipping this page" in error_message:
            return None
        if "Invalid resolution 0 dpi. Using 70 instead" in error_message:
            return None
        print("Tesseract OCR failed to determine the rotation angle. Error:", e)
        return None

def find_rotation_by_pca(image: Image.Image) -> float:
    """
    Find the rotation angle of an image using PCA.

    Args:
        image (Image.Image): The image to analyze.

    Returns:
        float: Rotation angle in degrees.
    """
    img = np.array(image)
    coords = np.column_stack(np.where(img > 0))
    if len(coords) > 0:
        pca = PCA(n_components=2)
        pca.fit(coords)
        angle = math.degrees(math.atan2(pca.components_[0][1], pca.components_[0][0]))
        rotation_angle = (360 - angle) % 360
        return rotation_angle
    else:
        return None

def normalize_rotation_angle(angle: float) -> float:
    """
    Normalize the rotation angle to be within the range [0, 359.99] and round to two decimal places.
    Consider values close to 360 as 0.

    Args:
        angle (float): The rotation angle to normalize.

    Returns:
        float: Normalized rotation angle.
    """
    normalized_angle = round(angle % 360, 2)
    if normalized_angle >= 359:
        return 0.00
    return normalized_angle

def determine_rotation_angle(page: 'fitz.Page') -> float:
    """
    Determine the rotation angle needed to make the page upright.

    Args:
        page (fitz.Page): A single page from a PDF.

    Returns:
        float: The rotation angle in degrees (e.g. 0.00, 90.00, 180.00, 270.00).
               0 means the page is already upright, 90 means 90 degrees clockwise, etc.
    """
    text_info = page.get_text("dict")
    page_height = page.rect.height

    # Extract orientations, ignoring watermarks, headers, and footers
    orientations = []
    for block in text_info["blocks"]:
        if block["type"] == 0 and not is_watermark_or_header_footer(block, page_height):
            for line in block["lines"]:
                orientations.append(line["dir"])

    if orientations:
        rotation_angles = [calculate_rotation_angle(orientation) for orientation in orientations]
        normalized_angles = [normalize_rotation_angle((360 - angle) % 360) for angle in rotation_angles]
        most_common_angle = Counter(normalized_angles).most_common(1)[0][0]
        return most_common_angle

    # Render page as an image for OCR or PCA if no text found
    pix = page.get_pixmap(dpi=150)  # Lower DPI for faster processing
    img = Image.open(io.BytesIO(pix.tobytes()))

    # Try OCR
    rotation_angle = find_rotation_by_ocr(img)
    if rotation_angle is not None:
        return normalize_rotation_angle(rotation_angle)

    # Enhance image and try PCA if OCR fails
    enhanced_img = ImageEnhance.Contrast(img).enhance(2.0)
    enhanced_img = ImageEnhance.Brightness(enhanced_img).enhance(1.5)
    rotation_angle = find_rotation_by_pca(enhanced_img)

    return normalize_rotation_angle(rotation_angle) if rotation_angle is not None else 0.00

def rotate_all_pages_upright(input_pdf: str) -> List[float]:
    """
    Analyze all pages in the input PDF and determine the rotation angle needed for each page.

    Args:
        input_pdf (str): The file path of the input PDF.

    Returns:
        List[float]: A list of rotation angles (in degrees) for each page.
                     The angles are normalized to be in the range [0, 359.99].
                     0 means no rotation needed, 90 means 90 degrees clockwise, etc.
    """
    document = fitz.open(input_pdf)
    angles = []

    for page_number in range(len(document)):
        current_page = document.load_page(page_number)
        rotation_angle = determine_rotation_angle(current_page)
        angles.append(rotation_angle)

    return angles

# Usage
input_pdf = "grouped_documents.pdf"
rotation_angles = rotate_all_pages_upright(input_pdf)
print(f"Rotation angles for each page: {rotation_angles}")


Rotation angles for each page: [0.0, 48.0, 28.0, 23.9, 193.84, 189.39, 348.0, 17.0, 261.0, 266.0, 198.0, 353.0, 168.0, 219.0, 122.0, 197.15, 188.34, 189.41]
