## PDF Page Rotation Angle Detection Task

Objective:
Implement the `determine_rotation_angle` function within the given code structure to detect the rotation angle of each page in a PDF file.

Code Structure:
The main function `rotate_all_pages_upright` is already implemented, but if necessary you are allowed to change its implementation. Your task is to complete the `determine_rotation_angle` function.

Input:
- A PDF file path (the function should be able to handle various PDF files)

Output:
- A list of integers, where each integer represents the rotation angle needed for a page in the PDF

Rotation Angle:
- The rotation angle should be in degrees, normalized to the range [0, 359].
- 0 means the page is already upright
- 90 means the page needs to be rotated 90 degrees clockwise to be upright
- and so on...

Task:
1. Implement the `determine_rotation_angle` function:
   - Input: A single page object (PdfReader.PageObject)
   - Output: An integer representing the rotation angle in degrees

2. The function should analyze the content of the page and determine the angle needed to make the page upright.

Requirements:
1. The function should work with different PDF files, not just a specific one.
2. Implement robust methods to determine the correct rotation angle.
3. Handle potential exceptions or edge cases (e.g., pages with mixed orientations, complex layouts).
4. Optimize for both accuracy and processing speed, as the function will be called for each page in the PDF.

Additional Considerations:
- You are allowed to use up to 40GB of GPU VRAM if necessary for your implementation.
- You may create as many additional functions as needed to support your implementation.
- You may use additional libraries if required, but ensure they are imported properly.
- Provide clear comments in your code to explain your rotation detection logic.

Testing:
- Test your implementation with various types of PDFs to ensure its robustness and generalizability.
- The main script provides a way to test your implementation on a file named "grouped_documents.pdf".

Note:
The task involves determining the rotation angle only. The actual rotation of the pages is not required in this implementation.

In [1]:
from typing import List
from PyPDF2 import PdfReader, PdfWriter
import io
import numpy as np
from pdf2image import convert_from_path, convert_from_bytes

def rotate_all_pages_upright(input_pdf: str) -> List[int]:
    """
    Analyze all pages in the input PDF and determine the rotation angle needed for each page.

    Args:
    input_pdf (str): The file path of the input PDF.

    Returns:
    List[int]: A list of rotation angles (in degrees) for each page. 
               The angles are normalized to be in the range [0, 359].
               0 means no rotation needed, 90 means 90 degrees clockwise, etc.
    """
    reader = PdfReader(input_pdf)
    writer = PdfWriter()
    
    angles = []
    for page_number in range(len(reader.pages)):
        current_page = reader.pages[page_number]
        
        rotation_angle = determine_rotation_angle(current_page)
        angles.append(rotation_angle)
    
    return angles

def determine_rotation_angle(page: 'PdfReader.PageObject') -> int:
    """
    Determine the rotation angle needed to make the page upright.

    Args:
    page (PdfReader.PageObject): A single page from a PDF.

    Returns:
    int:  The rotation angle in degrees (e.g. 0, 90, 210).
          The rotation angle is  normalized to be in the range [0, 359].
          0 means the page is already upright, 90 means 90 degrees clockwise, etc.
    """
    # Approach Tesseract
    #image = convert_from_path('example.pdf')
    # TODO: Implement the logic to determine the rotation angle of the pdf page
    return 0

# Usage
input_pdf: str = "grouped_documents.pdf"
rotation_angles: List[int] = rotate_all_pages_upright(input_pdf)
print(f"Rotation angles for each page: {rotation_angles}")

Rotation angles for each page: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [13]:
# Install pip packages in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pdf2image
!{sys.executable} -m pip install pytesseract
!{sys.executable} -m pip install cv2
!{sys.executable} -m pip install matplotlib

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


ERROR: Could not find a version that satisfies the requirement cv2 (from versions: none)
ERROR: No matching distribution found for cv2


Defaulting to user installation because normal site-packages is not writeable
Collecting matplotlib
  Downloading matplotlib-3.9.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.2.1-cp312-cp312-win_amd64.whl.metadata (5.8 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.53.1-cp312-cp312-win_amd64.whl.metadata (165 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.5-cp312-cp312-win_amd64.whl.metadata (6.5 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.1.2-py3-none-any.whl.metadata (5.1 kB)
Downloading matplotlib-3.9.2-cp312-cp312-win_amd64.whl (7.8 MB)
   ---------------------------------------- 0.0/7.8 MB ? eta -:--:--
   ---------------------- ----------------- 4.5/7.8 MB 24.4 MB/s eta 0:00:01
   ---------------------------------



In [22]:
from PIL import Image
import pytesseract # install tessearct Library for OCR, 
pytesseract.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

In [10]:
"""
Converts PDf to image for better processing and saves pictures to "pictures" directory
"""
from pdf2image import convert_from_path # cave: Install Toddler first, Conda https://stackoverflow.com/questions/46184239/extract-a-page-from-a-pdf-as-a-jpeg

pdf_images = convert_from_path('grouped_documents.pdf')
for i in range(len(pdf_images)):
    pdf_images[i].save('pictures\\pdf_page_'+ str(i+1) +'.png', 'PNG') 
print("Successfully converted PDF to images")



Successfully converted PDF to images


In [53]:
#https://stackoverflow.com/questions/28816046/
#displaying-different-images-with-actual-size-in-matplotlib-subplot
"""
Function to show images inline in Jupyter Lab
@type image: cv2.image
@param image: The image to be shown in Notebook
"""
def display(image):
    print("test1")
    im_path = "tempImage.jpg" #temporarily save the immage
    cv2.imwrite("tempImage.jpg", image)
    
    dpi = 80
    im_data = plt.imread(im_path)

    height, width  = im_data.shape[:2]
    
    # What size does the figure need to be in inches to fit the image?
    figsize = width / float(dpi), height / float(dpi)

    # Create a figure of the right size with one axes that takes up the full figure
    fig = plt.figure(figsize=figsize)
    ax = fig.add_axes([0, 0, 1, 1])

    # Hide spines, ticks, etc.
    ax.axis('off')

    # Display the image.
    ax.imshow(im_data, cmap='gray')

    plt.show()

In [3]:
#Quelle Chat GPT
#Helper Functions
import numpy as np

def sort_points(points):
    """
    Sorts the four corner points of a quadrilateral so that they are in a consistent order (clockwise).
    """
    # Calculate the center of the quadrilateral
    center = np.mean(points, axis=0)

    # Sort the points based on their angle relative to the center
    def angle_from_center(point):
        return np.arctan2(point[1] - center[1], point[0] - center[0])

    points_sorted = sorted(points, key=angle_from_center)
    return np.array(points_sorted)

def calculate_quadrilateral_area(points):
    """
    Calculates the area of a quadrilateral given by 4 points, regardless of the order of the points.

    :param points: A 2D array or a list containing the 4 points representing the corners of the quadrilateral.
                   e.g., [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
    :return: The area of the quadrilateral.
    """
    # Sort the points to ensure a consistent order
    points_sorted = sort_points(points)

    # Shoelace formula to calculate the area of a quadrilateral
    x = points_sorted[:, 0]
    y = points_sorted[:, 1]

    # Calculate the area
    area = 0.5 * np.abs(np.dot(x, np.roll(y, 1)) - np.dot(y, np.roll(x, 1)))
    return area

def shorten_array(arr):
    """
    Shortens Python arrays by removing outer lists that contain only a single element.
    Works recursively for multi-dimensional arrays.
    
    :param arr: A nested array or list.
    :return: A shortened array with unnecessary outer lists removed.
    """
    # If the array is a list and has only one element, return that element
    while isinstance(arr, list) and len(arr) == 1:
        arr = arr[0]

    # If the array is a list, shorten all elements recursively
    if isinstance(arr, list):
        return [shorten_array(element) for element in arr]
    
    # If it's not an array, simply return it
    return arr

In [4]:
"""
preprocess pictures in the "pictures" Directory for angle Detection and OCR. Afterwards the processed pictures are stored in the folder "processImages"
1. Binarization -> Black and white Picture
2. Remove Borders -> e.g. red line around page is removed
3. Find contours of the Page Border and use it to align the page to the picture and crop the background
"""
import math
import cv2
from matplotlib import pyplot as plt
from PIL import ImageOps
import os
import fnmatch

filenames = fnmatch.filter(os.listdir(".\\pictures"), '*.png')

for filename in filenames: 

    path = ".\\pictures\\" + filename
    img = cv2.imread(path) # Load Image
    
    #Noise removal does not apply for example document
    
    #Binarization
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    thresh, img = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)
    
    #remove border
    contours, heiarchy = cv2.findContours(img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cntsSorted = sorted(contours, key=lambda x:cv2.contourArea(x))
    cnt = cntsSorted[-1]
    x, y, w, h = cv2.boundingRect(cnt)
    img = img[y:y+h, x:x+w]
    
    
    """
    Find outer Contours.
    With the Border of the pages, every Page is at first turned so that the borders match the image
    """
    
    imgTurn = cv2.copyMakeBorder(img, 200, 200, 200, 200, cv2.BORDER_CONSTANT, 1) #Border is added to have head Room for Turning
    
    contours, hierarchy = cv2.findContours(imgTurn, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    boxes = []
    for contour in contours:
        epsilon = 0.02 * cv2.arcLength(contour, True)
        rect = cv2.approxPolyDP(contour, epsilon, True) #PolyDP since rectangle shaped paper may be warped
        
        if (len(rect) == 4): # Teste auf 4Eck
            rect = np.squeeze(rect) #Transform to format without double-Array brackets
            boxes.append(rect)
     
    #Choose rectangle with greatest Area
    maxArea = 0
    paperBorder = None
    for box in boxes:
        shortend = box #shorten_array(box)
        #print (shortend)
        area = calculate_quadrilateral_area(shortend)
        if (area > maxArea):
            maxArea = area
            paperBorder = box
    paperBorder = sort_points(paperBorder)
    #print("Eckpunkte der Page:", paperBorder)
    
    #calculate rotation angle
    deltaX = paperBorder[1][0] - paperBorder[0][0]
    deltaY = paperBorder[1][1] - paperBorder[0][1]
    sign = 1
    
    if deltaY < 0:
        sign = -1
        deltaY = abs(deltaY)

    angle = math.degrees(math.atan2(deltaY, deltaX))
    #print("Der Winkel beträgt: " + str(angle))
 
    
    #rotate picture and paperBorder if necessary
    if angle > 0.5:
        height, width = imgTurn.shape
        center = (width // 2, height // 2)
        rotation_matrix = cv2.getRotationMatrix2D(center, sign*angle, 1.0) # calculate Rotation matrix
        rotated_image = cv2.warpAffine(imgTurn, rotation_matrix, dsize=(width, height))
        
    
        paperBorderRotation = np.array([[paperBorder[0]], [paperBorder[1]], [paperBorder[2]], [paperBorder[3]]]) #Transform points to no formatting
        paperBorderRotation = cv2.transform(paperBorderRotation, rotation_matrix)
        paperBorderRotation = np.squeeze(paperBorderRotation) #Transform to format without double-Array brackets
        paperBorderRotation = sort_points(paperBorderRotation)

    
        #crop background
        yMin = min (paperBorderRotation[0][1], paperBorderRotation[1][1])
        xMax = max (paperBorderRotation[1][0], paperBorderRotation[2][0])
        yMax = max (paperBorderRotation[2][1], paperBorderRotation [3][1])
        xMin = min (paperBorderRotation[3][0], paperBorderRotation[0][0])
        
        rotated_image = rotated_image[yMin:yMax, xMin:xMax]
        img = rotated_image
    
    
    #crop header and footer if necessary
    else:
        h, _ = img.shape
        offset = 0.1
        img = img[math.floor(h*offset):math.floor(h-h*offset), :]
    
    
    cv2.imwrite(".//processImages//"+filename, img)
    #display (img)
    """#img = rotateImage(img, -20)
    
    #img = rotateImage(img, -11.87)
    
    
    #angle = getSkewAngle(img)
    #print ("Der Winkel ist " + str(angle))
    #img = rotateImage (img, -1 * angle - 90)
    
    #print (pytesseract.image_to_osd(img, output_type='dict'))
    h, w = img.shape
    boxes = pytesseract.image_to_boxes(img)
    print (boxes)
    for b in boxes.splitlines():
        b = b.split(' ')
        img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
    
    """
    
    
print("processed Images saved to 'processImages'")
print( os.getcwd())
print(os.listdir(".//processImages"))


processed Images saved to 'processImages'
C:\Users\Timo4\Documents\GitHub\hiringstudy
['.ipynb_checkpoints', 'pdf_page_1.png', 'pdf_page_10.png', 'pdf_page_11.png', 'pdf_page_12.png', 'pdf_page_13.png', 'pdf_page_14.png', 'pdf_page_15.png', 'pdf_page_16.png', 'pdf_page_17.png', 'pdf_page_18.png', 'pdf_page_2.png', 'pdf_page_3.png', 'pdf_page_4.png', 'pdf_page_5.png', 'pdf_page_6.png', 'pdf_page_7.png', 'pdf_page_8.png', 'pdf_page_9.png']


In [25]:
# Test Tesseact with different cases
im = cv2.imread(".\\TesseractTesting\\test1.PNG")
try:
    osd = pytesseract.image_to_osd(im, output_type='dict')
    print(osd)
    if (osd['orientation_conf'] > 3):
        im = rotateImage(im, -1* osd['rotate'])
        display(im)
    else:
        print ("Tesseract konnte Winkel nicht sicher erkennen")
except:
    print ("Tesseract akzeptiert Eingabedatei nicht")
    



test
Tesseract akzeptiert Eingabedatei nicht


In [45]:
import cv2
"""
segmetion of Letters -> draws Border around the letters
"""
filenames = fnmatch.filter(os.listdir(".\\processImages"), '*.png')
for filename in filenames:
    img = cv2.imread(".\\processImages\\"+filename)
    h, w, c = img.shape
    boxes = pytesseract.image_to_boxes(img)
    print (boxes)
    print (type(boxes))
    for b in boxes.splitlines():
        b = b.split(' ')
        img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
        
    cv2.imwrite(".\\Boxes\\"+filename, img)
print ("pictures saved")    

~ 0 0 0 1644 0

<class 'str'>

<class 'str'>
b 619 870 634 914 0
v 623 876 643 904 0
J 654 874 663 901 0
U 667 878 684 900 0
S 688 876 707 897 0
8 699 851 709 905 0
U 710 872 738 895 0
N 741 870 759 892 0
D 762 869 780 890 0
0 782 866 801 887 0
g 804 855 828 883 0
4 838 853 851 881 0
0 853 859 872 880 0
2 885 848 906 877 0
o 918 852 937 873 0
B 939 850 958 878 0
e 962 848 980 869 0
d 982 844 1001 872 0
S 1014 841 1031 862 0
I 1033 833 1039 861 0
s 1053 837 1070 858 0
i 1072 829 1078 857 0
y 1082 827 1099 856 0
y 1100 823 1122 852 0

<class 'str'>

<class 'str'>

<class 'str'>

<class 'str'>

<class 'str'>
~ 71 2064 664 2065 0
T 40 2013 52 2033 0
h 48 2012 60 2033 0
i 54 2012 67 2032 0
s 71 2012 85 2028 0
I 94 2013 97 2028 0
s 100 2020 107 2027 0
p 116 2008 140 2029 0
a 132 2008 144 2029 0
g 141 2010 155 2028 0
e 157 2012 170 2027 0
1 181 2013 188 2033 0
o 197 2012 210 2028 0
t 211 2013 218 2033 0
D 231 2011 242 2034 0
e 242 2012 250 2028 0
w 250 2011 259 2024 0
u 260 2012 281 2028 0
r 

In [8]:
img = cv2.imread(".\\processImages\\pdf_page_3.PNG")
angle = getSkewAngle(img)
print (angle)
img = rotateImage (img, angle)
display(img)

error: OpenCV(4.10.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\contours_new.cpp:330: error: (-2:Unspecified error) in function 'class std::shared_ptr<struct ContourScanner_> __cdecl ContourScanner_::create(class cv::Mat,int,int,class cv::Point_<int>)'
> Modes other than RETR_FLOODFILL and RETR_CCOMP support only CV_8UC1 images (expected: 'img.type() == CV_8UC1'), where
>     'img.type()' is 16 (CV_8UC3)
> must be equal to
>     'CV_8UC1' is 0 (CV_8UC1)


In [6]:
import numpy as np

def getSkewAngle(cvImage) -> float:
    thresh = cvImage

    # Apply dilate to merge text into meaningful lines/paragraphs.
    # Use larger kernel on X axis to merge characters into single line, cancelling out any spaces.
    # But use smaller kernel on Y axis to separate between different blocks of text
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (30, 5))
    dilate = cv2.dilate(thresh, kernel, iterations=2)

    # Find all contours
    contours, hierarchy = cv2.findContours(dilate, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
    contours = sorted(contours, key = cv2.contourArea, reverse = True)
    for c in contours:
        rect = cv2.boundingRect(c)
        x,y,w,h = rect
        cv2.rectangle(newImage,(x,y),(x+w,y+h),(0,255,0),2)

    # Find largest contour and surround in min area box
    largestContour = contours[0]
    print (len(contours))
    minAreaRect = cv2.minAreaRect(largestContour)
    cv2.imwrite(".//temp/boxes.jpg", newImage)
    # Determine the angle. Convert it to the value that was originally used to obtain skewed image
    angle = minAreaRect[-1]
    #if angle < -45:
    #    angle = 90 + angle
    return -1.0 * angle




# Rotate the image around its center
def rotateImage(cvImage, angle: float):
    newImage = cvImage.copy()
    (h, w) = newImage.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    newImage = cv2.warpAffine(newImage, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
    return newImage