# Image pre-processing
When we think about the challenges of optical character recognition (OCR) of rare materials, we often think about the difficulty of training a language model to recognize archaic letterforms or to handle the inherent variability of hand-set type.

Those are serious questions, for sure, but it turns out that one of the biggest determinants of OCR quality is the quality of the images that we attempt to run OCR on.

This all means that the pre-processing of images *before* they're OCR'ed can be at least as important as having a good OCR training.

This notebook and the following one walk through pre-processing steps to prepare images in hopes of getting the best result we can from our OCR training.

In this notebook, we'll experiment with thresholding, and then work to crop the images to exclude portions of the page that don't include text.

In the next notebook, we'll work to straighten any skewing in the images so that the lines of text are more readily recognized by the OCR software. In that notebook we'll also convert the images to black and white so that we can experiment with performing OCR on them.

>*Note:* Your attention **really** doesn't need to be on the details of the code, itself, in these notebooks. *Most* of the code in this notebook ends up being devoted not to performing the transformations we're actually after, but simply to showing the stages of the transformation in the browser.
>
> In the first part of the notebook, focus on the different variables that affect the way that the images are transformed. I've set the notebook up so that you can change variables easily and re-process images to see what difference your changes make.
>
>In the second, longer, part of the notebook, focus on the strategies and heuristics involved in identifying the text block, even if the details of the code seem obscure.


###A - Move images from Google Drive to Colaboratory environment
**Note**: Make sure you've added the RBS shared folder to your Google Drive.

In [None]:
#Code cell #1
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

#Import libraries to allow interactive widgets in this notebook
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, interactive

In [None]:
#Code cell #2
%cp /gdrive/MyDrive/L-100\ Digital\ Approaches\ to\ Bibliography\ \&\ Book\ History-2023/2023_page_images.zip /content/2023_page_images.zip
%cd /content/
!unzip 2023_page_images.zip
%cd /content/2023_page_images/
%ls -al

### B - Setting our source image

In [None]:
#@title Select an image
 #@markdown **Run this cell** to create a select list widget that allows us to choose an image to process. With an image selected (a default is provided), you can continue working through the code below.

 #@markdown You only need to run this cell once (re-running it will just set things back to the default value). But you can change the image you're working with using the select list in order to see how these processes work given different starting images.
import os
import glob
file_list = sorted([os.path.basename(file) for file in glob.glob('/content/2023_page_images/*')])
image_select = widgets.Dropdown(
    description='Choose image',\
    options = file_list,\
    value = '1730f_p13.tif',
    style={'description_width': 'initial'})
display(image_select)

In [None]:
#Code cell #3
source_directory = '/content/2023_page_images/'
source_image = source_directory + image_select.value


## 1 - Getting the idea of binarizing: Setting a threshold manually

In [None]:
#Code cell #4
# !pip install pillow
#Install the Image and ImageDraw libraries from PIL (actually Pillow)
from PIL import Image, ImageDraw

#Open the original color image
pilcolor_image = Image.open(source_image)
pilcolor_image

### 1.a - Converting from color to grayscale

In [None]:
#Code cell #5
#Use the Image library's convert() method to convert the color image to grayscale
pilgray_image = pilcolor_image.convert('L')

#Output
pilgray_image

### 1.b - Converting our grayscale image to black and white
This isn't going to work the way we might expect...

In [None]:
#Code cell #6
pilbw_image = pilgray_image.convert('1')
pilbw_image

#### 1.b.i - Overriding the default behavior: turning off dithering
See [the documentation](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.convert):
>The default method of converting a greyscale (“L”) or “RGB” image into a bilevel (mode “1”) image uses Floyd-Steinberg dither to approximate the original image luminosity levels. If dither is NONE, all values larger than 127 are set to 255 (white), all other values to 0 (black). To use other thresholds, use the point() method.

In [None]:
#Code cell #7
pilbw_image = pilgray_image.convert('1', dither=0)
pilbw_image

#### 1.b.ii - Adjusting the threshold point manually


In [None]:
 #@title Set a threshold value {display-mode: "form"}
 #@markdown Run this cell, then enter a value between 0 and 255 and hit the "enter" key to adjust the threshold point for our image in the cell below. (We begin with a default of 150)

 #@markdown You only need to run this cell once (re-running it will just set things back to the default value). Try entering a new value and then re-running the *next* cell a few times to see the difference that different threshold values make.
 thresh_value_int = widgets.BoundedIntText(
    value=150,
    min=0,
    max=255,
    step=1,
    description='Threshold:',
    disabled=False
)
 display(thresh_value_int)

In [None]:
#Code cell #8
#See https://stackoverflow.com/questions/9506841/using-python-pil-to-turn-a-rgb-image-into-a-pure-black-and-white-image/50090612#50090612

#Get the current value of our slider widget from the cell above
thresh = thresh_value_int.value

#Look at every pixel of our image. If the value of that pixel is
#greater than the "thresh" value (set by the slider above), set the pixel to
#255 (pure black). Otherwise, set the pixel to 0 (pure white)
fn = lambda x : 255 if x > thresh else 0

#Convert our image, overriding the default dithering behavior
#with the threshold we've chosen, using the lambda function from line 10
pilbinary_image = pilgray_image.convert('L').point(fn, mode='1')

pilbinary_image

## Interlude - Automating image optimization

###We are immediately going to run into a problem...

In [None]:
#Code cell #9
# !pip install opencv-python
import cv2
#This is a Google Colab-specific patch to enable us to view OpenCV images
#in the browser
from google.colab.patches import cv2_imshow
#Other Python libraries that CV2 needs
import matplotlib.pyplot as plt
import numpy as np

In [None]:
#Code cell 10
#Open the original color image
cv2color_image = cv2.imread(source_image, cv2.IMREAD_COLOR)
#Convert to grayscale
cv2gray_image = cv2.cvtColor(cv2color_image, cv2.COLOR_BGR2GRAY)
#Apply Gaussian blur
cv2blurred_image = cv2.GaussianBlur(cv2gray_image, (5, 5), 0)
#Threshold using Otsu's method
(T_orig, cv2binary_otsu_image) = cv2.threshold(cv2blurred_image, 0, 255, cv2.THRESH_OTSU)
#Show binarized image
cv2_imshow(cv2binary_otsu_image)
print('Otsu threshold value: ' + str(T_orig))

##2 - Cropping pages
Buckle up, because here's where things start to get weird.

####2.a - Invert the image

In [None]:
#Code cell 11
#This sets anything above a threshold level to true black, and combines two types
#of thresholding (THRESH_BINARY_ENV and the OpenCV implementation of Otsu's method).
invert_image = cv2.threshold(cv2blurred_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
cv2_imshow(invert_image)

In [None]:
#Code cell #12
#Let's get rid of the (now) white border around the page image
#See: https://learnopencv.com/filling-holes-in-an-image-using-opencv-python-c/
h, w = invert_image.shape[:2]
mask = np.zeros((h+2, w+2), np.uint8)
floodfill_image = invert_image.copy()
 # Floodfill from point (0, 0)
cv2.floodFill(floodfill_image, mask, (0,0), 0);
cv2_imshow(floodfill_image)

In [None]:
#@title Set kernel size for dilation {display-mode: "form"}

#@markdown **Run the code** in this cell to create a set of
#@markdown slider widgets for changing the values of the
#@markdown "kernel" used to dilate the white pixels in the
#@markdown image. You can change the height and width of the
#@markdown kernel (i.e., the amount of vertical and horizontal
#@markdown dilation to be applied) as well as the number of
#@markdown iterations (how many times the dilation operation
#@markdown will be applied.)

#@markdown You only need to run this cell once (re-running
#@markdown it will just re-set the values to their defaults).
#@markdown You can change the values of the sliders and
#@markdown then run Code cell 12 to see the different
#@markdown effects that different values have.
kernel_width = widgets.IntSlider(description = 'Kernel width', \
                                               min=1, max=25, step=1, value=10)
kernel_height = widgets.IntSlider(description='Kernel height', \
                                                 min=1, max=25, step=1, value=20)
num_iterations = widgets.IntSlider(description='Iterations', min=1, \
                      max=10, step=1, value=5)
display(kernel_width)
display(kernel_height)
display(num_iterations)

####2.b - Dilate the inverted image

In [None]:
#Code cell 13
#The shape for dilating pixels
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_width.value, kernel_height.value))
#Create a new image by dilating the prior image using the kernel shape we've set
dilate_image = floodfill_image.copy()
dilate_image = cv2.dilate(dilate_image, kernel, iterations=num_iterations.value)
cv2_imshow(dilate_image)

###2.c - Identify contours of dilated regions


In [None]:
#Code cell 14
#Identify the contours and their hierarchy
contours, hierarchy = cv2.findContours(dilate_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

#These lines are just to visualize what we have. We make a copy of the dilate
#image, converting it from binary to color (so we can see colored lines on it),
#then draw all of the contours on that new image in green.
show_contours = cv2.cvtColor(dilate_image.copy(), cv2.COLOR_BayerGR2RGB)
show_contours = cv2.drawContours(show_contours, contours, -1, (0,255,0), 3)
cv2_imshow(show_contours)

###2.d - Defining contours of interest

####2.d.i - Define a region in the center of the page

In [None]:
#Code cell 15
#Get the height and width of the image
height = np.shape(dilate_image)[0]
width = np.shape(dilate_image)[1]

#Divide the width by 8
eighth = int(width/8)
#Find the midpoint on the x-axis
midpoint_x = int(width/2)
#Create a tuple with the left-most and right-most x-axis for this zone
middle_zone = (midpoint_x - eighth, midpoint_x + eighth)

#This code is just to display what's going on. We make a copy of the image
#that already has our contours drawn in green...
show_middle_zone = show_contours.copy()
#...then draw two blue lines to show the edges of the middle zone
show_middle_zone = cv2.line(show_middle_zone, (middle_zone[0],0),
                            (middle_zone[0], height), (255,0,0), 3)
show_middle_zone = cv2.line(show_middle_zone, (middle_zone[1],0),
                            (middle_zone[1],height), (255,0,0), 3)
cv2_imshow(show_middle_zone)

####2.d.ii - Identify contours centered on the area of interest

In [None]:
#Code cell 16
#Create an empty list
middle_zone_contours = []
#Iterate through the list of contours
for contour in contours :
  #https://learnopencv.com/find-center-of-blob-centroid-using-opencv-cpp-python/
  M = cv2.moments(contour)
  contour_x = int(M["m10"] / M["m00"])
  #If the x-axis value of the centroid is in range for teh x-axis values of the
  #middle_zone, then add it to the list of middle_zone_contours
  if middle_zone[0] <= contour_x <= middle_zone[1] :
    middle_zone_contours.append(contour)

#This code just shows what we've done
show_middle_contours = show_middle_zone.copy()

#Iterate through the list of middle_zone_contours, outlining them in purple
for middle_contour in middle_zone_contours :
  show_middle_contours = cv2.drawContours(show_middle_contours, [middle_contour], -1, (255, 0, 255), 3)
cv2_imshow(show_middle_contours)

####2.d.iii - Find boundary rectangles for contours of interest

In [None]:
#Code cell 17

#The image is getting a little busy, so I'm making a new copy of the dilate image,
#converting it to color to be able to show colored lines and rectangles
show_rectangles = cv2.cvtColor(dilate_image.copy(), cv2.COLOR_BayerGR2BGR)
#Put the detected contours back on the image
for contour in contours :
  #All contours in green
  show_rectangles = cv2.drawContours(show_rectangles, contour, -1, (0,255,0), 3)
for middle_zone_contour in middle_zone_contours :
  #Middle zone contours in purple
  show_rectangles = cv2.drawContours(show_rectangles, [middle_zone_contour], -1, (255, 0, 255), 3)

#Create a list of rectangles founding by getting the boundingRect of each contour
#in the middle_zone
rectangles = [cv2.boundingRect(contour) for contour in middle_zone_contours]
for rectangle in rectangles :

  #openCV stores boundingRects as a tuple consisting of the x, y coordinate of
  #the upper left corner, the width of the rectangle, and the height of the rectangle.
  #But it *draws* rectangles based on two points: the upper left corner and
  #the lower right corner. So these lines figure out those two points for each
  #rectangle
  start_point = (rectangle[0], rectangle[1])
  end_point = (rectangle[0] + rectangle[2], rectangle[1] + rectangle[3])
  #Then draw the rectangle on the image
  show_rectangles = cv2.rectangle(show_rectangles, start_point, end_point, (0, 0, 255), 3)
  #And print the coordinates of the upper left corner
  show_rectangles = cv2.putText(show_rectangles, str(rectangle[0]) + ',' + str(rectangle[1]),
                              (rectangle[0], rectangle[1]),
                              cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 0, 255), 2)

cv2_imshow(show_rectangles)

####2.d.iv - Determine a rectangle large enough to contain all of the boundary rectangles of the contours of interest

In [None]:
#Code cell 18
#Construct lists of x- and y-axis coordinates for each rectangle
leftx_coords = [rectangle[0] for rectangle in rectangles]
rightx_coords = [rectangle[0] + rectangle[2] for rectangle in rectangles]
topy_coords = [rectangle[1] for rectangle in rectangles]
bottomy_coords = [rectangle[1]  + rectangle[3] for rectangle in rectangles]

#Get the left-, right-, top-, and bottom-most x- and y-axis values by getting
#the minima and maxima of the values in the lists we just made, then
#padding them a little bit so that we're not cropping right against the text
leftmost = min(leftx_coords) - 100
rightmost = max(rightx_coords) + 100
topmost = min(topy_coords) - 50
bottommost = max(bottomy_coords) + 50

#Construct coordinates for the four corners of the imaginary rectangle using the
#left-, right-, top-, and bottom-most x- and y-axis values
upper_left = (leftmost, topmost)
upper_right = (rightmost, topmost)
lower_right = (rightmost, bottommost)
lower_left = (leftmost, bottommost)

In [None]:
#Code cell 19
#Make a copy of the show_rectangles image with the red rectangles already drawn
text_block = show_rectangles.copy()
#Draw a rectangle on the image, using the upper_left and lower_right coordinates
text_block = cv2.rectangle(text_block, upper_left, lower_right, (255, 255, 0), 3)

cv2_imshow(text_block)


####2.d.v - Use imaginary rectangle to crop the original image to the text block
It can be hard to remember as we step through the code, but all of these strange-looking white-on-black images are actually representations of our original page image.

That means that we can take information derived from these reader-hostile (but computer-friendly) images and apply it to the original image.

In [None]:
#Code cell 20
text_block_cropped = cv2color_image.copy()
y = topmost
x = leftmost
w = rightmost
h = bottommost
text_block_cropped = text_block_cropped[y:h, x:w]
cv2_imshow(text_block_cropped)

###2.e - Automatically cropping images

In [None]:
#Code cell 21
def get_text_block(image) :
  #Invert
  invert = cv2.threshold(cv2blurred_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
  #Floodfill
  h, w = invert.shape[:2]
  mask = np.zeros((h+2, w+2), np.uint8)
  floodfill = invert.copy()
  cv2.floodFill(floodfill_image, mask, (0,0), 0);
  #Dilate
  kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (10, 20))
  dilate = cv2.dilate(floodfill, kernel, iterations=5)
  #Find_all_contours
  contours, hierarchy = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
  #Define middle-of-page area of interest
  height = np.shape(dilate)[0]
  width = np.shape(dilate)[1]
  eighth = int(width/8)
  midpoint_x = int(width/2)
  middle_zone = (midpoint_x - eighth, midpoint_x + eighth)
  #Select contours centered on area of interest
  middle_zone_contours = []
  for contour in contours :
    M = cv2.moments(contour)
    contour_x = int(M["m10"] / M["m00"])
    if middle_zone[0] <= contour_x <= middle_zone[1] :
      middle_zone_contours.append(contour)
  #Get bounding rectangles
  rectangles = [cv2.boundingRect(contour) for contour in middle_zone_contours]
  #Construct text block rectangle
  leftx_coords = [rectangle[0] for rectangle in rectangles]
  rightx_coords = [rectangle[0] + rectangle[2] for rectangle in rectangles]
  topy_coords = [rectangle[1] for rectangle in rectangles]
  bottomy_coords = [rectangle[1]  + rectangle[3] for rectangle in rectangles]
  leftmost = min(leftx_coords) - 100
  rightmost = max(rightx_coords) + 100
  topmost = min(topy_coords) - 50
  bottommost = max(bottomy_coords) + 50
  return image[topmost:bottommost, leftmost:rightmost]


In [None]:
#Code cell 22
if not os.path.exists('/content/cropped/') :
  os.makedirs('/content/cropped/')
for file in glob.glob('/content/2023_page_images/*.tif') :
  basename = os.path.basename(file)[:-4] + '-cropped.tif'
  original = cv2.imread(file, cv2.IMREAD_COLOR)
  cropped = get_text_block(original)
  cv2.imwrite('/content/cropped/' + basename, cropped)
  print('Saved ' + basename)

### Move files out of Colab environment to Google Drive

In [None]:
#Code cell 23
%cd /content/
!zip -r cropped.zip cropped/
!mv /content/cropped.zip /gdrive/MyDrive/rbs_digital_approaches_2023/output/cropped.zip