## Using untrained Tesseract to get line coordinates and preliminary OCR
Even without training, Tesseract will do a pretty good job of recognizing lines of text on the page when dealing with print that's more or less like what it's been trained on. Its recognition of the characters isn't what we'd like, but it gets the lines right, at least.

To do that, though, we have to prepare the images for OCR, just as we would do if we were actually expecting to get good text out. That starts with binarization.

## Connect to Google Drive

In [None]:
#Code cell #1
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

## Import packages

In [None]:
#Code cell #2
#import libraries we'll need
import os
import glob
import cv2
from google.colab.patches import cv2_imshow
from PIL import Image
import numpy as np

## Move page images from Google Drive to Colaboratory


In [None]:
#Code cell #3
%cp /gdrive/MyDrive/L-100\ Digital\ Approaches\ to\ Bibliography\ \&\ Book\ History-2023/penn_pr3732_t7_1730b.zip /content/penn_pr3732_t7_1730b.zip
%cd /content/
!unzip penn_pr3732_t7_1730b.zip
%cd /content/penn_pr3732_t7_1730b/

## Define functions
You've seen all of this code before in prior notebooks. This notebook just repurposes the code from the interactive notebooks into functions that can be called from other cells.

In [None]:
#Code cell #4
#Define image processing functions: these should look familiar
def binarize_invert_color_image(cv2image) :
  cv2gray_image = cv2.cvtColor(cv2image, cv2.COLOR_BGR2GRAY)
  cv2blurred_image = cv2.GaussianBlur(cv2gray_image, (9, 9), 0)
  cv2binary_inverted_image = cv2.threshold(cv2blurred_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
  return cv2binary_inverted_image

def get_deskew_angle(cv2image) :
  kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (20, 1))
  dilate = cv2.dilate(cv2image, kernel, iterations=3)
  contours, hierarchy = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

  rects = []
  for contour in contours :
    minAreaRect = cv2.minAreaRect(contour)
    if minAreaRect[1][1] > 60 :
      if minAreaRect[-1] not in [0.0, -0.0, 90, -90.0] :
        rects.append(minAreaRect)
  if len(rects) == 0 :
    angle = 0
  else :
    angle_corrections = []
    for rect in rects :
      if 45 < rect[-1] < 90 :
        angle_corrections.append((-1* (90 - (rect[-1])), -1))
      else :
        angle_corrections.append((90 - (90 + rect[-1]), 1))
    angle = np.mean([angle_tuple[0] for angle_tuple in angle_corrections])
    plus_or_minus = sum(angle_tuple[1] for angle_tuple in angle_corrections)
    if plus_or_minus > 0 :
      angle = -1.0 * angle
  return angle

def deskew_image(cv2image, angle) :
  new_image = cv2image.copy()
  (h, w) = new_image.shape[:2]
  center = (w // 2, h // 2)
  M = cv2.getRotationMatrix2D(center, angle, 1.0)
  deskewed_image = cv2.warpAffine(new_image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
  return deskewed_image

def threshold_color_image(cv2image) :
  cv2_gray_image = cv2.cvtColor(cv2image, cv2.COLOR_BGR2GRAY)
  cv2_blurred_image = cv2.GaussianBlur(cv2_gray_image, (5, 5), 0)
  cv2_binary_otsu_image = cv2.threshold(cv2_blurred_image, 0, 255, cv2.THRESH_OTSU)[1]
  return cv2_binary_otsu_image

def prepare_color_image(image) :
  cv2image = cv2.imread(image, cv2.IMREAD_COLOR)
  invert = binarize_invert_color_image(cv2image)
  deskew_angle = get_deskew_angle(invert)
  deskewed = deskew_image(cv2image, deskew_angle)
  otsu_image = threshold_color_image(deskewed)
  return otsu_image



## Choose images to work on
By default, I've set this notebook to only process a few images, just so you can see how the process flows. If you'd like to process all the images, simply comment out line 4 and uncomment the list from line 5 to line 13.

In [None]:
#Code cell #5
source_image_directory = '/content/penn_pr3732_t7_1730b/'
source_image_basename = 'PR3732_T7_1730b_body00'
source_pages = ['13', '21', '22', '86']
# source_pages = ['01', '03', '04', '05', '07', '08', '09', '10', '11',
#                 '12', '13', '14', '15', '16', '17', '18', '19', '20', '21',
#                 '22', '23', '24', '25', '26', '27', '28', '29', '30', '31',
#                 '32', '33', '34', '35', '36', '37', '38', '39', '40', '41',
#                 '42', '43', '44', '45', '46', '47', '48', '49', '50', '51',
#                 '52', '53', '54', '55', '56', '57', '58', '59', '60', '61',
#                 '62', '63', '64', '65', '66', '67', '68', '69', '70', '71',
#                 '72', '73', '74', '75', '76', '77', '78', '79', '80', '81',
#                 '82', '83', '84', '85', '86']

## Binarize and deskew pages for OCR
This sends each of our color images through the `prepare_color_image` function (which, in turn calls `binarize_invert_color_image`, `get_deskew_angle`, `deskew_image`, and `threshold_color_image`, all defined in code cell #4).

In [None]:
#Code cell #6
#Create deskewed black and white derivative files
page_output_directory = '/content/penn_pr3732_t7_1730b/bw/'
if not os.path.exists(page_output_directory) :
  os.makedirs(page_output_directory)
for source_page in source_pages :
  image = source_image_directory + source_image_basename + source_page + '.tif'
  print(image)
  outfile_name = page_output_directory + source_image_basename + source_page + '-bw.tif'
  bw_deskewed = prepare_color_image(image)
  bw_deskewed = Image.fromarray(bw_deskewed)
  bw_deskewed.save(outfile_name, dpi=(400,400))
  print('Saving ' + outfile_name)

## Move black and white pages back to Google Drive so we can inspect them more easily
If you examine the binarized images, you may well find some where the process we just used didn't yield the best results: perhaps Otsu's method didn't yield the best binarization, or perhaps the deskewing routine didn't quite do the trick for a particular page. (I noticed that page 86 fared pretty badly, for instance, and there may be others I'm missing.)

It's much easier to look at images in Google Drive, so we'll compress our folder of black and white images with `zip`, copy that .zip archive over to Google Drive, and then unzip it.

When that process is finished, you can view the images in your browser and note any that need to be re-processed. The next notebook gives you a way to tweak any problem images to get a better binarized image to use for preliminary OCR. (Note that, while you have black and white derivatives of *all* of the images, unless you changed the code in code cell #5, you only created new black and white versions of numbers 13, 21, 22, and 86.)

In [None]:
#Code cell #7
%cd /content/penn_pr3732_t7_1730b/
!zip -r penn_pr3732_t7_1730b-bw.zip bw/
!mv penn_pr3732_t7_1730b-bw.zip /gdrive/MyDrive/rbs_digital_approaches_2023/output/penn_pr3732_t7_1730b-bw.zip
%cd /gdrive/MyDrive/rbs_digital_approaches_2023/output/
!unzip penn_pr3732_t7_1730b-bw.zip
!mv bw/ penn_pr3732_t7_1730b-bw
# !rm penn_pr3732_t7_1730b-bw.zip

## Wipe out the contents of the Colaboratory environment
I don't *think* that leaving all of those .tif files will end up counting against your Google storage quota when you're done, but why take the risk? Let's just delete everything we uploaded here and head back to Google Drive for the next steps.

In [None]:
#Code cell #8
%cd /content/
! rm -r ./*