# Re-process bad images
If you examine the binarized images, you may well find some where the process we used in the last notebook didn't yield the best results: perhaps Otsu's method didn't yield the best binarization, or perhaps the deskewing routine didn't quite do the trick for a particular page. (I noticed that page 86 fared pretty badly, for instance, and there may be others I'm missing.)

This notebook offers an interactive way to tweak the binarization and deskewing methods in order to come up with a better result for any given image. When you have a result that looks better, you can save a new binarized file for  preliminary OCR.

If, after checking out the images, you don't see any that need fixing, then you can just skip this altogether. If you do see some that need tweaking, I'd recommend only doing one or two to get a feel for the kinds of adjustments you'd make—in the time we have, there's no need to go for perfect results for all of the images.

(**Note:** Because this notebook mostly repackages things we've already done, there are very few comments. There are also some differences here that I introduced to solve little snags along the way. I haven't tested this exhaustively, so some things might not work as expected.)

In [None]:
#Code cell #1
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

In [None]:
#Code cell #2
!pip install pytesseract
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, interactive
from PIL import Image, ImageDraw
import cv2
from google.colab.patches import cv2_imshow
import matplotlib.pyplot as plt
import numpy as np
import pytesseract

In [None]:
#Code cell #3
#I'm assuming that you'll be working with the page images I provided. If you're
#working on the images you produced yourself in the last notebook, just change
#the path below to retrieve the images from your rbs_digital_approaches_2021/output/
#folder
%cp /gdrive/MyDrive/L-100a/page_images.zip /content/page_images.zip
%cd /content/
!unzip page_images.zip
%cd /content/page_images/
!unzip penn_pr3732_t7_1730b.zip

In [None]:
#Code cell #4
image_source_directory = '/content/page_images/penn_pr3732_t7_1730b/'

In [None]:
#@title 1 - Choose Image to Reprocess
#@markdown Run this cell to generate a dropdown menu to select an image that needs to be reprocessed{display: 'form'}
image_select = widgets.Dropdown(
    description='Choose image',\
    options = ['PR3732_T7_1730b_body00' + i for i in ['01.tif',
 '02.tif', '03.tif', '04.tif', '05.tif', '06.tif', '07.tif', '08.tif',
 '09.tif', '10.tif', '11.tif', '12.tif', '13.tif', '14.tif', '15.tif',
 '16.tif', '17.tif', '18.tif', '19.tif', '20.tif', '21.tif', '22.tif',
 '23.tif', '24.tif', '25.tif', '26.tif', '27.tif', '28.tif', '29.tif',
 '30.tif', '31.tif', '32.tif', '33.tif', '34.tif', '35.tif', '36.tif',
 '37.tif', '38.tif', '39.tif', '40.tif', '41.tif', '42.tif', '43.tif',
 '44.tif', '45.tif', '46.tif', '47.tif', '48.tif', '49.tif', '50.tif',
 '51.tif', '52.tif', '53.tif', '54.tif', '55.tif', '56.tif', '57.tif',
 '58.tif', '59.tif', '60.tif', '61.tif', '62.tif', '63.tif', '64.tif',
 '65.tif', '66.tif', '67.tif', '68.tif', '69.tif', '70.tif', '71.tif',
 '72.tif', '73.tif', '74.tif', '75.tif', '76.tif', '77.tif', '78.tif',
 '79.tif', '80.tif', '81.tif', '82.tif', '83.tif', '84.tif', '85.tif', '86.tif']],\
    value = 'PR3732_T7_1730b_body0001.tif',
    style={'description_width': 'initial'})
display(image_select)

In [None]:
#Code cell #5
source_image = image_source_directory + image_select.value

## 2 - Try Adaptive Thresholding
If you get good results with adaptive thresholding in this step, you can proceed to number 4 (Deskew or Save?).

In [None]:
#Code cell #6
cv2color_image = cv2.imread(source_image, cv2.IMREAD_COLOR)
cv2gray_image = cv2.cvtColor(cv2color_image, cv2.COLOR_BGR2GRAY)


In [None]:
#@title Set values for Gaussian blur {display-mode: "form"}
#@markdown Try adjusting the value that will be used for blurring in the next cell.

#@markdown (You only need to run this cell once—re-running it will simply reset it to the default value. After changing the value of the slider, try re-running the cell below this one.)
blur = widgets.IntSlider(min=1, max=31, step=2, value=5, description='Blur')
display(blur)

In [None]:
#Code cell #7
cv2blurred_image = cv2.GaussianBlur(cv2gray_image, (blur.value, blur.value), 0)
cv2_imshow(cv2blurred_image)

In [None]:
#Code cell #8
cv2binary_adaptive_image = cv2.adaptiveThreshold(cv2blurred_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 101, 30)
cv2_imshow(cv2binary_adaptive_image)

## 3 - Try Manual Thresholding
If you're not liking the results you're getting with adaptive thresholding, you can try manual thresholding, instead. When you've gotten the image looking good to your mind, move on to number 4 (Deskew or Save?).

In [None]:
#Code cell #9
pilcolor_image = Image.open(source_image)
pilgray_image = pilcolor_image.convert('L')

In [None]:
#@title Set a threshold value {display-mode: "form"}
 #@markdown Run this cell, then use the slider that will appear to adjust the threshold point for our image in the cell below. 
 
 #@markdown You only need to run this cell once (re-running it will just set things back to the default value). Try adjusting the slider and then re-running the *next* cell a few times to see the difference that different threshold values make.
thresh_value_slider = widgets.IntSlider(
    min=0,
    max=255,
    step=1,
    description='Threshold:',
    value=150
)
display(thresh_value_slider)

In [None]:
#Code cell #10
thresh = thresh_value_slider.value
fn = lambda x : 255 if x > thresh else 0
pilbinary_image = pilgray_image.convert('L').point(fn, mode='1')
pilbinary_image

## 4 - Deskew or Save?

In [None]:
#{display-mode: 'form'}
#@markdown (Run this cell to create some widgets for this step.)

#@markdown Do we need to deskew? If so, which thresholding method produced the better result?

#@markdown If you're ready to save the image, select "No" and choose which
#@markdown thresholded image to save, then skip to the "Save" section and 
#@markdown and proceed to re-OCR.

#@markdown If the image needs deskewing, select "Yes" and indicate which of
#@markdown the thresholded images should be used for deskewing.
proceed_to_deskew = widgets.Dropdown(
    description='Deskew?',\
    options = ['Yes', 'No'],\
    value = 'Yes',
    style = {'description_width': 'initial'}
    )
thresholded_image = widgets.Dropdown(
    description='Deskew Method',\
    options = ['Adaptive Threshold', 'Manual Threshold'],\
    value = 'Adaptive Threshold',
    style={'description_width': 'initial'}
    )
display(proceed_to_deskew)
display(thresholded_image)

### 4.a - Deskew

In [None]:
#Code cell #11
if thresholded_image.value == 'Adaptive Threshold' :
  image_to_deskew = cv2binary_adaptive_image
else :
  pass_to_cv2 = np.array(pilbinary_image) 
  image_to_deskew = pass_to_cv2.astype(np.uint8) * 255
thresh = cv2.threshold(image_to_deskew, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

In [None]:
#@title Set dilation variables { display-mode: "form" }
#@markdown (Run this cell to create a slider for setting the dilation amount.)
kernel_width = widgets.IntSlider(description = 'Kernel width', \
                                               min=10, max=50, step=5, value=30)
kernel_height = widgets.IntSlider(description='Kernel height', \
                                                 min=1, max=10, step=1, value=5)
num_iterations = widgets.IntSlider(description='Iterations', min=1, \
                      max=10, step=1, value=5)
display(kernel_width) 
display(kernel_height)
display(num_iterations)

In [None]:
#Code cell #12
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_width.value, kernel_height.value))
#We dilate the pixels using the shape defined by kernel, and perform the operation
#five times. You could try increasing or decreasing the number of iterations to
#see how the output changes.
dilate = cv2.dilate(thresh, kernel, iterations=num_iterations.value)
cv2_imshow(dilate)

In [None]:
#Code cell #13
contours, hierarchy = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
sorted_contours = sorted(contours, key = cv2.contourArea, reverse = True)

In [None]:
#Code cell #14
def draw_min_area_rect(cv2minimumarearectangle, base_image) :
  draw_min_area_rect = cv2.cvtColor(base_image, cv2.COLOR_BayerGR2RGB)
  if isinstance(cv2minimumarearectangle, list) == True :
    print(len(cv2minimumarearectangle))
    for rect in cv2minimumarearectangle :
      min_area_box = cv2.boxPoints(rect)
      min_area_box = np.int0(min_area_box)
      draw_min_area_rect = cv2.line(draw_min_area_rect, (min_area_box[0][0], min_area_box[0][1]), \
                                    (min_area_box[1][0], min_area_box[1][1]), (0, 30, 255), 3)
      draw_min_area_rect = cv2.line(draw_min_area_rect, (min_area_box[1][0], min_area_box[1][1]), \
                                    (min_area_box[2][0], min_area_box[2][1]), (0, 30, 255), 3)
      draw_min_area_rect = cv2.line(draw_min_area_rect, (min_area_box[2][0], min_area_box[2][1]), \
                                    (min_area_box[3][0], min_area_box[3][1]), (0, 30, 255), 3)
      draw_min_area_rect = cv2.line(draw_min_area_rect, (min_area_box[3][0], min_area_box[3][1]), \
                                    (min_area_box[0][0], min_area_box[0][1]), (0, 30, 255), 3)
      cv2.putText(draw_min_area_rect, str(rect[-1]), 
                  (int(rect[0][0]) -100, int(rect[0][1])), cv2.FONT_HERSHEY_SIMPLEX, 
                  1, (0, 30, 255, 255), 3)
  else :
    min_area_box = cv2.boxPoints(cv2minimumarearectangle)
    min_area_box = np.int0(min_area_box)
    draw_min_area_rect = cv2.line(draw_min_area_rect, (min_area_box[0][0], min_area_box[0][1]), \
                                  (min_area_box[1][0], min_area_box[1][1]), (0, 30, 255), 3)
    draw_min_area_rect = cv2.line(draw_min_area_rect, (min_area_box[1][0], min_area_box[1][1]), \
                                  (min_area_box[2][0], min_area_box[2][1]), (0, 30, 255), 3)
    draw_min_area_rect = cv2.line(draw_min_area_rect, (min_area_box[2][0], min_area_box[2][1]), \
                                  (min_area_box[3][0], min_area_box[3][1]), (0, 30, 255), 3)
    draw_min_area_rect = cv2.line(draw_min_area_rect, (min_area_box[3][0], min_area_box[3][1]), \
                                  (min_area_box[0][0], min_area_box[0][1]), (0, 30, 255), 3)
    cv2.putText(draw_min_area_rect, str(cv2minimumarearectangle[-1]), 
                (int(cv2minimumarearectangle[0][0]) -100, int(cv2minimumarearectangle[0][1])), 
                cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 30, 255, 255), 3)

  return draw_min_area_rect

In [None]:
#@title Angle Calculation Method{display-mode: 'form'}
#@markdown (Run this cell to create a widget for use in this step.)

#@markdown Do you want to use all minAreaRect angles for deskewing, or
#@markdown just the angles from a subset of the largest contours? 
angle_method = widgets.Dropdown(
    description='Select method',\
    options = ['All Rects', 'Selected'],\
    value = 'All Rects',
    style={'description_width': 'initial'})
num_rects = widgets.IntSlider(description='Top rects', min=1, \
                      max=5, step=1, value=1)

display(angle_method)
display(num_rects)


In [None]:
#Code cell #15
rects = []
if angle_method.value == 'All Rects' :
  for contour in contours :
    minAreaRect = cv2.minAreaRect(contour)
    if minAreaRect[1][1] > 60 : 
      if minAreaRect[-1] not in [-0.0, 0.0, -90.0] :
        rects.append(minAreaRect)
else :
  for contour in sorted_contours[0:num_rects.value] :
    minAreaRect = cv2.minAreaRect(contour)
    rects.append(minAreaRect)

draw_all_rects = draw_min_area_rect(rects, dilate)
cv2_imshow(draw_all_rects)

In [None]:
#Code cell #16
angle_corrections = []
for rect in rects :
  if rect[-1] < -45 :
    angle_corrections.append((90 - (-1.0 * rect[-1]), -1))
  else :
    angle_corrections.append((90 - (90 + rect[-1]), 1))
average_angle = np.mean([angle_tuple[0] for angle_tuple in angle_corrections])

plus_or_minus = sum(angle_tuple[1] for angle_tuple in angle_corrections)
if plus_or_minus > 0 :
  average_angle = -1.0 * average_angle

In [None]:
#Code cell #17
average_angle_deskew = image_to_deskew.copy()
(h, w) = average_angle_deskew.shape[:2]
center = (w // 2, h // 2)
# M = cv2.getRotationMatrix2D(center, angle, 1.0)
M = cv2.getRotationMatrix2D(center, average_angle, 1.0)
deskewed_average_angle = cv2.warpAffine(average_angle_deskew, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
cv2_imshow(deskewed_average_angle)

### 4.b - Save
Save the reprocessed image before re-OCR'ing.

In [None]:
#Code cell #18
import os
output_directory = '/content/page_iamges/penn_pr3732_t7_1730/bw/'
if os.path.exists(output_directory) is not True :
  os.makedirs(output_directory)
outname = image_select.value.rstrip('.tif') + '-bw.tif'
with open(image_source_directory + 'bw/' + outname, 'wb') as new_image :
  if proceed_to_deskew.value == 'Yes' :
    final_image = Image.fromarray(deskewed_average_angle)

  else :
    if thresholded_image.value == 'Adaptive Threshold' :
      final_image = Image.fromarray(cv2binary_adaptive_image)

    if thresholded_image.value == 'Manual Threshold' :
      pass_to_cv2 = np.array(pilbinary_image) 
      intermediate_image = pass_to_cv2.astype(np.uint8) * 255
      final_image = Image.fromarray(intermediate_image)
      
  print('Saving ' + image_source_directory + outname)
  final_image.save(new_image)

## Move output files back to Google Drive

In [None]:
#Code cell #19
%cd content/page_images/
!zip -r penn_pr3732_t7_1730b.zip penn_pr3732_t7_1730b/
!mv penn_pr3732_t7_1730b.zip /gdrive/MyDrive/rbs_digital_approaches_2021/output/penn_pr3732_t7_1730b.zip


## Clear Colaboratory environment

In [None]:
#Code cell #20
%cd /content/
!rm -r ./*

## Moving on to preliminary OCR to get hOCR output
The next notebook will have you moving files back into the Colaboratory environment to perform preliminary OCR to get hOCR output and then slice up your page images into line level images. That will be the last step for now!