# Using untrained hOCR output to get coordinates to create line-level images
This notebook is going to use an untrained version of Tesseract to get identify the coordinates of text lines on the pages. We'll get those coordinates out of the hOCR XML using `BeautifulSoup`, then use those coordinates to crop the page images using `Pillow`

Because of the large number of files involved, we'll copy files over to the Colaboratory environment, rather than trying to read and write directly from Google Drive.

In [1]:
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
%cd /gdrive/MyDrive/rbs_digital_approaches_2021/data_class/page_images/
!zip -r penn_pr3732_t7_1730b.zip penn_pr3732_t7_1730b/

/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/page_images
updating: penn_pr3732_t7_1730b/ (stored 0%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0002.tif (deflated 48%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0001.tif (deflated 41%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0004.tif (deflated 38%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0003.tif (deflated 36%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0005.tif (deflated 42%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0007.tif (deflated 42%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0006.tif (deflated 46%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0009.tif (deflated 43%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0008.tif (deflated 43%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0010.tif (deflated 42%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0011.tif (deflated 45%)
  adding: penn_pr3732_t7_1730b/PR3732_T7_1730b_body0012.tif (def

## Install Tesseract and Pytesseract
Tesseract is not Python code. We're installing it on the vitual machine that's serving up our Collaboratory environment using `apt`. If you were doing this work in a different environment, you'd need to install Tesseract on your system following different methods depending on your operating system.

`Pytesseract` is not installed in Colaboratory by default, so we download it with `pip`. If you were working in a different environment, you'd need to make sure `pytesseract` was installed and available to Python.

In [None]:
!apt install tesseract-ocr
!pip install pytesseract

## Import modules

In [None]:
import os
import glob
import pytesseract
from PIL import Image
from bs4 import BeautifulSoup
import lxml
import cv2
from google.colab.patches import cv2_imshow
from PIL import Image
import numpy as np

## Copy files from Google Drive to Colaboratory environment

In [None]:
%cp /gdrive/MyDrive/rbs_digital_approaches_2021/data_class/page_images/penn_pr3732_t7_1730b.zip /content/penn_pr3732_t7_1730b.zip
%cd /content/
!unzip penn_pr3732_t7_1730b.zip

## Define a couple of directories

In [None]:
#Designate a directory for our hOCR output and create it if it 
#doesn't already exist
hocr_directory = '/content/hocr/'
if not os.path.exists(hocr_directory) :
  os.makedirs(hocr_directory)

#Identify source of our binarized images
bw_source_image_directory = '/content/penn_pr3732_t7_1730b/bw/'

## Perform OCR to get hOCR output
This will take a little while. In each case, Tesseract has to load the image and perform some preparatory transformations on it. Then comes the actual OCr'ing, followed by saving the output.

In [None]:


#Work through all of the .tif files in the directory
for tif_file in glob.glob(bw_source_image_directory + '*.tif') :
  #Get the filename (excluding the file path)
  filename = os.path.split(tif_file)[1]
  #Get the portion of the filename before "-bw.tif"
  basename = filename[:-7]
  
  #Get the last part of the basename: from "body" and four digits
  page = basename[basename.rfind('_')+1:]
  
  #Name for the hOCR output file: it will end up taking the form, e.g.
  #/content/hocr/PR3732_T7_1730b_body0001-hocr.xml
  outfile_name = hocr_directory + basename + '-hocr.xml'
  
  #Use our black and white image (previosuly deskewed)
  bw_deskewed = Image.open(tif_file)

  #Perform OCR using pytesseract, saving output as hOCR. This will take 
  #a few minutes
  hocr = pytesseract.image_to_pdf_or_hocr(bw_deskewed, extension='hocr')
  with open(outfile_name, 'wb') as outfile :
    print('Saving ' + outfile_name + '...')
    outfile.write(hocr)

## See an example of hOCR output
We can use `BeautifulSoup` and `lxml` to parse the hOCR XML. I'm selecting all of the `span`s with `class` equal to `ocr_line`, then printing the first ten of them so you can see what they look like. 

Each `ocr_line` has line-level coordinates (the bit at `title=bbox 1347 481 1457 531`, for example, is what we're after). Each of those lines contains `span`s with class `ocrx_word`, and those, in turn, have bounding boxes. (Note, too, the word confidence scores for each word.) As you can see, the text recognition isn't great, due to problems with the long-s. But let's get a sens of how those line-level coordinates work in the next cell.

In [None]:
with open(hocr_directory + 'PR3732_T7_1730b_body0013-hocr.xml', 'r') as hocr_file :
  hocr_data = hocr_file.read()
  soup = BeautifulSoup(hocr_data, 'xml')
  ocr_lines = soup.find_all('span', class_='ocr_line')
  for ocr_line in ocr_lines[0:10] :
    print(ocr_line.prettify())

## See the line-level boxes that Tesseract's hOCR has found
This cell once again uses `BeautifulSoup` and `lxml` to find the line-level coordinates. We'll hand those coordinates off to `cv2` to draw boxes on the image so we can see what those coordinates mean.

In [None]:
sample_hocr_file = hocr_directory + 'PR3732_T7_1730b_body0013-hocr.xml'
sample_page = bw_source_image_directory + 'PR3732_T7_1730b_body0013-bw.tif'
bw_image = cv2.imread(sample_page, cv2.IMREAD_ANYCOLOR)
boxes_sample = cv2.cvtColor(bw_image, cv2.COLOR_BAYER_GR2BGR)
# cv2.rectangle(boxes_sample, (186, 135), (1109, 223), (0, 255, 0), 2)
with open(sample_hocr_file, 'r') as hocr :
  soup = BeautifulSoup(hocr, 'xml')
  lines = soup.find_all('span', class_='ocr_line')
  
  for line in lines :
    coord_string = line['title'][5:line['title'].find(';')]
    coords = coord_string.split(' ')
    cv2.rectangle(boxes_sample, (int(coords[0]), int(coords[1])), \
                  (int(coords[2]), int(coords[3])), (0, 255, 0), 2)

cv2_imshow(boxes_sample)
# cv2_imshow(bw_image)

## Define some functions to attempt to deskew things at the level of individual lines
I'm not sure how successful I was here, but I wantes to attempt to deskew individual lines to correct for irregularity in the lines of the (already-deskewed) image.

In [None]:
def get_line_deskew_angle(cv2image) :
  kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (30, 5))
  dilate = cv2.dilate(cv2image, kernel, iterations=5)
  contours, hierarchy = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
  if len(contours) == 0 :
    line_angle = 0
  else :
    contour = contours[0]
    line_min_area_rect = cv2.minAreaRect(contour)
    line_angle = line_min_area_rect[-1]
    if line_angle < -45 :
      line_angle = 90 + line_angle
  return line_angle

def deskew_line(cv2image, angle) :
  new_image = cv2image.copy()
  (h, w) = new_image.shape[:2]
  center = (w // 2, h //2)
  M = cv2.getRotationMatrix2D(center, angle, 1.0)
  deskewed_line = cv2.warpAffine(new_image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_CONSTANT)
  return deskewed_line

def reprocess_line(pil_line_image) :
  new_line_image = cv2.imread(pil_line_image, cv2.IMREAD_ANYCOLOR)
  invert = cv2.threshold(new_line_image, 0, 255, cv2.THRESH_BINARY_INV)
  deskew_angle = get_deskew_angle(invert)
  deskewed = deskew_line(new_line_image, deskew_angle)
  return Image.fromarray(deskewed)

## Use hOCR coordinates to extract images of individual lines
Now that we know how to get those line-level coordinates, we'll use them to extract sections of the page image for each line. You'll see things bouncing between `Pillow` and `cv2` in the code below. Frankly, I have a clearer understanding of how to crop images with `Pillow` than I do with `cv2`, and I also know how to make sure that the file gets written with information about its resolution intact. But I knew how to do the deskewing stuff in `cv2`.

In [None]:
line_output_directory = '/content/line_images/'
if not os.path.exists(line_output_directory) :
  os.makedirs(line_output_directory)

#Process each hOCR file
for hocr_file in glob.glob(hocr_directory + '*.xml') :
  filename = os.path.split(hocr_file)[1]
  basename = filename[:filename.rfind('-')]  
  page = basename[basename.rfind('_')+5:]
  #Ignore these pages. Pages 1 and 6 (title page and dramatis personae) have text
  #that's much larger than what's elsewhere in the text, and the layout of p. 6
  #is weird. P. 2 is a blank verso.
  if page not in ['1','2','6'] :
    
    #Define a pattern for the file-names of our line-level images
    outfile_name = line_output_directory + 'Penn_PR3732_T7_1730b-' + page + '-line-'
    
    #Read the hOCR file and get line-level coordinates with BeautifulSoup and lxml
    with open(hocr_file, 'r') as hocr :
      file_read = hocr.read()
      soup = BeautifulSoup(file_read, 'xml')

      #Create an integer to use as a line number in the file names of our 
      #line-level images
      i = 1
      
      #Find all ocr_line spans
      lines = soup.find_all('span', class_='ocr_line')
      for line in lines :
        coord_string = line['title'][5:line['title'].find(';')]
        coords = coord_string.split(' ')
        
        #Open the full-page image 
        bw_tif = bw_source_image_directory + basename + '-bw.tif'
        read_tif = Image.open(bw_tif)
        
        #Create an image extracted from the full-page image using the line coordinates
        crop_line = read_tif.crop((int(coords[0]), int(coords[1]), int(coords[2]), int(coords[3])))
        
        #Try to deskew the line-level image using functions defined above
        new_line_image = cv2.cvtColor(np.asarray(crop_line), cv2.COLOR_BAYER_GR2GRAY)
        invert = cv2.threshold(new_line_image, 0, 255, cv2.THRESH_BINARY_INV, cv2.THRESH_OTSU)[1]
        line_deskew_angle = get_line_deskew_angle(invert)
        deskewed_line = deskew_line(new_line_image, line_deskew_angle)
        deskewed_line = Image.fromarray(deskewed_line)
        
        #Save the line-level image
        print('Saving ' + outfile_name + str(i) + '.tif...')
        deskewed_line.save(outfile_name + str(i) + '.tif', dpi=(400, 400))
        
        #increment the line number counter
        i += 1

## Compress the folders we've created and move them over to Google Drive

In [None]:
%cd /content/
!zip -r hocr.zip hocr/
!zip -r line_images.zip line_images/
!mv hocr.zip /gdrive/MyDrive/rbs_digital_approaches_2021/output/hocr.zip
!mv line_images.zip /gdrive/MyDrive/rbs_digital_approaches_2021/output/line_images.zip


## Delete materials from Colaboratory environment

In [None]:
%cd /content/
!rm -r ./*