# Using untrained hOCR output to get coordinates to create line-level images
This notebook is going to use an untrained version of Tesseract to get identify the coordinates of text lines on the pages. We'll get those coordinates out of the hOCR XML using `BeautifulSoup`, then use those coordinates to crop the page images using `Pillow`

Because of the large number of files involved, we'll copy files over to the Colaboratory environment, rather than trying to read and write directly from Google Drive.

In [None]:
#Code cell #1
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

In [None]:
#Code cell #2
%cp /gdrive/MyDrive/rbs_digital_approaches_2023/output/penn_pr3732_t7_1730b-bw.zip /content/penn_pr3732_t7_1730b-bw.zip
%cd /content/
!unzip penn_pr3732_t7_1730b-bw.zip


## 1 - Install Tesseract and Pytesseract
Tesseract is not Python code. We're installing it on the vitual machine that's serving up our Colaboratory environment using `apt`. If you were doing this work in a different environment, you'd need to install Tesseract on your system following different methods depending on your operating system.

`Pytesseract` is not installed in Colaboratory by default, so we download it with `pip`. If you were working in a different environment, you'd need to make sure `pytesseract` was installed and available to Python.

In [None]:
#Code cell #3
!apt install tesseract-ocr
!pip install pytesseract

## 2 - Import modules

In [None]:
#Code cell #4
import os
import glob
import pytesseract
from PIL import Image
from bs4 import BeautifulSoup
import lxml
import cv2
from google.colab.patches import cv2_imshow
from PIL import Image
import numpy as np

## 3 - Define a couple of directories

In [None]:
#Code cell #5
#Designate a directory for our hOCR output and create it if it
#doesn't already exist
hocr_directory = '/content/ocr_training_materials/hocr/'
if not os.path.exists(hocr_directory) :
  os.makedirs(hocr_directory)

#Identify source of our binarized images
bw_source_image_directory = '/content/penn_pr3732_t7_1730b-bw/'

## 4 - Perform OCR to get hOCR output
This will take five or six minutes. In each case, Tesseract has to load the image and perform some preparatory transformations on it. Then comes the actual OCR'ing, followed by saving the output.

In [None]:
#Code cell #6
#Work through all of the .tif files in the directory
for tif_file in glob.glob(bw_source_image_directory + '*.tif') :
  #Get the filename (excluding the file path)
  filename = os.path.split(tif_file)[1]
  #Get the portion of the filename before "-bw.tif"
  basename = filename[:-7]

  #Get the last part of the basename: from "body" and four digits
  page = basename[basename.rfind('_')+1:]

  #Name for the hOCR output file: it will end up taking the form, e.g.
  #/content/ocr_training_materials/hocr/PR3732_T7_1730b_body0001-hocr.xml
  outfile_name = hocr_directory + basename + '-hocr.xml'

  #Use our black and white image
  bw_deskewed = Image.open(tif_file)

  #Perform OCR using pytesseract, saving output as hOCR. This will take
  #several minutes—seven or eight, probably.
  hocr = pytesseract.image_to_pdf_or_hocr(bw_deskewed, extension='hocr')
  with open(outfile_name, 'wb') as outfile :
    print('Saving ' + outfile_name + '...')
    outfile.write(hocr)

## 5 - See an example of hOCR output
We can use `BeautifulSoup` and `lxml` to parse the hOCR XML. I'm selecting all of the `span`s with `class` equal to `ocr_line`, then printing the first ten of them so you can see what they look like.

Each `ocr_line` has line-level coordinates (the bit at `title=bbox 1347 481 1457 531`, for example, is what we're after). Each of those lines contains `span`s with class `ocrx_word`, and those, in turn, have bounding boxes. (Note, too, the word confidence scores for each word.) As you can see, the text recognition isn't great, due to problems with the long-s. But let's get a sense of how those line-level coordinates work in the next cell.

In [None]:
#Code cell #7
with open('/content/ocr_training_materials/hocr/PR3732_T7_1730b_body0009-hocr.xml', 'r') as hocr_file :
  hocr_data = hocr_file.read()
  soup = BeautifulSoup(hocr_data, 'xml')
  ocr_lines = soup.find_all('span', class_='ocr_line')
  for ocr_line in ocr_lines[0:10] :
    print(ocr_line.prettify())

## 6 - See the line-level boxes that Tesseract's hOCR has found
This cell once again uses `BeautifulSoup` and `lxml` to find the line-level coordinates. We'll hand those coordinates off to `cv2` to draw boxes on the image so we can see what those coordinates mean.

In [None]:
#Code cell #8
sample_hocr_file = '/content/ocr_training_materials/hocr/PR3732_T7_1730b_body0009-hocr.xml'
sample_page = '/content/penn_pr3732_t7_1730b-bw/PR3732_T7_1730b_body0009-bw.tif'
bw_image = cv2.imread(sample_page, cv2.IMREAD_ANYCOLOR)
boxes_sample = cv2.cvtColor(bw_image, cv2.COLOR_BAYER_GR2BGR)
# cv2.rectangle(boxes_sample, (186, 135), (1109, 223), (0, 255, 0), 2)
with open(sample_hocr_file, 'r') as hocr :
  soup = BeautifulSoup(hocr, 'xml')
  lines = soup.find_all('span', class_='ocr_line')

  for line in lines :
    coord_string = line['title'][5:line['title'].find(';')]
    coords = coord_string.split(' ')
    cv2.rectangle(boxes_sample, (int(coords[0]), int(coords[1])), \
                  (int(coords[2]), int(coords[3])), (0, 255, 0), 2)

cv2_imshow(boxes_sample)
# cv2_imshow(bw_image)

## 7 - Use hOCR coordinates to extract images of individual lines
Now that we know how to get those line-level coordinates, we'll use them to extract sections of the page image for each line. You'll see things bouncing between `Pillow` and `cv2` in the code below. Frankly, I have a clearer understanding of how to crop images with `Pillow` than I do with `cv2`, and I also know how to make sure that the file gets written with information about its resolution intact. But I knew how to do the deskewing stuff in `cv2`.

In [None]:
#Code cell #9
line_output_directory = '/content/ocr_training_materials/line_images/'
if not os.path.exists(line_output_directory) :
  os.makedirs(line_output_directory)

#Process each hOCR file
for hocr_file in glob.glob(hocr_directory + '*.xml') :
  filename = os.path.split(hocr_file)[1]
  basename = filename[:filename.rfind('-')]
  page = basename[basename.rfind('_')+5:]
  #Ignore these pages. Pages 1 and 6 (title page and dramatis personae) have text
  #that's much larger than what's elsewhere in the text, and the layout of p. 6
  #is weird. P. 2 is a blank verso.
  if page not in ['1','2','6'] :

    #Define a pattern for the file-names of our line-level images
    outfile_name = line_output_directory + 'Penn_PR3732_T7_1730b-' + page + '-line-'

    #Read the hOCR file and get line-level coordinates with BeautifulSoup and lxml
    with open(hocr_file, 'r') as hocr :
      file_read = hocr.read()
      soup = BeautifulSoup(file_read, 'xml')

      #Create an integer to use as a line number in the file names of our
      #line-level images
      i = 1

      #Find all ocr_line spans
      lines = soup.find_all('span', class_='ocr_line')
      for line in lines :
        coord_string = line['title'][5:line['title'].find(';')]
        coords = coord_string.split(' ')

        #Open the full-page image
        bw_tif = bw_source_image_directory + basename + '-bw.tif'
        read_tif = Image.open(bw_tif)

        #Create an image extracted from the full-page image using the line coordinates
        crop_line = read_tif.crop((int(coords[0]), int(coords[1]), int(coords[2]), int(coords[3])))

        #Save the line-level image
        print('Saving ' + outfile_name + str(i) + '.tif...')
        crop_line.save(outfile_name + str(i) + '.tif', dpi=(400, 400))

        #increment the line number counter
        i += 1

## 8 - Compress the folders we've created and move them over to Google Drive

In [None]:
#Code cell #11
%cd /content/ocr_training_materials/
!zip -r hocr.zip hocr/
!mkdir /gdrive/MyDrive/rbs_digital_approaches_2023/output/ocr_training_materials/
!mv hocr.zip /gdrive/MyDrive/rbs_digital_approaches_2023/output/ocr_training_materials/hocr.zip
!zip -r line_images.zip line_images/
!mv line_images.zip /gdrive/MyDrive/rbs_digital_approaches_2023/output/ocr_training_materials/line_images.zip

## 9 - Delete materials from Colaboratory environment

In [None]:
%cd /content/
!rm -r ./*