# Using lines from TCP text for Tesseract training ground truth
As we saw when we ran our page images through Tesseract to get coordinates for the line images in hOCR XML, Tesseract *does* recognize text in these page images, it's just that the accuracy isn't what we'd like.

In this notebook, we're going to compare Tesseract's output to the lines we've extracted from the modified version of the ECCO-TCP transcription of *Sophonisba*.

In [None]:
#Code cell #1
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

## 1 - Install Python package for calculating Levenshtein distance and import needed packages
In comparing the OCR output from our untrained installation of Tesseract to the lines of the transcribed (and corrected) text, we'll use a measure called [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance), which calculates the difference between two strings: basically, how many changes would we have to make to the first string to turn it into the other.

In [None]:
#Code cell #2
!pip install python-Levenshtein

In [None]:
#Code cell #3
import os
import glob
from bs4 import BeautifulSoup
import lxml
import re
import Levenshtein

### 1.a - A quick look at Levenshtein distance
Let's take a quick look at Levenshtein distance in action.

In [None]:
#Code cell #4
def show_lev(list) :
  lev_distance = Levenshtein.distance(list[0], list[1])
  string = str(lev_distance)
  return 'Levenshtein distance between "' + list[0]  + '" and "' + list[1] + '" is: ' + string

first = show_lev(['slip', 'slop'])
second = show_lev(['sister', 'sinister'])
third = show_lev(['I can see clearly now', 'I will pay dearly now'])
print(first)
print(second)
print(third)

## 2 - Copy files from Google Drive to Colaboratory environment and set paths
To reduce the amount of input/output between Colab and Google Drive, we'll move files into the Colab environment to work on. In this notebook, I'm going to assume you're working with the pre-prepared materials I provided in the shared Google Drive folder for our class. If you'd like to use the (same) materials that you created in the prior notebooks, comment out lines 1 and 8 and uncomment the following lines.

In [None]:
#Code cell #5
!mkdir /content/ocr_training_materials/
!cp /gdrive/MyDrive/rbs_digital_approaches_2023/output/hocr.zip /content/ocr_training_materials/hocr.zip
!cp /gdrive/MyDrive/rbs_digital_approaches_2023/output/tcp_lines.zip /content/ocr_training_materials/tcp_lines.zip
!cp /gdrive/MyDrive/rbs_digital_approaches_2023/output/line_images.zip /content/ocr_training_materials/line_images.zip
%cd /content/ocr_training_materials/
!unzip hocr.zip
!unzip tcp_lines.zip
!unzip line_images.zip

In [None]:
#Code cell #6
hocr_directory = '/content/ocr_training_materials/hocr/'
tcp_line_directory = '/content/ocr_training_materials/tcp_lines/'
line_image_directory = '/content/ocr_training_materials/line_images/'

## 3 - Compare Tesseract output from hOCR files to lines from TCP text and try to match them

### 3.a - Get each Tesseract page
This reads the hOCR files and extracts the text content of the OCR'ed lines. The `tesseract_pages` dictionary has page number as keys and lists as values: for each page, there is a list of lines that we get from the hOCR files.

In [None]:
#Code cell #7
#Create dictionary
tesseract_pages = {}

#Begin processing hOCR files
for hocr_file in glob.glob(hocr_directory + '*.xml') :
  #Mangle filenames to get identifiers
  filename = os.path.split(hocr_file)[1]
  basename = filename[:filename.rfind('-')]
  page = int(basename[basename.rfind('_')+5:].lstrip('0'))

  #Don't need these
  if page not in ['1', '2'] :

    #Create an entry for this page in the dictionary
    tesseract_pages.setdefault(page, [])

    #Open the file and get ocr_lines
    with open(hocr_file, 'r') as hocr :
      file_read = hocr.read()
      soup = BeautifulSoup(file_read, 'xml')
      lines = soup.find_all('span', class_='ocr_line')

      #Set an identifier for line numbers
      i = 1

      for line in lines :
        #Strip the line break character
        recognized_line = line.get_text().replace('\n', ' ')
        #Append a new item to the list of lines for this page.
        #Each recognized line is presented as a tuple with an integer
        #for the line number and the text as recognized by Tesseract
        tesseract_pages[page].append((i, recognized_line))

        #Increment the counter for our line number identifier
        i += 1

#See what we have
for k, v in tesseract_pages.items() :
  print(k)
  for content in v :
    print(content)


### 3.b - Get each TCP page
We're going to more or less repeat what we did for Tesseract's OCR with the lines we identified from the TCP text.

In [None]:
#Code cell #8
tcp_pages = {}

for tcp_line_file in glob.glob(tcp_line_directory + '*.txt') :
  filename = os.path.split(tcp_line_file)[1]
  page_num = int(filename.rstrip('.txt'))
  tcp_pages.setdefault(page_num, [])
  i = 1
  with open(tcp_line_file, 'r') as tcp_line :
    clean_text = tcp_line.readlines()
    for line in clean_text:
      tcp_pages[page_num].append((i, line.rstrip('\n')))
      i += 1

for k, v in sorted(tcp_pages.items()) :
  print(k)
  for entry in v :
    print(entry)


### 3.c - Get Levenshtein distance and accept corrections (or not)

In [None]:
#Code cell #10
#I know there's a built-in punctuation list, but couldn't remember how to use it
#when I was writing this
punct = re.compile(r'[\!@#\$\%\^&\*\(\)\-_\+=\{\}\[\]\|\\\:;\"\'\\’<\>,\.\?\/]')
emdash = re.compile(r'—')
replacements = {}

for tesseract_k, tesseract_v in tesseract_pages.items() :
  if tesseract_k > 2 :
    #Setting default value for all replacements to say "-NO_MATCH." Seems kind of
    #pessimistic, in retrospect
    for i in enumerate(tesseract_v) :
        label = str(tesseract_k) + '-' + str(tesseract_v[i[0]][0])
        replacements.setdefault(label, '-' + tesseract_v[i[0]][1] + '-NO_MATCH')

    #TEI doesn't have any forme work. Need to figure out what to do with
    #running title and signature/catchword lines: could save them explicitly
    #marked as not attempted to match? Not doing that here. Just ignoring first
    #and last lines in tesseract output, hence [1:-1]
    for tesseract_index, entry in enumerate(tesseract_v[1:-1]) :
        tesseract_line_num = entry[0]
        tesseract_ocr = entry[1]
        label = str(tesseract_k) + '-' + str(tesseract_line_num)

        #Get rid of punctuation for purposes of comparison—seems to be lots of
        #spurious punctuation in recognized text which causes otherwise good
        #to fail
        no_punct_tesseract = re.sub(punct, '', tesseract_ocr).strip()

        #Work through TCP lines from page corresponding to Tesseract page (off by 1)
        for tcp_v in sorted(tcp_pages[tesseract_k-1]) :
            tcp_line_num = tcp_v[0]
            transcribed_text = tcp_v[1]

            #Eliminate punctuation for matching of word (see above)
            no_punct_transcribed = re.sub(punct, '', transcribed_text)

            #Indices of lines need to be within 5 of each other so we're not
            #comparing a line at the top of the Tesseract page to something
            #way down the TCP page
            if -5 <= tcp_line_num - tesseract_index <= 5 :

                #If there's an exact match, then heck yeah, let's accept it
                if no_punct_tesseract == no_punct_transcribed :
                    replacements[label] = (tesseract_k-1, tcp_line_num, transcribed_text)
                else :
                    #Short lines need a different threshold for what Levenshtein
                    #distance indicates a probable match represents
                    if len(no_punct_transcribed) < 20 :

                        #For method for determining Levenshtein distance threshold
                        #in this code, please see:
                        #https://en.wikipedia.org/wiki/Scientific_wild-ass_guess#Use
                        if -0.25 <= ((len(no_punct_transcribed) - len(no_punct_tesseract)) \
                                     / len(no_punct_transcribed)) <= 0.25 :
                            lev_dist = Levenshtein.distance(no_punct_tesseract, no_punct_transcribed)
                            if lev_dist / len(no_punct_transcribed) < 0.4 :
                                replacements[label] = (tesseract_k-1, tcp_line_num, transcribed_text)
                    else :
                        if -0.10 <= ((len(no_punct_transcribed) - len(no_punct_tesseract)) \
                                     / len(no_punct_transcribed)) <= 0.10 :
                            lev_dist = Levenshtein.distance(no_punct_tesseract, no_punct_transcribed)
                            if lev_dist / len(no_punct_transcribed) < 0.4 :
                                replacements[label] = (tesseract_k-1, tcp_line_num, transcribed_text)

for orig, repl in replacements.items() :
  print(orig, repl)

### 3.d - How did we do?
Let's compare the number of Tesseract lines to the number we were able to match

In [None]:
#Code cell #11
num_tesseract_lines = 0
for tesseract_key, tesseract_value in tesseract_pages.items() :
  for line in tesseract_value :
    num_tesseract_lines += 1

num_replacements = 0
for orig, repl in replacements.items() :
  if repl[2].find('-NO_MATCH') == -1 :
    num_replacements += 1

print(num_tesseract_lines, num_replacements)

I can accept that.

## 4 - Save accepted corrections as individual text files
We now need to save each of the replacements we identified

In [None]:
#Code ccell #12
filename_base = 'Penn_PR3732_T7_1730b-'

#I'm violating our class naming conventions by using hyphens rather than
#underscores in the directory name because the script we'll use to train
#Tesseract expects the folder to be named that way
groundtruth_directory = '/content/ocr_training_materials/sophonisba-ground-truth/'
if not os.path.exists(groundtruth_directory) :
  os.makedirs(groundtruth_directory)
  print('Creating directory')
a = 0
for orig, replacement in replacements.items() :
  if isinstance(replacement, tuple) :
    page = orig[:orig.find('-')]
    #zeropadding because I stupidly stripped zeroes out in an earlier step
    page_number = page.zfill(4)
    line = orig[orig.rfind('-')+1:]

    filename = filename_base + page_number + '-line-' + str(line)
    # print(filename)
    # print(filename + '.txt' + ' | ' + replacement[2])

    with open(groundtruth_directory + filename + '.gt.txt', 'w') as groundtruth_line_out :
      print('Writing ' + groundtruth_directory + filename + '.gt.txt...')
      groundtruth_line_out.write(replacement[2])
      a += 1
print(str(a) + ' files saved')

I'm not quite sure why this ends up writing about 500 fewer text line files than we have matches. I'll have to try to track this down another day...

## 5 - Get line image files for every matched ground truth line

In [None]:
#Code cell #13
z = 1
for file in glob.glob(groundtruth_directory + '*.gt.txt') :
  filename = os.path.split(file)[1]
  basename = filename[:filename.rfind('.gt.txt')]

  if os.path.exists(line_image_directory + basename + '.tif') :
    print('Moving ' + line_image_directory + basename + '.tif to ' + groundtruth_directory)
    os.rename(line_image_directory + basename + '.tif', groundtruth_directory + basename + '.tif')
    z += 1

print(str(z) + ' image files moved')

## 6 - Have a look at our ground truth folder
We now have pairs of .tif files and .gt.txt files for a great many lines of text. Not every line, of course (since there were some that we couldn't match using our little Levenshtein distance trick above), but a lot of lines, nonetheless.

We need to make sure that all of these files come in pairs—that is, that for every .gt.txt file there's a corresponding .tif file. We just brought in the .tif files where we could match them to a .txt file, but let's take a pass through and see if we have any .txt files that *didn't* find a matching .tif file.

If all goes to plan, code cell #14 should print out 0. If you get anything other than that, run code cell #15.

In [None]:
#Code cell #14
partnerless = []
for textfile in glob.glob('/content/ocr_training_materials/sophonisba_ground_truth/*.txt') :
  basename = os.path.split(textfile)[1].rstrip('.gt.txt')
  if os.path.exists('/content/ocr_training_materials/sophonisba_ground_truth/'
                    + basename + '.tif') is not True :
    partnerless.append(basename)
print(len(partnerless))

In [None]:
#Code cell #15
for lonefile in partnerless :
  os.remove('/content/ocr_training_materials/sophonisba_ground_truth/'
            + lonefile + '.gt.txt')

For our purposes, we're going to stop here and hope that this gives us enough to work with for training Tesseract.

As we'll see later, the lines we collected here don't include every character we could possibly want for recognzing eighteenth-century print—there's no upper-case Z in *Sophonisba*, for example, and the text we have is light on numerals (we could perhaps correct that latter point by going back to get the page numbers in the running titles). If we were trying to do a large-scale OCR training, our time would be more profitably spent processing images of more texts than it would be going back and trying to chase down every line we missed.

## 7 - Compress folder of ground truth text lines and save to Google Drive

In [None]:
#Code cell #16
%cd /content/ocr_training_materials/
!zip -r sophonisba-ground-truth.zip sophonisba-ground-truth/
!mv sophonisba-ground-truth.zip /gdrive/MyDrive/rbs_digital_approaches_2023/output/ocr_training_materials/sophonisba-ground-truth.zip


## 8 - Clear Colaboratory environment

In [None]:
#Code cell #17
%cd /content/
! rm -r ./*