# Training Tesseract
As has been the case with so many things this week, the process of getting ready for OCR training is far more involved than the training, itself—if you've gotten everything set up correctly, you mostly just start a process running, wait for it to complete, start another process, wait, etc.

The training we're doing here wouldn't take the days or weeks that Carl mentioned yesterday when talking about the kinds of big neural net models that it's possible to get into. (Apparently, though, training Tesseract from scratch on a very large data set—like the hundreds of thousands of pages that were used to train Tesseract's standard English language model—can, indeed, take weeks).

My initial plan had been for you to work through the code *up to but not including* the cell that would build the full training so that you could see the steps that go into getting all the data ready to build into a completed "language" for Tesseract, but for you then to load the output of a completed training run so that you could experiment with how the results changed using more or fewer iterations.

For reasons I haven't yet been able to figure out, however, the after-the-fact experimentation works as expected when you're dealing with a model that you create using the code in this notebook, but not when you're working with the checkpoints from a previously completed model. (I strongly suspect that there is something simple that I'm missing here, but a few additional hours of work hasn't made it any clearer to me.)

If you want to experiment with how the number of iterations affects the quality of the model, then, it seems that you'll need to actually run the code at step 6 (that will take about an hour to complete the default 10,000 iterations).

If you just want to see the output of a trained model, you can skip step 6 and go to step 7: adding the completed model to Tesseract works, even if rolling back and forward from prior checkpoints with tesstrain doesn't.

We'll use a shell script called `tesstrain` to handle the training. This isn't the only way to train Tesseract, by any means, but it's a relatively simple one: we provide the necessary data, and `tesstrain` takes care of firing off all the various commands to Tesseract.

In this notebook, we'll be stepping away from Python temporarily and working largely in the Unix `bash` shell, so the commands you'll be seeing have more in common with our routines for getting everything set up for the class (`mkdir`, `cd`, `ls`, etc.) than anything else.

(If you're familiar with working at the command line, you may wonder why I'm constantly using `cd`, even if you're already in the directory if you're stepping through the cells in order. I went that slightly paranoid route because Jupyter notebooks lend themselves to being run *out* of order, so I've tried to make sure that any command was immediately preceded by a change into the correct directory.)

## 1 - Connect to Google Drive

In [None]:
#Code cell #1
from google.colab import drive
drive.mount('/gdrive')
from google.colab import files

## 2 - Install Tesseract and other packages

In [None]:
#Code cell #2
!apt install tesseract-ocr
#tesstrain expects this to be available
!apt install bc

In [None]:
#Code cell #3
#Install Python wrapper for Tesseract
!pip install pytesseract

In [None]:
#Code cell #4
import pytesseract
from PIL import Image

## 3 - Get preliminary files
We'll clone `tresstrain.sh` from GitHub and also copy the pre-cooked materials for training from Google Drive into the Colaboratory environment. We're going to circle back to some of these files in a bit, but for now we'll just move the `sophonisba-ground-truth` folder to the place where `tesstrain` expects it to be.

In [None]:
#Code cell #5
#Get page image files from Google Drive
%cp -r /gdrive/MyDrive/rbs_digital_approaches_2023/output/penn_pr3732_t7_1730b-bw.zip /content/penn_pr3732_t7_1730b-bw.zip
%cd /content/
!unzip penn_pr3732_t7_1730b-bw.zip

In [None]:
#Code cell #6
#Clone tesstrain repo
%cd /content/
! git clone https://github.com/tesseract-ocr/tesstrain

#Make expected data directory in tesstrain directory
%cd /content/tesstrain/
%mkdir data

In [None]:
#Code cell #7
%cd /gdrive/MyDrive/rbs_digital_approaches_2023/output/
!zip -r ocr_training_materials.zip ocr_training_materials/
#Copy prepared training materials from Google Drive to /content for now
%cp /gdrive/MyDrive/rbs_digital_approaches_2023/output/ocr_training_materials.zip /content/ocr_training_materials.zip
%cd /content/
!unzip /content/ocr_training_materials.zip
%cd ocr_training_materials/
!unzip /content/ocr_training_materials/sophonisba-ground-truth.zip

#Move sophonisba-ground-truth (line-level images and text) to tesstrain/data
%mv /content/ocr_training_materials/sophonisba-ground-truth/ /content/tesstrain/data/sophonisba-ground-truth

In [None]:
#Code cell #8
%cd /content/ocr_training_materials/
%ls

## 4 - Using tesstrain to create a proto-model for our new Tesseract training
Let's first have a look at the `tesstrain` directory: it's full of scripts to automate the process of training Tesseract. We'll trust that these people know what they're doing.

In [None]:
#Code cell #9
%cd /content/tesstrain/
%ls

### 4.a - Create the skeleton of our new training
Tesstrain will generate a list of unicode characters associated with our ground truth files, as well as some other scaffolding for our model, which we'll call `sophonisba`. This will probably take about ten minutes minutes. Expect to see some non-fatal errors reported at the end. Fingers crossed that they won't hinder us too much.

In [None]:
#Code cell #10
!make unicharset lists proto-model MODEL_NAME=sophonisba

### 4.b - Let's see what tesstrain created

In [None]:
#Code cell #11
%cd /content/tesstrain/data/sophonisba/
%ls

## 5 - Extending Tesseract's existing English model, rather than starting from scratch
In theory, we could train Tesseract with just the line images and transcriptions from *Sophonisba*. (Well, we could do it in practice, too—I did it that way as an experiment and can report that it works... -ish.) But that's really much too small a base on which to ground an entire language model.

There are certainly cases where it makes sense to think about building a model from scratch (for a language that Tesseract doesn't currently support, for instance, or for an especially unusual typeface).

Without more text than we have, though, we're almost surely better off "fine tuning" Tesseract: the documentation notes that it's possible to get fairly good results here even without a lot of training data. That seems like our best bet.

To do that, we'll have to extract some information from Tesseract's existing English language model. This involves several terminal commands that are specific to Tesseract.

In [None]:
#Code cell #12
#Create a new directory to hold a copy of Tesseract's English language model and
#copy that model from its location in the system's installation of Tesseract to
#a folder in /content
!mkdir /content/tesstrain/data/eng/
%cp /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata /content/tesstrain/data/eng/eng.traineddata

#Move to the directory with our copy of the English language model and extract
#several components using combine_tessdata (-e is for "extract")
%cd /content/tesstrain/data/eng/
!combine_tessdata -e eng.traineddata eng.lstm-unicharset eng.lstm-word-dawg eng.lstm-punc-dawg   eng.lstm-number-dawg

### 5.a - Extract the English language model's word list and punctuation rules list
Tesseract's word lists are in the form of "Directed Acyclic Word Graphs," or "DAWG files." ([This blog post](http://stevehanov.ca/blog/?id=115) provides an explanation with illustrations.) If we want to extend Tesseract's vocabulary, we need to get the information out of those graphs using `dawg2wordlist`. We'll extract the English word list and punctuation list.

In [None]:
#Code cell #13
%cd /content/tesstrain/data/eng/
!dawg2wordlist eng.lstm-unicharset eng.lstm-word-dawg english_words.txt
!dawg2wordlist eng.lstm-unicharset eng.lstm-punc-dawg engpunclist.txt

#### 5.a.i - Let's have a look...
Yep. There are text files there now, all right.

In [None]:
#Code cell #14
%cd /content/tesstrain/data/eng/
%ls

### 5.b - Add words and punctuation patterns from ECCO to Tesseract's existing lists


In [None]:
#Code cell #15
#Copy ECCO word and punctuation lists
%cd /content/ocr_training_materials/
!unzip /content/ocr_training_materials/training_lists.zip
%cp /content/ocr_training_materials/ecco-words.txt /content/tesstrain/data/ecco-words.txt
%cp /content/ocr_training_materials/ecco-punct.txt /content/tesstrain/data/ecco-punct.txt
%cd /content/tesstrain/data/

#Concatenate Tesseract's English word list with our ecco-words, then sort the
#resulting file and eliminate duplicate lines. Note: this is *literally* a
#"pipeline": we send the output of one command to the next with the pipe character
#("|") before saving the output as a file
!cat /content/tesstrain/data/eng/english_words.txt /content/tesstrain/data/ecco-words.txt | sort | uniq > combined-words-sorted-unique.txt

#Do the same thing for the punctuation lists
!cat /content/tesstrain/data/eng/engpunclist.txt /content/tesstrain/data/ecco-punct.txt | sort | uniq > combined-punc-sorted-unique.txt

#See what we have
%ls

#### 5.b.i - Turning our text files into DAWG files
Just as we used `dawg2wordlist` to unpack Tesseract's DAWG files into plain text, we now need to use the complementary `wordlist2dawg` to turn our plain text files into DAWG files that Tesseract can use.

In [None]:
#Code cell #16
!wordlist2dawg /content/tesstrain/data/combined-words-sorted-unique.txt /content/tesstrain/data/sophonisba/sophonisba.wordlist /content/tesstrain/data/sophonisba/sophonisba.unicharset
!wordlist2dawg /content/tesstrain/data/combined-punc-sorted-unique.txt /content/tesstrain/data/sophonisba/sophonisba.punc /content/tesstrain/data/sophonisba/sophonisba.unicharset

### 5.c - Let's take a quick look at the files we've added to sophonisba
These are the files that we'll be telling `tesstrain` about as it executes the Tesseract training routine.

In [None]:
#Code cell #17
%cd /content/tesstrain/data/sophonisba/
%ls

## 6 - Actually run the training (OPTIONAL)
This will take all the files we've produced so far and actually create the trained model based on the line images and text of *Sophonisba* and the word lists we created.

This cell will take about an hour to complete. You don't need to do anything but let it run, but if you don't want to wait an hour to see the results, you can skip to number 7.

In [None]:
#Code cell #18
#Start training. Go eat a sandwich, or something, 'cause this will take an hour
#When it's done, it's going to be a huge folder. Create a .zip and them use the
#Colab UI to download the .zip: cp over to  Google Drive or using the
#google.files download() seems to choke.
%cd /content/tesstrain/
! make training MODEL_NAME=sophonisba TESSDATA=/data/eng FINETUNE_TYPE=Plus WORD_FILE=/data/sophonisba/sophonisba.wordlist PUNC_FILE=/data/sophonisba/sophonisba.punc MAX_ITERATIONS=10000

## 7 - Loading a completed training to see what Tesseract has done
Here's where we'll load the output for a completed Tesseract training so we can try some various things.

### 7.a - Adding a completed .traineddata file to our installation of Tesseract
This makes our new training available to Tesseract. If you actually ran the code in code cell #18, uncomment line 2 and comment out line 3 in in code cell #19.

In [None]:
#Code cell #19
# %cp /content/tesstrain/data/sophonisba.traineddata /usr/share/tesseract-ocr/4.00/tessdata/sophonisba.traineddata
%cp /gdrive/MyDrive/L-100\ Digital\ Approaches\ to\ Bibliography\ \&\ Book\ History-2023/sophonisba.traineddata /usr/share/tesseract-ocr/4.00/tessdata/sophonisba.traineddata

#### 7.a.i - How's it look?
Let's see what this gets us.

In [None]:
#Code cell #20
image_file = '/content/penn_pr3732_t7_1730b-bw/PR3732_T7_1730b_body0009-bw.tif'
im = Image.open(image_file)
untrained_string = pytesseract.image_to_string(im, lang='eng')
trained_string = pytesseract.image_to_string(im, lang='sophonisba')

In [None]:
#Code cell #21
#Text recognized using Tesseract's default English language model
print(untrained_string)

In [None]:
#Code cell #22
#Text recognized with sophonisba language model
print(trained_string)

### 7.b - Other kinds of output
We've already seen that Tesseract can do more than just extract text. We saw when extracting the line-level images how `pytesseract`'s `image_to_pdf_or_hocr` creates XML with information about the recognized text. If set to produce a pdf, instead, that function will produce a searchable PDF—that's really not worth doing with this inadequate model, but it's nice to know it can be done. (Though I don't think it would be advisable to produce a searchable PDF riddled with long-s characters: if we wanted to use this to produce searchable text, we would want to have an alternative model that would modernize long-s.)

`Pytesseract` can also output more information about its recognition of text. The `image_to_data` function will show word-level coordinates and can also produce word-level confidence scores. In the event that we reached the end of our ability to improve  Tesseract's recognition of this print any more, then knowing that "foon" has a confidence score of only 46, for example, might give us useful information for programmatically correcting the recognized text. Or it might allow us at least to flag that result for closer scrutiny.

In [None]:
#Code cell #23
trained_data_str = pytesseract.image_to_data(im, lang='sophonisba')
print(trained_data_str)

In [None]:
#Code cell #24
from pytesseract import Output
trained_data_df = pytesseract.image_to_data(im, lang='sophonisba', output_type='data.frame')
print(trained_data_df)

It's also possible to use `image_to_boxes` to get character-level coordinates. I'm finding that those coordinates aren't always quite as accurate as I'd expect—especially given that the recognized text is often correct, even if the coordinates don't seem to align quite correctly with the characters in the image. This is something I need to look into further: with Tesseract 3, I was able to pretty reliably extract images of individual characters from a page, which began to raise the possibility of studying things like type recurrence computationally.

In [None]:
#Code cell #25
import cv2
from google.colab.patches import cv2_imshow
import numpy as np

cv2_im = np.array(im)
display_im = cv2.cvtColor(cv2_im, cv2.COLOR_BAYER_GR2BGR)

height = cv2_im.shape[0]
width = cv2_im.shape[1]

trained_boxes = pytesseract.image_to_boxes(cv2_im, output_type=Output.DICT)

n_boxes = len(trained_boxes['char'])
for i in range(n_boxes):
    (text,x1,y2,x2,y1) = (trained_boxes['char'][i],trained_boxes['left'][i],
                          trained_boxes['top'][i],trained_boxes['right'][i],
                          trained_boxes['bottom'][i])
    cv2.rectangle(display_im, (x1,height-y1), (x2,height-y2) , (0,255,0), 2)
cv2_imshow(display_im)

## 8 - Load the training output to experiment with different checkpoints (OPTIONAL)
If you ran the code in code cell #18, you can now experiment with how changing the number of iterations that `tesstrain` runs affects the output you get from the model.

In my own experiments, I began with 10,000 iterations (the default) and then "continued" the training by picking up from different checkpoints and setting a new number of iterations.

* I rolled back to an early checkpoint, for instance, to see how things looked with just 6,000 iterations (I read several comments about fine tuning that suggested that you could start to see good results with lower numbers of iterations—and that running too many iterations could cause the model to lose its ability to generalize. More about that in our discussion.)

* I continued from a later checkpoint, allowing it to run for 15,000 iterations. Interestingly, when running the language on a page from Sophonisba, I saw improvement up to about 14,000 iterations, but then things got *worse* between 14,000 and 15,000.

The next cell shows the contents of the folder of checkpoints that `tesstrain` produced from from code cell #18 (sorted with the latest checkpoints at the top). The filename of the checkpoint gives useful clues: the first series of numbers shows the error rate achieved at that checkpoint, while the second series of numbers indicates how many iterations had been run at the time that checkpoint was written.

You can copy the names of different checkpoints into the code cell #27 and alter the value of MAX_ITERATIONS to see what kind of effect those adjustments can have.

In [None]:
#Code cell #26
%cd /content/tesstrain/data/sophonisba/checkpoints/
%ls -lt

In [None]:
#Code cell #27
#Copy a checkpoint from the list above and paste it over "sophonisba_checkpoint"
#(right after "START_MODEL="). Then go to the end of the line and change the
#value of MAX_ITERATIONS. Because the training has already been run, this should
#usually only take a couple of mintues to re-run, depending on the values
#you supply
%cd /content/tesstrain/
! make training MODEL_NAME=sophonisba START_MODEL=sophonisba_checkpoint TESSDATA=/data/eng FINETUNE_TYPE=Plus WORD_FILE=/data/sophonisba/sophonisba.wordlist PUNC_FILE=/data/sophonisba/sophonisba.punc MAX_ITERATIONS=12000

### 8.a - Update the .traineddata in our Tesseract installation
For each new training you run, you'll need to move the resulting .traineddata file into our installation of Tesseract for the changes to take effect.

In [None]:
#Code cell #28
%cd /content/tesstrain/data/
!zip -r completed_training.zip sophonisba.traineddata sophonisba-ground-truth/ sophonisba/

### 8.b - See how the output changes
Re-run tesseract with your new language model

In [None]:
#Code cell #29
new_model = pytesseract.image_to_string(im, lang='sophonisba')
print(new_model)

## 9 - Okay, but let's try it on something other than *Sophonisba*
If we try running a page from a different text from Bowyer's press from around the same time, we'll see how inadequate this model is in its current state—it leans too heavily on a single text and doesn't give Tesseract's neural net enough data to allow for good generalization.

In [None]:
bl_image = '/content/page_images/bl_iiif-bw.png'
next_image = Image.open(bl_image)
next_test_string = pytesseract.image_to_string(next_image, lang='sophonisba')
print(next_test_string)

## 10 - Examine the output
As you can see, the results of this relatively small-scale training aren't really ready for prime time, and certainly not ready to drive work in analytical bibliography: I wouldn't want to rely on this model to give me an accurate sense of what characters are actually on the page.

Have a look at the ways that this model falls down, though, and see if you can think of possible applications for using a *better* model to study printed texts.