# Training Tesseract
As has been the case with so many things this week, the process of getting ready for OCR training is far more involved than the training, itself—if you've gotten everything set up correctly, you mostly just start a process running, wait for it to complete, start another process, wait, etc.

The training we're doing here wouldn't take the days or weeks that Carl mentioned yesterday when talking about the kinds of big neural net models that it's possible to get into. (Apparently, though, training Tesseract from scratch on a very large data set—like the hundreds of thousands of pages that were used to train Tesseract's standard English language model—can, indeed, take weeks). 

I'm recommending that you work through the code *up to but not including* the cell that would build the full training so that you can see the steps that go into getting all the data ready to build into a completed "language" for Tesseract. (That cell seems to take almost exactly an hour to complete the default 10,000 iterations.) I've put all the files that are need to prepare the training in our `class_data` folder (these are the same files that are produced by the previous workbooks, with—I think—two hand-tweaks). So you can step through the process to see how it works. (And, if you really want to see the whole thing happen, you would be able to run the cell that performs the training.) 

I've also made the output of a completed training run available in the `class_data` folder. We can use that to test out different results: Tesseract allows us to continue a training from a specified checkpoint, so we can experiment with what how the model does with varying numbers of iterations (up to 15,000).

We'll use a shell script called `tesstrain` provided by the maintainers of Tesseract to handle the training. This isn't the only way to train Tesseract, by any means, but it's a relatively simple one: we provide the necessary data, and `tesstrain` takes care of firing off all the various commands to Tesseract.

In this notebook, we'll be stepping away from Python temporarily and working largely in the Unix `bash` shell, so the commands you'll be seeing have more in common with our routines for getting everything set up for the class (`mkdir`, `cd`, `ls`, etc.) than anything else. 

(If you're familiar with working at the command line, you may wonder why I'm constantly using `cd`, even if you're already in the directory if you're stepping through the cells in order. I went that slightly paranoid route because Jupyter notebooks lend themselves to being run *out* of order, so I've tried to make sure that any command was immediately preceded by a change into the correct directory.)

## Connect to Google Drive

In [None]:
from google.colab import drive
drive.mount('/gdrive')
from google.colab import files

## Install Tesseract and another library that tesstrain.sh needs

In [None]:
! apt install tesseract-ocr
! apt install bc

## Get preliminary files
We'll clone `tresstrain.sh` from GitHub and also copy the pre-cooked materials for training from Google Drive into the Colaboratory environment. We're going to circle back to some of these files in a bit, but for now we'll just move the `sophonisba-ground-truth` folder to the place where `tesstrain` expects it to be.

In [None]:
#Clone tesstrain repo
%cd /content/
! git clone https://github.com/tesseract-ocr/tesstrain

#Make expected data directory in tesstrain directory
%cd /content/tesstrain/
%mkdir data

In [None]:
%cd /content/
#Copy prepared training materials from Google Drive to /content for now
%cp /gdrive/MyDrive/L-100a/ocr_training_materials.zip /content/ocr_training_materials.zip
!unzip /content/ocr_training_materials.zip

#Move sophonisba-ground-truth (line-level images and text) to tesstrain/data
%mv /content/ocr_training_materials/sophonisba-ground-truth/ /content/tesstrain/data/sophonisba-ground-truth 

In [None]:
%cd /content/ocr_training_materials/
%ls

## Using tesstrain to create a proto-model for our new Tesseract training
Let's first have a look at the `tesstrain` directory: it's full of scripts to automate the process of training Tesseract. We'll trust that these people know what they're doing.

In [None]:
%cd /content/tesstrain
%ls

### Create the skeleton of our new training
Tesstrain will generate a list of unicode characters associated with our ground truth files, as well as some other scaffolding for our model, which we'll call `sophonisba`. This will take several minutes. Expect to see some non-fatal errors reported at the end. Fingers crossed that they won't hinder us too much.

In [None]:
!make unicharset lists proto-model MODEL_NAME=sophonisba

### Let's see what tesstrain created

In [None]:
%cd /content/tesstrain/data/sophonisba/
%ls

## Extending Tesseract's existing English model, rather than starting from scratch
In theory, we could train Tesseract with just the line images and transcriptions from *Sophonisba*. (Well, we could do it in practice, too—I did it that way last week and can report that it works. -Ish.) But that's really much too small a base on which to ground an entire language model. 

There are certainly cases where it makes sense to think about building a model from scratch (for a language that Tesseract doesn't currently support, for instance, or for an especially unusual typeface).

Without more text than we have, though, we're almost surely better off "fine tuning" Tesseract: the documentation notes that it's possible to get fairly good results here even without a lot of training data. That seems like our best bet.

To do that, we'll have to extract some information from Tesseract's existing English language model. This involves several terminal commands that are specific to Tesseract.

In [None]:
#Create a new directory to hold a copy of Tesseract's English language model and
#copy that model from its location in the system's installation of Tesseract to 
#a folder in /content
!mkdir /content/tesstrain/data/eng/
%cp /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata /content/tesstrain/data/eng/eng.traineddata

#Move to the directory with our copy of the English language model and extract
#several components using combine_tessdata (-e is for "extract")
%cd /content/tesstrain/data/eng/
!combine_tessdata -e eng.traineddata eng.lstm-unicharset eng.lstm-word-dawg eng.lstm-punc-dawg   eng.lstm-number-dawg

### Extract the English language model's word list and punctuation rules list
Tesseract's word lists are in the form of "Directed Acyclic Word Graphs," or "DAWG files." ([This blog post](http://stevehanov.ca/blog/?id=115) provides an explanation with illustrations.) If we want to do anything, we need to get the information out of those graphs using `dawg2wordlist`. We'll extract the English word list and punctuation list.

In [None]:
%cd /content/tesstrain/data/eng/
!dawg2wordlist eng.lstm-unicharset eng.lstm-word-dawg english_words.txt
!dawg2wordlist eng.lstm-unicharset eng.lstm-punc-dawg engpunclist.txt

#### Let's have a look...
Yep. There are text files there now, all right.

In [None]:
%cd /content/tesstrain/data/eng/
%ls

### Add words and punctuation patterns from ECCO to Tesseract's existing lists


In [None]:
!cat /content/tesstrain/data/eng/english_words.txt /content/ocr_training_materials/ecco-words.txt | sort | uniq > /content/testing_pipe.txt

In [None]:
%cd /content/tesstrain/data/
%ls

In [None]:
#Copy ECCO word and punctuation lists
!mv /content/ocr_traning_materials/ecco-words.txt /content/tesstrain/data/ecco-words.txt
!mv /content/ocr_training_materials/ecco-punct.txt /content/tesstrain/data/ecco-punct.txt
%cd /content/tesstrain/data/

#Concatenate Tesseract's English word list with our ecc-words, then sort the 
#resulting file and eliminate duplicate lines. Note: this is *literally* a 
#"pipeline": we send the output of one command to the next with the pipe character
#("|") before saving the output as a file
!cat /content/tesstrain/data/eng/english_words.txt /content/ocr_training_materials/ecco-words.txt | sort | uniq > combined-words-sorted-unique.txt

#Do the same thing for the punctuation lists
!cat /content/tesstrain/data/eng/engpunclist.txt /content/ocr_training_materials/ecco-punct.txt | sort | uniq > combined-punc-sorted-unique.txt

#See what we have
%ls

#### Turning our text files into DAWG files
Just as we used `dawg2wordlist` to unpack Tesseract's DAWG files into plain text, we now need to use the complementary `wordlist2dawg` to turn our plain text files into DAWG files that Tesseract can use.

In [None]:
!wordlist2dawg /content/tesstrain/data/combined-words-sorted-unique.txt /content/tesstrain/data/sophonisba/sophonisba.wordlist /content/tesstrain/data/sophonisba/sophonisba.unicharset
!wordlist2dawg /content/tesstrain/data/combined-punc-sorted-unique.txt /content/tesstrain/data/sophonisba/sophonisba.punc /content/tesstrain/data/sophonisba/sophonisba.unicharset

### Let's take a quick look at the files we've added to sophonisba
These are the files that we'll be telling `tesstrain` about as it executes the Tesseract training routine.

In [None]:
%cd /content/tesstrain/data/sophonisba/checkpoints
%ls -lt

## DO NOT RUN THIS CELL
Unless you want to kick off the hour-long training

In [None]:
#Start training. Go eat a sandwich, or something, 'cause this will take an hour
#When it's done, it's going to be a huge folder. Create a .zip and them use the
#Colab UI to download the .zip: cp over to  Google Drive or using the 
#google.files download() seems to choke.
%cd /content/tesstrain/
! make training MODEL_NAME=sophonisba TESSDATA=/data/eng FINETUNE_TYPE=Plus WORD_FILE=/data/sophonisba/sophonisba.wordlist PUNC_FILE=/data/sophonisba/sophonisba.punc MAX_ITERATIONS=6000

## Loading a completed training to see what Tesseract has done
Here's where we'll load the output for a completed Tesseract training so we can try some various things.

### Adding a completed .traineddata file to our installation of Tesseract
This makes our new training available to Tesseract

In [None]:
%cp /content/ocr_training_materials/sophonisba.traineddata /usr/share/tesseract-ocr/4.00/tessdata/sophonisba.traineddata

In [None]:
#Get page image files from Google Drive
%cp -r /gdrive/MyDrive/rbs_digital_approaches_2021/data_class/page_images/penn_pr3732_t7_1730b.zip /content/penn_pr3732_t7_1730b/
%cd /content/
!unzip penn_pr3732_t7_1730b.zip

In [None]:
#Install Python wrapper for Tesseract
!pip install pytesseract

In [None]:
import pytesseract
from PIL import Image

#### How's it look?
Let's see what this gets us.

In [None]:
image_file = '/content/penn_pr3732_t7_1730b/bw/PR3732_T7_1730b_body0009-bw.tif'
im = Image.open(image_file)
untrained_string = pytesseract.image_to_string(im, lang='eng')
trained_string = pytesseract.image_to_string(im, lang='sophonisba')

In [None]:
print(untrained_string)

In [None]:
print(trained_string)

## Load the training output to experiment with different checkpoints
The training we're working with is one that I ran a few different times, varying the maximum number of iterations that the training ran.

I say that I ran it "a few different times," but really I ran it just once (at 10,000 iterations—the default) and then "continued" the training by picking up from different checkpoints and setting a new number of iterations. 

* I rolled back to an early checkpoint, for instance, to see how things looked with just 6,000 iterations (I read several comments about fine tuning that suggested that you could start to see good results with lower numbers of iterations—and that running too many iterations could cause the model to lose its ability to generalize. More about that in our discussion.)

* I continued from a later checkpoint, allowing it to run for 15,000 iterations. Interestingly, when running the language on a page from Sophonisba, I saw improvement up to about 14,000 iterations, but then things got *worse* between 14,000 and 15,000.

The next cell shows the contents of the folder of checkpoints that `testrain` produced from my training (sorted with the latest checkpoints at the top). You can copy the names of different checkpoints into the cell below that, altering the MAX_ITERATIONS variable to see what kind of effect those adjustments can have.

In [None]:
%ls -lt /content/ocr_training_materials/sophonisba/checkpoints/

In [None]:
#Copy a checkpoint from the list above and paste it over "sophonisba_checkpoint"
#(right after "START_MODEL="). Then go to the end of the line and change the 
#value of MAX_ITERATIONS. Because the training has already been run, this should 
#usually only take a couple of mintues to re-run, depending on the values
#you supply
%cd /content/tesstrain/
! make training MODEL_NAME=sophonisba START_MODEL=sophonisba_checkpoint TESSDATA=/data/eng FINETUNE_TYPE=Plus WORD_FILE=/data/sophonisba/sophonisba.wordlist PUNC_FILE=/data/sophonisba/sophonisba.punc MAX_ITERATIONS=12000

### Update the .traineddata in our Tesseract installation
For each new training you run, you'll need to move the resulting .traineddata file into our installation of Tesseract for the changes to take effect.

In [None]:
%cp /content/tesstrain/data/sophonisba.traineddata /usr/share/tesseract-ocr/4.00/tessdata/sophonisba.traineddata

### See how the output changes
Re-run tesseract with your new language model

In [None]:
new_model = pytesseract.image_to_string(im, lang='sophonisba')
print(new_model)

## Okay, but let's try it on something other than *Sophonisba*
Sobering.

In [None]:
bl_image = '/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/page_images/bl_iiif-bw.png'
next_image = Image.open(bl_image)
next_test_string = pytesseract.image_to_string(next_image, lang='sophonisba')
print(next_test_string)