# Creating ground truth files
To train Tesseract to better recognize the eighteenth-century print in *Sophonisba*, we need to provide line-level images of the text accompanied by corresponding line-level transcriptions: this combination of image and transcriptions that are known to be correct provides a "ground truth" for Tesseract's training process.

This is a task that, when done at scale, ends up creating literally thousands of files (they're small files, but still). That's all well and good when you're working on your own computer, but it's not great when working with files in Google Drive—it's not so much the number of files or the amount of storage as it is the lag time between files being written by Colaboratory and their actually being reliably available in Google Drive to feed the next stage of the process. (I speak from experience, having done all this exclusively in Colab, just to see if it could be done.)

There also ends up being a certain amount of spot checking and tweaking by hand involved in getting everything ready that we don't really have time for in our format. In this notebook, we'll run through the processes on just a subset of the images we'd need to create a training so that we can see how things work without getting ourselves into more than we have time for. (In the manner of Julia Child pulling a completed soufflé out of a different oven, I've provided a finished set of the line images and text files so that we can actually create a training without all of the tweaking—I did the tedious work so you don't have to!)


### Playing to Tesseract's strengths...
Even without training, Tesseract can do a reasonably good job with our page images of *Sophonisba*: it's mostly reliable on identifying the lines of text, and it actually isn't *terrible* at recognizing the text itself. 

We'll take advantage of how good a job Tesseract does at recognizing the lines of the text in order to create our line-level images. We can get Tesseract to produce what's called [hOCR output](https://en.wikipedia.org/wiki/HOCR), which provides information in XML not just about the recognized text, but also about the coordinates in the image where that text was located. We can use Beautiful Soup to get the coordinates of the text lines that Tesseract recognizes, and then have Pillow use those coordinates to create new images of the individual lines of text extracted from our full-page images. (Not every single one of those line images will turn out to be perfect, but we should get plenty of good ones—enough to produce a good training set.)

### ... While working around its weaknesses
While Tesseract is pretty strong at recognizing typographic lines in the text, it's not quite where we'd like it to be for recognizing the text accurately (that's why we're trying to train it, after all). It wouldn't be prohibitively *difficult* to correct Tesseract's output a line at a time to provide our ground-truth text. But it would be pretty laborious and time-consuming, and the sort of thing that, on the whole, we probably want to avoid if we can. Fortunately, for the most part, we can.

In the case of *Sophonisba*, there's already a double-keyed transcription of the text available in TEI-compliant XML as part of the ECCO-TCP collection. Why not just use that as the basis for our ground truth, rather than correcting Tesseract's untrained output or transcribing the text from scratch, ourselves? I've cleaned up the TCP text by using Penn's PR732 T7 1730b to fill in gaps in the transcription caused by defects in the copy transcribed as part of TCP. (I've also made sure that PR3732 T7 1730b matches the TCP transcription on 19 of the 20 textual points I identified through a traditional collation of fifteen copies of *Sophonisba*—because, come on, this is Rare Book School. The point I couldn't check was a catchword that's not captured in the TCP transcription.)

### Taking TEI-XML where it was never meant to go
As it turns out, getting the transcription *out of* the TEI XML file in a way that's usable for our purposes isn't entirely straightforward. Like all XML, the TEI schema is built around a conceptual representation of a text as a nested tree structure. As a dramatic text, *Sophonisba* is made up of Acts, each of which has one or more Scenes nested within it; each scene has stage directions and speeches; each speech has one or more line groups, each of which has one or more lines; even individual lines can enclose other elements (like typographically highlighted spans of text).

The problem for our purposes, though, is that TEI generally privileges a representation of the *text*, and treats the structure of the *book* only in passing: page beginnings are marked using empty "milestone" elements, but the text is not *structured* by its pages. So getting the text that appears on a particular page isn't simply a matter of extracting the content of a particular element in the XML. (Inconvenient as it is for our purposes, this makes sense, given the aims of the Text Encoding Initiative. Consider a paragraph that spans more than one page: the paragraph doesn't end just because the page does, but would be the same if the book were published in a different format with different pagination. TEI does *allow for* representations that privilege the surface, but that's not the approach that the Text Creation Partnership took woth the ECCO-TCP texts, and for perfectly understandable reasons.)

So we're going to need get the text in the TCP file re-organized in a page-centric, rather than text-centric way. (**Warning:** The code we'll use to do that is not conceptually elegant, and may be upsetting to people who care deeply about TEI markup. I will apologize in advance.)

### Some light processing to check string similarity
Once we have our TCP transcription broken up by pages, we'll put Tesseract to work on OCR'ing our page images, but before we save the output, we'll check Tesseract's recognized text against our TCP file and—where possible—substitute that text for Tesseract's dirty OCR. (Understanding exactly what's happening at this stage will be easier to see in the context of the code, so I'll save the discussion until later.)


### Saving the output
Finally, we'll save a ton of tiny images, and a ton of tiny text files to feed to Tesseract using a training script made available by some pretty heavy users of Tesseract (several of whom have, I think, contributed to Tesseract itself).