# Getting transcribed text from ECCO-TCP's TEI-XML file
The text of *Sophonisba* in the ECCO-TCP corpus was double-keyed by two human readers, so we can be reasonably confident that it's pretty accurate.

Note, though, that the quality of the TCP text is conditioned by the copy that was consulted for transcription. There were several `gap` elements in the TCP text indicating places where the source copy was illegible. After comparing the TCP text on a number of different points that I'd identified in a collation of multiple copies of the play, I concluded that the TCP text was transcribed from a copy in the same state as Penn's PR3732 T7 1730b, so I filled in those gaps. I also happened to catch a couple of transcription errors, which just goes to show that some of the letter forms can be hard to distinguish for human readers, too: in the case I found, in this case an italic ſb ligature was mis-transcribed as ſh, which can look *very* similar in binarized images. (In other cases, though, my examination of Penn's PR3732 T7 1730b revealed that there really *was* an ſh ligature where there should have been an ſb.)

The plan here is to mine this corrected TCP text for known-good transcriptions of the lines on each page so that we can match that text to the output that Tesseract gives us before training. As we saw, Tesseract is pretty good at finding the lines of text, but the TCP text gives us a correct version of what those lines actually say.

Unfortunately for our purposes, TEI doesn't treat pages as structural elements in the XML, so there's not an immediately obvious way to extract all the lines for each page from the XML. (I should say: not immediately obvious to me. Someone who's a wizard with XSLT might have a better answer than I do.) Instead, we'll ignore all the carefully-crafted XML structure in the TCP transcription and deal with it as if it were just a plain text file.

We'll end up with a set of plain text files—one for each page of text—that we can use for checking untrained-Tesseract's output.



## 1 - Connect to Google Drive, install packages, and set source directory

In [None]:
#Code cell #1
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

In [None]:
#Code cell #2
#Import the libraries we need: os and glob for dealing with the file system,
#re for handling regular expressions
import os
import glob
import re

In [None]:
#Code cell #3
source_directory = '/gdrive/MyDrive/rbs_digital_approaches_2023/2023_data_class/'

## 2 - Replace ampersand entities, save the hyphens and chunk the text on page beginnings
For the most part, we're simply going to throw away all of the tagging in this file. But we're making a couple of exceptions to ensure that our text files show what's actually on the page: converting `&amp;` to "&" and replacing all instances of `<g ref=char:"EOLhyphen"/>` with "-" and a newline character (`\n`). (That last substitution has the added bonus of being a workaround for the TCP's handling of verse lines that end up wrapped on the page: in the TCP text, those line breaks aren't reflected in the XML structure. Now they *will* be in our plain text version.)

Other than that, though, the only other tags we care about right now are the `pb` (page beginning) tags. Those tell us where to split the text of our file.

In [None]:
#Code cell #4
#Compile regular expressions for substitution
ampersand = re.compile(r'&amp;')
eol_hyphen = re.compile(r'<g.+?EOLhyphen"/>')
pb_tag = re.compile(r'<pb.+?>')

#Open the file (ignoring that it's XML) and replace selected
#patterns
with open(source_directory + 'ecco-tcp_K132743.000_modified.xml', 'r') as xml_file :
  xml_content = xml_file.read()
  ampersanded = re.sub(ampersand, '&', xml_content)
  hyphenated = re.sub(eol_hyphen, '-\n', ampersanded)
  #Create a list of chunks of text for each page by splitting
  #the contents of the file on appearances of <pb>
  page_chunks = re.split(pb_tag, hyphenated)
print(len(page_chunks))

## 3 - Get rid of the TEI header
Because we split the text of the file at every occurrence of the regular expression `<pb.+?>`, the first item in our list is whatever text came *before* the first `pb` in the file. That turns out to mostly be the TEI header, which we don't need. So we'll delete that from our list of page chunks, leaving the title page as the first of our page_chunks.

(Note that our page chunks still have all of the XML tags in them. Those are going away very soon.)

In [None]:
#Code cell #5
#See the first element in the list--the text before the first pb tag
print('Original page_chunks[0]: ')
print(page_chunks[0])

#Farewell, TEI header. We hardly knew ye.
del(page_chunks[0])

#There's a new item at index 0. Hello, title page.
print('----------\nNew page_chunks[0]')
print(page_chunks[0])

## 4 - Get rid of all remaining tags
We'll create a new list to hold the text of the `page_chunks` once the XML tags are stripped out. For each `page_chunk`, we'll look for every occurrence of an XML tag and substitute nothing (`''`) in its place. Then we'll add that text, shorn of its XML tags, to our list of `pages`.

In [None]:
#Code cell #6
#Create empty list
pages = []

#Define a regular expression to find all tags
xml_tag = re.compile('<.+?>')

#Loop through page_chunks
for page_chunk in page_chunks :
  #Delete all occurrences of the xml_tag pattern (by substituting ''
  #in their place), and append the resulting text to the pages list.
  pages.append(re.sub(xml_tag, '', page_chunk))

#Have a look at a sample page
print(pages[7])

## 5 - Strip extraneous white space—but keep the line breaks!
We want to get rid of that extraneous leading white space, but we can't just strip *all* whitespace from the list item, because that would remove the line breaks, too.

So we'll split each page up into a list of lines, strip the extraneous white space from each *line*, and save that resulting list to a new list of `page_lines`.

In [None]:
#Code cell #7
#Create what will be a list-of-lists
page_lines = []

#Loop through the pages list
for page in pages :
    #Create a new list by splitting each page at every occurrence of the
    #newline character (\n)
    line_list = page.split('\n')

    #Create yet another list to hold the lines stripped of extraneous
    #white space
    stripped_lines = []

    for line in line_list :
        #Strip leading white space (i.e., white space from the left end
        #of the string: lstrip() is short for "left strip").
        stripped_line = line.lstrip()

        #Ignore any empty lines--we won't need those, where we're going...
        if stripped_line != '' :
            #Add the now left-stripped line to the stripped_lines list
            stripped_lines.append(stripped_line)

    #Append the resulting list of stripped lines to the page_lines list
    page_lines.append(stripped_lines)

#Have a look at a sample
print(page_lines[7])

## 6 - So what do we actually have, now?
Our `page_lines` list contains 86 lists (one list for each page of text in the TCP transcription). Each of those 86 lists is a list of lines that appear on one page of text. While we didn't explicitly save any information about which page those lines appear on, we can kind of work that out using each item's index in the list, itself: the list at `page_lines[0]` is the text of page 1 (the title page), the list at `page_lines[1]` is the text of page 2, and so on.


In [None]:
#Code cell #8
print(str(len(page_lines)) + ' items in page_lines.')
for page_lines_index, page_lines_content in enumerate(page_lines) :
  if page_lines_index in range(0,10) :
    print('\npage_lines[' + str(page_lines_index) + ']')
    for content_index, content_line in enumerate(page_lines_content) :
      print('--' + str(content_index) + ': ' + content_line)

Note that the some pages don't reflect the typographic line breaks we would have hoped for—the title page, the dedication to the Queen, and the Preface, in the list we're printing out below, for example. This is another of those places where we see how the way that the TCP applied the TCP schema (understandably) aims to represent the *text* rather than the *book*.


### 6.a - A brief digression on TEI
TEI *does* have a way of representing ["topographic lines"](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-line.html), but the TEI `line` element is only permitted to appear within `zone` or `surface` elements that are intended for the representation of primary sources. As the TEI Guidelines note in [chapter 11](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html):
>This chapter describes elements that may be used to represent primary source materials, such as manuscripts, printed books, ephemera, or other textual documents. Some of these specialized elements, particularly at phrase-level, add to the other elements available within [text](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-text.html) to deal with textual phenomena more specific to primary source transcription. Other structural and block-level elements described here can be used to represent primary source materials by prioritizing the encoding of their spatial features over their logical textual structure (that is, the elements described in chapter [4 Default Text Structure](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/DS.html)). These elements, [facsimile](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-facsimile.html), [sourceDoc](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-sourceDoc.html), and their children, may be used in parallel and in combination with an encoding of logical text structures with [text](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-text.html), or as standalone representations. The element [sourceDoc](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-sourceDoc.html) in particular provides a way of combining facsimile and transcriptions by embedding transcribed text. This approach focuses on physical and textual features that can be primarily described spatially, such as the sequence of pages in a manuscript, or the layout of a printed page.

Note how there *are* choices available—for some purposes, the representation of the physical characteristics of a document might be more important than the representation of the logical structure of the text, and TEI *does* offer methods for accommodating those cases. As it says here, it's *possible* to offer these representations in parallel, but I gather that it's not entirely straightforwad to do so (bear in mind, though, that I'm not an expert in XML, generally, or in TEI, more particularly).

## 7 - Okay, whatever. Let's save it!
Whatever may be *possible* in TEI, however, the file we're dealing with here represents the text rather than the page. We're going to want to have those line breaks in the non-lineated sections of the book later. There's probably a [clever solution](https://i.pinimg.com/originals/82/4f/9a/824f9aef3d67ef521c54f6ed966ae9e3.png) to be coded, and if we were working with more than one document, it would probably be [worth the time](https://xkcd.com/1205/) to figure that solution out. As it is, though, since there's just the one document and I'm only planning to do this once, I'm just going to save these files as-is, then manually add line breaks to pages 1, 3, 4, and 86.

(**Note:** The copy of *Sophonisba* we're dealing with lacks the dedication to the Queen, which could actually be [a rather interesting](http://xtf.lib.virginia.edu/xtf/view?docId=StudiesInBiblio/uvaBook/tei/sibv012.xml;chunk.id=vol012.12;toc.depth=1;toc.id=vol012.12;brand=default) bibliographical fact about this particular copy. Nonetheless, we'll skip over saving those pages because they won't be present in the Tesseract output we're dealing with, so removing them now simplifies matters later.)

Since we're going to be writing a lot of small files in quick succession, we'll save the files locally in the Colaboratory environment. When that's all done, we'll compress them into a .zip file and copy that file over to Google Drive to reduce the amount of input/output between Colaboratory and Google Drive.

In [None]:
#Code cell #9

#Penn PR3732 T7 1730b lacks the dedication to the Queen, which are pages
#2 and 3 of our TCP transcription, so we need to skip index 1 and 2 of page_lines

tcp_lines_output_directory = '/content/tcp_lines/'

#Check to see if our output directory already exists. If not, make it (and any
#intermediate directories we need along the way)
if os.path.exists(tcp_lines_output_directory) is not True :
  os.makedirs(tcp_lines_output_directory)

#Let's start our index at 1, so that we're not having to think about "page zero"
i = 1
for content_index, content_line in enumerate(page_lines) :
  if content_index not in [1, 2] :
    filename = tcp_lines_output_directory + str(i) + '.txt'
    with open(filename, 'w') as outfile :
      for tcp_line in content_line :
        outfile.write(tcp_line + '\n')
      print('Saved ' + filename)
    i += 1

### Move these files over to Google Drive and clear Colaboratory environment

In [None]:
#Code cell #10
%cd /content/
! zip -r tcp_lines.zip tcp_lines/
! mv tcp_lines.zip /gdrive/MyDrive/rbs_digital_approaches_2023/output/ocr_training_materials/tcp_lines.zip
! rm -r ./*