# Build an eighteenth-century word list for Tesseract
We're mostly trying to train Tesseract to recognize eighteenth-century letter forms, but we can improve Tesseract's chances of getting our text right if we also provide it with some information about the kinds of words (and punctuation) we expect to find in the kinds of text that we're hoping to use Tesseract to recognize.

In this notebook, we'll plunder the ECCO-TCP corpus for words to use as a dictionary for training. ECCO-TCP is *much* smaller than EEBO-TCP, but it's big enough to find words that aren't in Tesseract's default English language model.

## 1 - Connect to Google Drive and import packages

In [None]:
#Code cell #1
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

In [None]:
#Code cell #2
import os
import glob
import shutil
from bs4 import BeautifulSoup
import lxml
import nltk

## 2 - Move files from Google Drive to Colaboratory
The files available from the TCP GitHub repository don't have the long-s characters in the transcriptions. I happen to have on my hard drive what I believe must be an earlier release of the texts that does have it. (It may be that it was just from a different source: the versions at the [Oxford Text Archive](http://www.ota.ox.ac.uk/) do have them, and I rather suspect this may have been where I got them, at a point when they had a system for bulk download). We'll use my copy, which I've placed in the shared Google Drive folder for the class, rather than on GitHub.

In [None]:
#Code cell #3
%cp /gdrive/MyDrive/L-100a/ecco_tcp.zip /content/ecco_tcp.zip
%cd /content/
!unzip ecco_tcp.zip

## 3 - Extract plain text from the TEI files
For the purposes of generating a word list, we don't need any of the TEI markup that's in the TCP texts. This cell uses Beautiful Soup to extract the text content from each file and save it as a plain text file. (Note: this will take several minutes.)

In [None]:
#Code cell #4
#Save plaintext versions of ECCO-TCP texts
corpus_directory = '/content/ecco_tcp/plain_text/'
if not os.path.exists(corpus_directory) :
  os.makedirs(corpus_directory)
for filepath in glob.glob('/content/ecco_tcp/*.xml') :
  filename = os.path.basename(filepath)[:-4]
  # print(filename)
  with open(filepath, 'r') as infile :
    content = infile.read()
    soup = BeautifulSoup(content, 'xml')
    text = soup.find('text').get_text()
    #Causes problems due to concatenation of end-of-line-hyphenated words??
    # text = text.replace('∣','')
  with open(corpus_directory + filename + '.txt', 'w') as outfile :
    outfile.write(text)
    print('Saved ' + filename + '.txt')

## 4 - Get distinct words
There may well be other ways to do this—this is a question I should have asked Carl on text mining day. I used `nltk` to build a corpus from the plain text versions of the ECCO-TCP texts, then ran a frequency distribution to get unique tokens in the corpus.

In [None]:
#Code cell #5
corpusdir = '/content/ecco_tcp/plain_text/'
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
ecco_tcp_eng_corpus = PlaintextCorpusReader(corpusdir, '.*')
print(len(ecco_tcp_eng_corpus.words()))

In [None]:
#Code cell #6
from nltk import FreqDist
fdist = FreqDist(ecco_tcp_eng_corpus.words())

In [None]:
#Code cell #7
print(len(fdist.keys()))
tokens = [key for key in fdist.keys()]
tokens.sort()

That seems like a lot of distinct words. The fact that I have to get past the 10,000th token just to get the "A"s gives me pause.

In [None]:
#Code cell #8
for token in tokens[10500:10550] :
  print(token)

I worked up a few regular expressions to try to trim away some of the noise...

In [None]:
#Code cell #9
import re
#I know there are defined lists of punctuation. But there's all *sorts* of freaky
#stuff in the ECCO-TCP texts...
punct_pattern = re.compile(r'^[!@#$%\^&\*\(\)\-_\+=\{\}\[\]\|\\;\:\'\"\"<,>\.\?\/€‹›ﬂ‡°·—±„´ˇÁ¨\"\'»Ó˝◊¿▪.…☜☞]*$')

#Anything that, from beginning to end, is composed of characters that are not
#alphabetic
nonword = re.compile(r'^[^A-Za-z]+$')

#I don't have anything against numbers. I just don't want them at the beginning
#of my words. I mean, it's okay in the titles of Prince songs, I guess...
contaminated = re.compile(r'^[0-9]+')

#Anything that's only lettes (including long-s)
all_alphabetic = re.compile(r'^[A-Za-zſ]+$')

#Define some lists to hold results
punct = []
words = []

#Search for matches of these regular expressions, and add them to the lists
for token in tokens :
  if re.match(punct_pattern, token) is not None :
    punct.append(token)
  if re.match(all_alphabetic, token) is not None:
    words.append(token)



There were still plenty of problems with my word list, but after trying a few things that were simply taking too long on Colaboratory (or possibly in Python), I exported the word list as-is and processed it some more in a matter of minutes in my text editor, reducing a 5.1MB file to 3.8MB file (which is still more than 400,000 words). I've included those files in the collection of pre-prepared materials that you can use for the last notebook in this sequence.

In [None]:
#Code cell #10
with open('/content/ecco-words.txt', 'w') as wordfile :
  for word in words :
    wordfile.write(word + '\n')

In [None]:
#Code cell #11
with open('/content/ecco-punct.txt', 'w') as punctfile :
  for pattern in punct :
    punctfile.write(pattern + '\n')

In [None]:
#Code cell #12
%cd /content/
!zip training_lists.zip *.txt
!mv training_lists.zip /gdrive/MyDrive/rbs_digital_approaches_2023/output/ocr_training_materials/training_lists.zip

## 4 - Clear Colaboratory environment

In [None]:
#Code cell #13
%cd /content/
! rm -r ./*