# TOOLS_ELA

*tools_ela.py* is a command line tool that extracts information from original transcribed text. The provided information is:

1. Indexes of Words, Lemmas, Types
2. Frequencies of Lemmas, Types
3. Concordances
4. TTR
5. Collocations (for both words and lemmas)
6. N-grams
7. Min, Max and Mean lengths of Types
8. POS Tagging (both Bayesian and HMM - Hidden Markov Model)
9. TEI attributes (on TEI files)
10. TEI entity lists (on TEI files)

The tool can handle both *XML-TEI encoded* and pure-text files. By default it handles *XML-TEI*, but a command line argument chan instruct the tool to treat files as text. It is actually a multiplatform tool, actively tested on both Linux and Windows. It works in the current Python *virtual environment*, the same as this Jupyter Lab session.

In the same directory as *tools_ela.py* a configuration file is required, namely *tools_ela.cfg*, that specifies working directories. The default configuration is

```
[Paths]
base = .
origin_subdir = SOURCE/ela_txt
result_subdir = RESULT/tools_ela
database = tools_ela.db
```

Apart from the database, it means that input files are expected to be in `SOURCE/ela_txt` and the results will be written in `RESULT/tools_ela`; the source directory is considered to be *flat*, thus no subdirectories are traversed. Results are stored in a flat directory as well. Result files keep their base prefix and have suffixes such as the following:

* `BASEPREFIX_collocations.json`: word collocations
* `BASEPREFIX_lemma_collocations.json`: lemma collocations (latin only)
* `BASEPREFIX_ngrams.json`: ngrams (at various levels)
* `BASEPREFIX_concordances.txt`: concordances (*note: these are text lines*)
* `BASEPREFIX_statistics.json`: statistics on the latin part of text
* `BASEPREFIX_fulltext_statistics.json`: statistics on all text for all languages
* `BASEPREFIX_tei_attrs.json`: data from the XML-TEI header
* `BASEPREFIX_tei_lists.json`: lists from XML-TEI encoding (*persName*, *geogName*, *placeName*)

All results, except for *concordances*, are in JSON format.

**Note:** *tools_ela.py* requires CLTK to be installed, and the `latin_models_cltk` corpus to be loaded. This can be performed from a Python session as follows:

```
>>> from cltk.corpus.utils.importer import CorpusImporter
>>> corpus_importer = CorpusImporter('latin')
>>> corpus_importer.import_corpus('latin_models_cltk') 
```

otherwise the tool will exit with a warning. Also, in order to produce data related to *TEI lists* (namely places and people), the *tools_ela.db* has to be present, created from updated dumps of the *Pleiades* and *Geonames* databases. This can be done using the *nbk_dbload* notebook.

The following are the command line parameters:

In [3]:
!python tools_ela.py --help

usage: tools_ela.py [-h] [-a] [-C N] [-L N] [-N N] [-O] [-B] [-M] [-T] [-A]
                    [-S] [-v] [-p] [--full-monty]
                    [BASE_FILENAME]

Preprocess latin text file(s) for ELA

positional arguments:
  BASE_FILENAME         base of filename to process (no directory/extension)

optional arguments:
  -h, --help            show this help message and exit
  -a, --all             process all files in the source directory
  -C N, --collocations N
                        generate collocations file with window up to N
                        [N=2..5]
  -L N, --lemma-collocations N
                        generate lemma collocations file with window up to N
                        [N=2..5]
  -N N, --ngrams N      generate ngrams file of sizes up to N [N=2..5]
  -O, --concordances    generate concordances file
  -B, --bayesian-pos    generate bayesian POS tag file
  -M, --hmm-pos         generate HMM base POS tag file
  -T, --assume-text     assume simple text files inste

To invoke the tool for a specific file the `BASE_FILENAME` attribute has to be explicitly specified, *without* extension: extension will be chosen by *tools_ela.py* between *.txt* and *.xml* depending on whether `--assume-text` has been provided or not. To process all files in a directory, just specify `--all`. The `--full-monty` switch is an utility that generates most useful information (actually: everything but *POS tagging*) without having to specify all related command line switches. Thus, for instance, to regenerate POS tagging for *Epistola.xml* in the source directory, issue the following:

```
$ python tools_ela.py -v -p -M -B Epistola
```

and to generate most useful information the following command can be launched, as we will do below:

```
$ python tools_ela.py --full-monty --verbose -a
```

**Note:** For optimal results this tool should be used on XML-TEI files processed by *RETAG*.

The following cell invokes the *tools_ela.py* from the command line, in the above form.

In [4]:
!python tools_ela.py --full-monty --verbose -a

[Sun Mar  1 23:01:48 2020] reading configuration from 'tools_ela.cfg'
[Sun Mar  1 23:01:50 2020] notice: assuming input files are in XML-TEI specific format
[Sun Mar  1 23:01:50 2020] extracting: statistics
[Sun Mar  1 23:01:50 2020] extracting: collocations (window max size: 5)
[Sun Mar  1 23:01:50 2020] extracting: lemma_collocations (window max size: 5)
[Sun Mar  1 23:01:50 2020] extracting: ngrams (ngram max size: 5)
[Sun Mar  1 23:01:50 2020] extracting: concordances
[Sun Mar  1 23:01:50 2020] extracting: XML-TEI header attributes
[Sun Mar  1 23:01:50 2020] extracting: XML-TEI element lists
[Sun Mar  1 23:01:50 2020] starting process for all files
[Sun Mar  1 23:01:50 2020] environment:
source: 'SOURCE/ela_txt'
destination: 'RESULT/tools_ela'
[Sun Mar  1 23:01:50 2020] logging errors to: 'error.log'
[Sun Mar  1 23:01:50 2020] --- BEGIN ---
[Sun Mar  1 23:01:50 2020] --- END ---
[Sun Mar  1 23:01:50 2020] process ended: 0 files processed, 0 skipped (0.000 sec)


Now the `RESULT/tools_ela` directory contains all the requested information.