# RETAG: Part ONE

Replace and normalize `geogName` and `persName` tags in files by using lists (machine-produced and human-reviewed); all known names and their variants are replaced by normalized tags.

**Note:** This notebook performs the *FIRST PASS: create list of names and variants*.

In [1]:
import os
import time
import retag

from jnbkutility import log_progress

SOURCE = os.path.join("SOURCE", "ela_corpus")
DEST = os.path.join("RESULT", "retag", "intermediate")
RT_DATA = os.path.join("RESULT", "retag", "retag_output.csv")

# this is fixed and should not be changed
RT_PERSISTDATA = "retag.data"
if os.path.exists(RT_PERSISTDATA):
    retag.JUP_loadData()

# SOURCE is expected to contain directories, and every subdirectory 
# in SOURCE is taken into account in order to recreate the structure
dirs = []
with os.scandir(SOURCE) as o:
    for d in o:
        if d.is_dir():
            dirs.append(d.name)

The following steps are only performed when:

1. a *CSV* file is needed in order to be examined and corrected, so that the second step (in the *nbk_XML_retag2pass* notebook) can be performed,
2. a persistence database (in internal, non-human-readable format) is needed, for corpora of considerable dimensions.

For a sufficiently small corpus only the first action is needed. Uncomment the desired action and run the cell accordingly.

In [2]:
retag.JUP_writeCSV(RT_DATA)
# retag.JUP_dumpData()

## First Preliminary Correction: Add Entities when Untagged

This step represents an intermediate step to help the reviewer by adding recognized entities to partially tagged texts: entities are only added where the corresponding elements are not found or incorrectly/incompletely tagged.

**Note:** This step may be optional, when the primary goal is to produce the `RT_DATA` file.

In [3]:
start = time.time()
for d in log_progress(dirs, name="Directories"):
    orig_dir = os.path.join(SOURCE, d)
    dest_dir = os.path.join(DEST, d)
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
    retag.JUP_addEntitiesXMLDir(orig_dir, dest_dir)
print("elapsed: %s sec" % (time.time() - start))

VBox(children=(HTML(value=''), IntProgress(value=0, max=5)))

elapsed: 56.162230014801025 sec
