# RETAG: Part TWO

Replace and normalize `geogName` and `persName` tags in files by using lists (machine-produced and human-reviewed); all known names and their variants are replaced by normalized tags.

**Note:** This notebook performs the *SECOND PASS: replace tags in files*.

In [1]:
import os
import time
import retag

from jnbkutility import log_progress

SOURCE = os.path.join("SOURCE", "ela_corpus")
DEST = os.path.join("RESULT", "retag", "production")

# The following should be set to the actual corrected file, in CSV
# format: the format should be the same as the one produced in the
# first pass
# RT_DATA = os.path.join("RESULT", "retag", "retag_input.csv")
RT_DATA = "DATA/entities_20200217-01.csv"

# SOURCE is expected to contain directories, and every subdirectory 
# in SOURCE is taken into account in order to recreate the structure
dirs = []
with os.scandir(SOURCE) as o:
    for d in o:
        if d.is_dir():
            dirs.append(d.name)

Read persistent data if needed (normally the action is commented), and read the reviewed data file. The `merge` parameter is set to `False` because we want to completely refresh *retag* data.

In [2]:
# if os.path.exists("retag.data"):
#     retag.JUP_loadData()

start = time.time()
retag.JUP_readCSV(RT_DATA, merge=False)
retag.JUP_dumpData()
print("elapsed: %s sec" % (time.time() - start))

elapsed: 0.04660534858703613 sec


## Final Step: Replace *all* Entity Tags

This step *replaces all* the entity tags with the ones provided in the reviewed data file, regardless whether they are correct and complete or not. This is usually the final step - although it may be performed several times with refined versions of the provided data file. The original directory structure is preserved in the result.

An optional list of tags to replace can be provided: available tags are `geogName`, `persName`, and `placeName`: if not specified, all three tags are replaced.

**Note:** This step is always applied to the original source files.

In [3]:
start = time.time()
for d in log_progress(dirs, name="Directories"):
    orig_dir = os.path.join(SOURCE, d)
    dest_dir = os.path.join(DEST, d)
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
    # retag.JUP_replaceEntitiesXMLDir(orig_dir, dest_dir, ['persName'])
    retag.JUP_replaceEntitiesXMLDir(orig_dir, dest_dir)
print("elapsed: %s sec" % (time.time() - start))

VBox(children=(HTML(value=''), IntProgress(value=0, max=5)))

elapsed: 208.30879664421082 sec
