# Names Data Harmonization

This script finds xml:id's in Primary Source Coop documents and attempts to match them with the Names Authorization spreadsheet. The script produces a spreadsheet with unmatched xml:id's. Editors can find the corresponding entity within the Names Authorization and write the desired xml:id in a designated column within the generated report.

Once editors have confirmed the new xml:id, a second script will find and replace the old, unmatched xml:id's.

#### Assumptions
1. Currently in this prototype, xml:id's are only compared to Taney's name authority.
    * Some xml:id's might not exist in Taney's Names Authority because the xml:id was pulled from a larger spreadsheet (e.g., JQA). Future versions of this script will need to compared id's to the larger list or mutliple lists.
2. This script assumes that Names Authority has the correct unique-identifier.
    * It works best (at the moment) with the expectation that work will be done to the names authority first. Can also be iterative, but that might require retracing steps.

In [1]:
import re, warnings, csv, sys, os, glob
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
from lxml import etree

warnings.simplefilter('ignore')

## Find Unmatch xml:id's

### Variables for Directories + Files

In [4]:
%%time

# Collect Taney files
xml_directory = glob.glob("/Users/quinn.wi/Documents/Data/JQA/1821/*.xml")

names_auth = pd.read_excel("/Users/quinn.wi/Documents/Data/JQA/DJQA_Names-List_singleSheet.xlsx",
                           sheet_name = 0)[['Last Name', 'First Name', 'Hyogebated-unique-string-of-characters']]

print (len(names_auth['Hyogebated-unique-string-of-characters'].unique()), f'Number of unique identifiers.')

names_auth.head(5)

16035 Number of unique identifiers.
CPU times: user 1min 4s, sys: 354 ms, total: 1min 5s
Wall time: 1min 5s


Unnamed: 0,Last Name,First Name,Hyogebated-unique-string-of-characters
0,??,??,u
1,??,??,u
2,??,??,u
3,??,??,u
4,??,??,unknown


### Generate Report of Unmatched Entities

In [17]:
%%time

# Read in file and get root of XML tree.
def get_root(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    return root


# Get namespace of individual file from root element.
def get_namespace(root):
    namespace = re.match(r"{(.*)}", str(root.tag))
    ns = {"ns":namespace.group(1)}
    return ns


# Get list of unique-id's in names authority & lowercase them. # .lower()
names_auth_ids = [x for x in names_auth['Hyogebated-unique-string-of-characters'] \
                  .values.tolist()]

persData = []

for file in xml_directory:
    reFile = re.sub('.*/(.*.xml)', '\\1', file)
    root = get_root(file)
    ns = get_namespace(root)
    
    for persRef in root.findall('.//ns:p/ns:persRef/[@ref]', ns):
        ref_id = persRef.get('ref').lower() # Lowercase xml:id's.
        
        if ref_id not in names_auth_ids:
            persData.append({'file':reFile, 'ref_id':ref_id})
        else:
            continue
            
unmatched_persRef_df = pd.DataFrame(persData)

unmatched_persRef_df['ref_id'] = unmatched_persRef_df['ref_id'].str.split(';', 1).tolist()

unmatched_persRef_df = unmatched_persRef_df.explode('ref_id')

# Add empty column for user-input.
unmatched_persRef_df['correct_id'] = ''

print (unmatched_persRef_df.shape)
unmatched_persRef_df.head()

(145, 3)
CPU times: user 675 ms, sys: 6.64 ms, total: 681 ms
Wall time: 689 ms


Unnamed: 0,file,ref_id,correct_id
0,JQADiaries-v32-1821-09-p082.xml,morton-perez,
0,JQADiaries-v32-1821-09-p082.xml,morton-sarah,
1,JQADiaries-v32-1821-09-p082.xml,gray-william,
1,JQADiaries-v32-1821-09-p082.xml,gray-elizabeth,
2,JQADiaries-v32-1821-09-p082.xml,adams-charles2,


#### Print csv for names correction

In [18]:
unmatched_persRef_df.to_csv('/Users/quinn.wi/Documents/Data/JQA/djqa_names_authorization.csv',
                            sep = ',', header = True)

## Re-write xml:id's

### Variables for Directories + Files

In [None]:
%%time

# Collect Taney files
xml_directory = glob.glob("/Users/quinn.wi/Documents/SemanticData/Data/Taney/*/*.xml")

user_corrections = pd.read_csv("...")

### Write new xml:id's into XML docs

In [None]:
%%time

# Read in file and get root of XML tree.
def get_root(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    return root


# Get namespace of individual file from root element.
def get_namespace(root):
    namespace = re.match(r"{(.*)}", str(root.tag))
    ns = {"ns":namespace.group(1)}
    return ns


# Get list of unique-id's in names authority & lowercase them.
old_ids = [x.lower() for x in user_corrections['xml_id'].values.tolist()]

persData = []

for file in xml_directory:
    reFile = re.sub('.*/(.*.xml)', '\\1', file)
    root = get_root(file)
    ns = get_namespace(root)
    
    for persRef in root.findall('.//ns:p/ns:persRef/[@ref]', ns):
        xml_id = persRef.get('ref').lower() # Lowercase xml:id's.
        
#         Checks
#         If xml_id == old_ids

#         Replace (overwrite) xml_id with corrected_id.
        