# nerHelper - Function by Function Testing

Below is each function separated into an individual cell for testing. Scrolling down should mimic the steps the nerHelper takes. Some of the user-interaction functions might not appear here as they would need to be de-bugged while app is running.

In [1]:
import warnings, re, glob, datetime, csv, sys, os, base64, io, spacy, datetime
import pandas as pd
import numpy as np

import xml.etree.ElementTree as ET ## not called in app.

from lxml import etree, isoschematron
from lxml.html import fragment_fromstring

import dash, dash_table
import dash_core_components as dcc
from dash.dependencies import Input, Output, State
from dash.exceptions import PreventUpdate
import dash_html_components as html
from jupyter_dash import JupyterDash

# Import spaCy language model.
nlp = spacy.load('en_core_web_sm')

# Ignore simple warnings.
warnings.simplefilter('ignore', DeprecationWarning)

# Declare directory location to shorten filepaths later.
abs_dir = "/Users/quinn.wi/Documents/"

## Function Order

1. [parse_contents()](#parse_contents)
    * get_namespace
    * get_abridged_xpath
    * get_text
    * get_encoding
    * get_spacy_entities
    * get_contents
    * make_dataframe
    * make_ner_suggestions
    
parse_contents() and the functions it calls reads in an existing XML file, converts it to plain text, finds named entities, and creates a dataframe of results with their metadata (filename, parent node, etc.).

make_ner_suggestions(), also part of parse_contents(), does the important job of transforming plaintext and the found entity back into XML using regular expressions. The old encoding can then be "found & replaced" with the new encoding (again, using regular expressions).
    
2. [highlighter()](#highlighter)
    * highlighter
    

3. [revisions](#revisions)
    * inherit_changes
        * up_convert_encoding
        * xml_cleanup
        * revise_without_uniq_id

4. [update header](#update_header)
    * append_change_to_revisionDesc
    * append_app_to_appInfo

5. [writing](#writing)
    * revise_xml

In [71]:
%%time

# Preset/User-defined variables: static here.
label_dict = {'PERSON':'persName',
                  'LOC':'placeName', # Non-GPE locations, mountain ranges, bodies of water.
                  'GPE':'placeName', # Countries, cities, states.
                  'FAC':'placeName', # Buildings, airports, highways, bridges, etc.
                  'ORG':'orgName', # Companies, agencies, institutions, etc.
                  'NORP':'name', # Nationalities or religious or political groups.
                  'EVENT':'name', # Named hurricanes, battles, wars, sports events, etc.
                  'WORK_OF_ART':'name', # Titles of books, songs, etc.
                  'LAW':'name', # Named documents made into laws.
                  'DATE':'date' # Absolute or relative dates or periods.
                 }

ner_values = ['LOC', 'GPE']
subset_ner = {k: label_dict[k] for k in ner_values}
filename = "JQADiaries-v28-1809-12-p047.xml"
banned_list = ['persRef', 'date']


xml_string = open(abs_dir + "Data/PSC/JQA/1809/JQADiaries-v28-1809-12-p047.xml", "r+").read().strip()

# Base lxml function: parse xml from a string.
root = etree.fromstring(xml_string.encode())

CPU times: user 3.04 ms, sys: 2.04 ms, total: 5.07 ms
Wall time: 17.7 ms


<a id="get_namespace">parse_contents</a>

## parse_contents

The test version here differs slightly than the app. This test reads files in with a local filepath rather than an upload.

In [3]:
%%time

# Create empty dataframe to store results.
df = pd.DataFrame(columns = ['file', 'abridged_xpath', 'previous_encoding', 'entities'])

"""
XML Parsing Function: Get Namespaces
"""
def get_namespace(root):
    namespace = re.match(r"{(.*)}", str(root.tag))
    ns = {"ns":namespace.group(1)}
    return ns

ns = get_namespace(root)
print ('get_namespace() is working.')


"""
XML Parsing Function: Retrieve XPaths
"""
def get_abridged_xpath(child):
    if child.getparent().get('{http://www.w3.org/XML/1998/namespace}id') is not None:    
        ancestor = child.getparent().tag
        xml_id = child.getparent().get('{http://www.w3.org/XML/1998/namespace}id')

        abridged_xpath = f'.//ns:body//{ancestor}[@xml:id="{xml_id}"]/{child.tag}'
        return abridged_xpath


"""
XML Parsing Function: Convert to String
"""
def get_text(elem):
    text_list = []
    text = ''.join(etree.tostring(elem, encoding='unicode', method='text', with_tail=False))
    text_list.append(re.sub(r'\s+', ' ', text))
    return ' '.join(text_list)

        
"""
XML Parsing Function: Get Encoded Content
"""    
def get_encoding(elem):
#     encoding = etree.tostring(elem, with_tail = False, pretty_print = True).decode('utf-8')
    encoding = ET.tostring(descendant, method = 'xml').decode('utf-8')
    
#     encoding = etree.fromstring(encoding)
#     encoding = etree.tostring(encoding).decode('utf-8')
    
    encoding = re.sub('\s+', ' ', encoding) # remove additional whitespace
    return encoding

  

"""
NER Function
"""
# spaCy
def get_spacy_entities(text, subset_ner):
    sp_entities_l = []
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in subset_ner.keys():
            sp_entities_l.append((str(ent), ent.label_))
        else:
            pass
    return sp_entities_l




"""
XML & NER: Retrieve Contents
"""
def get_contents(ancestor, xpath_as_string, namespace, subset_ner):
    
    textContent = get_text(ancestor) # Get plain text.
    encodedContent = get_encoding(ancestor) # Get encoded content.
    sp_entities_l = get_spacy_entities(textContent, subset_ner) # Get named entities from plain text.
    
    return (sp_entities_l, encodedContent)

    
"""
XML: & NER: Create Dataframe of Entities
"""
def make_dataframe(child, df, ns, subset_ner, filename, descendant_order):
    abridged_xpath = get_abridged_xpath(child)
    entities, previous_encoding = get_contents(child, './/ns:.', ns, subset_ner)

    df = df.append({
        'file':re.sub('.*/(.*.xml)', '\\1', filename),
        'descendant_order': descendant_order,
#         'abridged_xpath':abridged_xpath,
        'previous_encoding': previous_encoding,
        'entities':entities,
    },
        ignore_index = True)
    
    return df

    
desc_order = 0
for child in root.findall('.//ns:body//ns:div[@type="docbody"]', ns): # part of parse_contents()
    
    abridged_xpath = get_abridged_xpath(child)
                
    for descendant in child:
#         print (descendant.tag)
#         print (etree.tostring(descendant, with_tail = False))
#         print (ET.tostring(descendant, method = 'xml'))
    
        desc_order = desc_order + 1
        df = make_dataframe(descendant, df, ns, subset_ner, filename, desc_order)
        df['abridged_xpath'] = abridged_xpath
        
print ('abridged_xpath(), all functions in get_contents(), and make_dataframe() are working.')
df.head(5)

get_namespace() is working.
abridged_xpath(), all functions in get_contents(), and make_dataframe() are working.
CPU times: user 2.04 s, sys: 72.2 ms, total: 2.11 s
Wall time: 2.11 s


Unnamed: 0,file,abridged_xpath,previous_encoding,entities,descendant_order
0,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:pb xmlns:ns0=""http://www.tei-c.org/ns/1.0...",[],1.0
1,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:dateline xmlns:ns0=""http://www.tei-c.org/...",[],2.0
2,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...",[],3.0
3,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","[(Holland, GPE)]",4.0
4,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","[(Russia, GPE), (England, GPE), (Counsels, GPE...",5.0


## Make NER Suggestions 

Part of parse_contents()

In [4]:
%%time


"""
XML Parsing Function: Write New Encoding with Up-Conversion
"""
def make_ner_suggestions(previous_encoding, entity, label, subset_ner, kwic_range, banned_list):
#     Regularize spacing & store data as new variable ('converted_encoding').
    converted_encoding = re.sub('\s+', ' ', previous_encoding, re.MULTILINE)
    
#     Create regex that replaces spaces with underscores if spaces occur within tags.
#     This regex treats tags as a single token later.
    tag_regex = re.compile('<(.*?)>')

#     Accumulate underscores through iteration
    for match in re.findall(tag_regex, previous_encoding):
        replace_space = re.sub('\s', '_', match)
        converted_encoding = re.sub(match, replace_space, converted_encoding)
    
#     Up-convert entity (label remains unchanged).
    label = subset_ner[label]    
    converted_entity = ' '.join(['<w>' + e + '</w>' for e in entity.split(' ')])
    
#     Up-Converstion
#     Tokenize encoding and text, appending <w> tags, and re-join.
    converted_encoding = converted_encoding.split(' ')
    for idx, item in enumerate(converted_encoding):
        item = '<w>' + item + '</w>'
        converted_encoding[idx] = item
        
    converted_encoding = ' '.join(converted_encoding)
    
#     Find converted entities and kwic-converted entities, even if there's additional encoding within entity.
    try:
        entity_regex = re.sub('<w>(.*)</w>', '(\\1)(.*?</w>)', converted_entity)
        entity_match = re.search(entity_regex, converted_encoding)
        
        ban_decision = []
        for i in banned_list:
            if i in entity_match.group(0):
                ban_decision.append('y')
                
        if 'y' in ban_decision:
            return "Already Encoded"
        
#         If expanded regex is in previous encoding, find & replace it with new encoding.
        elif entity_match:
            new_encoding = re.sub(f'{entity_match.group(0)}',
                                  f'<{label}>{entity_match.group(1)}</{label}>{entity_match.group(2)}',
                                  converted_encoding)
            
#             Remove <w> tags to return to well-formed xml.
            new_encoding = re.sub('<[/]?w>', '', new_encoding)
#             Remove underscores.
            new_encoding = re.sub('_', ' ', new_encoding)

            return new_encoding

        else:
            return 'Error Making NER Suggestions'
    
#     Up-conversion works well because it 'breaks' if an entity already has been encoded:
#     <w>Abel</w> (found entity) does not match <w><persRef_ref="abel-mary">Mrs</w> <w>Abel</persRef></w>
#     <persRef> breaks function and avoids duplicating entities.
    
    except:
        return 'Error Occurred with Regex.'
        

                    
#             Join data
df = df \
    .explode('entities') \
    .dropna()

df[['entity', 'label']] = pd.DataFrame(df['entities'].tolist(), index = df.index)

df['new_encoding'] = df \
    .apply(lambda row: make_ner_suggestions(row['previous_encoding'],
                                            row['entity'],
                                            row['label'],
                                            subset_ner, 4, banned_list),
           axis = 1)


# Add additional columns for user input.
df['uniq_id'] = ''

#             Drop rows if 'new_encoding' value equals 'Already Encoded'.
df = df[df['new_encoding'] != 'Already Encoded']

df.head(5)

CPU times: user 441 ms, sys: 3.99 ms, total: 445 ms
Wall time: 443 ms


Unnamed: 0,file,abridged_xpath,previous_encoding,entities,descendant_order,entity,label,new_encoding,uniq_id
3,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","(Holland, GPE)",4.0,Holland,GPE,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...",
4,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","(Russia, GPE)",5.0,Russia,GPE,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...",
4,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","(England, GPE)",5.0,England,GPE,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...",
4,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","(Counsels, GPE)",5.0,Counsels,GPE,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...",
4,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","(Britain, GPE)",5.0,Britain,GPE,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...",


<a id="highlighter">highlighter</a>

## Highlighter

In [5]:
%%time

"""
Reading Pane: Highlight Found Entity
"""
def highlighter(previous_encoding, entity):
    highlighted_text = etree.fromstring(previous_encoding)
    highlighted_text = etree.tostring(highlighted_text, method = 'text', encoding = 'UTF-8').decode('UTF-8')
    
    entity_match = re.search(f'(.*)({entity})(.*)', highlighted_text)
    
    highlighted_text = html.P([entity_match.group(1), html.Mark(entity_match.group(2)), entity_match.group(3)])
    
    return highlighted_text

# Mimicking row selection from app., which handles one row at a time.
reading_df = df.head(1)

highlighted_text = highlighter(reading_df['previous_encoding'].squeeze(),
                               reading_df['entity'].squeeze())

highlighted_text

CPU times: user 1.28 ms, sys: 280 µs, total: 1.56 ms
Wall time: 1.31 ms


P(['2. I sent a note this morning to Count Romanzoff on the subject of the vessels belonging to Mr: Thorndike of Beverley, and to the Weymouth Commercial Company— After employing as much of the morning as I had at my disposal in writing I took an exploring walk on the Canal— Dined with Mr: Six, the Minister from ', Mark('Holland'), '— Evening at home.'])

<a id="revisions">revisions</a>

## Revisions

In [6]:
%%time

"""
XML & Regex: Up Conversion

Function replaces all spaces between beginning and end tags with underscores.
Then, function wraps each token (determined by whitespace) with word tags (<w>...</w>)
"""
def up_convert_encoding(column):
#     Regularize spacing & store data as new variable ('converted_encoding').
    converted_encoding = re.sub('\s+', ' ', column, re.MULTILINE)
    
#     Create regex that replaces spaces with underscores if spaces occur within tags.
#     This regex treats tags as a single token later.
    tag_regex = re.compile('<(.*?)>')

#     Accumulate underscores through iteration
    for match in re.findall(tag_regex, column):
        replace_space = re.sub('\s', '_', match)
        converted_encoding = re.sub(match, replace_space, converted_encoding)
    
#     Up-Converstion
#     Tokenize encoding and text, appending <w> tags, and re-join.
    converted_encoding = converted_encoding.split(' ')
    for idx, item in enumerate(converted_encoding):
        item = '<w>' + item + '</w>'
        converted_encoding[idx] = item
    converted_encoding = ' '.join(converted_encoding)
    
    return converted_encoding


"""
XML: Remove word tags and clean up
"""
def xml_cleanup(encoding):
#     Clean up any additional whitespace and remove word tags.
    encoding = re.sub('\s+', ' ', encoding, re.MULTILINE)
    encoding = re.sub('<[/]?w>', '', encoding)

    encoding = re.sub('_', ' ', encoding) # Remove any remaining underscores in tags.
    encoding = re.sub('“', '"', encoding) # Change quotation marks to correct unicode.
    encoding = re.sub('”', '"', encoding)
    
    return encoding


"""
XML Parsing Function: Suggest New Encoding with Hand Edits

Similar to make_ner_suggestions(), this function folds in revision using regular expressions.
The outcome is the previous encoding with additional encoded information determined by user input.

Expected Columns:
    previous_encoding
    entities
    uniq_id
"""
def revise_without_uniq_id(label_dict, uniq_id, 
                           label, entity, previous_encoding, new_encoding):
    
    label = label_dict[label]
    
#     Up convert PREVIOUS ENCODING: assumes encoder will supply new encoding and attribute with value.
    converted_encoding = up_convert_encoding(previous_encoding)
    converted_entity = ' '.join(['<w>' + e + '</w>' for e in entity.split(' ')])

#     If there is a unique id to add & hand edits...
    if uniq_id == '':

        entity_regex = re.sub('<w>(.*)</w>', '(\\1)(.*?</w>)', converted_entity)
        entity_match = re.search(entity_regex, converted_encoding)

        revised_encoding = re.sub(f'{entity_match.group(0)}',
                                  f'<{label} type="nerHelper-added">{entity_match.group(1)}</{label}>{entity_match.group(2)}',
                                  converted_encoding)
        
        revised_encoding = xml_cleanup(revised_encoding)
        
        return revised_encoding

    else:
        pass
    
    

"""
XML & NER: Update/Inherit Accepted Changes
Expects a dataframe (from a .csv) with these columns:
    file
    abridged_xpath
    descendant_order
    previous_encoding
    entities
    new_encoding
    uniq_id
"""
def inherit_changes(label_dict, dataframe):
    
    dataframe = dataframe.fillna('')
    for index, row in dataframe.iterrows():
        
#         If HAND changes are accepted...
        if row['uniq_id'] != '':
        
            revised_by_hand = revise_with_uniq_id(label_dict, row['uniq_id'],
                                                  row['label'], row['entity'], 
                                                  row['previous_encoding'], row['new_encoding'])

            dataframe.loc[index, 'new_encoding'] = revised_by_hand
            
            try:
                if dataframe.loc[index + 1, 'abridged_xpath'] == row['abridged_xpath'] \
                and dataframe.loc[index + 1, 'descendant_order'] == row['descendant_order']:
                    dataframe.loc[index + 1, 'previous_encoding'] = row['new_encoding']
                    
                else:
                    dataframe.loc[index, 'new_encoding'] = revised_by_hand
                    
                    
            except KeyError as e:
                dataframe.loc[index, 'new_encoding'] = revised_by_hand
        
#         If NER suggestions are accepted as-is...
        elif row['label'] != '' and row['uniq_id'] == '':
        
            revised_no_uniq_id = revise_without_uniq_id(label_dict, row['uniq_id'],
                                                        row['label'], row['entity'],
                                                        row['previous_encoding'], row['new_encoding'])

            dataframe.loc[index, 'new_encoding'] = revised_no_uniq_id
        
            try:
                if dataframe.loc[index + 1, 'abridged_xpath'] == row['abridged_xpath'] \
                and dataframe.loc[index + 1, 'descendant_order'] == row['descendant_order']:
                    dataframe.loc[index + 1, 'previous_encoding'] = row['new_encoding']
                
                else:
                    dataframe.loc[index, 'new_encoding'] = row['new_encoding']
                    
            except KeyError as e:
                dataframe.loc[index, 'new_encoding'] = row['new_encoding']
                
#         If changes are rejected...
        else:
            try:
                if dataframe.loc[index + 1, 'abridged_xpath'] == row['abridged_xpath'] \
                and dataframe.loc[index + 1, 'descendant_order'] == row['descendant_order']:
                    dataframe.loc[index + 1, 'previous_encoding'] = dataframe.loc[index, 'previous_encoding']
                    
            except KeyError as e:
                dataframe.loc[index, 'new_encoding'] = dataframe.loc[index, 'previous_encoding']

#     Subset dataframe with finalized revisions.
    dataframe = dataframe.groupby(['abridged_xpath', 'descendant_order']).tail(1)
    
    return dataframe
    

# using same data from highlighter 
for index, row in reading_df.iterrows():
    
    revised_no_uniq_id = revise_without_uniq_id(label_dict, row['uniq_id'],
                                                            row['label'], row['entity'],
                                                            row['previous_encoding'], row['new_encoding'])

# reading_df.loc[0, 'new_encoding'] = revised_no_uniq_id
reading_df

# # Inheritance update: if change occurred, change next row's "previous_encoding" as long as elem. is the same.
# dataframe.loc[index, 'new_encoding'] = revised_no_uniq_id

# try:
#     if dataframe.loc[index + 1, 'abridged_xpath'] == row['abridged_xpath'] \
#     and dataframe.loc[index + 1, 'descendant_order'] == row['descendant_order']:
#         dataframe.loc[index + 1, 'previous_encoding'] = row['new_encoding']

#     else:
#         dataframe.loc[index, 'new_encoding'] = row['new_encoding']

# except KeyError as e:
#     dataframe.loc[index, 'new_encoding'] = row['new_encoding']

CPU times: user 790 µs, sys: 9 µs, total: 799 µs
Wall time: 797 µs


Unnamed: 0,file,abridged_xpath,previous_encoding,entities,descendant_order,entity,label,new_encoding,uniq_id
3,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","(Holland, GPE)",4.0,Holland,GPE,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...",


<a id="update_header">update_header</a>

## Update Header

These functions called in <a id="writing">writing</a> section.

In [7]:
%%time

"""
XML: Write <change> to <revisionDesc>
Expects:
    XML File (xml_contents in revise_xml())
    
Output:
    Writes changes directly to xml structure (root)
"""
def append_change_to_revisionDesc(root, ns):
#     Create a change element for revisionDesc.
#     If revisionDesc already exists...
    if root.find('.//ns:teiHeader/ns:revisionDesc', ns):
        revision_desc = root.find('.//ns:teiHeader/ns:revisionDesc', ns)

        new_change = etree.SubElement(revision_desc, 'change',
                                      when = str(datetime.datetime.now().strftime("%Y-%m-%d")),
                                      who = '#nerHelper')
                                      
        new_change.text = f"Entities added by NER (spaCy: {spacy.__version__}) application."
#     Else, create revisionDesc with SubElement, then change.
    else:
        teiHeader = root.find('.//ns:teiHeader', ns)
        revision_desc = etree.SubElement(teiHeader, 'revisionDesc')
        new_change = etree.SubElement(revision_desc, 'change',
                                      when = str(datetime.datetime.now().strftime("%Y-%m-%d")),
                                      who = '#nerHelper')
        new_change.text = f"Entities added by NER (spaCy: {spacy.__version__}) application."
        


"""
XML: Write <application> to <appInfo>
Expects:
    XML File (xml_contents in revise_xml())
    
Output:
    Writes changes directly to xml structure (root)
"""
def append_app_to_appInfo(root, ns):
#     If <appInfo> already exists...
    if root.find('.//ns:teiHeader//ns:appInfo', ns):
        app_info = root.find('.//ns:teiHeader//ns:appInfo', ns)

        ner_app_info = etree.SubElement(app_info, 'application',
                                        ident = 'nerHelper',
                                        version = "0.1")

        # Without saving a variable.
        etree.SubElement(ner_app_info, 'label').text = 'nerHelper App'
        etree.SubElement(ner_app_info, 'p').text = f'Entities added with spaCy-{spacy.__version__}.'
        
#     If <appInfo> missing BUT <encodingDesc> exists...
    elif root.find('.//ns:teiHeader/ns:encodingDesc', ns):
        encoding_desc = root.find('.//ns:teiHeader/ns:encodingDesc', ns)
        
        app_info = etree.SubElement(encoding_desc, 'appInfo')

        ner_app_info = etree.SubElement(app_info, 'application',
                                ident = 'nerHelper',
                                version = "0.1")
        
        etree.SubElement(ner_app_info, 'label').text = 'nerHelper App'
        etree.SubElement(ner_app_info, 'p').text = f'Entities added with spaCy-{spacy.__version__}.'
        
#     Else <appInfo> and <encodingDesc> missing...
    else:
        teiHeader = root.find('.//ns:teiHeader', ns)
        
        encoding_desc = etree.SubElement(teiHeader, 'encodingDesc')
        
        app_info = etree.SubElement(encoding_desc, 'appInfo')

        ner_app_info = etree.SubElement(app_info, 'application',
                                ident = 'nerHelper',
                                version = "0.1")
        
        etree.SubElement(ner_app_info, 'label').text = 'nerHelper App'
        etree.SubElement(ner_app_info, 'p').text = f'Entities added with spaCy-{spacy.__version__}.'


CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 6.91 µs


<a id="writing">writing</a>

## Writing

In [10]:
%%time

new_df = inherit_changes(label_dict, reading_df)

new_df

CPU times: user 8.3 ms, sys: 2 ms, total: 10.3 ms
Wall time: 8.75 ms


Unnamed: 0,file,abridged_xpath,previous_encoding,entities,descendant_order,entity,label,new_encoding,uniq_id
3,JQADiaries-v28-1809-12-p047.xml,.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...","(Holland, GPE)",4.0,Holland,GPE,"<ns0:p xmlns:ns0=""http://www.tei-c.org/ns/1.0""...",


In [85]:
%%time

"""
XML & NER: Write New XML File with Accepted Revisions
Expects:
    XML File with Original Encoding
    CSV File with Accepted Changes
    Label Dictionary
"""
# def revise_xml(xml_contents, csv_df):
# #     Label dictionary.
#     label_dict = {'PERSON':'persRef',
#                   'LOC':'placeName', # Non-GPE locations, mountain ranges, bodies of water.
#                   'GPE':'placeName', # Countries, cities, states.
#                   'FAC':'placeName', # Buildings, airports, highways, bridges, etc.
#                   'ORG':'orgName', # Companies, agencies, institutions, etc.
#                   'NORP':'name', # Nationalities or religious or political groups.
#                   'EVENT':'name', # Named hurricanes, battles, wars, sports events, etc.
#                   'WORK_OF_ART':'name', # Titles of books, songs, etc.
#                   'LAW':'name', # Named documents made into laws.
#                   'DATE':'date' # Absolute or relative dates or periods.
#                  }
    
# #     First, update data to reflect accepted changes.
#     new_data = inherit_changes(label_dict, csv_df)
    
#     xml_content_type, xml_content_string = xml_contents.split(',')
#     xml_decoded = base64.b64decode(xml_content_string).decode('utf-8')
#     xml_file = xml_decoded.encode('utf-8')
    
# root = ET.fromstring(xml_string)
# root = etree.fromstring(xml_string)
#     ns = get_namespace(root)
    
#     Add <change> to <revisionDesc> and add <application> to <appInfo>

new_data = reading_df

append_change_to_revisionDesc(root, ns)
append_app_to_appInfo(root, ns) # Does not need to save as variable; changes written to root.


#     Convert XML structure to string for regex processing.

tree_as_string = ET.fromstring(xml_string)
tree_as_string = ET.tostring(tree_as_string, method = 'xml').decode('utf-8')

# tree_as_string = etree.tostring(root).decode('utf-8') #, pretty_print = True
tree_as_string = re.sub('\s+', ' ', tree_as_string) # remove additional whitespace

#     Write accepted code into XML tree.
for index, row in new_data.iterrows():
    original_encoding_as_string = row['previous_encoding']

    # Remove namespaces within tags to ensure regex matches accurately.
    original_encoding_as_string = re.sub('^<(.*?)( xmlns.*?)>(.*)$',
                                         '<\\1>\\3',
                                         original_encoding_as_string)
# #     Remove namespaces.
#     original_encoding_as_string = re.sub('ns0:', '', original_encoding_as_string)

    accepted_encoding_as_string = row['new_encoding']
    accepted_encoding_as_string = re.sub('<(.*?)( xmlns.*?)>(.*)$',
                                         '<\\1>\\3',
                                         accepted_encoding_as_string) # Remove namespaces within tags.
    
# #     Remove namespaces.
#     accepted_encoding_as_string = re.sub('ns0:', '', accepted_encoding_as_string)

    tree_as_string = re.sub(original_encoding_as_string,
                            accepted_encoding_as_string,
                            tree_as_string)
    
#     Remove namespaces.
    tree_as_string = re.sub('ns0:', '', tree_as_string)
    

#     Check well-formedness (will fail if not well-formed)
doc = etree.fromstring(tree_as_string)
et = etree.ElementTree(doc)

#     Convert to string.
et = etree.tostring(et, encoding='unicode', method='xml', pretty_print = True)
#     return et

changes happened True
CPU times: user 17.8 ms, sys: 1.55 ms, total: 19.3 ms
Wall time: 17.5 ms


In [80]:
# prev_str = new_df['previous_encoding'].values[0]

# prev_str = re.sub('ns0:', '', prev_str)

# prev_str = re.sub('^<(.*?)( xmlns.*?)>(.*)$',
#                   '<\\1>\\3',
#                   prev_str)

# original_encoding_as_string



#     Convert XML structure to string for regex processing.
# tree_as_string = etree.tostring(root).decode('utf-8') #, pretty_print = True


tree_as_string = ET.fromstring(xml_string)
tree_as_string = ET.tostring(tree_as_string, method = 'xml').decode('utf-8')
tree_as_string = re.sub('\s+', ' ', tree_as_string)

# new_xml_string

# tree_as_string = re.sub('\s+', ' ', tree_as_string) # remove additional whitespace

# error = 'others&#8212;<persRef ref="barney-john2">Barney</persRef>' # og has space between &#8212; and <persRef

# error in tree_as_string
# og in tree_as_string

tree_as_string

'<ns0:TEI xmlns:ns0="http://www.tei-c.org/ns/1.0" xmlns:ns1="http://www.masshist.org/ns/1.0" xml:id="v28-1809-12">\n\t<ns0:teiHeader>\n\t\t<ns0:fileDesc>\n\t\t\t<ns0:titleStmt>\n\t\t\t\t<ns0:title>John Quincy Adams Diary Digital Project</ns0:title>\n\t\t\t</ns0:titleStmt>\n\t\t\t<ns0:publicationStmt>\n\t\t\t\t<ns0:p />\n\t\t\t</ns0:publicationStmt>\n\t\t\t<ns0:sourceDesc>\n\t\t\t\t<ns0:p>DJQA SOURCE DESCRIPTION</ns0:p>\n\t\t\t</ns0:sourceDesc>\n\t\t</ns0:fileDesc>\n\t</ns0:teiHeader>\n\t<ns0:text>\n\t\t<ns0:body>\n\t\t\t<ns0:div type="month" ns1:startingPage="47" ns1:precedingFile="" ns1:followingFile="JQADiaries-v28-1810-01-063.xml" ns1:volume="28">\n\t\t\t\t<ns0:bibl n="metadata"><ns0:date from="1809-12-01" to="1809-12-31" /></ns0:bibl>\n\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t<ns0:div type="entry" xml:id="jqadiaries-v28-1809-12-01">\n\t\t\t\t\t<ns0:head>1 December 1809</ns0:head>\n\t\t\t\t\t<ns0:bibl><ns0:author>JQA</ns0:author><ns0:date type="creation" when="1809-12-01" /><ns0:editor role="t

In [46]:
accepted_encoding_as_string in tree_as_string

True

In [45]:
tree_as_string

'<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:mhs="http://www.masshist.org/ns/1.0" xmlns:tei="http://www.tei-c.org/ns/1.0" xml:id="v28-1809-12"> <teiHeader> <fileDesc> <titleStmt> <title>John Quincy Adams Diary Digital Project</title> </titleStmt> <publicationStmt> <p/> </publicationStmt> <sourceDesc> <p>DJQA SOURCE DESCRIPTION</p> </sourceDesc> </fileDesc> <revisionDesc><change when="2021-10-04" who="#nerHelper">Entities added by NER (spaCy: 2.3.7) application.</change></revisionDesc><encodingDesc><appInfo><application ident="nerHelper" version="0.1"><label>nerHelper App</label><p>Entities added with spaCy-2.3.7.</p></application></appInfo></encodingDesc><revisionDesc><change when="2021-10-04" who="#nerHelper">Entities added by NER (spaCy: 2.3.7) application.</change></revisionDesc><encodingDesc><appInfo><application ident="nerHelper" version="0.1"><label>nerHelper App</label><p>Entities added with spaCy-2.3.7.</p></application></appInfo></encodingDesc><revisionDesc><change when="20

## App

In [None]:
%%time

"""
Parse Contents: XML Structure (ouput-data-upload)
"""
def parse_contents(contents, filename, ner_values): # date, 
    ner_values = ner_values#.split(',')
    
    content_type, content_string = contents.split(',')
    decoded = base64.b64decode(content_string).decode('utf-8')
    
    # Label dictionary.
    label_dict = {'PERSON':'persRef',
                  'LOC':'placeName', # Non-GPE locations, mountain ranges, bodies of water.
                  'GPE':'placeName', # Countries, cities, states.
                  'FAC':'placeName', # Buildings, airports, highways, bridges, etc.
                  'ORG':'orgName', # Companies, agencies, institutions, etc.
                  'NORP':'name', # Nationalities or religious or political groups.
                  'EVENT':'name', # Named hurricanes, battles, wars, sports events, etc.
                  'WORK_OF_ART':'name', # Titles of books, songs, etc.
                  'LAW':'name', # Named documents made into laws.
                  'DATE':'date' # Absolute or relative dates or periods.
                 }
    
    #### Subset label_dict with input values from Checklist *****
    subset_ner = {k: label_dict[k] for k in ner_values}
    
#     Run XML Parser + NER here.
    try:
#         Assume that the user uploaded a CSV file
        if 'csv' in filename:
            df = pd.read_csv(
                io.StringIO(decoded)
            )
            
#         Assume that the user uploaded an XML file
        elif 'xml' in filename:
            xml_file = decoded.encode('utf-8')
            
            df = pd.DataFrame(columns = ['file', 'abridged_xpath', 'previous_encoding', 'entities'])
            
            root = etree.fromstring(xml_file)
            ns = get_namespace(root)
            
#             Search through elements for entities.
            desc_order = 0
            for child in root.findall('.//ns:body//ns:div[@type="docbody"]', ns):
            
                abridged_xpath = get_abridged_xpath(child)
                
                for descendant in child:
                    desc_order = desc_order + 1
                    df = make_dataframe(descendant, df, ns, subset_ner, filename, desc_order)
                    df['abridged_xpath'] = abridged_xpath
                
#             Join data
            df = df \
                .explode('entities') \
                .dropna()

            df[['entity', 'label']] = pd.DataFrame(df['entities'].tolist(), index = df.index)
            
            df['new_encoding'] = df \
                .apply(lambda row: make_ner_suggestions(row['previous_encoding'],
                                                        row['entity'],
                                                        row['label'],
                                                        subset_ner, 4, banned_list),
                       axis = 1)

            
            # Add additional columns for user input.
            df['uniq_id'] = ''
            
#             Drop rows if 'new_encoding' value equals 'Already Encoded'.
            df = df[df['new_encoding'] != 'Already Encoded']

            
    except Exception as e:
        return html.Div([
            f'There was an error processing this file: {e}.'
    ])


#     Return HTML with outputs.
    return df
#     return filename, date, df


In [None]:
%%time


app = JupyterDash(__name__) 
#                   external_scripts = external_scripts)

app.config.suppress_callback_exceptions = True


# Preset variables.
ner_labels = ['PERSON','LOC','GPE','FAC','ORG','NORP','EVENT','WORK_OF_ART','LAW','DATE']

# Layout.
app.layout = html.Div([
    
#     Add or substract labels to list for NER to find. Complete list of NER labels: https://spacy.io/api/annotation
    html.H2('Select Entities to Search For'),
    
    dcc.Checklist(
        className = 'ner-checklist',
        id = 'ner-checklist',
        options = [{
            'label': i,
            'value': i
        } for i in ner_labels],
        value = ['LOC', 'GPE']
    ),
    
#     Upload Data Area.
    html.H2('Upload File'),
    dcc.Upload(
        className = 'upload-data',
        id = 'upload-data',
        children = html.Div([
            'Drag and Drop or ', html.A('Select File')
        ]),
        style={
            'width': '95%',
            'height': '60px',
            'lineHeight': '60px',
            'borderWidth': '1px',
            'borderStyle': 'dashed',
            'borderRadius': '5px',
            'textAlign': 'center',
            'margin': '10px'
        },
        multiple=False # Allow multiple files to be uploaded
    ),
    
#     Store uploaded data.
    dcc.Store(id = 'data-upload-store'),
    
#     Display pane for file information.
    html.Div(className = 'file-information', id = 'file-information'),
    
#     Display pane for data as table.
    dash_table.DataTable(id = 'data-table-container',
                         row_selectable="single",
                         selected_rows = [0],
                         editable = True,
                         page_size=1,
                        )
])


####################################################################################################################
####################################################################################################################
######### Callbacks ################################################################################################
####################################################################################################################
####################################################################################################################



# Upload data & create table.
@app.callback([Output('file-information', 'children'),
               Output('data-upload-store', 'data')],
              [Input('upload-data', 'contents'),
               Input('ner-checklist', 'value')],
              [State('upload-data', 'filename'),
               State('upload-data', 'last_modified')])
def upload_data(contents, ner_values, filename, date):
    if contents is None:
        raise PreventUpdate
            
    data = parse_contents(contents, filename, ner_values)
    
    file_information = html.Div([html.P(f'File name: {filename}'),
                                 html.P(f'Last modified: {datetime.datetime.fromtimestamp(date)}')])
    
    return file_information, data.to_dict('rows')
    


if __name__ == "__main__":
    app.run_server(mode = 'inline', debug = True) # mode = 'inline' for JupyterDash
#     app.run_server(debug = True)

## Revise XML

In [262]:
%%time


"""
XML & Regex: Up Conversion

Function replaces all spaces between beginning and end tags with underscores.
Then, function wraps each token (determined by whitespace) with word tags (<w>...</w>)
"""
def up_convert_encoding(column):
#     Regularize spacing & store data as new variable ('converted_encoding').
    converted_encoding = re.sub('\s+', ' ', column, re.MULTILINE)
    
#     Create regex that replaces spaces with underscores if spaces occur within tags.
#     This regex treats tags as a single token later.
    tag_regex = re.compile('<(.*?)>')

#     Accumulate underscores through iteration
    for match in re.findall(tag_regex, column):
        replace_space = re.sub('\s', '_', match)
        converted_encoding = re.sub(match, replace_space, converted_encoding)
    
#     Up-Converstion
#     Tokenize encoding and text, appending <w> tags, and re-join.
    converted_encoding = converted_encoding.split(' ')
    for idx, item in enumerate(converted_encoding):
        item = '<w>' + item + '</w>'
        converted_encoding[idx] = item
    converted_encoding = ' '.join(converted_encoding)
    
    return converted_encoding
  
"""
XML Parsing Function: Suggest New Encoding with Hand Edits

Similar to make_ner_suggestions(), this function folds in revision using regular expressions.
The outcome is the previous encoding with additional encoded information determined by user input.

Expected Columns:
    previous_encoding
    entities
    uniq_id
"""
def revise_without_uniq_id(label_dict, uniq_id, 
                           label, entity, previous_encoding):
    
    label = label_dict[label]
    
#     Up convert PREVIOUS ENCODING: assumes encoder will supply new encoding and attribute with value.
    converted_encoding = up_convert_encoding(previous_encoding)
    converted_entity = ' '.join(['<w>' + e + '</w>' for e in entity.split(' ')])
    
    print ('found placeName', 'placeName' in converted_encoding)

#     If there is a unique id to add & hand edits...
    if uniq_id == '':

        entity_regex = re.sub('<w>(.*)</w>', '(\\1)(.*?</w>)', converted_entity)
        entity_match = re.search(entity_regex, converted_encoding)

        revised_encoding = re.sub(f'{entity_match.group(0)}',
                                  f'<{label} type="nerHelper-added">{entity_match.group(1)}</{label}>{entity_match.group(2)}',
                                  converted_encoding)
        
        revised_encoding = xml_cleanup(revised_encoding)
        
        return revised_encoding

    else:
        pass
    
    
"""
XML & NER: Update/Inherit Accepted Changes
Expects a dataframe (from a .csv) with these columns:
    file
    abridged_xpath
    descendant_order
    previous_encoding
    entities
    new_encoding
    uniq_id
"""
def inherit_changes(label_dict, dataframe):
    print ('starting inherit_changes()...')
    
    dataframe = dataframe.fillna('')
    for index, row in dataframe.iterrows():
        
# #         If HAND changes are accepted...
#         if row['uniq_id'] != '':
        
#             revised_by_hand = revise_with_uniq_id(label_dict, row['uniq_id'],
#                                                   row['label'], row['entity'], 
#                                                   row['previous_encoding'], row['new_encoding'])

#             dataframe.loc[index, 'new_encoding'] = revised_by_hand
            
#             try:
#                 if dataframe.loc[index + 1, 'abridged_xpath'] == row['abridged_xpath'] \
#                 and dataframe.loc[index + 1, 'descendant_order'] == row['descendant_order']:
#                     dataframe.loc[index + 1, 'previous_encoding'] = row['new_encoding']
                    
#                 else:
#                     dataframe.loc[index, 'new_encoding'] = revised_by_hand
                    
                    
#             except KeyError as e:
#                 dataframe.loc[index, 'new_encoding'] = revised_by_hand
        
#         If NER suggestions are accepted as-is...
#         elif row['label'] != '' and row['uniq_id'] == '':
        if row['accept'] == 'y':
        
            revised_no_uniq_id = revise_without_uniq_id(label_dict, row['uniq_id'],
                                                        row['label'], row['entity'],
                                                        row['previous_encoding'])

            dataframe.loc[index, 'new_encoding'] = revised_no_uniq_id
        
            try:
#                 If the next row has the same xpath & descendant order (next row handles the same element),
#                 update next row's "previous_encoding" to reflect changes.
                if (row['abridged_xpath'] == dataframe.loc[index + 1, 'abridged_xpath']) \
                and (row['descendant_order'] == dataframe.loc[index + 1, 'descendant_order']):
        
                    dataframe.at[index + 1, 'previous_encoding'] = revised_no_uniq_id
                
                else:
                    pass
                    
            except KeyError as e:
                dataframe.at[index, 'new_encoding'] = revised_no_uniq_id
                
                
#         If changes are rejected...
        else:
            try:
#                 If next row handles same element, pass current row's "previous_encoding"
#                 to carry forward accepted changes in previous rows.
                if dataframe.loc[index + 1, 'abridged_xpath'] == row['abridged_xpath'] \
                and dataframe.loc[index + 1, 'descendant_order'] == row['descendant_order']:
        
                    dataframe.at[index + 1, 'previous_encoding'] = dataframe.loc[index, 'previous_encoding']
                   
                    
            except KeyError as e:
                dataframe.at[index, 'new_encoding'] = dataframe.loc[index, 'previous_encoding']

#     Subset dataframe with finalized revisions.
    dataframe = dataframe.groupby(['abridged_xpath', 'descendant_order']).tail(1)
    return dataframe


label_dict = {'PERSON':'persRef',
                  'LOC':'placeName', # Non-GPE locations, mountain ranges, bodies of water.
                  'GPE':'placeName', # Countries, cities, states.
                  'FAC':'placeName', # Buildings, airports, highways, bridges, etc.
                  'ORG':'orgName', # Companies, agencies, institutions, etc.
                  'NORP':'name', # Nationalities or religious or political groups.
                  'EVENT':'name', # Named hurricanes, battles, wars, sports events, etc.
                  'WORK_OF_ART':'name', # Titles of books, songs, etc.
                  'LAW':'name', # Named documents made into laws.
                  'DATE':'date' # Absolute or relative dates or periods.
                 }


df = pd.read_excel(abs_dir + 'GitHub/dsg-mhs/Jupyter_Notebooks/Interfaces/NER_Application/before-JQADiaries-v36i-1828-12-p065.xlsx', 
                   index_col = 0)

inherit_changes(label_dict, df)

starting inherit_changes()...
found placeName False
found placeName False
found placeName False
found placeName False
found placeName False
found placeName False
found placeName False
found placeName False
found placeName False
found placeName False
CPU times: user 26.2 ms, sys: 2.76 ms, total: 29 ms
Wall time: 27.8 ms


Unnamed: 0,accept,entity,label,uniq_id,previous_encoding,new_encoding,entities,abridged_xpath,descendant_order,file
0,n,Johnston,GPE,,"<list xmlns=""http://www.tei-c.org/ns/1.0"" rend...","<list xmlns=""http://www.tei-c.org/ns/1.0"" rend...","['Johnston', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,5,before-JQADiaries-v36i-1828-12-p065.xml
9,y,Tiber,LOC,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['Tiber', 'LOC']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
10,y,College Hill,LOC,,"<p xmlns=""http://www.tei-c.org/ns/1.0""><date>2...","<p xmlns=""http://www.tei-c.org/ns/1.0""><date>2...","['College Hill', 'LOC']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,11,before-JQADiaries-v36i-1828-12-p065.xml


In [265]:
df

Unnamed: 0,accept,entity,label,uniq_id,previous_encoding,new_encoding,entities,abridged_xpath,descendant_order,file
0,n,Johnston,GPE,,"<list xmlns=""http://www.tei-c.org/ns/1.0"" rend...","<list xmlns=""http://www.tei-c.org/ns/1.0"" rend...","['Johnston', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,5,before-JQADiaries-v36i-1828-12-p065.xml
1,y,Kentucky,GPE,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['Kentucky', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
2,y,New-York,GPE,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['New-York', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
3,y,New-Jersey,GPE,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['New-Jersey', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
4,y,New-Jersey,GPE,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['New-Jersey', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
5,y,Maryland,GPE,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['Maryland', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
6,y,Charleston,GPE,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['Charleston', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
7,y,S.C.,GPE,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['S.C.', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
8,y,Mexico,GPE,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['Mexico', 'GPE']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml
9,y,Tiber,LOC,,"<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","<p xmlns=""http://www.tei-c.org/ns/1.0"">Senator...","['Tiber', 'LOC']",.//ns:body//{http://www.tei-c.org/ns/1.0}div[@...,9,before-JQADiaries-v36i-1828-12-p065.xml


All information needed to revise an xml file is stored in a spreadsheet. 

The revision function(s) checks one row at a time, the "current row." The "current row" provides the entity's label and text (i.e. placeName, Kentucky). 

Before the function starts making revisions, it checks the "preceding row." If the "preceding row" handles the same element, has the same "abridged_xpath" and "descendant_order," then the function grabs the "preceding row's" "new_encoding" field. The reason the function grabs the "preceding row's" "new_encoding" rather than the "current row's" is because the "current row" does not know about previous revisions. When new changes are committed, described later, then those changes go into the "new_encoding" field, which has to include all previous changes in order for revisions to accumulate (as opposed to being over-written with each new row).

If the "current row" has different values in either "abridged_xpath" or "descendant_order," then the "current row" is a different element thant he "preceding row." This means that the current row is essentially starting from scratch and will add the first revision.

After the first revision is made (using regular expressions), the revision is stored in the "new_encoding" field of the "current row." If the "following row" handles the same element, then the "following row" must grab the "current row's" "new_encoding." Otherwise, the "following row" will miss/overlook changes that have been made.

In [316]:
%%time


"""
XML: Remove word tags and clean up
"""
def xml_cleanup(encoding):
#     Clean up any additional whitespace and remove word tags.
    encoding = re.sub('\s+', ' ', encoding, re.MULTILINE)
    encoding = re.sub('<[/]?w>', '', encoding)

    encoding = re.sub('_', ' ', encoding) # Remove any remaining underscores in tags.
    encoding = re.sub('“', '"', encoding) # Change quotation marks to correct unicode.
    encoding = re.sub('”', '"', encoding)
    
    return encoding


def revise_without_uniq_id(entity, converted_encoding, label):
    converted_entity = ' '.join(['<w>' + e + '</w>' for e in entity.split(' ')])
    entity_regex = re.sub('<w>(.*)</w>', '(\\1)(.*?</w>)', converted_entity)
    entity_match = re.search(entity_regex, converted_encoding)

    revised_encoding = re.sub(f'{entity_match.group(0)}',
                              f'<{label} type="nerHelper-added">{entity_match.group(1)}</{label}>{entity_match.group(2)}',
                              converted_encoding)
    
    cleaned_revisions = xml_cleanup(revised_encoding)
    
    return cleaned_revisions



dataframe = df#.query('descendant_order == 9')#.copy(deep=False)

for index, row in dataframe.iterrows():
    label = label_dict[row['label']]
    entity = row['entity']
    
    if row['accept'] == 'y': # if changes are accepted...
        
#         If the current row is handling same element as the preceding row...
        if (row['abridged_xpath'] == dataframe.loc[index - 1, 'abridged_xpath']) \
        and (row['descendant_order']== dataframe.loc[index - 1, 'descendant_order']):
            print ('previous elem is the same')

#         Up convert encoding.
#         Up convert preceding row's "new_encoding" or, if not, up convert current row's "previous encoding"
            try:
                converted_encoding = up_convert_encoding(dataframe.loc[index - 1, 'new_encoding'])
            except KeyError:
                converted_encoding = up_convert_encoding(row['previous_encoding']) # this KeyError triggers if the
            
            polished_revisions = revise_without_uniq_id(entity, converted_encoding, label)
        
            dataframe.loc[index, 'new_encoding'] = polished_revisions(entity, converted_encoding, label)
                
#         If the current row is handling a different row...
        else:
            print ('previous elem is NOT the same')
#             Up convert current row's "previous encoding"
            converted_encoding = up_convert_encoding(row['previous_encoding'])
    
            polished_revisions = revise_without_uniq_id(entity, converted_encoding, label)
        
            dataframe.loc[index, 'new_encoding'] = polished_revisions

#         else:
#             converted_encoding = up_convert_encoding(dataframe.loc[index, 'previous_encoding'])

#             converted_entity = ' '.join(['<w>' + e + '</w>' for e in entity.split(' ')])
#             entity_regex = re.sub('<w>(.*)</w>', '(\\1)(.*?</w>)', converted_entity)
#             entity_match = re.search(entity_regex, converted_encoding)
#             print (entity_match)
#             revised_encoding = re.sub(f'{entity_match.group(0)}',
#                                       f'<{label} type="nerHelper-added">{entity_match.group(1)}</{label}>{entity_match.group(2)}',
#                                       converted_encoding)


#             dataframe.loc[index, 'new_encoding'] = xml_cleanup(revised_encoding)
        
        
    else:
        print ('changes not accepted')
        dataframe.loc[index, 'new_encoding']

    
# newest_df = new_df.groupby(['abridged_xpath', 'descendant_order']).tail(1)

changes not accepted
previous elem is NOT the same
previous elem is the same
<w><p_xmlns="http://www.tei-c.org/ns/1.0">Senator</w> <w>from</w> <w><placeName_type="nerHelper-added">Kentucky</placeName>&#8212;quite</w> <w>shocked</w> <w>at</w> <w>the</w> <w>virulence</w> <w>of</w> <w>Newspaper</w> <w>Slanders</w> <w>ag<hi_rend="superscript">t.</hi></w> <w>the</w> <w>Administration&#8212;</w> <w><persRef_ref="allen-samuel">Allen</persRef></w> <w>thinks</w> <w>I</w> <w>have</w> <w>suffered,</w> <w>for</w> <w>not</w> <w>turning</w> <w>my</w> <w>enemies</w> <w>out</w> <w>of</w> <w>office;</w> <w>particularly</w> <w>the</w> <w><persRef_ref="mclean-john2">Post-master</w> <w>General</persRef>.</w> <w>Committee</w> <w>of</w> <w>both</w> <w>Houses</w> <w>of</w> <w>Congress,</w> <w>notified</w> <w>me</w> <w>that</w> <w>they</w> <w>had</w> <w>formed</w> <w>Quorums</w> <w>and</w> <w>were</w> <w>ready</w> <w>to</w> <w>receive</w> <w>any</w> <w>Communication</w> <w>from</w> <w>me.</w> <w>Answered</w> 

TypeError: 'str' object is not callable

In [241]:
for i, r in new_df.iterrows():
    if '<placeName type="nerHelper-added">Kentucky' in r['previous_encoding']:
        print (i, 'entity added')
    else:
        print (i, 'entity not added or Kentucy lost')

1 entity not added or Kentucy lost
2 entity added
3 entity added
4 entity added
5 entity added
6 entity added
7 entity added
8 entity added
9 entity added
10 entity added


In [255]:
newest_df.query('entity == "Tiber"')['new_encoding'].values[0]

'<p xmlns="http://www.tei-c.org/ns/1.0">Senator from <placeName type="nerHelper-added">Kentucky</placeName>&#8212;quite shocked at the virulence of Newspaper Slanders ag<hi rend="superscript">t.</hi> the Administration&#8212; <persRef ref="allen-samuel">Allen</persRef> thinks I have suffered, for not turning my enemies out of office; particularly the <persRef ref="mclean-john2">Post-master General</persRef>. Committee of both Houses of Congress, notified me that they had formed Quorums and were ready to receive any Communication from me. Answered that I should make one at 12 to-morrow&#8212; <persRef key="rush-richard">M<hi rend="superscript">r</hi> Rush</persRef> read me the draft of his annual report on the finances&#8212; Very pleasing&#8212;corrected the message by the revised figures of his Report&#8212; <persRef ref="sanford-nathan">Sanford</persRef> says <persRef ref="vanburen-martin">Van Burren</persRef> is not coming. He is elected Governor by a minority, in <placeName type="n

In [164]:
test = pd.DataFrame({'a': [1, 2], 'b': [0, 3], 'c': [5, 6]})

for i, r in test.iterrows():
    if r['b'] < r['a']:
        test.loc[i + 1, 'b'] = r['a'] + 5
        
test

Unnamed: 0,a,b,c
0,1,0,5
1,2,6,6


In [167]:
'ab' == 'a'

False

In [217]:
string = 'abcdef'

reg = 'cd'

re.sub(reg, '12', string)

'ab12ef'

In [228]:
conv_enc = '<w><p_xmlns="http://www.tei-c.org/ns/1.0">Senator</w> <w>from</w> <w>Kentucky&#8212;quite</w> <w>shocked</w> <w>at</w> <w>the</w> <w>virulence</w> <w>of</w> <w>Newspaper</w> <w>Slanders</w> <w>ag<hi_rend="superscript">t.</hi></w> <w>the</w> <w>Administration&#8212;</w> <w><persRef_ref="allen-samuel">Allen</persRef></w> <w>thinks</w> <w>I</w> <w>have</w> <w>suffered,</w> <w>for</w> <w>not</w> <w>turning</w> <w>my</w> <w>enemies</w> <w>out</w> <w>of</w> <w>office;</w> <w>particularly</w> <w>the</w> <w><persRef_ref="mclean-john2">Post-master</w> <w>General</persRef>.</w> <w>Committee</w> <w>of</w> <w>both</w> <w>Houses</w> <w>of</w> <w>Congress,</w> <w>notified</w> <w>me</w> <w>that</w> <w>they</w> <w>had</w> <w>formed</w> <w>Quorums</w> <w>and</w> <w>were</w> <w>ready</w> <w>to</w> <w>receive</w> <w>any</w> <w>Communication</w> <w>from</w> <w>me.</w> <w>Answered</w> <w>that</w> <w>I</w> <w>should</w> <w>make</w> <w>one</w> <w>at</w> <w>12</w> <w>to-morrow&#8212;</w> <w><persRef_key="rush-richard">M<hi_rend="superscript">r</hi></w> <w>Rush</persRef></w> <w>read</w> <w>me</w> <w>the</w> <w>draft</w> <w>of</w> <w>his</w> <w>annual</w> <w>report</w> <w>on</w> <w>the</w> <w>finances&#8212;</w> <w>Very</w> <w>pleasing&#8212;corrected</w> <w>the</w> <w>message</w> <w>by</w> <w>the</w> <w>revised</w> <w>figures</w> <w>of</w> <w>his</w> <w>Report&#8212;</w> <w><persRef_ref="sanford-nathan">Sanford</persRef></w> <w>says</w> <w><persRef_ref="vanburen-martin">Van</w> <w>Burren</persRef></w> <w>is</w> <w>not</w> <w>coming.</w> <w>He</w> <w>is</w> <w>elected</w> <w>Governor</w> <w>by</w> <w>a</w> <w>minority,</w> <w>in</w> <w>New-York.</w> <w><persRef_ref="condict-lewis">Condict</persRef></w> <w>spoke</w> <w>of</w> <w><persRef_ref="southard-samuel">Southard&#8217;s</persRef></w> <w>coming</w> <w>as</w> <w>Senator</w> <w>from</w> <w>New-Jersey.</w> <w>Fears</w> <w>they</w> <w>will</w> <w>make</w> <w>him</w> <w>a</w> <w>Non-resident;</w> <w>as</w> <w>they</w> <w>did</w> <w><persRef_ref="bailey-john">Bailey</persRef>.</w> <w>Asked</w> <w>if</w> <w>Southard</w> <w>could</w> <w>not</w> <w>withdraw</w> <w>and</w> <w>return</w> <w>to</w> <w>New-Jersey&#8212;</w> <w>I</w> <w>thought</w> <w>it</w> <w>unnecessary&#8212;</w> <w>Visit</w> <w>from</w> <w>Maryland</w> <w>members</w> <w>with</w> <w>others&#8212;</w> <w><persRef_ref="barney-john2">Barney</persRef></w> <w>lately</w> <w>at</w> <w>Charleston</w> <w>S.C.&#8212;</w> <w><persRef_ref="brent-daniel">M<hi_rend="superscript">r</hi></w> <w>Brent</persRef></w> <w>took</w> <w><persRef_ref="tacon-francisco">Tacon&#8217;s</persRef></w> <w>Letters</w> <w>complaining</w> <w>of</w> <w><persRef_ref="salmon-hilario">Salmon&#8217;s</persRef></w> <w>quarrels</w> <w>with</w> <w><persRef_ref="kirk-william">Kirk&#8217;s</persRef></w> <w>children</w> <w>and</w> <w>servant</w> <w><unclear>maids</unclear>&#8212;</w> <w><persRef_ref="debresson-charles">Bresson</persRef></w> <w>and</w> <w>the</w> <w><persRef_ref="lannes-napoleon">Duke</w> <w>of</w> <w>Montebello</persRef></w> <w>took</w> <w>leave&#8212;going</w> <w>to-morrow</w> <w>for</w> <w>Mexico&#8212;</w> <w>I</w> <w>rode</w> <w>before</w> <w>dinner</w> <w>on</w> <w>horseback</w> <w>across</w> <w>the</w> <w>Tiber</w> <w>round</w> <w>by</w> <w>the</w> <w>Navy-Yard</w> <w>and</w> <w>Eastern</w> <w>Branch</w> <w>upper</w> <w>bridge,</w> <w>returning</w> <w>by</w> <w>the</w> <w>Capitol&#8212;</w> <w>Evening</w> <w>visits</w> <w>from</w> <w>M<hi_rend="superscript">r</hi></w> <w>Bailey</w> <w>and</w> <w><persRef_ref="little-peter">Col<hi_rend="superscript">l.</hi></w> <w>Little</persRef>&#8212;</w> <w>Read</w> <w>to</w> <w>Bailey,</w> <w>my</w> <w>Letters</w> <w>to</w> <w><persRef_ref="bacon-ezekiel">Bacon</persRef></w> <w>of</w> <w>Dec<hi_rend="superscript">r.</hi></w> <w>1808&#8212;</w> <w>Collated</w> <w>the</w> <w>two</w> <w>Copies</w> <w>of</w> <w>my</w> <w>Message</w> <w>for</w> <w>the</w> <w>two</w> <w>Houses</w> <w>of</w> <w>Congress&#8212;</w> <w>My</w> <w>Son</w> <w><persRef_ref="adams-john2">John&#8217;s</persRef></w> <w><persRef_ref="hellen-mary">wife</persRef></w> <w>was</w> <w>taken</w> <w>in</w> <w>labour</w> <w>this</w> <w>day&#8212;</w> <w><persRef_ref="worthington-charles">D<hi_rend="superscript">r</hi></w> <w>Worthington</persRef></w> <w>attends</w> <w>her&#8212;</w> <w><persRef_ref="u">M<hi_rend="superscript">rs</hi></w> <w>Nowland</persRef></w> <w>is</w> <w>her</w> <w>Nurse&#8212;</p></w> <w></w>'

entity_match = re.search('(New-York)(.*?</w>)', converted_encoding)


revised_encoding = re.sub(f'{entity_match.group(0)}',
                          f'<{label} type="nerHelper-added">{entity_match.group(1)}</{label}>{entity_match.group(2)}',
                          converted_encoding)
        
xml_cleanup(revised_encoding)

'<p xmlns="http://www.tei-c.org/ns/1.0">Senator from Kentucky&#8212;quite shocked at the virulence of Newspaper Slanders ag<hi rend="superscript">t.</hi> the Administration&#8212; <persRef ref="allen-samuel">Allen</persRef> thinks I have suffered, for not turning my enemies out of office; particularly the <persRef ref="mclean-john2">Post-master General</persRef>. Committee of both Houses of Congress, notified me that they had formed Quorums and were ready to receive any Communication from me. Answered that I should make one at 12 to-morrow&#8212; <persRef key="rush-richard">M<hi rend="superscript">r</hi> Rush</persRef> read me the draft of his annual report on the finances&#8212; Very pleasing&#8212;corrected the message by the revised figures of his Report&#8212; <persRef ref="sanford-nathan">Sanford</persRef> says <persRef ref="vanburen-martin">Van Burren</persRef> is not coming. He is elected Governor by a minority, in <placeName type="nerHelper-added">New-York</placeName>. <persRef