# Overview of Markup Tags in John Quincy Adams Papers

This notebook provides an overview of encoding practices within a directory of XML files, specifically the John Quincy Adams papers. The goal of the overview is to get a general sense of overall encoding patterns and possible semantic relations that are implicit within the encoding.

General Guide for Parsing XML with Python:

Nair, Deepesh, "[Processing XML in Python—ElementTree](https://towardsdatascience.com/processing-xml-in-python-elementtree-c8992941efd2)," Accessed Sept. 22, 2020.

## Notes

1. Multiple namespaces in each file: namely, mhs & tei. These namespaces seem reserved for attributes, not elements?

The mhs namespace provides bibliogrpahy information like page order and volume.

In [125]:
# Import necessary libraries.
import re, glob, csv, sys, os
import pandas as pd
import numpy as np
from collections import Counter
import xml.etree.ElementTree as ET

# Declare directory location to shorten filepaths later.
abs_dir = "/Users/quinn.wi/Documents/SemanticData/"

# Gather all .xml files using glob.
list_of_files = glob.glob(abs_dir + "Data/JQA_papers/*/*.xml")

## Gather All Elements

This cell gathers each element in the XML directory and counts their frequencies.

In [2]:
%%time

# Create an empty list to store tags as they're gathered.
tags = []

# Loop through each file within a directory.
for file in list_of_files:
    
#     Read in file and get root and namespace.
#     Using root to get namespace does not assume all files share the same namespace.
    tree = ET.parse(file)
    root = tree.getroot()
    namespace = re.match(r"{(.*)}", str(root.tag)) # get multiple namespaces
    ns = {"ns":namespace.group(1)}
    
#     Loop through every element (child of root).
    for element in root.iter():
#         Gather element tag and remove namespace.
        elem_tag = element.tag.replace(str('{' + namespace.group(1) + '}'), '')
#         Append tag to 'tags' list
        tags.append(elem_tag)

# Count occurences of items in 'tags' list.
counter = Counter(tags)

# Convert Counter object to dataframe for easier use and functionality.
elements = pd.DataFrame.from_dict(counter, orient = 'index') \
    .reset_index() \
    .rename(columns = {'index':'tags', 0:'count'}) \
    .sort_values(by = ['count'], ascending = False)

elements  

CPU times: user 239 ms, sys: 11.7 ms, total: 251 ms
Wall time: 251 ms


Unnamed: 0,tags,count
21,persRef,28604
17,hi,14660
10,div,4465
12,date,4465
6,p,3406
11,bibl,2276
13,head,2238
14,author,2189
15,editor,2189
20,note,2090


## Examine Unique persRef

This cell explores the unique identifiers for persRef. A report can be produced to understand the general patterns and practices of encoding.

In [110]:
%%time

# Gather persons (unique ids of persRef)
persons_dict = {}

# Loop through each file within a directory.
for file in list_of_files:
    
#     Read in file and get root and namespace.
#     Using root to get namespace does not assume all files share the same namespace.
    tree = ET.parse(file)
    root = tree.getroot()
    namespace = re.match(r"{(.*)}", str(root.tag)) # get multiple namespaces
    ns = {"ns":namespace.group(1)}
    
#     Loop through every persRef element.
    for elem in root.findall('.//ns:persRef', ns):
        
#         Assign variables to unique ID (ref) and text (content)
        ref = elem.get('ref')
        content = re.sub(r'\s+', ' ', str(elem.text))
        
#         Create a dictionary where keys are unique identifiers
#         and values are list of extracted content.
        if ref not in persons_dict.keys():
            persons_dict[ref] = []
            persons_dict[ref].append(content)

        else:
            persons_dict[ref].append(content)

# Create an empty dataframe to store unique id's and their frequencies.
person_counts = pd.DataFrame(columns = ['ref', 'count'])

# Loop through persons_dict, counting how many values they have (are duplicates included?)
# Convert to a small 1x1 dataframe and merge with person_counts.
for k, v in persons_dict.items():
    data = pd.DataFrame({'ref':k, 'count':len([item for item in v if item])},
                        index = [0])
    person_counts = person_counts.append(data,
                                         ignore_index = True)

# Sort values from high to low.
person_counts.sort_values(by = ['count'], ascending=False)

CPU times: user 5.71 s, sys: 39.5 ms, total: 5.75 s
Wall time: 5.84 s


Unnamed: 0,ref,count
27,,2508
46,calhoun-john,427
47,southard-samuel,324
331,adams-george,302
26,wyer-edward,247
...,...,...
1282,johnson-unknown8;johnson-unknown9,1
3258,kerr-unknown,1
1280,menard-unknown,1
3260,turner-b,1


## Examine Encoding Anamolies

In the overview of unique persRef identifiers, two encoding practices stand out. First, 'None' occurs very frequently, which could impede efforts to differentiate individuals or, with some contextualization, could be an encoding practice that explains phenomenon in JQA. Second, the last entry possibly conflated three people (adams-charles2;adams-john2;adams-george). 

In both instances (the None and conflated cases), it would be useful to know where this encoding occured, so that it could be reviewed for revision/acceptance. The following cell works to track down where these encoding instances occur. Ultimately, though, the results above could serve as variables in XQuery, which would allow them to be called as functions rather than stored.

### Exploring 'None' attribute values.

In [130]:
%%time

# Declare regex to simplify file paths below
regex = re.compile(r'.*/\d{4}/(.*)')

# Create empty dataframe to store documents with persRef ref='None'.
none_pers = pd.DataFrame()

# Loop through each file within a directory.
for file in list_of_files:
    
#     Read in file and get root and namespace.
#     Using root to get namespace does not assume all files share the same namespace.
    tree = ET.parse(file)
    root = tree.getroot()
    namespace = re.match(r"{(.*)}", str(root.tag)) # get multiple namespaces
    ns = {"ns":namespace.group(1)}
    
#     Loop through every persRef element.
    for elem in root.findall('.//ns:persRef', ns):
        if elem.get('ref') == None:
            data = pd.DataFrame({'ref':re.sub(r'\s+', ' ', str(elem.text)),
                                 'doc':regex.search(file).groups()},
                                index = [0])
            none_pers = none_pers.append(data,
                                         ignore_index = True)
            
none_pers.groupby(['ref', 'doc']) \
    .size() \
    .to_frame(name = 'occurence') \
    .reset_index()

CPU times: user 1.85 s, sys: 17 ms, total: 1.87 s
Wall time: 1.89 s


Unnamed: 0,ref,doc,occurence
0,M,JQADiaries-v23-1821-05-p359.xml,1
1,M,JQADiaries-v35-1824-07-p213.xml,1
2,Bagot’s,JQADiaries-v31-1821-01-p478.xml,1
3,Baron De Neuville,JQADiaries-v32-1821-11-p121.xml,1
4,Baron De Neuville,JQADiaries-v33-1822-03-p026.xml,1
...,...,...,...
625,wife,JQADiaries-v31-1821-01-p478.xml,1
626,wife,JQADiaries-v32-1821-09-p082.xml,1
627,wife,JQADiaries-v34-1823-10-p136.xml,1
628,wife,JQADiaries-v34-1824-06-p346.xml,1


### Finding cases of conflated persons

In [141]:
%%time

# Declare regex to simplify file paths below.
file_regex = re.compile(r'.*/\d{4}/(.*)')
# Declare regex to find conflated cases.
ref_regex = re.compile(r'.*;.*')

# Create empty dataframe to store documents with persRef ref='None'.
confl_pers = pd.DataFrame()

# Loop through each file within a directory.
for file in list_of_files:
    
#     Read in file and get root and namespace.
#     Using root to get namespace does not assume all files share the same namespace.
    tree = ET.parse(file)
    root = tree.getroot()
    namespace = re.match(r"{(.*)}", str(root.tag)) # get multiple namespaces
    ns = {"ns":namespace.group(1)}
    
#     Loop through every persRef element.
    for elem in root.findall('.//ns:persRef', ns):
        ref = elem.get('ref')
        
        if ref_regex.search(str(ref)):
            
            data = pd.DataFrame({'ref':ref,
                                 'doc':file_regex.search(file).groups(),
                                 'content':re.sub(r'\s+', ' ', str(elem.text))
                                 # include field for location &/or encoding itself
                                },
                                index = [0])

            confl_pers = confl_pers.append(data,
                                           ignore_index = True)
            
confl_pers

CPU times: user 488 ms, sys: 8.34 ms, total: 496 ms
Wall time: 499 ms


Unnamed: 0,ref,doc,content
0,brown-jacob;browne-pamela,JQADiaries-v49-1825-08-p891.xml,Gen
1,frye-nathaniel;frye-carolina,JQADiaries-v49-1825-08-p891.xml,M
2,smith-william-steuben;smith-catherine-johnson,JQADiaries-v49-1825-08-p891.xml,M
3,wirt-william;wirt-elizabeth,JQADiaries-v49-1825-08-p891.xml,M
4,hellen-mary;adams-elizabeth,JQADiaries-v49-1825-08-p891.xml,young Ladies
...,...,...,...
406,adams-george;adams-john2,JQADiaries-v31-1821-02-p508.xml,two eldest Sons
407,way-andrew;way-andrew2,JQADiaries-v31-1821-02-p508.xml,the Way’s
408,morris-thomas;livingston-john,JQADiaries-v31-1821-02-p508.xml,Marshals of the two Districts of New-York
409,stewart-charles;stewart-delia,JQADiaries-v31-1821-02-p508.xml,Commodore and M
