# XML Parser and Dataframe Creator

The first step is to parse XML files into documents.

* 09/23/2020: At the moment, I simplifying this step: each file will be it's own document. This will allow me to focus on the...

Useful Resources:
* Nair, Deepesh, "[Processing XML in Python—ElementTree](https://towardsdatascience.com/processing-xml-in-python-elementtree-c8992941efd2)," Accessed Sept. 22, 2020.

In [1]:
# Import necessary libraries.
import re, glob, csv, sys, os
import pandas as pd
import xml.etree.ElementTree as ET

# Declare directory location to shorten filepaths later.
abs_dir = "/Users/quinn.wi/Documents/SemanticData/"

# Gather all .xml files using glob.
list_of_files = glob.glob(abs_dir + "Data/JQA_papers/*/*.xml")

## Define Functions

In [2]:
'''
Arguments of Functions:

    namespace:

    ancestor:
    
    xpath_as_string:
    
    attrib_val_str:
    
'''

# Read in file and get root of XML tree.
def get_root(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    return root


# Get namespace of individual file from root element.
def get_namespace(root):
    namespace = re.match(r"{(.*)}", str(root.tag))
    ns = {"ns":namespace.group(1)}
    return ns


# Get document id.
def get_document_id(ancestor, attrib_val_str):
    doc_id = ancestor.get(attrib_val_str)
    return doc_id


# Get date of document.
def get_date_from_attrValue(ancestor, xpath_as_string, attrib_val_str, namespace):
    date = ancestor.find(xpath_as_string, namespace).get(attrib_val_str)
    return date


def get_peopleList_from_attrValue(ancestor, xpath_as_string, attrib_val_str, namespace):
    people_list = []
    for elem in ancestor.findall(xpath_as_string, namespace):
        person = elem.get(attrib_val_str)
        people_list.append(person)
#     Return a string object of 'list' to be written to output file. Can be split later.
    return ','.join(people_list)

    
# Get plain text of every element (designated by first argument).
def get_textContent(ancestor, xpath_as_string, namespace):
    text_list = []
    for elem in ancestor.findall(xpath_as_string, namespace):
        text = ''.join(ET.tostring(elem, encoding='unicode', method='text'))

#         Add text (cleaned of additional whitespace) to text_list.
        text_list.append(re.sub(r'\s+', ' ', text))

#     Return concetanate text list.
    return ' '.join(text_list)

## Declare Variables

In [3]:
# Declare regex to simplify file paths below
regex = re.compile(r'.*/\d{4}/(.*)')

# Declare document level of file. Requires root starting point ('.').
doc_as_xpath = './/ns:div/[@type="entry"]'

# Declare date element of each document.
date_path = './ns:bibl/ns:date/[@when]'

# Declare person elements in each document.
person_path = './/ns:p/ns:persRef/[@ref]'

# Declare text level within each document.
text_path = './ns:div/[@type="docbody"]/ns:p'

## Parse Documents

In [4]:
%%time

# Open/Create file to write contents.
with open(abs_dir + 'Output/Dataframes/ParsedXML/JQA_dataframe.txt', 'w') as outFile:
    
#     Write headers for table.
    outFile.write('file' + '\t' + 'entry' + '\t' + 'date' + '\t' + \
                  'people' + '\t' + 'text' + '\n')
    
#     Loop through each file within a directory.
    for file in list_of_files:
        
#         Call functions to create necessary variables and grab content.
        root = get_root(file)
        ns = get_namespace(root)

        for eachDoc in root.findall(doc_as_xpath, ns):
#             Call functions.
            entry = get_document_id(eachDoc, '{http://www.w3.org/XML/1998/namespace}id')
            date = get_date_from_attrValue(eachDoc, date_path, 'when', ns)
            people = get_peopleList_from_attrValue(eachDoc, person_path, 'ref', ns)
            text = get_textContent(eachDoc, text_path, ns)
            
#             Write results in tab-separated format.
            outFile.write(str(regex.search(file).groups()) + '\t' +  entry + \
                          '\t' + date + '\t' + people + '\t' + text + '\n')

CPU times: user 532 ms, sys: 14 ms, total: 546 ms
Wall time: 551 ms


## Import Dataframe

In [5]:
dataframe = pd.read_csv(abs_dir + 'Output/Dataframes/ParsedXML/JQA_dataframe.txt', sep = '\t')

dataframe

Unnamed: 0,file,entry,date,people,text
0,"('JQADiaries-v49-1825-01-p795.xml',)",jqadiaries-v49-1825-01-01,1825-01-01,,"1. VI:30. H. Humphreys here, for Methodist Chu..."
1,"('JQADiaries-v49-1825-01-p795.xml',)",jqadiaries-v49-1825-01-02,1825-01-02,,2. VII:15— Heard Lynde at the Capitol—late. Ca...
2,"('JQADiaries-v49-1825-01-p795.xml',)",jqadiaries-v49-1825-01-03,1825-01-03,,3. VII. I called at M. Van-Buren’s lodgings—ou...
3,"('JQADiaries-v49-1825-01-p795.xml',)",jqadiaries-v49-1825-01-04,1825-01-04,,4. VI:30. W. Findlay here; Statesman Newspaper...
4,"('JQADiaries-v49-1825-01-p795.xml',)",jqadiaries-v49-1825-01-05,1825-01-05,,5. V:15. Mills came and took the draft of the ...
...,...,...,...,...,...
2184,"('JQADiaries-v31-1821-02-p508.xml',)",jqadiaries-v31-1821-02-25,1821-02-25,"palfrey-john,forsyth-john,hopkinson-joseph,cli...","25. VII: Attended Church at the Bath-room, and..."
2185,"('JQADiaries-v31-1821-02-p508.xml',)",jqadiaries-v31-1821-02-26,1821-02-26,"lowndes-william,randolph-john,dickinson-john,n...",26. IV:15. The pressure of business upon me pu...
2186,"('JQADiaries-v31-1821-02-p508.xml',)",jqadiaries-v31-1821-02-27,1821-02-27,"baker-anthony,vanness-cornelius,porter-peter,w...",27. VII:Mr Baker the British Consul General ca...
2187,"('JQADiaries-v31-1821-02-p508.xml',)",jqadiaries-v31-1821-02-28,1821-02-28,"edwards-ninian,walton-george,walker-freeman,cl...",28. VII:Mr Ninian Edwards the Senator from Ill...
