### MODS XML & XPath Lab

1. To make sure things are working, adapt the code from class that we used to count the individual MODS records in the file. How many are there?

In [2]:
import xml.etree.ElementTree as ET
import os

In [8]:
mods = ET.parse(open('2018_lcwa_MODS_25.xml'))
root = mods.getroot()

In [13]:
record_count = 0

for record in root: 
    record_count += 1

print(f"Answer: {record_count}")

Answer: 25


2. These records contain 'subject' designations, but only some of these correspond to headings that are authorized headings in the Library of Congress Subject Headings. Those are marked with an attribute authority='lcsh', which is indicated as an embedded attribute in the tag. Look through 'subject' tags, identify only the ones that include an LCSH attribute, then print the content of those subject headings. Use an XPath expression here.

In [14]:
ns = {
    'mods': 'http://www.loc.gov/mods/v3'
}

In [47]:
for subject in root.findall('.//mods:subject[@authority="lcsh"]/mods:topic', namespaces=ns):
    print(subject.tag, subject.attrib, subject.text)

{http://www.loc.gov/mods/v3}topic {} Animals
{http://www.loc.gov/mods/v3}topic {} Memes
{http://www.loc.gov/mods/v3}topic {} Memes
{http://www.loc.gov/mods/v3}topic {} Memes
{http://www.loc.gov/mods/v3}topic {} Web portals
{http://www.loc.gov/mods/v3}topic {} Political candidates
{http://www.loc.gov/mods/v3}topic {} Elections
{http://www.loc.gov/mods/v3}topic {} Politics and government
{http://www.loc.gov/mods/v3}topic {} Political candidates
{http://www.loc.gov/mods/v3}topic {} Elections
{http://www.loc.gov/mods/v3}topic {} Politics and government
{http://www.loc.gov/mods/v3}topic {} Political candidates
{http://www.loc.gov/mods/v3}topic {} Elections
{http://www.loc.gov/mods/v3}topic {} Politics and government
{http://www.loc.gov/mods/v3}topic {} Political candidates
{http://www.loc.gov/mods/v3}topic {} Elections
{http://www.loc.gov/mods/v3}topic {} Politics and government
{http://www.loc.gov/mods/v3}topic {} Political candidates
{http://www.loc.gov/mods/v3}topic {} Elections
{http://

3. Data validation: check the local call number references to ensure that they are in the proper format (e.g., lcwaAddddddd). Try adapting the regular expression implementation that we used in class. Hint: this is very similar to what we did in class, but you will need to modify the regex: look carefully at the identifiers because some will be similar but won't match the expression we built in class. 

In [46]:
import re

for identifier in root.findall('.//mods:identifier', namespaces=ns):
    if identifier.text in re.findall("lcwa[A-Z]{1}\d{7}", identifier.text):
        print(identifier.text)

lcwaN0010234
lcwaN0001999
lcwaN0003238
lcwaN0010144
lcwaN0010145
lcwaN0012178
lcwaN0012179
lcwaN0012180
lcwaN0012184
lcwaN0012195
lcwaN0010932
lcwaN0010933
lcwaN0010936
lcwaN0010937
lcwaN0010940
lcwaN0010888
lcwaN0010226
lcwaN0009692
lcwaN0009700
lcwaN0010401
lcwaE0008846
lcwaE0008263
lcwaE0008338
lcwaE0008918
lcwaE0008001


4. Data addition or modification: identify the local call numbers, then check to make sure all of them have appropriate attribute data attached. As in class, make sure that there is a type attribute as well as an updated attribute. You can add more if you want to.

In [53]:
from datetime import date

for identifier in root.findall('.//mods:identifier', namespaces=ns):
    if identifier.text in re.findall("lcwa[A-Z]{1}\d{7}", identifier.text):
        identifier.set('type', 'call_number')
        identifier.set('updated', 'yes')
        print(identifier.attrib)

{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated': 'yes'}
{'type': 'call_number', 'updated':

5. Save the updated records to a new file, which includes a valid namespace declaration for MODS, is well-formed MODS, has an appropriate XML document type declaration, and is valid XML.

In [54]:
ET.register_namespace('mods', 'http://www.loc.gov/mods/v3')

mods.write('2018_lcwa_MODS_25.xml', xml_declaration=True, encoding='utf-8', method='xml')