# Assignment 2: Analyze XML Data with ElementTree

#### Code Sources: 
* GeeksForGeeks tutorial on [XML Parsing in Python](https://www.geeksforgeeks.org/xml-parsing-python/)
* OpenWritings tutorial on [Python - How to write XML file](https://openwritings.net/pg/python/python-how-write-xml-file#:~:text=As%20always%20with%20Python%2C%20there%20are%20different%20ways,library%20since%20Python%202.5.%20%23%21%2Fusr%2Fbin%2Fpython3%20import%20xml.%20etree.)

***

## Course: Analyzing Structured Data with Pandas and ElementTree

### [Centre for Data, Culture & Society](https://www.cdcs.ed.ac.uk/)

#### Instructor Example by Lucy Havens

March 25, 2022

***

In [45]:
import csv
import urllib.request
from xml.dom import minidom
import xml.etree.ElementTree as ET

First we write **functions** that we'll use to access and parse (a.k.a. read, process) data in **Extensible Markup Language (XML)** format from a website.  Functions store lines of code you write under a name that allows you to quickly reference and run those lines of code without having to rewrite them.  As a result, functions improve the *efficiency* of your programming.

In [76]:
################# FUNCTION 1 #####################
# Input: file name (type=str) and URL
# Output: file of XML data from that website named as the input file name and XML tree
##################################################
def loadXMLFromURL(filename, url):
    response_object = urllib.request.urlopen(url)   # create HTTP response object from url
    tree = ET.parse(response_object)
    root = tree.getroot()
    # create a formatted XML string
    xml_string = minidom.parseString(ET.tostring(root)).toprettyxml(indent="\t")
    with open(filename, 'w') as f:                  # open a new, blank file in write mode ('w' means write)
        f.write(xml_string)                         # save the xml data string to a file f
    f.close()
    print("Finished writing "+filename+"!")
    return tree

In [81]:
################# FUNCTION 2 #####################
# Input: ElementTree tree object of XML data
# Output: data as a list of dictionaries (type=dict)
##################################################
def parseXML(tree):
    root = tree.getroot()          # get the tree's root node (a.k.a. root element)
    items = []                     # create a list to add news items to
    # iterate through the news items in the tree of XML data with a for loop
    for elem in root.findall('./channel/item'):
        item_dict = {}             # create a dictionary to add news information to
        for child in elem:         # iterate through the children of elem
            item_dict[child.tag] = child.text
        items.append(item_dict)    # add the news dict to the newsitems list
    return items                   # return the list of dictionaries

In [82]:
################# FUNCTION 3 #####################
# Input: list of dictionaries (type=list), file name (type=str), and list (type=list) of 
#        column names (type=str) for the CSV file to be output (if none provided, use default)
# Output: file of the input data in comma-separated values format (CSV)
##################################################
europeana_tag_names = ['guid','title','description','link']
def writeCSV(items, filename, fields=europeana_tag_names):
    with open(filename, 'w') as csvfile:                     # open a new blank file to write to
        writer = csv.DictWriter(csvfile, fieldnames=fields)  # create the csv module's writer object
        writer.writeheader()                                 # write fields as header (column names)
        writer.writerows(items)                              # write input data as rows
    csvfile.close()
    print("Finished writing "+filename+"!")

In [83]:
################# FUNCTION 4 #####################
# Input: URL (if none given, use the default URL provided in the () below), XML file name, CSV file name
# Output: XMl file and CSV file of data from an input url
##################################################
europeana_search_blue = "https://api.europeana.eu/record/v2/opensearch.rss?searchTerms=blue&count=100&startIndex=100"
def getXMLandCSV(xmlname, csvname, url=europeana_search_blue):
    tree = loadXMLFromURL(xmlname, url)     # write an XML file
    items = parseXML(tree)                  # parse the XML file
    writeCSV(items, csvname)                # write data in XML file to CSV file

In [84]:
getXMLandCSV("europeana_search-blue_count-100_start-100.xml", "europeana_blue_examples.csv")

Finished writing europeana_search-blue_count-100_start-100.xml!
Finished writing europeana_blue_examples.csv!


Try opening the XML file in a text editor such as Atom and the CSV file as a spreadsheet in, for example, Microsot Excel.  How does the data look?

It's useful to save the XML data in a file because it means we can recreate the ElementTree tree object for future programming tasks, because when we shutdown this Jupyter Notebook, the `tree` variable will be forgotten.

To create a tree from an XML file, instead of a URL, we can write:

In [44]:
xmlfile = "europeana_search-blue_count-100_start-100.xml"
tree = ET.parse(xmlfile)
root = tree.getroot()
titles = []
for elem in root.findall('./channel/item/title'):
    titles += [elem.text]
print(titles[:5])

['amulets', 'amulet, leaf', 'amulets', 'beads', 'amulets']
