# Denison CS-181/DA-210 Homework

---

## XML Procedural Operations Exercises

In [90]:
import os
import json
import sys
import lxml
from lxml import etree

def add_modules():
    """
    Starting at the current directory and proceeding up the file system
    tree, search for a directory named `modules`.  If found, and if not
    already there, add to the Python module search path.
    
    Params: None
    
    Return: None
    """
    directory = "."
    levels = 0
    while not os.path.isdir(os.path.join(directory, "modules")) and \
          levels < 5:
        directory = os.path.join(directory, "..")
        levels += 1
    module_path = os.path.abspath(os.path.join(directory, "modules"))
    if os.path.isdir(module_path):
        if not module_path in sys.path:
            sys.path.append(module_path)

add_modules()
import util

datadir = util.resolve_dir("hierarchicaldata")
myparser = etree.XMLParser(remove_blank_text=True)

**Q1** Write a function:
    
    buildXML(filename, datadir=".", parser=None)
    
that performs the common steps of creating a path from the given `filename` and `datadir` and parses the XML file, using the passed `parser`, if any, and returns the Element at the **root** of the tree.  If parser is not passed, the standard `XMLParser` (without removing blank text) should be used.

If the file is not found, or if the parse is unsuccessful (due to XML not being "well formed"), the function should return `None`. Remember that if a parse is unsuccessful, the `etree` module raises an exception.  That means that you should have a `try` block, and indented within that block, the `parse()` invocation should occur.  The `try` block is followed by an `except Exception as e:` line, and within that, your return `None`.  If no exception is raised, code execution will proceed beyond the `try`/`except` block, and that is where you would return the root of the parsed tree.

In [91]:
# Solution cell
def buildXML(filename, datadir=".", parser=None):
    '''
    This function gets an xml file and parses through it,
    then it returns the root.
    
    Parameters: filename: the name of the xml file
                datadir: directory of file, uses cwd if
                none is given
                parser: the way the file is parsed, if
                none is given then it will use the standard
                parser
                
    Return: root: root of the xml file if the file exists and
            the given parser is successful
            None: if no xml file is at the given location or
            the given parser is unsuccessful
    '''
    if(os.path.isfile(os.path.join(datadir, filename))):
        path = os.path.join(datadir, filename)
        if(parser==None):
            tree = etree.parse(path, etree.XMLParser())
        else:
            try:
                tree = etree.parse(path, parser)
            except Exception as e:
                return None
        root = tree.getroot()
        return root
    return None

In [92]:
# Testing cell

assert True

As preparation for the following sequence of exercises, we need to parse the file `widombooks.xml` in the data directory and assign to `wroot` the root `Element` object.  This is done for you in the following code cell.

In [93]:
wroot = buildXML("widombooks.xml", datadir, myparser)
assert isinstance(wroot, lxml.etree._Element)
util.print_xml(wroot, nlines=12)

<Bookstore>
  <Book ISBN="ISBN-0-13-713526-2" Price="85" Edition="3rd">
    <Title>A First Course in Database Systems</Title>
    <Authors>
      <Author>
        <First_Name>Jeffrey</First_Name>
        <Last_Name>Ullman</Last_Name>
      </Author>
      <Author>
        <First_Name>Jennifer</First_Name>
        <Last_Name>Widom</Last_Name>
      </Author>


**Use the `lxml` operational interface for solving the following problems, not XPath.**

**Q2** Using the Element `wroot` from above, get the attributes of the first child tagged 'Magazine', and store your answer as a dictionary `myAttrib`.  Try to do this with a single assignment expression.

In [94]:
myAttrib = wroot.find("Magazine").attrib
myAttrib

{'Month': 'January', 'Year': '2009'}

In [95]:
# Testing cell

assert True

**Q3** Using the Element `wroot`, use iteration or a list comprehension to obtain a list of the **tags** of all the children of `wroot`.  Assign to `taglist`.

In [96]:
taglist = [node.tag for node in wroot.iter()]
taglist

['Bookstore',
 'Book',
 'Title',
 'Authors',
 'Author',
 'First_Name',
 'Last_Name',
 'Author',
 'First_Name',
 'Last_Name',
 'Book',
 'Title',
 'Authors',
 'Author',
 'First_Name',
 'Last_Name',
 'Author',
 'First_Name',
 'Last_Name',
 'Author',
 'First_Name',
 'Last_Name',
 'Remark',
 'Book',
 'Title',
 'Authors',
 'Author',
 'First_Name',
 'Last_Name',
 'Author',
 'First_Name',
 'Last_Name',
 'Remark',
 'Book',
 'Title',
 'Authors',
 'Author',
 'First_Name',
 'Last_Name',
 'Magazine',
 'Title',
 'Magazine',
 'Title',
 'Magazine',
 'Title',
 'Magazine',
 'Title']

In [97]:
# Testing cell

assert True
assert isinstance(taglist, list)

**Q4** Find all children of `wroot` with tag `Book`, and store them in a list of Elements called `booklist`.

In [98]:
booklist = wroot.findall('Book')
print("Length of result:", len(booklist))

Length of result: 4


In [99]:
# Testing cell

assert True

In the following, we give preliminary code to access the position 2 child of `wroot` and assign it to `node`.  You are to use `node` in the subsequent set of exercises.

In [100]:
node = wroot[2]
util.print_xml(node)

<Book ISBN="ISBN-0-11-222222-3" Price="50">
  <Title>Hector and Jeff's Database Hints</Title>
  <Authors>
    <Author>
      <First_Name>Jeffrey</First_Name>
      <Last_Name>Ullman</Last_Name>
    </Author>
    <Author>
      <First_Name>Hector</First_Name>
      <Last_Name>Garcia-Molina</Last_Name>
    </Author>
  </Authors>
  <Remark>An indispensible companion to your textbook</Rema
</Book>


**Q5** In a single assignment from an expression, set `title` to the **text** of the `Title` child of `node`.

In [101]:
title = node.find("Title").text
title

"Hector and Jeff's Database Hints"

In [102]:
# Testing cell

assert True

**Q6** Write a function

    findValue(anode, tag)
    
that, relative to `node` finds the first subelement matching `tag` and returns the `.text` attribute if found, and None, if no match was found.  After you have defined the function, demonstrate its use by invoking the function to find the text of the `Title` subelement of `node` and assign to `title2` as an alternative way to solve what we did in Q4.

In [103]:
def findValue(anode, tag):
    '''
    This function takes a node and finds a given tag to 
    return the text associated with the tag.
    
    Parameters: anode: the node that will be searched
                tag: the tag that will have its text
                returned
                
    Return: tagtext: the text of the tag in the node
            None: if the node, tag, or text does not
            exist
    '''
    try:
        tagtext = anode.find(tag).text
    except:
        return None
    return tagtext

title2 = findValue(node, "Title")
title2

"Hector and Jeff's Database Hints"

In [104]:
# Testing cell

assert True

**Q7** This problem will likely require multiple steps to solve.  Suppose we want to print the author(s) of a book, with one author per line.  So for `node`, the desired print output is:
```
Jeffrey Ullman
Hector Garcia-Molina
```
At an English step level, the solution would look something like the following:
1. From `node`, obtain the Element for the `Authors` child; call it `authors`
2. For each child of `authors`, do the following:
    - get the text value for the `First_Name` subelement, call it `first`
    - get the text value for the `Last_Name` subelement, call it `last`
    - print on a single line `first` and `last`, separated by a single space

Write the code to obtain this output for the `node` book Element.

In [105]:
authors = node.find("Authors")
for child in authors:
    first = child.find("First_Name").text
    last = child.find("Last_Name").text
    print(first, last)

Jeffrey Ullman
Hector Garcia-Molina
