# Working with XML Data using Python's ElementTree API
Course Meeting 4 (the last one!!!)

**Note:** This Notebook is based on a Python Module of the Week tutorial on [Creating XML Documents](https://pymotw.com/2/xml/etree/ElementTree/create.html).

In [1]:
import xml.etree.ElementTree as ET
from xml.dom import minidom
import pandas as pd
import csv
import re

### 1. Build a tree of XML data node by node

In [2]:
root = ET.Element('root')
print(root)

<Element 'root' at 0x7f7a395fefb0>


In [3]:
comment = ET.Comment('Example created for CDCS Python Course 1')
root.append(comment)

In [4]:
child = ET.SubElement(root, 'child')
child.text = "Analyzing Structured Data"
print(child)
print(child.text)

<Element 'child' at 0x7f7a3960a170>
Analyzing Structured Data


In [5]:
grandchild = ET.SubElement(child, 'grandchild', {'week' : '1'})
grandchild.text = "Pandas"
print(grandchild)
print(grandchild.text)

<Element 'grandchild' at 0x7f7a3960aa70>
Pandas


In [6]:
grandchild1 = ET.SubElement(child, 'grandchild', {'week' : '2'})
grandchild1.text = "ElementTree"
print(grandchild1)
print(grandchild1.text)

<Element 'grandchild' at 0x7f7a3960ae90>
ElementTree


In [7]:
greatgrandchild = ET.SubElement(grandchild, 'greatgrandchild')
greatgrandchild.text = "CSV"
greatgrandchild1 = ET.SubElement(grandchild1, 'greatgrandchild')
greatgrandchild1.text = "XML"

In [8]:
print(ET.tostring(root))

b'<root><!--Example created for CDCS Python Course 1--><child>Analyzing Structured Data<grandchild week="1">Pandas<greatgrandchild>CSV</greatgrandchild></grandchild><grandchild week="2">ElementTree<greatgrandchild>XML</greatgrandchild></grandchild></child></root>'


Let's try to print our XML in an easier-to-read way:

In [17]:
# Approach from last class:
print(ET.tostring(root, encoding='utf8').decode('utf8'))

<?xml version='1.0' encoding='utf8'?>
<CDCS_Python_Course_Series><course name="Analyzing Structured Data">Pandas, ElementTree</course><course name="Analyzing Unstructured Data">Natural Language Toolkit</course><course name="Network Analysis and Data Visualization">NetworkX, Matplotlib, Altair, Seaborn</course></CDCS_Python_Course_Series>


In [9]:
# Alternative approach using Python's minidom module:
def readable(elem):
    raw = ET.tostring(elem, 'utf-8')
    minidom_parsed = minidom.parseString(raw)
    return minidom_parsed.toprettyxml(indent="    ")

In [10]:
print(readable(root))

<?xml version="1.0" ?>
<root>
    <!--Example created for CDCS Python Course 1-->
    <child>
        Analyzing Structured Data
        <grandchild week="1">
            Pandas
            <greatgrandchild>CSV</greatgrandchild>
        </grandchild>
        <grandchild week="2">
            ElementTree
            <greatgrandchild>XML</greatgrandchild>
        </grandchild>
    </child>
</root>



That looks much better!  Now we also get the XML document type declaration in the first line.

### 2. Build XML data from lists

In [97]:
courses = ["Analyzing Structured Data", "Analyzing Unstructured Data", "Network Analysis and Data Visualization"]
libs = ["Pandas, ElementTree", "Natural Language Toolkit", "NetworkX, Matplotlib, Altair, Seaborn"]

In [98]:
root = ET.Element('CDCS_Python_Course_Series')
children = [ET.Element('course', name=c) for c in courses]
print(children)

[<Element 'course' at 0x7f7a39493ad0>, <Element 'course' at 0x7f7a39507710>, <Element 'course' at 0x7f7a39507170>]


In [99]:
root.extend(children)

In [100]:
print(ET.tostring(root))

b'<CDCS_Python_Course_Series><course name="Analyzing Structured Data" /><course name="Analyzing Unstructured Data" /><course name="Network Analysis and Data Visualization" /></CDCS_Python_Course_Series>'


In [101]:
print(readable(root))

<?xml version="1.0" ?>
<CDCS_Python_Course_Series>
    <course name="Analyzing Structured Data"/>
    <course name="Analyzing Unstructured Data"/>
    <course name="Network Analysis and Data Visualization"/>
</CDCS_Python_Course_Series>



In [102]:
i = 0
for child in root.iter('course'):
    child.text = libs[i]
    i += 1
print(readable(root))

<?xml version="1.0" ?>
<CDCS_Python_Course_Series>
    <course name="Analyzing Structured Data">Pandas, ElementTree</course>
    <course name="Analyzing Unstructured Data">Natural Language Toolkit</course>
    <course name="Network Analysis and Data Visualization">NetworkX, Matplotlib, Altair, Seaborn</course>
</CDCS_Python_Course_Series>



### 3. Modfy existing elements in your XML data
The XML above is **well-formed** but let's try restructuring the data it holds so it's more extensible:

In [103]:
# courses = ["Analyzing Structured Data", "Analyzing Unstructured Data", "Network Analysis and Data Visualization"]
libs = [["Pandas", "ElementTree"], ["Natural Language Toolkit"], ["NetworkX", "Matplotlib", "Altair", "Seaborn"]]

i = 0
for child in root.iter('course'):
    child.text = courses[i]
    # Remember, attributes must be surrounded by quotation marks!
    # Otherwise, an error will be thrown because the XML won't be well-formed.
    child.attrib = {'number' : str(i+1)}
    for library in libs[i]:
        descendant = ET.SubElement(child, 'library')
        descendant.text = library
    i += 1
print(readable(root))

<?xml version="1.0" ?>
<CDCS_Python_Course_Series>
    <course number="1">
        Analyzing Structured Data
        <library>Pandas</library>
        <library>ElementTree</library>
    </course>
    <course number="2">
        Analyzing Unstructured Data
        <library>Natural Language Toolkit</library>
    </course>
    <course number="3">
        Network Analysis and Data Visualization
        <library>NetworkX</library>
        <library>Matplotlib</library>
        <library>Altair</library>
        <library>Seaborn</library>
    </course>
</CDCS_Python_Course_Series>



What other information might someone want to know about the CDCS Python Course Series?  How about the types of data that can be analyzed with the Python libraries covered in each course?

In [104]:
lib_pd = root.find("./course/[library='Pandas']/")
print(lib_pd)

<Element 'library' at 0x7f7a394a8170>


In [105]:
data_types = ["CSV", "TSV"]
lib_pd_children = [ET.Element('data', name=d) for d in data_types]
lib_pd.extend(lib_pd_children)
print(readable(root))

<?xml version="1.0" ?>
<CDCS_Python_Course_Series>
    <course number="1">
        Analyzing Structured Data
        <library>
            Pandas
            <data name="CSV"/>
            <data name="TSV"/>
        </library>
        <library>ElementTree</library>
    </course>
    <course number="2">
        Analyzing Unstructured Data
        <library>Natural Language Toolkit</library>
    </course>
    <course number="3">
        Network Analysis and Data Visualization
        <library>NetworkX</library>
        <library>Matplotlib</library>
        <library>Altair</library>
        <library>Seaborn</library>
    </course>
</CDCS_Python_Course_Series>



### 4. Export your XML data

In [106]:
tree = ET.ElementTree(root)
tree.write("python_courses.xml")

*Note: Options for writing XML to a file are summarized [here](https://stackoverflow.com/questions/3605680/creating-a-simple-xml-file-using-python)*