# Introduction to Analyzing XML data with ElementTree
XML data is a structured data format that we can parse using Python's ElementTree API.

**Source:** This Notebook uses data and code from the DataCamp tutorial [Python XML with ElementTree](https://www.datacamp.com/community/tutorials/python-xml-elementtree#intro).

### Demo 1

In [2]:
# Import ElementTree and give it an alias (ET)
# so it can be quickly and concisely referred to
import xml.etree.ElementTree as ET
import pandas as pd
import re

In [3]:
tree = ET.parse("xmlpractice.xml")

Once you've parsed, or loaded, an xml file, the first step is to locate the top-most node in the tree, called the **root** (remember XML is a hierarchical data format):

In [4]:
root = tree.getroot()

For learning purposes we've parsed a pretty small file, so let's print it out to see what it looks like:

In [5]:
print(ET.tostring(root, encoding='utf8').decode('utf8'))    
# For information about encodings, checkout W3's "Character encodings for
# beginners" at https://www.w3.org/International/questions/qa-what-is-encoding

<?xml version='1.0' encoding='utf8'?>
<collection>
    <genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones
                is hired by the U.S. government to find the Ark of the
                Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back 2 the Future">
               <format multiple="False">Blu-ray</format>
               <year>1985</year>
    

Ta da!

### Demo 2
Say we want to exclude the `<`, `>`, `=`, and `\` symbols that characterize XML data to make the data more human-readable.  We can use **for loops** to iterate through the data, starting from the root or any other child node:

In [25]:
for child in root:
    print("Tag:",child.tag, "| Attribute:", child.attrib, "| Text:", child.text)
    if 

Tag: genre | Attribute: {'category': 'Action'} | Text: 
        
Tag: genre | Attribute: {'category': 'Thriller'} | Text: 
        


Above we see the first generation of children from the root node.  What if we want to see *all* of the nodes (children of children of children, etc.)?  We can use a for loop with the iterator method `.iter()`:

In [26]:
for child in root.iter():
    print("Tag:",child.tag, "| Attribute:", child.attrib, "| Text:", child.text)

Tag: collection | Attribute: {} | Text: 
    
Tag: genre | Attribute: {'category': 'Action'} | Text: 
        
Tag: decade | Attribute: {'years': '1980s'} | Text: 
            
Tag: movie | Attribute: {'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'} | Text: 
                
Tag: format | Attribute: {'multiple': 'No'} | Text: DVD
Tag: year | Attribute: {} | Text: 1981
Tag: rating | Attribute: {} | Text: PG
Tag: description | Attribute: {} | Text: 
                'Archaeologist and adventurer Indiana Jones
                is hired by the U.S. government to find the Ark of the
                Covenant before the Nazis.'
                
Tag: movie | Attribute: {'favorite': 'True', 'title': 'THE KARATE KID'} | Text: 
               
Tag: format | Attribute: {'multiple': 'Yes'} | Text: DVD,Online
Tag: year | Attribute: {} | Text: 1984
Tag: rating | Attribute: {} | Text: PG
Tag: description | Attribute: {} | Text: None provided.
Tag: movie | Attribute: {'favorite

Great!

If we only want a subset of the data, for example the title of the movie and the year it was released, we can use **if statements** in our for loop:

In [31]:
movie_titles = []
years = []
for child in root.iter():
    if child.tag == "movie":
        movie_titles += [child.attrib['title']]
    if child.tag == "year":
        years += [child.text]

print("Movie Titles:\n", movie_titles)
print("Release Years:\n", years)

assert len(movie_titles) == len(years)  # There should be one release year for each movie

Movie Titles:
 ['Indiana Jones: The raiders of the lost Ark', 'THE KARATE KID', 'Back 2 the Future', 'X-Men', 'Batman Returns', 'Reservoir Dogs', 'ALIEN', "Ferris Bueller's Day Off", 'American Psycho']
Release Years:
 ['1981', '1984', '1985', '2000', '1992', '1992', '1979', '1986', '2000']


Or, we can specify the tag we're looking for inside the parentheses of the `.iter()` method:

In [42]:
titles = []
for movie in root.iter('movie'):
    titles += [movie.attrib['title']]

release_years = []
for release_year in root.iter('year'):
    release_years += [release_year.text]
    
print("Movie Titles:\n", titles)
print("Release Years:\n", release_years)

Movie Titles:
 ['Indiana Jones: The raiders of the lost Ark', 'THE KARATE KID', 'Back 2 the Future', 'X-Men', 'Batman Returns', 'Reservoir Dogs', 'ALIEN', "Ferris Bueller's Day Off", 'American Psycho']
Release Years:
 ['1981', '1984', '1985', '2000', '1992', '1992', '1979', '1986', '2000']


We can also use the `.iter()` method to get a list of all the tag names in our data:

In [39]:
tags = [elem.tag for elem in root.iter()]
print(set(tags))   # sets are similar to lists in Python except they have no duplicates

{'rating', 'format', 'collection', 'description', 'movie', 'genre', 'year', 'decade'}


Another useful method is `.findall()`, which takes both a single tag name or a *path* as a parameter.  The method will search the data starting from the node input as a parameter.

In [45]:
for g in root.findall("genre"):
    print(g.attrib)

{'category': 'Action'}
{'category': 'Thriller'}


In [46]:
for movie in root.findall("./genre/decade/movie/[year='1992']"):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}


In [48]:
genre_action = [elem for elem in root.findall('genre') if elem.attrib['category'] == 'Action']
print(genre_action)
print(type(genre_action[0]))

[<Element 'genre' at 0x7f5c7b1798f0>]
<class 'xml.etree.ElementTree.Element'>
