# <span style="color:#54B1FF">Parsing data:</span> &nbsp; <span style="color:#1B3EA9"><b>XML files</b></span>

<br>

[XML](https://en.wikipedia.org/wiki/XML) is a text-file format that is similar to HTML, but there are two important differences:

* <span style="color:blue">**HTML files are meant for web brosers.**</span> They have a rigid format, and only specific element types can be used to ensure that web browsers can interpret them.
* <span style="color:blue">**XML files store arbitrary data**</span>. They have a much freer format, and programmers develop a specific XML format for specifc needs. XML files are usually not renderable in web browsers.

In other words, HTML files can be regarded as a specific type of XML file.

Anyone can create any XML format for any purpose. The basic XML syntax is identical across files, making XML parsers able to read them, but the actual structure of the XML file varies substantially from dataset to dataset.

Due to its flexibility, the XML format is more common than CSV for complex datasets. <span style="color:red">XML is a very common data format</span>, in science, engineering, and many other applications. Many websites use XML to dynamically store and update data (e.g. retail websites).



This notebook demonstrates how to parse relatively simple XML files.
<br>

⚠️ **NOTE!**  &nbsp; &nbsp; All data files are saved in the same directory as this notebook.



___

First let's import the modules we'll need for this lecture.

In [1]:
from lxml import etree

<a name="toc"></a>
# Table of Contents

* [Simple XML](#xml-simple)
* [XML real world example](#xml-realworld)

___
<a name="xml-simple"></a>
## `xml` <span style="background-color:powderblue;">Simple XML</span>
[Back to Table of Contents](#toc)
<br>

This example (`data6.xml`) is an excerpt of three books from an example XML file at [microsoft.com](https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85))

The contents of the file are:

<br>

```
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
</catalog>
```

<br>
<br>

Let's try to extract the titles from these books. Similar HTML file parsing, we can also use the `lxml` package to parse XML files.

In [2]:
from lxml import etree

tree      = etree.parse('data6.xml')

print(tree)


<lxml.etree._ElementTree object at 0x7ffc104c0280>


Now that the XML data are stored in an element tree, we can parse the data identical to HTML data.

In [3]:
titles    = [e.find('title').text for e in tree.findall('book')]

print( titles )

["XML Developer's Guide", 'Midnight Rain', 'Maeve Ascendant']


___
<a name="xml-realworld"></a>
## `xml` <span style="background-color:powderblue;">XML real-world example</span>
[Back to Table of Contents](#toc)
<br>


<img alt="iTunes" width=300 src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/ITunes_12.2_logo.png/600px-ITunes_12.2_logo.png">

Apple's popular [iTunes](https://www.apple.com/itunes/) software previously stored all music library details in an XML file.  <span style="color:blue">(Apple now uses a different, similar file format.)</span>

`data7.xml` is an example iTunes library file that uses Apple's previous format.

This example file is from [stratify on GitHub](https://github.com/jasonrudolph/stratify/blob/master/spec/fixtures/iTunes%20Music%20Library.xml).

<br>
<br>

The data file is about 2000 lines long, so for brevity it is not displayed in full here. The key structure (including the first two audio tracks) looks something like this <span style="color:blue">(Note that many details are excluded)</span>:

<br>
<br>
<br>
<br>

```
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Major Version</key><integer>1</integer>
	<key>Minor Version</key><integer>1</integer>
	<key>Music Folder</key><string>file://localhost/Users/z/Music/iTunes/iTunes%20Media/</string>
	<key>Tracks</key>
	<dict>
		<key>86</key>
		<dict>
			<key>Track ID</key><integer>86</integer>
			<key>Name</key><string>Play Your Part (Pt. 1)</string>
			<key>Artist</key><string>Girl Talk</string>
			<key>Album</key><string>Feed The Animals</string>
			<key>Genre</key><string>Mash-up</string>
			<key>Location</key><string>file://localhost/Users/z/Music/iTunes/iTunes%20Media/Music/Girl%20Talk/Feed%20The%20Animals/01%20Play%20Your%20Part%20(Pt.%201).mp3</string>
		</dict>
		<key>88</key>
		<dict>
			<key>Track ID</key><integer>88</integer>
			<key>Name</key><string>Shut The Club Down</string>
			<key>Artist</key><string>Girl Talk</string>
			<key>Album</key><string>Feed The Animals</string>
			<key>Genre</key><string>Mash-up</string>
			<key>Location</key><string>file://localhost/Users/z/Music/iTunes/iTunes%20Media/Music/Girl%20Talk/Feed%20The%20Animals/02%20Shut%20The%20Club%20Down.mp3</string>
			<key>File Folder Count</key><integer>5</integer>
			<key>Library Folder Count</key><integer>1</integer>
		</dict>
</dict>
</plist>
```

<br>
<br>
<br>
<br>


Notice that each track has `<key>` elements for both `Artist` and `Name` (i.e., song name).

Suppose we wanted to retrieve all song names in our library for the artist: Nirvana. How could we do this?

<br>

First let's check that we can access the artist and song name for just the **first track**.

In [5]:
tree      = etree.parse('data7.xml')

track     = tree.find('dict/dict/dict')  # track details stored in the third nested <dict> element


for key in track.findall('key'):         # cycle through all <key> elements in the track
    if key.text == 'Artist':             # check whether this is the Artist key
        artist = key.getnext().text      # get the text from the next element (the <string> element)
    elif key.text == 'Name':             # check whether this is the Name key
        song_name = key.getnext().text   # get the text from the next element (the <string> element)

print(artist)
print(song_name)

Girl Talk
Play Your Part (Pt. 1)


Great! We have successfully accessed the artist and song names.

Next, let's try retrieving artist and song names for all tracks.

In [6]:
tracks    = tree.findall('dict/dict/dict')  # all tracks

for track in tracks:
    artist    = None                     # initialize the artist name for this track
    song_name = None                     # initialize the artist name for this track
    for key in track.findall('key'):     
        if key.text == 'Artist':         
            artist = key.getnext().text
        elif key.text == 'Name':
            song_name = key.getnext().text
    print(artist, '-', song_name)

Girl Talk - Play Your Part (Pt. 1)
Girl Talk - Shut The Club Down
Girl Talk - Still Here
Girl Talk - What It's All About
Girl Talk - Set It Off
Girl Talk - No Pause
Girl Talk - Like This
Girl Talk - Give Me A Beat
Girl Talk - Hands In The Air
Girl Talk - In Step
Girl Talk - Let Me See You
Girl Talk - Here's The Thing
Girl Talk - Don't Stop
Girl Talk - Play Your Part (Pt. 2)
Nirvana - Smells Like Teen Spirit
Nirvana - In Bloom
Nirvana - Come As You Are
Nirvana - Breed
Nirvana - Lithium
Nirvana - Polly
Nirvana - Territorial Pissings
Nirvana - Drain You
Nirvana - Lounge Act
Nirvana - Stay Away
Nirvana - On A Plain
Nirvana - Something In The Way
Bush - Everything Zen
Bush - Swim
Bush - Bomb
Bush - Little Things
Bush - Comedown
Bush - Body
Bush - Machinehead
Bush - Testosterone
Bush - Monkey
Bush - Glycerine
Bush - Alien
Bush - X-Girlfriend
Steve Martin - Born Standing Up: A Comic's Life (Unabridged)
Tom Merritt, Becky Worley, Sarah Lane and Jason Howell - Tech News Today 187: To Xoom Or No

Excellent!

Now let's extract all song names for only the artist: Nirvana.

In [7]:

tracks    = tree.findall('dict/dict/dict')

song_names = []

for track in tracks:
    artist    = None                     # initialize the artist name for this track
    song_name = None                     # initialize the artist name for this track

    for key in track.findall('key'):

        if key.text=='Artist':
            artist = key.getnext().text
            if artist == 'Nirvana':
                artist_found = True
            else:
                artist_found = False
        
        elif key.text == 'Name':
            song_name = key.getnext().text
    
    if artist_found:
        song_names.append( song_name )
    

for name in song_names:
    print(name)



Smells Like Teen Spirit
In Bloom
Come As You Are
Breed
Lithium
Polly
Territorial Pissings
Drain You
Lounge Act
Stay Away
On A Plain
Something In The Way


Stupendous!

We have successfully achieved our goal of extracting all song names by the artist Nirvana.