# XML Tutorial

Everything about XML in Python is done with package `xml`. Let's import it at the beginning of the notebook.

In [2]:
import xml.etree.ElementTree as ET

tree = ET.parse('data/data.xml')
print(type(tree))

<class 'xml.etree.ElementTree.ElementTree'>


In [4]:
# to get the main (root) tag of the file, call the function `getroot()`
root = tree.getroot()
root

<Element 'data' at 0x10affd630>

In [5]:
print(root.tag)
print(root.attrib)
print(len(root))

data
{}
3


The length of the element is 3, meaning that it has 3 children. They can be accessed in the same way as elements in a `list`.

In [6]:
# first child of the root
country1 = root[0]

# first child of the child
rank = country1[0]

# what is the tag of the grandchild
print(rank.tag)

# what is the text inside this grandchild
print(rank.text)

# what are the attributes of last element
print(country1[4].attrib) # or
print(country1[-1].attrib)

rank
1
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Switzerland', 'direction': 'W'}


In [7]:
# finding the same information regarding the third child of the root
country3 = root[2]

# first child of the child
rank = country3[0]

# what is the tag of the grandchild
print(rank.tag)

# what is the text inside this grandchild
print(rank.text)

# what are the attributes of last element
print(country3[4].attrib) # or
print(country3[-1].attrib)

rank
68
{'name': 'Colombia', 'direction': 'E'}
{'name': 'Colombia', 'direction': 'E'}


In [8]:
# find all child with tag `country`
for country in root.findall('country'):
    # rank is child of the country
    rank = country.find('rank').text
    # name is attribute of the country
    name = country.get('name')
    print(name, rank)

Liechtenstein 1
Singapore 4
Panama 68


In [9]:
for neighbor in root.iter('neighbor'):
    print(neighbor.attrib)

{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}


An XML file has a main/parent element/tag. In this case, the tag is called `data`. Within the `data`, there are `country` tags enclosing the children and their information.

In [12]:
# top-level elements
root.findall(".")

[<Element 'data' at 0x10affd630>]

In [13]:
# all `neighbor` grand-children of `country` children of the top-level elements
root.findall("./country/neighbor")

[<Element 'neighbor' at 0x10afd0450>,
 <Element 'neighbor' at 0x10afd0090>,
 <Element 'neighbor' at 0x10b03aea0>,
 <Element 'neighbor' at 0x10b0a75e0>,
 <Element 'neighbor' at 0x10b0a7e00>]

In [14]:
# elements with name='Singapore` that have a `year` child
root.findall(".//year/..[@name='Singapore']")

[<Element 'country' at 0x10b0d9770>]

In [15]:
# 'year' elements that are children of elements with name='Singapore'
root.findall(".//*[@name='Singapore']/year")

[<Element 'year' at 0x107b48180>]

In [16]:
# all `neighbor` elements that are the second child of their parent
root.findall(".//neighbor[2]")

[<Element 'neighbor' at 0x10afd0090>, <Element 'neighbor' at 0x10b0a7e00>]

Extract the name, rank, year, and gdppc from the countries and create a Pandas DataFrame. Try to do it alone before checking the solution here.

In [17]:
import xml.etree.ElementTree as ET
import pandas as pd

my_dict = {'name': [],
           'rank': [],
           'year': [],
           'gdppc': []}

for country in root:
    name_value = country.attrib['name']
    my_dict['name'].append(name_value)
    
    rank_value = country[0].text
    my_dict['rank'].append(rank_value)
    
    year_value = country[1].text
    my_dict['year'].append(year_value)
    
    gdppc_value = country[2].text
    my_dict['gdppc'].append(gdppc_value)

In [21]:
df = pd.DataFrame(my_dict)
df

Unnamed: 0,name,rank,year,gdppc
0,Liechtenstein,1,2008,141100
1,Singapore,4,2011,59900
2,Panama,68,2011,13600


The `country` tag attribute `name` is accessed through the `attrib` method. The others are children of the parent `country` element. Full documentation to xml parsing can be found [here](https://docs.python.org/3/library/xml.etree.elementtree.html).