# 5 - Importing XML
In this fifth step I'll show you how to import and parse the XML data.

In [2]:
import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse("data-set.xml")
root = tree.getroot()

print(root.tag)

data-set


Taking a look we can see that the root tag is <code>data-set</code>.

In [4]:
print(root[0].tag)

record


The next child sub element tag is <code>record</code>.

In [8]:
# How many child elements?

child_tags = []
for child in root:
    child_tags.append(child.tag)

len(child_tags)

50

Looping through all of the <code>record</code> children sub element tags, we can see that we have 50. This corresponds to 50 records of data.

Let's take a look at the children element tags contained in <code>record</code>.

In [9]:
print(root[0][0].tag)

City


Taking a quick look, we can see that the first is <code>City</code>. Let's look for all of the tags, as well as their corresponding data.

In [11]:
for child in root[0]:
    print(child.tag, child.text)

City New York City
Zipcode 10012
Latitude_Longitude 40.726,-73.998


Taking a look we can see that our first record contains <code>City</code>, <code>Zipcode</code>, and <code>Latitude_Longitude</code> tags.

The data in these tags is <code>New York City</code>, <code>10012</code>, and <code>40.726,-73.998</code>.

If we open up the <code>data-set.xml</code> file in a text editor, we can confirm that this is in fact the first record.

Let's start parsing and gathering all of the data. We'll start by making a list of the tags to use for the columns.

In [13]:
# getting all of the child tags for column names

names = []

for child in root[0]:
    names.append(child.tag)

print(names)

['City', 'Zipcode', 'Latitude_Longitude']


Next, let's gather all of the data.

We'll create a dictionary using the list entries <code>['City', 'Zipcode', 'Latitude_Longitude']</code> as the keys. These will be our columns.

We'll collect the actual data using lists. These lists will be the values in our dictionary. 

In [14]:
# creating the dictionary to build the columns

d = {}


# using the sub children tags as the dictionary keys

for i in range(0,len(names)):
    d[names[i]] = []
#print(d)


# collecting all of the data

for i in range(0,50):
    for j in range(0,3):
        #print(root[i][j].text)
        value = root[i][j].text
        d[names[j]].append(value)

print(d)

{'City': ['New York City', 'New York City', 'New York City', 'New York City', 'New York City', 'Los Angeles', 'Los Angeles', 'Los Angeles', 'Los Angeles', 'Los Angeles', 'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Houston', 'Houston', 'Houston', 'Houston', 'Houston', 'Philadelphia', 'Philadelphia', 'Philadelphia', 'Philadelphia', 'Philadelphia', 'Phoenix', 'Phoenix', 'Phoenix', 'Phoenix', 'Phoenix', 'San Antonio', 'San Antonio', 'San Antonio', 'San Antonio', 'San Antonio', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'Dallas', 'Dallas', 'Dallas', 'Dallas', 'Dallas', 'San Jose', 'San Jose', 'San Jose', 'San Jose', 'San Jose'], 'Zipcode': ['10012', '10013', '10004', '10128', '10002', '90001', '90016', '90008', '90020', '90029', '60610', '60611', '60605', '60602', '60603', '77001', '77005', '77009', '77004', '77012', '19019', '19102', '19110', '19115', '19118', '85001', '85004', '85015', '85019', '85027', '78006', '78109', '78206', '78214', '78073', '91911'

The final step is to convert the dictionary to a dataframe with Pandas.

In [15]:
df = pd.DataFrame(data=d)
print(df)

             City Zipcode Latitude_Longitude
0   New York City   10012     40.726,-73.998
1   New York City   10013     40.721,-74.005
2   New York City   10004     40.699,-74.041
3   New York City   10128      40.782,-73.95
4   New York City   10002     40.717,-73.987
5     Los Angeles   90001    33.973,-118.249
6     Los Angeles   90016     34.03,-118.353
7     Los Angeles   90008     34.01,-118.337
8     Los Angeles   90020    34.066,-118.309
9     Los Angeles   90029     34.09,-118.295
10        Chicago   60610     41.899,-87.637
11        Chicago   60611     41.905,-87.625
12        Chicago   60605      41.86,-87.619
13        Chicago   60602     41.883,-87.629
14        Chicago   60603       41.88,-87.63
15        Houston   77001      29.813,-95.31
16        Houston   77005     29.718,-95.428
17        Houston   77009     29.793,-95.367
18        Houston   77004     29.729,-95.366
19        Houston   77012      29.72,-95.279
20   Philadelphia   19019     40.002,-75.118
21   Phila