# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
tree = ET.parse( './data/mondial_database.xml' )
root = tree.getroot()

In [6]:
import pandas as pd

## 10 countries with lowest infant mortality rates

In [7]:
#print 10 countries with lowest infant mortality rates
name = []
im = []
for country in root.iterfind('country'):
    name.append(country.find('name').text)
    try:
        im.append(country.find('infant_mortality').text)
    except:
        im.append(None)

df = pd.DataFrame({'name':name,'im':im})
df.im = df.im.astype(float).dropna()
df.sort_values('im', ascending=True).head(10)

Unnamed: 0,im,name
38,1.81,Monaco
98,2.13,Japan
117,2.48,Bermuda
36,2.48,Norway
106,2.53,Singapore
37,2.6,Sweden
10,2.63,Czech Republic
78,2.73,Hong Kong
79,3.13,Macao
44,3.15,Iceland


## 10 largest cities 

In [8]:
name = []
population = []
for element in tree.iterfind('country'):
    for city in element.getiterator('city'):
        name.append(city.find('name').text)
        try:
            population.append(city.find('population').text)
        except:
            population.append(None)

df = pd.DataFrame({'name':name,'population':population})
df.population = df.population.astype(float).dropna()
df.sort_values('population', ascending=False).head(10)

Unnamed: 0,name,population
1928,Seoul,10229262.0
1527,Mumbai,9925891.0
2810,São Paulo,9412894.0
1757,Jakarta,8259266.0
1341,Shanghai,8205598.0
2109,Ciudad de México,8092449.0
479,Moskva,8010954.0
1876,Tokyo,7843000.0
1340,Beijing,7362426.0
1582,Delhi,7206704.0
