# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [92]:
import pandas as pd
import numpy as np

document = ET.parse( './data/mondial_database.xml' )

In [84]:
#### Question 1 ####

infant_mortality_dict = {}
for element in document.iterfind('country'):
    if element.find('infant_mortality') != None:
        infant_mortality_dict[element.find('name').text] = float(element.find('infant_mortality').text)
    else:
        # We exclude the countries where the infant_mortality data is not provided
        pass

# Sort the values
sorted_infant_mortality_dict = sorted(infant_mortality_dict.items(), key=lambda x:x[1])

# Get the 10 countries with the lowest mortality rates
sorted_infant_mortality_dict[:10]

[('Monaco', 1.81),
 ('Japan', 2.13),
 ('Bermuda', 2.48),
 ('Norway', 2.48),
 ('Singapore', 2.53),
 ('Sweden', 2.6),
 ('Czech Republic', 2.63),
 ('Hong Kong', 2.73),
 ('Macao', 3.13),
 ('Iceland', 3.15)]

In [121]:
#### Question 2 ####

# By looking at the data, we assume that we want to have the most recent population number
# for each city.

cities = []
countries = []
years = []
populations = []

for country in document.iterfind('country'):
    for city in country.getiterator('city'):
        year = None
        population = None
        for population in city.getiterator('population'):
            if year == None or int(population.get('year')) > year :
                year = int(population.get('year'))
                population = int(population.text)
        if population != None:
            cities.append(city.find('name').text)
            countries.append(country.find('name').text)
            years.append(year)
            populations.append(population)

population_df = pd.DataFrame({"City": cities, "Country": countries, "Year": years, "Population": populations})
population_df.sort_values('Population', ascending = 0).head(10)

Unnamed: 0,City,Country,Population,Year
1251,Shanghai,China,22315474,2010
707,Istanbul,Turkey,13710512,2012
1421,Mumbai,India,12442373,2011
443,Moskva,Russia,11979529,2013
1250,Beijing,China,11716620,2010
2594,São Paulo,Brazil,11152344,2010
1252,Tianjin,China,11090314,2010
974,Guangzhou,China,11071424,2010
1467,Delhi,India,11034555,2011
977,Shenzhen,China,10358381,2010
