# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [92]:
from xml.etree import ElementTree as ET
document = ET.parse( './data/mondial_database.xml' )

In [164]:
import pandas as pd
import numpy as np

# Define multiindex


index1 = ['Infant Mortiality','Population']
index2 = ['Country','Population']
index3 = ['Country','Population']
table1 = pd.DataFrame([],[],columns=index1)
table2 = pd.DataFrame([],[],columns=index2)
table3 = pd.DataFrame([],[],columns=index3)


for country in document.iterfind(u'country'):
    # Table 1
    if country.find('infant_mortality') is None:
        val_inf = np.nan
    else:
        val_inf = float(country.find('infant_mortality').text)
    for new_pop in country.iterfind("population"):
        # We can have exponential interpolation on the population per year for a prediction if not in 2011
        # It can be done through invoking findall(".//population[@year="2011"]") and pandas interp function;
        val_pop = int(new_pop.text)
    key1 = [val_inf, val_pop] 
    table1 = table1.append(pd.DataFrame([key1],[country.find(u'name').text],columns=index1))
    # Table 2
    for ethnic in country.iterfind("ethnicgroup"):
        key2 = [country.find(u'name').text,val_pop*float(ethnic.attrib['percentage'])/100]
        table2 = table2.append(pd.DataFrame([key2],[ethnic.text],columns=index2))
    # Table 3
    for city in country.iterfind("city"):
        for new_pop in city.iterfind("population"):
            val_pop = int(new_pop.text)
        table3 = table3.append(pd.DataFrame([[country.find(u'name').text,val_pop]],[city.find(u'name').text],columns=index3))

In [116]:
# Solution for Prob 1
table1['Infant Mortiality'].sort_values().head(10)

Monaco            1.81
Japan             2.13
Bermuda           2.48
Norway            2.48
Singapore         2.53
Sweden            2.60
Czech Republic    2.63
Hong Kong         2.73
Macao             3.13
Iceland           3.15
Name: Infant Mortiality, dtype: float64

In [117]:
# Solution for Prob 2
table3['Population'].sort_values(ascending=False).head(10)

Marsa Matruh       83667047.0
Hurghada           83667047.0
Seoul               9708483.0
Mogadishu           9636173.0
Al Qahirah          8471859.0
Bangkok             7506700.0
Hong Kong           7055071.0
Ho Chi Minh         5968384.0
Singapore           5076700.0
Al Iskandariyah     4123869.0
Name: Population, dtype: float64

In [171]:
# Solution for Prob 3
c=table2.set_index('Country',append=True)
c=c.sum(level=0)
c = c.sort_values('Population',ascending=False)
c.head(10)

Unnamed: 0,Population
Han Chinese,1245059000.0
Indo-Aryan,871815600.0
European,494872200.0
African,318325100.0
Dravidian,302713700.0
Mestizo,157734400.0
Bengali,146776900.0
Russian,131857000.0
Japanese,126534200.0
Malay,121993600.0


I did not see any data about rivers and airpots in the sample file.