# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET
import pandas as pd
from collections import Counter

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':')
    capitals_string = ""
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

## Question 1

I iterate over the countries. I extract both the infant mortality value and country name by using .find(), and place the values into two lists. I assign the countries with missing mortality data a very high value since we are looking for countries with the lowest infant mortality. I take the two lists and make them columns of a dataframe, and then use .sort_values() to sort by infant mortality. .head(10) gives the 10 lowest. 

In [6]:
mortality_vals = []
countries_list = []
for element in document.iterfind('country'):
    countries_list.append(element.find("name").text)
    mor_elem = element.find('infant_mortality')
    if mor_elem == None:
        mortality_vals.append(1000)
    else:
        mortality_vals.append(float(mor_elem.text))                                              
    
mortality_df = pd.DataFrame({'country': countries_list, 'infant_mortality': mortality_vals})
mortality_df.sort_values("infant_mortality").head(10)

Unnamed: 0,country,infant_mortality
38,Monaco,1.81
98,Japan,2.13
117,Bermuda,2.48
36,Norway,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


## Question 2

I iterate through all the countries and extract a second iterator for the population objects. I iterate through the population measurements and find the one associated with the most recent year. I then append the population associated with this year along with the country name to a list. I place these lists into a dataframe which I then sort in descending order of population and take the top 10.

In [7]:
pop_vals = []
countries_list = []
for element in document.iterfind('country'):
    pop_iter = element.iterfind('population')
    latest_year = 0
    latest_year_pop = 0
    for pop_elem in pop_iter:
        elem_year = int(pop_elem.get("year"))
        if elem_year > latest_year:
            latest_year = elem_year
            latest_year_pop = int(pop_elem.text)
        
    countries_list.append(element.find("name").text)
    pop_vals.append(latest_year_pop)                                            
    
pop_df = pd.DataFrame({'country': countries_list, 'population': pop_vals})
pop_df.sort_values("population", ascending = False).head(10)

Unnamed: 0,country,population
55,China,1360720000
67,India,1210854977
120,United States,318857056
88,Indonesia,252124458
176,Brazil,202768562
57,Pakistan,173149306
202,Nigeria,164294516
65,Bangladesh,149772364
23,Russia,143666931
98,Japan,127298000


## Question 3

Here are the ethnic groups with the largest global populations:

In [8]:
group_counts = Counter()
for element in document.iterfind('country'):
    pop_iter = element.iterfind('population')
    latest_year = 0
    latest_year_pop = 0
    for pop_elem in pop_iter:
        elem_year = int(pop_elem.get("year"))
        if elem_year > latest_year:
            latest_year = elem_year
            latest_year_pop = int(pop_elem.text)
            
    group_iter = element.iterfind('ethnicgroup')
    for group_elem in group_iter:
        group_name = group_elem.text
        group_percent = float(group_elem.get("percentage"))
        group_counts[group_name] = int(round(group_percent * latest_year_pop))                                            
    
group_df = pd.DataFrame.from_dict(group_counts, orient='index').reset_index()
group_df["group_counts"] = group_df.loc[:, 0]
group_df.sort_values("group_counts", ascending = False).head(10)

Unnamed: 0,index,0,group_counts
80,Han Chinese,124505880000,124505880000
106,Indo-Aryan,87181558344,87181558344
105,Dravidian,30271374425,30271374425
98,Bengali,14677691672,14677691672
139,Japanese,12653421200,12653421200
130,Eastern Hamitic,8283037653,8283037653
147,Mulatto,7806589637,7806589637
119,Viet/Kinh,7607837530,7607837530
70,English,5359232674,5359232674
17,Mediterranean Nordic,4681591600,4681591600


## Question 4

Here is the longest river and the countries it travels through:

In [9]:
longest_river_length = 0
longest_river_name = "None"
longest_river_country = "None"
for element in document.iterfind('river'):
    if element.find("length") != None:
        elem_length = float(element.find("length").text)
    elem_name = element.find("name").text
    elem_country = element.get("country")
    if elem_length > longest_river_length:
        longest_river_length = elem_length
        longest_river_name = elem_name
        longest_river_country = elem_country
    
print(longest_river_name)
print(longest_river_country)

Amazonas
CO BR PE


Here is the name and country of the largest lake:

In [10]:
largest_lake_area = 0
largest_lake_name = "None"
largest_lake_country = "None"
for element in document.iterfind('lake'):
    if element.find("area") != None:
        elem_area = float(element.find("area").text)
    elem_name = element.find("name").text
    elem_country = element.get("country")
    if elem_length > largest_lake_area:
        longest_river_length = elem_area
        largest_lake_name = elem_name
        largest_lake_country = elem_country
    
print(largest_lake_name)
print(largest_lake_country)

Lake Hume
AUS


Here is the highest elevation airport and the country it is in:

In [11]:
highest_airport_elev = 0
highest_airport_name = "None"
highest_airport_country = "None"
for element in document.iterfind('airport'):
    if element.find("elevation") != None:
        if element.find("elevation").text != None:
            elem_elev = float(element.find("elevation").text)
        else:
            elem_elev = 0
    elem_name = element.find("name").text
    elem_country = element.get("country")
    if elem_elev > highest_airport_elev:
        highest_airport_elev = elem_elev
        highest_airport_name = elem_name
        highest_airport_country = elem_country
    
print(highest_airport_name)
print(highest_airport_country)

El Alto Intl
BOL
