# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

## 1. 10 countries with the lower infant mortality rates

In [32]:
document = ET.parse( './data/mondial_database.xml' )

In [34]:
inf_mort_dict = {}

# add each country name and infant mortality rate to a dict
for element in document.iterfind('country'):
    country = element.find('name').text
    try:
        inf_mortality = float(element.find('infant_mortality').text)
    except: 
        inf_mortality = 'N/A'
    inf_mort_dict[country] = inf_mortality     

In [31]:
import operator

# sort the dict and print the first 10 key, value pairs
sorted_inf_mort = sorted(inf_mort_dict.items(),key=operator.itemgetter(1))
sorted_inf_mort[0:10]

[('Monaco', 1.81),
 ('Japan', 2.13),
 ('Bermuda', 2.48),
 ('Norway', 2.48),
 ('Singapore', 2.53),
 ('Sweden', 2.6),
 ('Czech Republic', 2.63),
 ('Hong Kong', 2.73),
 ('Macao', 3.13),
 ('Iceland', 3.15)]

  ## 2. 10 cities with the largest population

In [175]:
pop_dict = {}

# add each country name and population rate to a dict
for element in document.iterfind('country'):
    country = element.find('name').text
    pop = 0
    # find the biggest population value under each country and add to dict
    for subelement in element.getiterator('population'):

        sub_pop = subelement.text
        sub_pop_int = int(sub_pop)
        if sub_pop_int > pop:
            pop = sub_pop_int
            pop_dict[country] = pop
        else:
            pass

In [176]:
# sort the dict and print the first 10 key, value pairs
sorted_pop_dict = sorted(pop_dict.items(),key=operator.itemgetter(1), reverse=True)
sorted_pop_dict[0:10]

[('China', 1360720000),
 ('India', 1210854977),
 ('United States', 318857056),
 ('Indonesia', 252124458),
 ('Brazil', 202768562),
 ('Pakistan', 173149306),
 ('Nigeria', 164294516),
 ('Bangladesh', 149772364),
 ('Russia', 148178487),
 ('Japan', 128057352)]

## 3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [163]:
from lxml import etree
tree = etree.parse('./data/mondial_database.xml')

In [166]:
for elem in tree.xpath('//city'):
    elem.getparent().remove(elem)

In [229]:
pop_dict = {}

for element in tree.iterfind('country'):
    
    country = element.find('name').text
    year = 0
    latest_pop = 0
    
    # find the biggest population value under each country and add to dict
    for subelement in element.getiterator('population'):
        get_recent_population(subelement)

    


In [228]:
def get_recent_population(sub_year):
    global year
    global sub_pop_int
    
    # get this subelement year
    sub_year = subelement.attrib['year']
    
    # convert to int to do boolean check
    sub_year_int = int(sub_year)

    # check if this year is later than the previous ones
    if sub_year_int > year:
        year = sub_year_int
        
        #get population value from the subelement, and return
        sub_pop = subelement.text
        latest_pop = int(sub_pop) 
        return latest_pop
    else:
        pass