# XML exercise

Using data from [**mondial database**](https://drive.google.com/file/d/14lFT4nWHgwN36ij4XZh6OUuup-K9qLgR/view?usp=sharing) find the answers to following questions:

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. name and country of a) longest river, b) largest lake and c) airport at highest elevation

## 1. 10 countries with the lowest infant mortality rates

In [1]:
import pandas as pd 

In [4]:
import xml.etree.ElementTree as ET

In [5]:
document = ET.parse( '/Users/Oswal/Documents/GitHub/Other_data_types_exercise/mondial.xml' )

In [6]:
root = document.getroot()

In [7]:
#get infant mortality of each country, add to heap if under capacity 
#otherwise check if new value is greater than smallest.
inf_mort = dict()
for element in document.iterfind('country'):
    for subelement in element.iterfind('infant_mortality'):
        inf_mort[element.find('name').text] = float(subelement.text)

In [8]:
infmort_df = pd.DataFrame.from_dict(inf_mort, orient ='index')
infmort_df.columns = ['infant_mortality']
infmort_df.index.names = ['country']
infmort_df.sort_values(by = 'infant_mortality', ascending = True).head(10)

Unnamed: 0_level_0,infant_mortality
country,Unnamed: 1_level_1
Monaco,1.81
Japan,2.13
Bermuda,2.48
Norway,2.48
Singapore,2.53
Sweden,2.6
Czech Republic,2.63
Spain,2.7
Hong Kong,2.73
Macao,3.13


## 2. 10 cities with the largest population

In [9]:
current_pop = 0
current_pop_year = 0
citypop = dict()

In [10]:
for country in document.iterfind('country'):
    for city in country.getiterator('city'):
        
        for subelement in city.iterfind('population'):
            #compare attributes of identically named subelements. Use this to hold onto the latest pop estimate. 
            if int(subelement.attrib['year']) > current_pop_year:
                current_pop = int(subelement.text)
                current_pop_year = int(subelement.attrib['year'])
                
        citypop[city.findtext('name')] = current_pop
        current_pop = 0
        current_pop_year = 0

  for city in country.getiterator('city'):


In [11]:
citypop_df = pd.DataFrame.from_dict(citypop, orient ='index')
citypop_df.columns = ['population']
citypop_df.index.names = ['city']
citypop_df.sort_values(by = 'population', ascending = False).head(10)

Unnamed: 0_level_0,population
city,Unnamed: 1_level_1
Shanghai,22315474
Karachi,14916456
Lagos,13745000
Istanbul,13710512
Mumbai,12442373
Moskva,11979529
Beijing,11716620
Kinshasa,11575000
São Paulo,11152344
Lahore,11126285


## 3. name and country of a) longest river, b) largest lake and c) airport at highest elevation

### a) Longest river

In [12]:
river_ctry=None
river_name= None
river_length= 0

In [13]:
for river in document.iterfind('river'):
    for length in river.iterfind('length'):
        if river_length < float(length.text):
            river_length=float(length.text)
            river_ctry= river.attrib['country']
            river_name = river.findtext('name')

In [14]:
data_river = [[river_name,],[river_ctry],[river_length]]
df_river = pd.DataFrame(data_river, columns = ['Longest River'],index=['Name','Location (Country Code)','Metric Value'])
df_river

Unnamed: 0,Longest River
Name,Yangtze
Location (Country Code),CN
Metric Value,6380.0


### b) largest lake

In [15]:
lake_ctry= None
lake_name= None
lake_area = 0

In [17]:
for lake in document.iterfind('lake'):
    for area in lake.iterfind('area'):
        if lake_area < float(area.text):
            lake_area=float(area.text)
            lake_ctry= lake.attrib['country']
            lake_name = lake.findtext('name')

In [23]:
data_lake = [[lake_name],[lake_ctry],[lake_area]]
df_lake = pd.DataFrame(data_lake, columns = ['Largest Lake'],index=['Name','Location (Country Code)','Metric Value'])
df_lake

Unnamed: 0,Largest Lake
Name,Caspian Sea
Location (Country Code),R AZ KAZ IR TM
Metric Value,386400.0


### c) airport at highest elevation

In [16]:
airport_ctry= None
airport_name = None
airport_elv = 0

In [18]:
for airport in document.iterfind('airport'):
    for elv in airport.iterfind('elevation'):
        #apprarently there is an airport in the database with an elevation tag an no entry. 
        #Probably should have been doing this previously
        if (elv.text is not None) and (airport_elv < float(elv.text)):
            airport_elv=float(elv.text)
            airport_ctry= airport.attrib['country']
            airport_name = airport.findtext('name') 

In [21]:
data_airport = [[airport_name],[airport_ctry],[airport_elv]]
df_airport = pd.DataFrame(data_airport, columns = ['Highest Airport'],index=['Name','Location (Country Code)','Metric Value'])
df_airport

Unnamed: 0,Highest Airport
Name,El Alto Intl
Location (Country Code),BOL
Metric Value,4063.0


### all together in one df

In [19]:

data_complete = [[lake_name,river_name,airport_name],[lake_ctry,river_ctry,airport_ctry],[lake_area,river_length,airport_elv]]
df_complete = pd.DataFrame(data_complete, columns = ['Largest Lake','Longest River','Highest Airport'],index=['Name','Location (Country Code)','Metric Value'])
df_complete

Unnamed: 0,Largest Lake,Longest River,Highest Airport
Name,Caspian Sea,Yangtze,El Alto Intl
Location (Country Code),R AZ KAZ IR TM,CN,BOL
Metric Value,386400.0,6380.0,4063.0
