# Project 2

MADS-TPDS 

WS 24/25

### Diego Raúl Roldán Urueña

In [289]:
import requests
import numpy as np  # for using pandas
import pandas as pd  # for using dataframes

## Exercise 1

Python function used to query from the WorldBank Indicator API.

In [290]:
def fetch(indicators,countries=[],years="",verbose=False, per_page=32000):
    """
    Queries data from the Worldbank Indicator API.
    info about the API: https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation

    Parameters:
        indicators (list): List of indicators to query
        countries (list): List of countries to query. If empty, all countries will be queried.
        years (str): Years to query. Format: "YYYY:YYYY" or "YYYY"
        verbose (bool): If True, prints the number of API calls made.
        per_page (int): Number of results per page. Default is 32000 (maximum allowed value for API).
    
    Returns:
        pandas.DataFrame: DataFrame with the data queried from the API. Countryiso3code as index. Year and indicators as columns.
    """
    assert type(indicators) == list, "indicators must be a list"
    assert indicators.__len__() > 0, "You must add at least 1 indicator"
    assert len(set(indicators)) == len(indicators), "indicators must be unique"        
    assert all(isinstance(indicator, str) for indicator in indicators), "indicators must be a list of strings"

    assert type(countries) == list, "countries must be a list"
    assert all(isinstance(country, str) for country in countries), "countries must be a list of strings"

    assert type(years) == str, "years must be a string"

    assert type(verbose) == bool, "verbose must be a boolean"

    assert type(per_page) == int, "per_page must be an integer"


    indicators_str = ';'.join(indicators)
    countries_str = 'all' if len(countries)==0 else ';'.join(countries)
    date_param = f"date={years}&" if years != "" else ""

    actual_page = 1
    df = pd.DataFrame(columns=["countryiso3code","year","indicator","value"])
    while True:
        endpoint = f"https://api.worldbank.org/v2/country/{countries_str}/indicator/{indicators_str}?{date_param}format=json&per_page={per_page}&page={actual_page}&source=2"
        res = requests.get(endpoint)
        data = res.json()

        dfi = pd.DataFrame(data[1], columns=["countryiso3code","date","indicator","value"])
        dfi.rename(columns={"date":"year"}, inplace=True)
        dfi.index = range(df.shape[0],df.shape[0]+dfi.shape[0])

        df = pd.concat([df,dfi]) if not df.empty else dfi.copy()

        if data[0]['pages'] == actual_page:
            break
        else:
            actual_page+=1
    
    if verbose: print(f"{actual_page} API {'calls have' if actual_page>1 else 'call has'} been made")
    
    return getIndicators(df)


def getIndicators(df):
    assert isinstance(df, pd.DataFrame), "df must be a pandas DataFrame"

    df[['indicator_id', 'indicator_value']] = pd.DataFrame(df['indicator'].to_list(), index=df.index)

    for _id in df.indicator_id.unique().tolist():
        df[_id] = df[(df.indicator_id == _id)]['value']

    if df['indicator_id'].nunique() > 1:
        df = df.groupby('countryiso3code').first()
    else:
        df.set_index('countryiso3code', inplace=True)

    df.drop(columns=['value', 'indicator','indicator_id','indicator_value'], inplace=True)
    df.drop('', inplace=True, errors='ignore')
    return df

a) The total population (SP.POP.TOTL) of Germany (DE) and France (FR) between 2015 and 2020.

In [291]:
df_1a = fetch(["SP.POP.TOTL"],["FR","DE"],"2015:2020", verbose=True)
df_1a

1 API call has been made


Unnamed: 0_level_0,year,SP.POP.TOTL
countryiso3code,Unnamed: 1_level_1,Unnamed: 2_level_1
DEU,2020,83160871
DEU,2019,83092962
DEU,2018,82905782
DEU,2017,82657002
DEU,2016,82348669
DEU,2015,81686611
FRA,2020,67571107
FRA,2019,67388001
FRA,2018,67158348
FRA,2017,66918020


b) The total population (SP.POP.TOTL), GDP in current US$ (NY.GDP.MKTP.CD), and life expectancy in years at birth (SP.DYN.LE00.IN) of all countries (all) in 2012. Print the shape of the resulting DataFrame and display its first 10 rows

In [292]:
df_1b = fetch(["SP.POP.TOTL","NY.GDP.MKTP.CD","SP.DYN.LE00.IN"],[],"2012", verbose=True)
print(f"Dataframe shape is {df_1b.shape}")
df_1b.head(10)

1 API call has been made
Dataframe shape is (261, 4)


Unnamed: 0_level_0,year,SP.POP.TOTL,NY.GDP.MKTP.CD,SP.DYN.LE00.IN
countryiso3code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ABW,2012,102112.0,2615208000.0,75.531
AFE,2012,552530654.0,952675600000.0,60.05078
AFG,2012,30466479.0,19907330000.0,61.923
AFW,2012,376797999.0,737799600000.0,55.340561
AGO,2012,25188292.0,128052900000.0,58.623
ALB,2012,2900401.0,12319830000.0,78.064
AND,2012,71013.0,3188653000.0,
ARB,2012,380383408.0,2793776000000.0,70.180461
ARE,2012,8664969.0,384610100000.0,78.716
ARG,2012,41733271.0,545982400000.0,76.467


c) State how many API calls your function makes for a) and b) respectively.

- Printed on each cell

## Excercise 2

 The file medal_table.csv contains information about the number of medals won by each country at the
 Olympic Games 2012.

##### a) Preprocess both the medal table data and the Worldbank data retrieved in exercise 1 b) and combine the two datasets suitably into one tidy dataset. The final dataset should be such that it allows you to answer the following exercises (2b and 3). Explain your actions and decisions in a few sentences. 
 
*Notes:* 
 
*1. If there are missing values in the Worldbank data set (e.g. if no population data is available for Germany), then you do NOT need to impute these values.*

*2. Exercises 2b and 3 may require different handling of missing values. Therefore, it is fine if you create slightly different versions of the combined dataset for these exercises.*

In [293]:
df_medal = pd.read_csv("medal_table.csv")

# https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3
country_code2iso3 = {
#   'USER_CODE' : 'ISO3',
    'GER':'DEU',
    'IRI':'IRN',
    'NED':'NLD',
    'RSA':'ZAF',
    'CRO':'HRV',
    'DEN':'DNK',
    'SUI':'CHE',
    'SLO':'SVN',
    'TPE':'TPE',
    'LAT':'LVA',
    'ALG':'DZA',
    'GRN':'GRD',
    'BAH':'BHS',
    'MGL':'MNG',
    'BUL':'BGR',
    'INA':'IDN',
    'MAS':'MYS',
    'PUR':'PRI',
    'BOT':'BWA',
    'GUA':'GTM',
    'POR':'PRT',
    'GRE':'GRC',
    'KSA':'SAU',
    'KUW':'KWT',
    'VIE':'VNM'
}

df_medal['iso3'] = df_medal['country_code'].map(country_code2iso3)

# set country_code as iso3 if iso3 is null
df_medal['iso3'] = df_medal['iso3'].combine_first(df_medal['country_code'])

df_medal.head(10)

Unnamed: 0,year,country,country_code,gold,silver,bronze,iso3
0,2012,United States,USA,46,28,30,USA
1,2012,People's Republic of China,CHN,38,31,22,CHN
2,2012,Great Britain,GBR,29,17,19,GBR
3,2012,Russian Federation,RUS,20,20,27,RUS
4,2012,Republic of Korea,KOR,13,9,8,KOR
5,2012,Germany,GER,11,20,13,DEU
6,2012,France,FRA,11,11,13,FRA
7,2012,Australia,AUS,8,15,12,AUS
8,2012,Italy,ITA,8,9,11,ITA
9,2012,Hungary,HUN,8,4,6,HUN


- Since there were some differences between the country code in the medal table and the one in the WorldBank API, I had to manually change the country code in the medal table to match the one in the WorldBank API. I used the this [link](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) to find the correct country code.

In [294]:
df = pd.merge(left=df_medal, right=df_1b, left_on='iso3', right_index=True, how='left')
df.rename(columns={"year_x":"year"}, inplace=True)
df.drop(columns=['year_y','iso3'], inplace=True)
df.head(10)

Unnamed: 0,year,country,country_code,gold,silver,bronze,SP.POP.TOTL,NY.GDP.MKTP.CD,SP.DYN.LE00.IN
0,2012,United States,USA,46,28,30,313877700.0,16253970000000.0,78.741463
1,2012,People's Republic of China,CHN,38,31,22,1354190000.0,8532185000000.0,76.192
2,2012,Great Britain,GBR,29,17,19,63700220.0,2707090000000.0,80.904878
3,2012,Russian Federation,RUS,20,20,27,143378400.0,2208294000000.0,70.072195
4,2012,Republic of Korea,KOR,13,9,8,50199850.0,1278047000000.0,80.819512
5,2012,Germany,GER,11,20,13,80425820.0,3527143000000.0,80.539024
6,2012,France,FRA,11,11,13,65662240.0,2683672000000.0,81.968293
7,2012,Australia,AUS,8,15,12,22733460.0,1547650000000.0,82.046341
8,2012,Italy,ITA,8,9,11,59539720.0,2086958000000.0,82.239024
9,2012,Hungary,HUN,8,4,6,9920362.0,128814300000.0,75.063415


- A left join is performed because we don not want to lose any information from the medal table. If there is no information about the medals that row is not important for the analysis.

##### b) Create an alternative medal table for the 2012 Olympic Games by calculating the number of Gold, Silver, and Bronze medals won per 10 million inhabitants. Display the 10 most successful countries according to this alternative medal table

In [330]:
df_per_10M = df.copy()
df_per_10M['gold_per_10M'] = df_per_10M['gold'] / df_per_10M['SP.POP.TOTL'] * 10**7
df_per_10M['silver_per_10M'] = df_per_10M['silver'] / df_per_10M['SP.POP.TOTL'] * 10**7
df_per_10M['bronze_per_10M'] = df_per_10M['bronze'] / df_per_10M['SP.POP.TOTL'] * 10**7
df_per_10M['total_per_10M'] = (df_per_10M['gold_per_10M']+df_per_10M['silver_per_10M']+df_per_10M['bronze_per_10M'])

df_per_10M.sort_values(['gold_per_10M', 'silver_per_10M', 'bronze_per_10M'], ascending=False, inplace=True)

df_per_10M = df_per_10M[['country', 'gold_per_10M', 'silver_per_10M', 'bronze_per_10M', 'total_per_10M']]
df_per_10M.rename(columns={'country':'Country', 'gold_per_10M':'Gold', 'silver_per_10M':'Silver', 
                           'bronze_per_10M':'Bronze', 'total_per_10M':'Total' }, inplace=True)

df_per_10M['Rank'] = np.arange(1, len(df_per_10M)+1)
df_per_10M.set_index('Rank', inplace=True)

print(f"Medal table per 10M inhabitants in 2012")
df_per_10M.head(10)

Medal table per 10M inhabitants in 2012


Unnamed: 0_level_0,Country,Gold,Silver,Bronze,Total
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Grenada,86.272345,0.0,0.0,86.272345
2,The Bahamas,26.173831,0.0,0.0,26.173831
3,Bahrain,24.592139,0.0,0.0,24.592139
4,Jamaica,14.493715,18.117143,10.870286,43.481144
5,New Zealand,13.611306,4.537102,11.342755,29.491164
6,Hungary,8.064222,4.032111,6.048166,18.144499
7,Croatia,7.029781,2.34326,4.686521,14.059563
8,Trinidad and Tobago,6.991164,6.991164,13.982328,27.964655
9,Lithuania,6.693949,0.0,10.040923,16.734872
10,Latvia,4.91565,0.0,4.91565,9.8313


## Exercise 3

##### Carry out a simple supervised machine learning experiment, in which you train a model to predict the number of medals a country wins at the Olympic Games based on demographic and economic features.

*Note: Since machine learning is not a focus topic of this course, you do not need to optimize the model. Just demonstrate that you are able to apply the steps we discussed in the course and correctly interpret the results.*

a) Train and evaluate a linear regression model: 
- 1. Split your data into a training and a test set. 
- 2. Train a linear regression model using population, life expectancy and the GDP per capita of a country as features.
- 3. Evaluate the model using the root mean squared error as the performance
 metric.

In [296]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

def split_fit_predict(X,y):
    x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.1,shuffle=True)

    model = LinearRegression()
    model.fit(x_train, y_train)

    y_pred = model.predict(x_test)

    rmse = root_mean_squared_error(y_test, y_pred)
    return model, rmse

In [317]:
df_3 = df.copy()
df_3.dropna(how='any', inplace=True)
X = df_3[['SP.POP.TOTL', 'NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN']]
y = df_3['gold'] + df_3['silver'] + df_3['bronze']

print("Printing RMSE for 4 different splits:")
for i in range(4):
    _,rmse = split_fit_predict(X,y)
    print(f"\troot Mean Squared Error: {rmse}")

Root Mean Squared Error: 21.474171484303533
Root Mean Squared Error: 5.717995558811607
Root Mean Squared Error: 12.773193161149457
Root Mean Squared Error: 7.263455507205382


- As this is a little dataset, the performance relys a lot on the selected train/test data, results change a lot depending on the split (from 4 to 25).

- For this reason a single split + prediction is not enough. We should run it not only once (let's try 1000) and get the root mean squared error mean of all iterarions. 

In [319]:
res = []
best_model = [None, None]
for i in range(1000):
    model_i,rmse_i = split_fit_predict(X,y)
    res.append(rmse_i)

    if best_model[0] == None or best_model[0] < rmse_i:
        best_model = [rmse_i,model_i]
        
res = np.array(res)
mean_rmse = res.mean()
print(f"The mean RMSE of all 100 iterations is {mean_rmse}")

np.float64(11.376468291096847)

b) Briefly discuss the results: How do you judge the performance? What are possible reasons for this
 performance? How could the model be improved?

<!-- Reasons -->
- The model is unacceptable if you want to predict other countries medals, RMSE 11.37 is too much.

- Maybe the model is overfitting to the training data, so when we test it with the new test data it performs poorly. 

- We could improve (if the model overfits) it by adding more features, like the number of athletes, the number of sports, the number of events, etc. We could also remove some useless features. These two things affect the performance of the model.
Other way to improve the model is to use a more complex model, like a Neural Network.

c) Predict the number of medals a hypothetical country with a population of 10 million, life expectancy
 of 70 years, and a GDP per capita of 20.000 US$ would win.

In [329]:
# 
names = ["SP.POP.TOTL","NY.GDP.MKTP.CD","SP.DYN.LE00.IN"]
values = np.array([10**7, 20000, 70]).reshape(1,-1)

x_sample = pd.DataFrame(data=values, columns=names)
y_sample = best_model[1].predict(x_sample)
print(f"The expeted number of medals is {y_sample}")

The expeted number of medals is [3.6562101]
