## OECD and IMF Data Processing for Machine Learning

This is a simple notebook to get and work two data sets: The OECD Better Life Index and IMF GDP Per Capital data

Years covered: 2017

Inspiration drawn from: https://github.com/ageron/handson-ml

In [3]:
# Standard libraries
import json
import urllib.request as ureq

# Third-part libraries (if pandas and numpy are even considered non-standard at this point)
import numpy as np
import pandas as pd
np.random.seed(13)

### Part 1 - OECD Better Life Index
URL: http://www.oecdbetterlifeindex.org

Countries Covered: All Available
Measures: All avaialble

In [4]:
"""
    URL for OECD API data that we'd like to get
"""
bli_2017_url = "https://stats.oecd.org/SDMX-JSON/data/BLI2017/AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+ISL+IRL+ISR+ITA+JPN+KOR+LVA+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA+OECD+BRA+RUS+ZAF.HO+HO_HISH+HO_NUMR+IW+IW_HADI+IW_HNFW+JE+JE_EMPL+JE_LTUR+JE_PEARN+ES+ES_EDUA+ES_EDUEX+EQ+EQ_AIRP+EQ_WATER+HS+HS_LEB+SW+SW_LIFS+PS+PS_REPH+WL+WL_EWLH+WL_TNOW.L.TOT/all?&dimensionAtObservation=MeasureDimension&detail=DataOnly"


"""
    Retrieve and decode data to Unicode
    Note: You can use bytes for this process
"""
with ureq.urlopen(bli_2017_url) as resp:
    data  = resp.read().decode("utf-8")


"""
    Restructure as proper json
"""
data = json.loads(data)

In [7]:
"""
    We know that our initial json data has three keys:
        * header;
        * dataSets; and
        * structure

    We'll put everything into a list for readability.
    Out of all the items, we reallj just want the first two, which cover country names and measures

    Enumerating each list of values to align with raw data from the BLI 'index'.
    The BLI index aligns with the number of observations in the "structure" section. Since
    we have four, the index looks like:
        0:0:0:0
        
    The first few items in the dataSets section look something like this:
    
    {'0:0:0:0': [20.0],
     '1:0:0:0': [21.0],
     '2:0:0:0': [21.0],
     '3:0:0:0': [22.0],
     '4:0:0:0': [24.0],
     '5:0:0:0': [24.0],
     '6:0:0:0': [23.0],
     '7:0:0:0': [21.0],
     '8:0:0:0': [20.0],
     '9:0:0:0': [24.0],
     ...}
     
     We'll confirm in the next section, but the first item in the key ('0:0:0:0') runs from 0 to 38, or
     39 total values, which aligns with the number of countries in our data set.
"""

obs_list = data.get("structure").get("dimensions").get("observation")

In [27]:
"""
    Now we're enumerating each list of values to align with raw data from the BLI 'index'.
    
    The BLI 'index' aligns with the number of observations in the "structure" section. Since
    we have four sections, the index looks like:
        0:0:0:0
        
    The first few items in the dataSets section look something like this:
    
    {'0:0:0:0': [20.0],
     '1:0:0:0': [21.0],
     '2:0:0:0': [21.0],
     '3:0:0:0': [22.0],
     '4:0:0:0': [24.0],
     '5:0:0:0': [24.0],
     '6:0:0:0': [23.0],
     '7:0:0:0': [21.0],
     '8:0:0:0': [20.0],
     '9:0:0:0': [24.0],
     ...}
     
     We'll process our required lists and then confirm in the next section, but the first item in
     the key ('0:0:0:0') runs from 0 to 38, or 39 total values,
     which aligns with the number of countries in our data set.
"""
country_dict = {k:v for k, v in enumerate(obs_list[0].get("values"))}
measure_dict = {k:v for k, v in enumerate(obs_list[1].get("values"))}
# value_dict = {k:v for k, v in enumerate(obs_list[2].get("values"))} # Not needed for this exercise
# total_dict = {k:v for k, v in enumerate(obs_list[3].get("values"))} # Not needed for this exercise

print("\n".join([f"{k}: {v}" for k, v in country_dict.items()]))

0: {'id': 'AUS', 'name': 'Australia'}
1: {'id': 'AUT', 'name': 'Austria'}
2: {'id': 'BEL', 'name': 'Belgium'}
3: {'id': 'CAN', 'name': 'Canada'}
4: {'id': 'CZE', 'name': 'Czech Republic'}
5: {'id': 'DNK', 'name': 'Denmark'}
6: {'id': 'FIN', 'name': 'Finland'}
7: {'id': 'FRA', 'name': 'France'}
8: {'id': 'DEU', 'name': 'Germany'}
9: {'id': 'GRC', 'name': 'Greece'}
10: {'id': 'HUN', 'name': 'Hungary'}
11: {'id': 'ISL', 'name': 'Iceland'}
12: {'id': 'IRL', 'name': 'Ireland'}
13: {'id': 'ITA', 'name': 'Italy'}
14: {'id': 'JPN', 'name': 'Japan'}
15: {'id': 'KOR', 'name': 'Korea'}
16: {'id': 'LUX', 'name': 'Luxembourg'}
17: {'id': 'MEX', 'name': 'Mexico'}
18: {'id': 'NLD', 'name': 'Netherlands'}
19: {'id': 'NZL', 'name': 'New Zealand'}
20: {'id': 'NOR', 'name': 'Norway'}
21: {'id': 'POL', 'name': 'Poland'}
22: {'id': 'PRT', 'name': 'Portugal'}
23: {'id': 'SVK', 'name': 'Slovak Republic'}
24: {'id': 'ESP', 'name': 'Spain'}
25: {'id': 'SWE', 'name': 'Sweden'}
26: {'id': 'CHE', 'name': 'Switzer

In [28]:
"""
    The dictionaries are converted into a pandas DataFrame for easier processing.
"""
country_df = pd.read_json(json.dumps(country_dict), orient="index").sort_index()
measure_df = pd.read_json(json.dumps(column_dict), orient="index").sort_index()

In [29]:
"""
    In this section, we are processing our dataSets values and converting a dictionary
    output into a pandas DataFrame, akin to the above section.
    
    Note that in the middle portion of the code block we're creating a dictionary with labeled
    keys and values.  Those will translate into column names when convered into the DataFrame (kv_df).
    
    We're also 'unpacking' the values, which are contained in a list structure upon delivery. To 
    get the values from the list, we just need to call the first index result, which is v[0].
"""
key_values = data.get("dataSets")[0].get("observations")
kv_dict = {i:dict(idx=k, value=v[0]) for i, (k, v) in enumerate(key_values.items())}
kv_df = pd.read_json(json.dumps(kv_dict), orient="index").sort_index()

print(kv_df.head())

       idx  value
0  0:0:0:0   20.0
1  1:0:0:0   21.0
2  2:0:0:0   21.0
3  3:0:0:0   22.0
4  4:0:0:0   24.0


In [30]:
"""
    Here, we will:
        split the index values into a separate DataFrame,
        keep the first two items,
        rename the columns
        convert the values to integers for joining with other data later on
"""
idx_df =  kv_df["idx"].str.split(":", expand = True).iloc[:, [0, 1]].rename(columns={0:"country_id", 1: "measure_id"}).copy()
for c in idx_df.columns:
    idx_df[c] = idx_df[c].astype(int)
    
print(idx_df.head())

   country_id  measure_id
0           0           0
1           1           0
2           2           0
3           3           0
4           4           0


And now, a smorgasbord of DataFrames and joins.

Basically, df1 merges country values into our core DataFrame, df2 merges measure values, and df3 addsa bit of formatting before merging with our key-value DataFrame.

The bli_df is a pivoted DataFrame with country as the index, measures as columns, and values as...well, values.

In [31]:
# Merge country and measure dataframes
df1 = idx_df.merge(country_df, how="left", left_on="country_id", right_index=True)
df2 = df1.merge(measure_df, how="left", left_on="measure_id", right_index=True)

# Reformat and rename for usability.
df3 = df2.drop(["country_id","measure_id","id_x","id_y",],axis=1).copy()
df3 = df3.rename(columns={"name_x":"country","name_y":"measure"})

# Drop idx column from key-value dataframe prior to merging.
kv_df = kv_df.drop("idx", axis=1).copy()

# Merge final with kv_df to produce near-final dataframe.
df3 = df3.merge(kv_df, left_index=True, right_index=True)

# Final bli_df is a pivoted table.
bli_df = df3.pivot(index="country", columns="measure", values="value")

# Print sample of data.
print(bli_df.head())

measure    Air pollution  Educational attainment  \
country                                            
Australia            5.0                    80.0   
Austria             16.0                    85.0   
Belgium             15.0                    75.0   
Brazil              10.0                    49.0   
Canada               7.0                    91.0   

measure    Employees working very long hours  Employment rate  Homicide rate  \
country                                                                        
Australia                              13.20             72.0            1.0   
Austria                                 6.78             72.0            0.4   
Belgium                                 4.31             62.0            1.0   
Brazil                                  7.15             64.0           27.6   
Canada                                  3.73             73.0            1.4   

measure    Household net adjusted disposable income  \
country            

### IMF GDP Data
URL: https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx

Same story for the first cell as we had with OECD. Import and decode raw data for processing.

In [19]:
imf_gdp_url = "https://www.imf.org/external/pubs/ft/weo/2018/01/weodata/weoreptc.aspx?"
imf_gdp_url += "pr.x=52&pr.y=12&sy=2017&ey=2017&sort=country&ds=.&br=1&c=193%2C946%2C122%2C137%2C124%2C546%2C156%2C181%2C423%2C138%2C935%2C196%2C128%2C142%2C939%2C182%2C172%2C359%2C132%2C135%2C134%2C576%2C174%2C936%2C532%2C961%2C176%2C184%2C178%2C144%2C436%2C146%2C136%2C528%2C158%2C112%2C542%2C111%2C941&s=NGDPDPC&grp=0&a="
with ureq.urlopen(imf_gdp_url) as resp:
    imf_data  = resp.read().decode("utf-8")

Here, we'll split each line of our data into a list.

We're also going to skip any blank lines or the final "total" line, which starts with "International Monetary Fund."

In [20]:
tokens = imf_data.split("\r\n")
tokens = [i for i in tokens if len(i) > 0 and not i.startswith("International Monetary Fund")]
header = tokens[0].split("\t")
tokens = tokens[1:]

We'll a dictionary named keys and blank lists as values.  Since the data is tab-delimited in each row, we'll split that.

We also know that the data values we want would be the first and second-to-last items in a list, so we implement proper indexing to capture those values and append each to our dictionary collection.

In [21]:
ddict = dict(country=[], gdp_per_capita=[])

for line in tokens:
    tmp = line.split("\t")
    ddict["country"].append(tmp[0])
    ddict["gdp_per_capita"].append(tmp[-2:-1][0])

# Add to DataFrame; Reformat currency to integer data type.
gdp_df = pd.DataFrame(ddict)
gdp_df["gdp_per_capita"] = gdp_df["gdp_per_capita"].str.split(".", expand=True)[0].str.replace(",","").astype(int)

# Set the DataFrame index to our country names for easier joining; Drop the country column thereafter.
gdp_df.index = gdp_df["country"].values
gdp_df.drop("country", axis=1, inplace=True)

Finally, we'll merge our two DataFrames together on the indices as an inner join.

In [25]:
full_country_stats = pd.merge(left=bli_df, right=gdp_df, left_index=True, right_index=True)

We can print some sample data for the US.

In [26]:
print(full_country_stats[["gdp_per_capita", 'Life satisfaction']].loc["United States"])

gdp_per_capita       59501.0
Life satisfaction        6.9
Name: United States, dtype: float64
