# Main Objective

The governator of Zootopia, brother country of Dystopia and another bad place to live in, wants to maximize their population score in the Happiness Score of Dystopia. He claims that he has the secret formula of how the scoring is calculated and he wants to compare politics that he can carry on to complish his goal. He wants to see with different combinations of the variables, which score his country will get.

Here the main objective is to predict the happiness scoring that a country will get, with different combinations of the inputs, basically develop a predictor for the target variable Happiness Scoring. There will be a issue in this models, because the variable Dystopia Residual is explained for how the others countries performs that year, and the model will not have any capabilities to predict that.

# Environment

In [1]:
# It would be important to create an environment, and install the specific libraries for avoiding possible issues having different version of the libraries. 
# As I was not sure if this would be possible with the DevSkill git repo, and I couldn't find the answer on the internet I leave this part just as commented.
# pip install -r requirements.txt

# Libraries

In [2]:
# for handling data:
import numpy as np
import pandas as pd

# for plotting:
import plotly.express as px
import matplotlib.pyplot as plt

# for machine learning:
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Setting everything Up

In [3]:
# Function to load CSV, from min_year to max_year.
def load_world_happinnes_reports(min_year, max_year):
    """DOCSTRING"""

    # Let's iterate to load all the csv.
    for i in range(min_year,max_year+1):

        # Create DataFrames based on the csv name.
        globals()[f"data_{i}"] = pd.read_csv(f'https://raw.githubusercontent.com/ChFontana/Datasets/main/World%20Happiness%20Report/{i}.csv')
        
        # Create column year, equals the year of the csv.
        globals()[f"data_{i}"]['Year'] = i

        # Create column Region if not exist.
        globals()[f"data_{i}"]['Region'] = globals()[f"data_{i}"].get('Region', np.nan)

        # Create column Dystopia Residual if not exist.
        # globals()[f"data_{i}"]['Dystopia Residual'] = globals()[f"data_{i}"].get('Dystopia Residual', np.nan)  

In [4]:
# Load all the CSV that we have, from 2015 to 2019.
load_world_happinnes_reports(2015,2019)

In [5]:
# The columns from the datasets have different names, we have to correct this.
data_2017.rename(columns = {
'Happiness.Score': 'Happiness Score', 'Dystopia.Residual':'Dystopia Residual',
'Happiness.Rank':'Happiness Rank', 'Economy..GDP.per.Capita.':'Economy (GDP per Capita)',
'Health..Life.Expectancy.':'Health (Life Expectancy)','Trust..Government.Corruption.':'Trust (Government Corruption)'
}, inplace=True)

data_2018.rename(columns = {
'Country or region':'Country', 'Overall rank':'Happiness Rank', 'GDP per capita':'Economy (GDP per Capita)',
'Social support':'Family', 'Healthy life expectancy':'Health (Life Expectancy)', 'Freedom to make life choices':'Freedom',
'Perceptions of corruption':'Trust (Government Corruption)', 'Score':'Happiness Score'
}, inplace=True)

data_2019.rename(columns = {
'Country or region':'Country', 'Overall rank':'Happiness Rank', 'GDP per capita':'Economy (GDP per Capita)',
'Social support':'Family', 'Healthy life expectancy':'Health (Life Expectancy)', 'Freedom to make life choices':'Freedom',
'Perceptions of corruption':'Trust (Government Corruption)', 'Score':'Happiness Score'
}, inplace=True)

# Select the columns that we will use.
columns = [ 'Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 
            'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
            'Generosity', 'Year', 'Dystopia Residual']

In [6]:
# Let's define a function to select the years that we want to find insights from.

def years_to_use(*year, mode='period'):
    """DOCSTRING"""
    
    # Create the variables that we will use in the function.
    global tables
    modes = ['period', 'individual']
    years = []
    tables = []
    
    # Check if mode is one of the two options.
    if mode not in modes:
        raise ValueError("Invalid mode. Expected one of: %s" % mode)

    # Check if the years are between the expected range.
    for individual_year in year:
        if individual_year not in range(2015,2020):
            raise ValueError("Invalid year. Expected values between the range: %s" % list(range(2015,2020))) 
            # Current min and max year in the dataset.

    # The function that will work for period mode.
    if mode == 'period':
        for i in range(min(year),max(year)+1):
            years.append(globals()[f"data_{i}"])

    # The function that will work for individual mode.
    elif mode == 'individual':
        for y in year:
            years.append(y)

    # Merge everything in one table.
    for x in years:
        tables = pd.concat(years)

    # Keep only the columns that we are interested on.
    tables = tables[columns]

    # Sort by country to fillna on Region after that.
    tables = tables.sort_values(by=['Country', 'Year'])

    # There are some countries with some upper isues, lets apply strip and upper for Country and Regions.
    tables[['Country', 'Region']] = tables[['Country', 'Region']].apply(lambda x: x.str.strip().str.upper())

In [7]:
# Initially I'll use all the information available, from 2015 to 2019. 
# You can choose the period changing the input variables, even choose individual years chaning the mode to individual and passing the years.
years_to_use(2015,2019, mode='period')

In [8]:
# Check our nulls in the DataFrame, basically in Region and 1 in Goverment Corruption. 
tables.isna().sum()

Country                            0
Region                           467
Happiness Rank                     0
Happiness Score                    0
Economy (GDP per Capita)           0
Family                             0
Health (Life Expectancy)           0
Freedom                            0
Trust (Government Corruption)      1
Generosity                         0
Year                               0
Dystopia Residual                312
dtype: int64

In [9]:
# We sorted by country and year, we will fill fordward, it would be great to do further investigation on the null value in Goverment Corruption.
# For the null value in Goverment Corruption, we can evaluate more sophisticated ways or filling it, 
# following the trend of other countries change on this one, following the own countrie trend, try to look for information about it online, between other methods.
tables[['Region', 'Trust (Government Corruption)']] = tables.loc[:,['Region', 'Trust (Government Corruption)']].fillna(method='ffill')

# Now let's get Dystopia Residual = Hapinness Score - SUM(rest of variables)
rest_of_variables = [   'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
                        'Freedom', 'Trust (Government Corruption)', 'Generosity']
tables['Dystopia Residual'] = tables.loc[:,'Happiness Score'] - tables.loc[:,rest_of_variables].sum(axis=1)

In [10]:
# Check that we don't have null left.
tables.isna().sum()

Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Year                             0
Dystopia Residual                0
dtype: int64

In [11]:
# Let's see our final DataFrame.
tables.head(20)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Year,Dystopia Residual
152,AFGHANISTAN,SOUTHERN ASIA,153,3.575,0.31982,0.30285,0.30335,0.23414,0.09719,0.3651,2015,1.95255
153,AFGHANISTAN,SOUTHERN ASIA,154,3.36,0.38227,0.11037,0.17344,0.1643,0.07112,0.31268,2016,2.14582
140,AFGHANISTAN,SOUTHERN ASIA,141,3.794,0.401477,0.581543,0.180747,0.10618,0.061158,0.311871,2017,2.151024
144,AFGHANISTAN,SOUTHERN ASIA,145,3.632,0.332,0.537,0.255,0.085,0.036,0.191,2018,2.196
153,AFGHANISTAN,SOUTHERN ASIA,154,3.203,0.35,0.517,0.361,0.0,0.025,0.158,2019,1.792
94,ALBANIA,CENTRAL AND EASTERN EUROPE,95,4.959,0.87867,0.80434,0.81325,0.35733,0.06413,0.14272,2015,1.89856
108,ALBANIA,CENTRAL AND EASTERN EUROPE,109,4.655,0.9553,0.50163,0.73007,0.31866,0.05301,0.1684,2016,1.92793
108,ALBANIA,CENTRAL AND EASTERN EUROPE,109,4.644,0.996193,0.803685,0.73116,0.381499,0.039864,0.201313,2017,1.490287
111,ALBANIA,CENTRAL AND EASTERN EUROPE,112,4.586,0.916,0.817,0.79,0.419,0.032,0.149,2018,1.463
106,ALBANIA,CENTRAL AND EASTERN EUROPE,107,4.719,0.947,0.848,0.874,0.383,0.027,0.178,2019,1.462


In [12]:
# Check that every Country have only 1 combination, of Country - Region pair. 
# If you want to understand more how it works in deep, delete since the first function and continue adding them step by step.
tables[['Country', 'Region']].value_counts().to_frame().reset_index().Country.value_counts().sort_values()

AFGHANISTAN                 1
MOLDOVA                     1
KOSOVO                      1
MYANMAR                     1
NEPAL                       1
                           ..
NORTH MACEDONIA             1
DJIBOUTI                    1
TAIWAN PROVINCE OF CHINA    1
SOMALILAND REGION           1
OMAN                        1
Name: Country, Length: 169, dtype: int64

# Data Exploratory Analysis

Now that we have everything settled, we can start our analysis. Allways is very important to check carefully your data before starting your analysis.

In [13]:
# Let's see the general descriptive statistics of the DataFrame.
# We can't inference much about the variables, because they are just calculations or transformations from the real ones, and we don't know how they are calculated.
# Anyways it's important this step to see if we can find something extrange or insightful.
tables.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Year,Dystopia Residual
count,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0
mean,78.69821,5.379018,0.916047,1.078392,0.612416,0.411091,0.12569,0.218576,2016.993606,2.016806
std,45.182384,1.127456,0.40734,0.329548,0.248309,0.15288,0.105988,0.122321,1.417364,0.556426
min,1.0,2.693,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,0.184
25%,40.0,4.50975,0.6065,0.869363,0.440183,0.309768,0.05425,0.13,2016.0,1.678
50%,79.0,5.322,0.982205,1.124735,0.64731,0.431,0.091033,0.201982,2017.0,2.022
75%,118.0,6.1895,1.236187,1.32725,0.808,0.531,0.156243,0.278832,2018.0,2.371473
max,158.0,7.769,2.096,1.644,1.141,0.724,0.55191,0.838075,2019.0,3.83738


In [14]:
# Here we can see the representation of each region, in the total dataset.
px.histogram(tables, x='Region', barmode='group', histnorm='probability',)

In [15]:
# Number of diferent countries in the table
tables.Country.nunique()

169

In [16]:
# The number of representation of the countries that don't have been the 5 times in the datasets. 27 of 169 countries.
# There are few case of countries writed in different ways, as trinidad and tobago, hong kong, etc.
tables.groupby('Country')['Region'].count()[tables.groupby('Country')['Region'].count() < 5].reset_index().rename(columns={'Region' :'Count'}).sort_values('Count', ascending=False)

Unnamed: 0,Country,Count
0,ANGOLA,4
2,CENTRAL AFRICAN REPUBLIC,4
24,TAIWAN,4
21,SUDAN,4
20,SOUTH SUDAN,4
6,HONG KONG,4
18,SOMALIA,4
8,LAOS,4
9,LESOTHO,4
10,MACEDONIA,4


In [27]:
# This part of the code is for creating dummies for the region, but the dataframe get's more difficult to read and the improvement in accuracy was to low with the region variables.

# Create dummies for the regions. Drop the first
# region_dummies = pd.get_dummies(tables['Region'], drop_first=True)

# Let's add the columns to tables. I'll use a longer way and not join, because if you want to re run the codes without restarting.
# tables[['CENTRAL AND EASTERN EUROPE', 'EASTERN ASIA',
#        'LATIN AMERICA AND CARIBBEAN', 'MIDDLE EAST AND NORTHERN AFRICA',
#        'NORTH AMERICA', 'SOUTHEASTERN ASIA', 'SOUTHERN ASIA',
#        'SUB-SAHARAN AFRICA', 'WESTERN EUROPE']] = \
#        region_dummies[['CENTRAL AND EASTERN EUROPE', 'EASTERN ASIA',
#        'LATIN AMERICA AND CARIBBEAN', 'MIDDLE EAST AND NORTHERN AFRICA',
#        'NORTH AMERICA', 'SOUTHEASTERN ASIA', 'SOUTHERN ASIA',
#        'SUB-SAHARAN AFRICA', 'WESTERN EUROPE']]

# Check the final dataset.
# tables.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,...,Dystopia Residual,CENTRAL AND EASTERN EUROPE,EASTERN ASIA,LATIN AMERICA AND CARIBBEAN,MIDDLE EAST AND NORTHERN AFRICA,NORTH AMERICA,SOUTHEASTERN ASIA,SOUTHERN ASIA,SUB-SAHARAN AFRICA,WESTERN EUROPE
152,AFGHANISTAN,SOUTHERN ASIA,153,3.575,0.31982,0.30285,0.30335,0.23414,0.09719,0.3651,...,1.95255,0,0,0,0,0,0,1,0,0
153,AFGHANISTAN,SOUTHERN ASIA,154,3.36,0.38227,0.11037,0.17344,0.1643,0.07112,0.31268,...,2.14582,0,0,0,0,0,0,1,0,0
140,AFGHANISTAN,SOUTHERN ASIA,141,3.794,0.401477,0.581543,0.180747,0.10618,0.061158,0.311871,...,2.151024,0,0,0,0,0,0,1,0,0
144,AFGHANISTAN,SOUTHERN ASIA,145,3.632,0.332,0.537,0.255,0.085,0.036,0.191,...,2.196,0,0,0,0,0,0,1,0,0
153,AFGHANISTAN,SOUTHERN ASIA,154,3.203,0.35,0.517,0.361,0.0,0.025,0.158,...,1.792,0,0,0,0,0,0,1,0,0


In [19]:
# Let's check the overall correlations. Most of the variables are positively correlated. 
# Generosity is the one with more negative and lower correlations and the lower correlation with Happiness Score.
# Economy Gdp and Health (life expectancy) seen to be the ones with higher correlation with Happiness Score and they have a good correlation between them 0.74.
# Column year doesn't mean anything in this case.

tables.corr()

Unnamed: 0,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Year,Dystopia Residual,CENTRAL AND EASTERN EUROPE,EASTERN ASIA,LATIN AMERICA AND CARIBBEAN,MIDDLE EAST AND NORTHERN AFRICA,NORTH AMERICA,SOUTHEASTERN ASIA,SOUTHERN ASIA,SUB-SAHARAN AFRICA,WESTERN EUROPE
Happiness Rank,1.0,-0.992066,-0.794791,-0.644842,-0.743655,-0.537942,-0.37466,-0.117713,-0.007768,-0.469515,-0.032262,-0.058971,-0.257989,0.00716,-0.169882,0.009989,0.161835,0.616554,-0.450569
Happiness Score,-0.992066,1.0,0.789284,0.648799,0.742456,0.551258,0.400103,0.137578,0.007065,0.47494,0.021022,0.048314,0.229203,-0.013881,0.181384,-0.00949,-0.153374,-0.60393,0.473656
Economy (GDP per Capita),-0.794791,0.789284,1.0,0.585966,0.784338,0.340511,0.310933,-0.01456,0.019768,0.020577,0.119841,0.157817,0.020146,0.171464,0.141467,-0.013621,-0.147415,-0.658048,0.438308
Family,-0.644842,0.648799,0.585966,1.0,0.57265,0.420361,0.123841,-0.037262,0.367431,-0.093039,0.134345,0.073285,0.126087,-0.096466,0.096624,0.011329,-0.180454,-0.400013,0.310224
Health (Life Expectancy),-0.743655,0.742456,0.784338,0.57265,1.0,0.340745,0.250495,0.010638,0.130302,0.001125,0.159162,0.201381,0.104358,0.080123,0.114052,0.022667,-0.086642,-0.762491,0.448874
Freedom,-0.537942,0.551258,0.340511,0.420361,0.340745,1.0,0.456353,0.290706,0.010353,0.041099,-0.182865,0.009274,0.136967,-0.152814,0.110121,0.21197,-0.025819,-0.212095,0.266977
Trust (Government Corruption),-0.37466,0.400103,0.310933,0.123841,0.250495,0.456353,1.0,0.317545,-0.120242,0.012282,-0.223013,-0.014517,-0.12114,0.073648,0.104932,0.024089,-0.05177,-0.113238,0.355407
Generosity,-0.117713,0.137578,-0.01456,-0.037262,0.010638,0.290706,0.317545,1.0,-0.192587,-0.053444,-0.240063,-0.030676,-0.108849,-0.123341,0.14316,0.330816,0.145753,-0.053741,0.165044
Year,-0.007768,0.007065,0.019768,0.367431,0.130302,0.010353,-0.120242,-0.192587,1.0,-0.213523,-0.002514,0.000902,-0.01643,-0.003813,0.000514,0.001102,0.000977,0.010933,0.007062
Dystopia Residual,-0.469515,0.47494,0.020577,-0.093039,0.001125,0.041099,0.012282,-0.053444,-0.213523,1.0,-0.050233,-0.143946,0.337798,-0.0772,0.074128,-0.161635,-0.072402,-0.073142,0.077495


In [20]:
# Here let's zoom in just the correlations to our variable Happiness Score and Dystopia Residuals.
# We can see some differences between the correlations, for example Family have a strong positive correlation with Happinness Score but low with Dystopia Residual.

tables.corr()[['Happiness Score', 'Dystopia Residual']]

Unnamed: 0,Happiness Score,Dystopia Residual
Happiness Rank,-0.992066,-0.469515
Happiness Score,1.0,0.47494
Economy (GDP per Capita),0.789284,0.020577
Family,0.648799,-0.093039
Health (Life Expectancy),0.742456,0.001125
Freedom,0.551258,0.041099
Trust (Government Corruption),0.400103,0.012282
Generosity,0.137578,-0.053444
Year,0.007065,-0.213523
Dystopia Residual,0.47494,1.0


In [21]:
# Let's see the mean Happiness Score by Region.
px.scatter(tables.groupby('Region').mean()['Happiness Score'].sort_values())

# Models

I'll use a Linear Regression and a random forest, I really like this models because they are easy to understand and to explain, maybe you can have more developed models that offers sometimes better results, but the trade-off between explainability and results is always important to take in count.

In [34]:
# Create variable X with the columns that we will use as predictors, and y with the target variable.
# I will not use Country because that is irrelevant. 
# I decied to try region because are many theories that having developed neighbours helps a country to develop, but this is far way more developed analysis.
# For simplicity I keeped Regions out, because the improvement in accuraccy was too low, and is much simplier to read without them.
X = tables[['Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity']].copy()
y = tables['Happiness Score'].copy()

In [35]:
# Let's split between test and train set, if we were going to do hyperparameters engineering we should add a part for validating.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print('Test size :', X_test.shape, '\nTrain size :', X_train.shape)

Test size : (235, 6) 
Train size : (547, 6)


In [36]:
# Create our linear model.
linear_model = LinearRegression()

In [55]:
# Let's fit the variables to the model.
linear_model.fit(X_train, y_train)

# See their coeficients.
linear_coef = [(feature, round(importance, 2)) for feature, importance in zip(X.columns, linear_model.coef_)]
[print('Variable: {:20} coef: {}'.format(*pair)) for pair in linear_coef];

Variable: Economy (GDP per Capita) coef: 1.15
Variable: Family               coef: 0.58
Variable: Health (Life Expectancy) coef: 1.14
Variable: Freedom              coef: 1.58
Variable: Trust (Government Corruption) coef: 0.97
Variable: Generosity           coef: 0.33


In [None]:
# Here we have the coefficients, we can send it to the governator, saying apply this to your main ideas to reach the highest score possible in happiness.

In [44]:
# Let's predict for the test set.
y_pred_linear = linear_model.predict(X_test)

# Now let's see the R2.
print('Linear Model R2:', round(sklearn.metrics.r2_score(y_test, y_pred_linear) * 100, 2), '%.')

# The model could predict 72% of the variation, taking in count that we have an important variable that we are not including as Dystopia Residual, I think the models perform pretty well.

Linear Model R2: 71.76 %.


In [40]:
# Instantiate model with 1000 decision trees
forest = RandomForestRegressor(n_estimators = 1000, random_state = 0)
# Train the model on training data
forest.fit(X_train, y_train)
# Let's predict for the test set.
y_pred_forest= forest.predict(X_test)

Mean Absolute Error: 0.401 degrees.


In [42]:
# Let's see the metrics of the random forest.
# Calculate the absolute errors.
errors = abs(y_pred_forest - y_test)
# First the mean absolute error (mae).
print('Mean Absolute Error:', round(np.mean(errors), 3), 'Happinnes Scoring points.')

# Calculate and display r2.
r2 = 100 - np.mean(100 * (errors / y_test))
print('Forest R2:', round(r2, 2), '%.')

Mean Absolute Error: 0.401 Happinnes Scoring points.
R2: 91.85 %.


In [51]:
# Get numerical feature importances
importances = list(forest.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X.columns, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature importance.
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: Economy (GDP per Capita) Importance: 0.39
Variable: Health (Life Expectancy) Importance: 0.36
Variable: Freedom              Importance: 0.11
Variable: Family               Importance: 0.05
Variable: Trust (Government Corruption) Importance: 0.04
Variable: Generosity           Importance: 0.04


# Conclusions.
- For sure there are left many work to do to improve this models, we can try to find some out information about other correlated variables, let's say for example the oil price could affect the happiness scoring (theory: more expensive oil less happy is the people). 
- The models could be worked further and try to do some hyperparameter testing to try to improve their accuraccy, but been carefull to don't doing over fitting of the data.
- We are working with transformations of the real data, and we don't know how they are transformated, we can only see how much each variable sum to the world happiness scoring.
- I think there are not much possible and easy analysis to do in this data set, not only because we don't have the entire information about the variables additionaly the variable Dystopia Residual is a hard thing to work with, because it's not easy to predict (because it depends on the minimum scoring of each variable for that year) and even not sure how it's calculated.
- I really liked this dataset because, first I choose it liking the idea of World Happiness Scoring but after realizing that it doesn't contain the real variables just a transformation of them I find it challenging, added the variable Dystopia Residual it was a pleasure working on it.
- We could add a plot of a random tree, and try to explain to the governator that is the way that a tree was choosing and working with the variables.
- The presentation would be on slides, with the more relevant key points to the governator, and the importance of the coeficients allowing him to work with.