# World Happiness Report 2015-2019 (EDA + Visualisation + Prediction)

This notebook shows some exploratory analysis and visualisation analysis for the World Happiness Reports from the years 2015 until 2019, then applies a Multiple Linear Regression model to predict country's Happiness Score and determine which factors are influence this score.

## Features Explanation

**Happiness Rank:** Rank of any country in a particular year.<br>
**Country:** Name of the country.<br>
**Standard Error:** The standard error of the happiness score.<br>
**Happiness Score:** Happiness score as the sum of all numerical columns in the datasets.<br>
**Economy (GDP per Capita):** The extent to which GDP contributes to the calculation of the Happiness Score.<br>
**Trust:** A quantification of the people’s perceived trust in their governments.<br>
**Health (Life Expectancy):** The extent to which Life expectancy contributed to the calculation of the Happiness Score.<br>
**Generosity:** Numerical value estimated based on the perception of Generosity experienced by poll takers in their country.<br>
**Family Support:** Metric estimating satisfaction of people with their friends and family.<br>
**Freedom:** Perception of freedom quantified.<br>
**Dystopia:** Hypothetically the saddest country in the world.<br>
**Lower Confidence Interval:** Lower Confidence Interval of the Happiness Score.<br>
**Upper Confidence Interval:** Upper Confidence Interval of the Happiness Score.<br>


In [None]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# load datasets from csv files
x = 2015
dfs = []
while True:
    globals()[f'df{x}'] = pd.read_csv(f'/kaggle/input/world-happiness/{x}.csv')
    dfs.append(globals()[f'df{x}'])
    x += 1
    if x == 2020:
        break

# Data Exploration

In [None]:
# show first few records for each dataset
for i, df in enumerate(dfs):
    print (f'201{i+5} dataset:')
    display (dfs[i].head(3))

In [None]:
# show the number of records and columns for each dataset
for i, df in enumerate(dfs):
    print (f'Size of 201{i+5} Report:', dfs[i].shape)

Since each row in these datasets represents a country, it's obvious that the number of countries covered in each report is different which means there no data available from some countries in some years!

In [None]:
# show column names for each dataset
for i, df in enumerate(dfs):
    print (f'Column names for 201{i+5} dataset:', dfs[i].columns, '\n')

So some datasets have different labels for the same columns (i.e. "Social Support" column is labeled as "Family" in latest reports), there are some changes in how data is represented between latest reports and their earlier counterparts (i.e. "Standard Error" column has defined in a different way for 2016-2017 reports than 2015 report (upper/lower & high/low values), some datasets have data that others do not have (i.e. "Dystopia Residual" column does not exist in both 2018-2019 reports). To make data tidy, here are several things need to be addressed in the data cleaning process:


1. Unifying column names
1. Creating "Year" column for each dataset
1. Adding "Region" column for both 2017-2019 reports
1. Adding "Standard Error" column for both 2016-2017 datasets based on Confidence Interval and Whisker values
1. Adding "Standard Error" column for both 2017-2019 datasets based on the average value of the previous years for each country
1. Handling missing values
1. Merge all datasets into one giant dataset
1. Dropping "Dystopia Residual" column as it's not represented in 2018/2019 datasets

# Data Cleaning

Since the datasets have a bit of a different naming convention we need to abstract them to a common name.

In [None]:
df2015.rename(columns = {'Economy (GDP per Capita)' : 'GDP',
                        'Health (Life Expectancy)' : 'Life',
                        'Trust (Government Corruption)' : 'Trust'}, inplace = True)

In [None]:
df2016.rename(columns = {'Economy (GDP per Capita)' : 'GDP',
                        'Health (Life Expectancy)' : 'Life',
                        'Trust (Government Corruption)' : 'Trust'}, inplace = True)

In [None]:
df2017.rename(columns = {'Happiness.Rank' : 'Happiness Rank',
                        'Happiness.Score' : 'Happiness Score',
                        'Economy..GDP.per.Capita.' : 'GDP',
                        'Health..Life.Expectancy.' : 'Life',
                        'Dystopia.Residual' : 'Dystopia Residual',
                        'Trust..Government.Corruption.' : 'Trust'}, inplace = True)

In [None]:
df2018.rename(columns = {'Overall rank' : 'Happiness Rank',
                        'Score' : 'Happiness Score',
                        'Country or region' : 'Country',
                        'Social support' : 'Family',
                        'Freedom to make life choices' : 'Freedom',
                        'GDP per capita' : 'GDP',
                        'Healthy life expectancy' : 'Life',
                        'Perceptions of corruption' : 'Trust'}, inplace = True)

In [None]:
df2019.rename(columns = {'Overall rank' : 'Happiness Rank',
                        'Score' : 'Happiness Score',
                        'Country or region' : 'Country',
                        'Social support' : 'Family',
                        'Freedom to make life choices' : 'Freedom',
                        'GDP per capita' : 'GDP',
                        'Healthy life expectancy' : 'Life',
                        'Perceptions of corruption' : 'Trust'}, inplace = True)

Creating "Year" column for each dataset.

In [None]:
# add year column for each dataset
for i, df in enumerate(dfs, 2015):
    df['Year'] = i

Adding Region column for both 2018-2019 datasets.

In [None]:
# add "Region" column
for df in dfs:
    if not ('Region') in df:
        df['Region'] = None
        temp = df.set_index('Country').Region.fillna(df2015.set_index('Country').Region).reset_index()
        df.fillna(temp, inplace = True)

Now we need to calculate "Standard Error" column for both 2016-2017 datasets based on Confidence Interval and Whisker values.

In [None]:
df2016['Standard Error'] = round((df2016['Upper Confidence Interval'] - df2016['Lower Confidence Interval']) / 2, 3)

In [None]:
df2017['Standard Error'] = round((df2017['Whisker.high'] - df2017['Whisker.low']) / 2, 3)

Then adding "Standard Error" column for both 2017-2019 datasets by calculating the average value of the previous years for each country. To do so, we need to create a temporary dataframe that combine the "Standard Error" columns for previous years in order to calculate the "Standard Error" values for 2018-2019 datasets.

In [None]:
temp = pd.merge(df2015[['Country', 'Standard Error']], df2016[['Country', 'Standard Error']], on = 'Country')
temp.rename(columns = {'Standard Error_x' : 'Standard Error 2015',
                        'Standard Error_y' : 'Standard Error 2016'}, inplace = True)
standard_error_df = pd.merge(temp, df2017[['Country', 'Standard Error']], on = 'Country')
standard_error_df.rename(columns = {'Standard Error' : 'Standard Error 2017'}, inplace = True)
standard_error_df.head(3)

In [None]:
# claculate Standard Error values for df2018
standard_error_df['Standard Error 2018'] = round(standard_error_df.mean(axis = 1), 4)
standard_error_df.head(3)

In [None]:
# claculate Standard Error values for 2019 dataset 
standard_error_df['Standard Error 2019'] = round(standard_error_df.mean(axis = 1), 3)
standard_error_df.head(3)

In [None]:
dfs[3] = pd.merge(dfs[3], standard_error_df[['Country','Standard Error 2018']], on = 'Country')
dfs[3].rename(columns = {'Standard Error 2018' : 'Standard Error'}, inplace = True)
dfs[4] = pd.merge(dfs[4], standard_error_df[['Country','Standard Error 2019']], on = 'Country')
dfs[4].rename(columns = {'Standard Error 2019' : 'Standard Error'}, inplace = True)

Droping columns that are not common to all five reports.

In [None]:
for i, df in enumerate(dfs, 2015):
    # drop "Dystopia Residual" columns
    if 'Dystopia Residual' in df:
        df.drop(['Dystopia Residual'], inplace = True, axis = 1)
    # drop "Confidence Interval" columns
    if ('Lower Confidence Interval' and 'Upper Confidence Interval') in df:
        df.drop(['Lower Confidence Interval', 'Upper Confidence Interval'], inplace = True, axis = 1)
    # drop "Whisker" columns
    if ('Whisker.high' and 'Whisker.low') in df:
        df.drop(['Whisker.high', 'Whisker.low'], inplace = True, axis = 1)

In [None]:
# check missing values
for i, df in enumerate(dfs, 2015):
    print ('\n' f'df{i} dataset:' '\n', df.isnull().sum())

As we still have some missing values in "Region" columns (countries that have not represented in 2015 reports), we need to fill these records manually.

In [None]:
# show records with missing values in df2017 
dfs[2][dfs[2].isnull().any(axis = 1)]

In [None]:
# fill missing values in df2017 manually
dfs[2].loc[32, ['Region']] = 'Eastern Asia'
dfs[2].loc[49, ['Region']] = 'Latin America and Caribbean'
dfs[2].loc[70, ['Region']] = 'Eastern Asia'
dfs[2].loc[92, ['Region']] = 'Sub-Saharan Africa'
dfs[2].loc[110, ['Region']] = 'Sub-Saharan Africa'
dfs[2].loc[146, ['Region']] = 'Sub-Saharan Africa'

In [None]:
# show records with missing values in df2018
dfs[3][dfs[3].isnull().any(axis = 1)]

In [None]:
# fill Trust missing values in df2018 by the mean value of the previous years
previous_trust_uae = [dfs[0][dfs[0]['Country'] == 'United Arab Emirates']['Trust'].item(), dfs[1][dfs[1]['Country'] == 'United Arab Emirates']['Trust'].item(), dfs[2][dfs[2]['Country'] == 'United Arab Emirates']['Trust'].item()]
dfs[3].loc[19, ['Trust']] = sum(previous_trust_uae) / len(previous_trust_uae)

As we standardised the structure of the five datasets, last step is to combine them all into a single giant dataframe.

In [None]:
# combine all datasets into a single dataframe
giant_df = pd.concat(dfs)

# Data Visualisation

In [None]:
# create a new dataframe for the sake of visualisation
eda_df = giant_df

Taking a look at how the Happiness Score relates to each other variable in the dataset!

In [None]:
import plotly.figure_factory as ff
z = pd.DataFrame(eda_df.corr().values.tolist())
z = z.round(2).values.tolist()
fig = ff.create_annotated_heatmap(z, x = eda_df.corr().columns.tolist(), y = eda_df.corr().columns.tolist(), colorscale = 'Portland')
fig.update_layout(title = {'text': 'Correlation Heatmap', 'y' : 0.93, 'x' : 0.5}, title_font_size = 25)
fig.show()

So the Happiness Score is highly correlated with GDP, Family Support and Life expectancy, and least correlated with Generosity.

# What Are the 10 Happiest Countries in the World?

In [None]:
import plotly.express as px
happiest_countries = eda_df.groupby(['Country'], sort = False)['Happiness Score', 'Year', 'GDP'].max()
top10 = happiest_countries.sort_values('Happiness Score', ascending = False)[:15]
fig = px.scatter(top10,
                x = top10.index,
                y = 'Happiness Score',
                size = 'GDP',
                color = top10.index,
                template = 'xgridoff',
                animation_frame = 'Year',
                title = 'The Top 10 Happiest Countries in The World <br> (Bubble Size Indicates GDP)')
fig.show()

# Comparing Happiness Scores Across Regions

In [None]:
eda_df['Continent'] = ['Asia' if (i == 'Eastern Asia' or i == 'Southern Asia' or i == 'Eastern Asia')
                          else 'Europe' if (i == 'Western Europe' or i == 'Central and Eastern Europe')
                          else 'Middle East' if (i == 'Middle East and Northern Africa')
                          else 'Africa' if (i == 'Sub-Saharan Africa')
                          else 'Australia' if (i == 'Australia and New Zealand')
                          else 'North America' if (i == 'North America')
                          else 'Latin America'
                          for i in giant_df['Region']]
fig = px.box(eda_df,
             x = 'Year',
             y = 'Happiness Score',
             color = 'Continent',
             template = 'xgridoff',
             labels = {'Continent': 'Region'},
             title = 'Happiness Score by Regions from 2015-2017')
fig.show()

# How Countries' Happiness Has Changed from 2015-2017?

To o answer this question, we need to calculate the change in happiness score for each country in the period 2015 to 2019.

In [None]:
eda_df['Happiness Change'] = (df2019['Happiness Score'] - df2015['Happiness Score']) / df2015['Happiness Score']
# show countries with at least 1% change
temp = eda_df[np.abs(eda_df['Happiness Change']) > 0.01]
temp = eda_df.sort_values('Happiness Change')
temp['Year'] = temp['Year'].astype(str)
fig = px.bar(temp,
             x = 'Happiness Change',
             y = 'Country',
             color = 'Year',
             orientation = 'h',
             height = 900,
#              width = 700,
             template = 'gridon',
             title = 'Change in Happiness Score from 2015-2017')
fig.show()

# Does Money Buy Happiness?

In [None]:
fig = px.scatter(eda_df,
                x = 'GDP',
                y = 'Happiness Score',
                size = 'Trust',
                color = 'Country',
                template = 'xgridoff',
                animation_frame = 'Year',
                title = 'GDP vs Happiness Score from 2015-2017 <br> (Bubble Size Indicates Trust)')
fig.show()

# How Is Life Expectancy Related to Happiness Score?

In [None]:
fig = px.scatter(eda_df,
                x = 'Life',
                y = 'Happiness Score',
                size = 'GDP',
                color = 'Country',
                template = 'xgridoff',
                animation_frame = 'Year',
                labels = {'Life': 'Life Expectancy'},
                title = 'Life Expectancy vs Happiness Score for Each Country from 2015-2017 <br> (Bubble Size Indicates GDP)')
fig.show()

# Family Support vs Happiness Score for Each Country from 2015-2017

In [None]:
fig = px.scatter(eda_df,
                x = 'Family',
                y = 'Happiness Score',
                size = 'GDP',
                color = 'Country',
                template = 'xgridoff',
                animation_frame = 'Year',
                labels = {'Family': 'Family Support'},
                title = 'Family Support vs Happiness Score from 2015-2017 <br> (Bubble Size Indicates GDP)')
fig.show()

# Relation of Freedom to Happiness Score?

In [None]:
fig = px.scatter(eda_df,
                x = 'Freedom',
                y = 'Happiness Score',
                size = 'GDP',
                color = 'Country',
                template = 'xgridoff',
                animation_frame = 'Year',
                title = 'Freedom vs Happiness Score for Each Country from 2015-2017 <br> (Bubble Size Indicates GDP)')
fig.show()

# Modelling

Now it's time to use scikit-learn to perform a simple linear regression and XGBoost (Gradient booster) to predict Happiness Score.<br>
First, categorical variables need to be encoded for the model, this can be done by using LabelEncoder class.

In [None]:
# encode categorical variables in order to prepare them for modelling
le = preprocessing.LabelEncoder()
giant_df['Region'] = le.fit_transform(giant_df['Region'])
giant_df['Country'] = le.fit_transform(giant_df['Country'])

Next step is to split the dataset into train and test sets for unbiased evaluation of the final model where the dependent variable is "Happiness Score".

In [None]:
# define the predictors
features = ['Country', 'Region', 'Happiness Rank', 'Standard Error', 'GDP', 'Family', 'Life', 'Freedom', 'Trust', 'Generosity', 'Year']
X = giant_df[features]
# define the target
y = giant_df['Happiness Score']

In [None]:
# split into the two subsets using random selection (67-33 policy)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

# Linear Regression

There are many possibilities of regressors to use. A particularly simple one is "LinearRegression", it's basically a wrapper around an ordinary least squares calculation.

In [None]:
# create linear regression object
lr = LinearRegression()
# train the model using the training set
lr.fit(X_train, y_train)
lr.score(X_train, y_train)

In [None]:
# how good is our model?
print('Coefficient:', lr.score(X_train, y_train))
print('Intercept:', lr.intercept_)
print('Slope:', lr.coef_)

In [None]:
coefficients = zip(X.columns, lr.coef_)
coefficients = pd.DataFrame(list(zip(X.columns, lr.coef_)), columns = ['Features', 'Coefficients'])
coefficients.sort_values('Coefficients', ascending = False)

Now that we have created our model and trained it, it's time we test the model with our testing set.

In [None]:
# make predictions using testset
y_pred = lr.predict(X_test)

In [None]:
pred = pd.DataFrame({'Actual': y_test.tolist(), 'Predicted': y_pred.tolist()}).head(25)
pred.head(10)

In [None]:
plt.style.use(style = 'fivethirtyeight')
plt.rcParams['figure.figsize'] = (10, 6)
plt.scatter(y_test, y_pred, alpha = 0.7, color = 'r')
m, b = np.polyfit(y_pred, y_test, 1)
plt.plot(y_pred, (m * y_pred + b), color = 'g')
plt.xlabel('Actual Score')
plt.ylabel('Predicted Score')
plt.title('Happiness Score')

# XGBoost

In [None]:
import xgboost as xgb
xgb = xgb.XGBRegressor(objective = 'reg:squarederror', n_estimators = 100, max_depth = 3, learning_rate = 0.1)

In [None]:
xgb.fit(X_train,y_train)

In [None]:
xgb.score(X_train, y_train)

In [None]:
y_preds = xgb.predict(X_test)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, y_preds))
print(xgb.score(X_test, y_test))