![](https://i.imgur.com/E4eJdxu.png)

So... how **happy** are we? 

Are we **happier** than last year? 

How happy are some of us **other than others**? 

What **dictates** our happiness?

It's time to find out. 😎

# 1. Imports, Data Preprocessing, Missing Values?

In [None]:
# Imports and preparing the dataset
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import scipy

import warnings
warnings.filterwarnings("ignore")

import folium # for the map

# Setting the default style of the plots
sns.set_style('whitegrid')
sns.set_palette('Set2')

# My custom color palette
my_palette = ["#7A92FF", "#FF7AEF", "#B77AFF", "#A9FF7A", "#FFB27A", "#FF7A7A",
             "#7AFEFF", "#D57AFF", "#FFDF7A", "#D3FF7A"]

# Importing the 3 datasets
data_2015 = pd.read_csv("../input/world-happiness/2015.csv")
data_2016 = pd.read_csv("../input/world-happiness/2016.csv")
data_2017 = pd.read_csv("../input/world-happiness/2017.csv")

# First we need to prepare the data for merging the tables together (to form only 1 table)
# Tables have different columns, so first we will keep only the columns we need
data_2015 = data_2015[['Country', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family',
                       'Health (Life Expectancy)', 'Freedom', 'Generosity', 'Trust (Government Corruption)', 
                       'Dystopia Residual']]
data_2016 = data_2016[['Country', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family',
                       'Health (Life Expectancy)', 'Freedom', 'Generosity', 'Trust (Government Corruption)', 
                       'Dystopia Residual']]
data_2017 = data_2017[['Country', 'Happiness.Rank', 'Happiness.Score', 'Economy..GDP.per.Capita.', 'Family',
                       'Health..Life.Expectancy.', 'Freedom', 'Generosity', 'Trust..Government.Corruption.', 
                       'Dystopia.Residual']]

# Tables do not have the same column names, so we need to fix that
new_names = ['Country', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family',
                       'Health (Life Expectancy)', 'Freedom', 'Generosity', 'Trust (Government Corruption)', 
                       'Dystopia Residual']

data_2015.columns = new_names
data_2016.columns = new_names
data_2017.columns = new_names

# Add a new column containing the year of the survey
data_2015['Year'] = 2015
data_2016['Year'] = 2016
data_2017['Year'] = 2017

# Merge the data together
data = pd.concat([data_2015, data_2016, data_2017], axis=0)
data.head(3)

### Update: 2018 and 2019 data dropped in 😁

In [None]:
# New data
data_2018 = pd.read_csv("../input/world-happiness/2018.csv")
data_2019 = pd.read_csv("../input/world-happiness/2019.csv")

# Concatenate data
data_2018['Year'] = 2018
data_2019['Year'] = 2019

new_data = pd.concat([data_2018, data_2019], axis=0)

# Switching overall rank column with country/ region
columns_titles = ['Country or region', 'Overall rank', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Year']
new_data = new_data.reindex(columns=columns_titles)

# Renaming old data columns:
old_data = data[['Country', 'Happiness Rank', 'Happiness Score','Economy (GDP per Capita)', 'Family', 
                 'Health (Life Expectancy)', 'Freedom', 'Generosity', 'Trust (Government Corruption)', 'Year']]
old_data.columns = ['Country or region', 'Overall rank', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Year']

# Finally, concatenating all data
data = pd.concat([old_data, new_data], axis=0)

data.head(3)

### Missing values

There is only one missing value in the data, so we will just drop it.

In [None]:
data[data['Perceptions of corruption'].isna()]

In [None]:
data.dropna(axis = 0, inplace = True)

In [None]:
# Double check to see if there are any missing values left
plt.figure(figsize = (16,6))
sns.heatmap(data = data.isna(), cmap = 'Blues')

plt.xticks(fontsize = 13.5);

We're done with the preprocessing part.

#### Let's get to business!

# 2. Let's familiarize with the numbers

## I. Shape of Data

In [None]:
data.shape

# 10 columns, 781 rows

## II. Summary Statistics

In [None]:
data.groupby(by='Year')['Score'].describe()

 * Well, looks like we are **slowly but shurely** becoming less and less happy.
 * 2019 was **better** than 2018, but still 2015 is the happiest year in our data.
 
## Factors difference between 2015 and 2019
 
 We first need to create a dataframe with the next columns:
 * `Factor` - our 7 factors
 * `Year` - the years between 2015 and 2019
 * `Avg_value` - average value of the factor for the year

In [None]:
# First we group the data by year and average the factors
grouped = data.groupby(by = 'Year')[['Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption']].mean().reset_index()

# Now we reconstruct the df by using melt() function
grouped = pd.melt(frame = grouped, id_vars='Year', value_vars=['Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'], var_name='Factor', value_name='Avg_value')

grouped.head()

In [None]:
plt.figure(figsize = (16, 9))

ax = sns.barplot(x = grouped[grouped['Factor'] != 'Score']['Factor'], y = grouped['Avg_value'], 
            palette = my_palette[1:], hue = grouped['Year'])

plt.title("Difference in Factors - Then and Now - ", fontsize = 25)
plt.xlabel("Factor", fontsize = 20)
plt.ylabel("Average Score", fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.legend(fontsize = 15)

ax.set_xticklabels(['Money','Family', 'Health', 'Freedom', 'Generosity', 'Trust']);

Let's look closer at the **top 10 nations** from top to bottom and from bottom to top... how do they look like?
## III. Which are the happiest people in 2019?

In [None]:
# Average top 5 most happy countries
country_score_avg = data[data['Year']==2019].groupby(by = ['Country or region'])['Score'].mean().reset_index()
table = country_score_avg.sort_values(by = 'Score', ascending = False).head(10)

table

In [None]:
plt.figure(figsize = (16, 9))
sns.barplot(y = table['Country or region'], x = table['Score'], palette = my_palette)

plt.title("Top 10 Happiest Countries in 2019", fontsize = 25)
plt.xlabel("Happiness Score", fontsize = 20)
plt.ylabel("Country", fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15);

## IV. Which are the least happy people in 2019?

In [None]:
# Average top 5 most "not that happy" countries
table2 = country_score_avg.sort_values(by = 'Score', ascending = True).head(10)

table2

In [None]:
plt.figure(figsize = (16, 9))
sns.barplot(y = table2['Country or region'], x = table2['Score'], palette = my_palette)

plt.title("Top 10 Least Happy Countries in 2019", fontsize = 25)
plt.xlabel("Happiness Score", fontsize = 20)
plt.ylabel("Country", fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15);

* Nothing surprising here either. Countries in **war zones** or with **poor sanitation systems**, diseases or very poor infrastructure are the least happy people out of all.

Let's give them a hand! 🤝

## V. Distribution of Smiles

In [None]:
# Checking the distribution for Happiness Score
plt.figure(figsize = (16, 9))

sns.distplot(a = country_score_avg['Score'], bins = 20, kde = True, color = "#A9FF7A")
plt.xlabel('Happiness Score', fontsize = 20)
plt.title('Distribution of Average Happiness Score - 2019 -', fontsize = 25)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlim((1.5, 8.9));

* The distribution of happiness is quite **platykurtic**, evenly spread between a score of ~3 and 7.5.

## Distribution for the other factors

In [None]:
## Creating the grouped table
country_factors_avg = data[data['Year'] == 2019].groupby(by = ['Country or region'])[['GDP per capita',
       'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption']].mean().reset_index()

plt.figure(figsize = (16, 9))

sns.kdeplot(data = country_factors_avg['GDP per capita'], color = "#B77AFF", shade = True)
sns.kdeplot(data = country_factors_avg['Social support'], color = "#FD7AFF", shade = True)
sns.kdeplot(data = country_factors_avg['Healthy life expectancy'], color = "#FFB27A", shade = True)
sns.kdeplot(data = country_factors_avg['Freedom to make life choices'], color = "#A9FF7A", shade = True)
sns.kdeplot(data = country_factors_avg['Generosity'], color = "#7AFFD4", shade = True)
sns.kdeplot(data = country_factors_avg['Perceptions of corruption'], color = "#FF7A7A", shade = True)

plt.xlabel('Factors Score', fontsize = 20)
plt.title('Distribution of Average Factors Score - 2019 -', fontsize = 25)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlim((-0.5, 2.3))
plt.legend(fontsize = 15);

Now, let's see the intensity of the correlation between the happiness score and its 7 main influencer factors:

## VI. What Influences Happiness?

In [None]:
# Calculating the Pearson Correlation

c1 = scipy.stats.pearsonr(data['Score'], data['GDP per capita'])
c2 = scipy.stats.pearsonr(data['Score'], data['Social support'])
c3 = scipy.stats.pearsonr(data['Score'], data['Healthy life expectancy'])
c4 = scipy.stats.pearsonr(data['Score'], data['Freedom to make life choices'])
c5 = scipy.stats.pearsonr(data['Score'], data['Generosity'])
c6 = scipy.stats.pearsonr(data['Score'], data['Perceptions of corruption'])

print('Happiness Score + GDP: pearson = ', round(c1[0],2), '   pvalue = ', round(c1[1],4))
print('Happiness Score + Family: pearson = ', round(c2[0],2), '   pvalue = ', round(c2[1],4))
print('Happiness Score + Health: pearson = ', round(c3[0],2), '   pvalue = ', round(c3[1],4))
print('Happiness Score + Freedom: pearson = ', round(c4[0],2), '   pvalue = ', round(c4[1],4))
print('Happiness Score + Generosity: pearson = ', round(c5[0],2), '   pvalue = ', round(c5[1],4))
print('Happiness Score + Trust: pearson = ', round(c6[0],2), '   pvalue = ', round(c6[1],4))

In [None]:
# Computing the Correlation Matrix

corr = data.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(16, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(0, 25, as_cmap=True, s = 90, l = 45, n = 5)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.title('What influences our happiness?', fontsize = 25)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15);

It seems that Happiness is influenced the **most by GDP (money moneyyy💰)** (very strong correlation) and **Health**. There is also medium positive correlation between Happiness, Freedom and Health. 

## VII. Globe Map 2019

In [None]:
# import os
# print(list(os.listdir("../input")))

In [None]:
#json file with the world map
import matplotlib.pyplot as plt
import geopandas as gpd

country_geo = gpd.read_file('../input/worldcountries/world-countries.json')

#import another CSV file that contains country codes
country_codes = pd.read_csv('../input/iso-country-codes-global/wikipedia-iso-country-codes.csv')
country_codes.rename(columns = {'English short name lower case' : 'Country or region'}, inplace = True)

#Merge the 2 files together to create the data to display on the map
data_to_plot = pd.merge(left= country_codes[['Alpha-3 code', 'Country or region']], 
                        right= country_score_avg[['Score', 'Country or region']], 
                        how='inner', on = ['Country or region'])
data_to_plot.drop(labels = 'Country or region', axis = 1, inplace = True)

data_to_plot.head(2)

In [None]:
#Creating the map using Folium Package
my_map = folium.Map(location=[10, 6], zoom_start=1.49)

my_map.choropleth(geo_data=country_geo, data=data_to_plot, 
                  name='choropleth',
                  columns=['Alpha-3 code', 'Score'],
                  key_on='feature.id',
                  fill_color='BuPu', fill_opacity=0.5, line_opacity=0.2,
                  nan_fill_color='white',
                  legend_name='Average Happiness Indicator')

my_map.save('data_to_plot.html')

from IPython.display import HTML
HTML('<iframe src=data_to_plot.html width=850 height=500></iframe>')

![Annotation%202020-02-19%20215951.png](https://i.imgur.com/BRX6Zur.png)

So.......

1. **The Nordics** and the **West** is the happiest. + **Australia** (kangaroos just MUST be a factor)
2. **East Europe** and the majority of Asia is in the middle.
3. **The South** is the least happiest.

# 3. A predictive model, because why not ?

After all that jazz, I couldn't help myself.

What if we tried to predict the happiness score of a country using the other factors available in the analysis?

## I. Imports:

In [None]:
# Importing the libraries
from sklearn.model_selection import train_test_split # for data validation

# Models
from sklearn.linear_model import LinearRegression, BayesianRidge, LassoLars
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb
from xgboost import XGBRegressor

# Metrics and Grid Search
from sklearn import model_selection, metrics
from sklearn.model_selection import GridSearchCV

## II. Preparing...

In [None]:
# Creating the table
data_model = data.groupby(by= 'Country or region')['Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'].mean().reset_index()

# Creating the dependent and independent variables
y = data_model['Score']
X = data_model[['GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption']]

# Splitting the data to avoid under/overfitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

## III. Models probation:

In [None]:
# Creating a predefined function to test the models
def modelfit(model):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mae = metrics.mean_absolute_error(y_test, preds)
    print('MAE:', round(mae,4))

In [None]:
# Linear Regression

lm = LinearRegression(n_jobs = 10000)
modelfit(lm)

In [None]:
# Random Forest Regressor

rf = RandomForestRegressor(n_jobs = 1000)
modelfit(rf)

In [None]:
# XGBoost
xg = XGBRegressor(learning_rate=0.1, n_estimators=5000)
modelfit(xg)

In [None]:
# Decision Tree
dt = DecisionTreeRegressor()
modelfit(dt)

In [None]:
# Bayesian Linear Model
br = BayesianRidge(n_iter=1000, tol = 0.5)
modelfit(br)

In [None]:
# Lasso Lars
ls = LassoLars()
modelfit(ls)

In [None]:
final_model = BayesianRidge(n_iter = 10, tol = 0.1, alpha_2 = 0.1)
final_model.fit(X_train, y_train)

Linear Regression and the Bayesian Ridge were the models that performed the best (they had the smallest mae out of all)

Also did some **parameter tuning**, but the MAE score didn't change.

So, we have a winner: Congrats to Bayesian Ridge (if you found a better model, please don't keep it to yourself 😁)

## IV. How important are the variables?

In [None]:
# How important is each variable into predicting the overall Happiness Score?

import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(estimator=final_model, random_state=1)
perm.fit(X_test, y_test)

eli5.show_weights(estimator= perm, feature_names = X_test.columns.tolist())

What actually influences our general well-being?

* Looks like **money** is of the highest importance.
* Following up next is **social support**, meaning the relationships in a family and the closest group of friends. Human interaction.
* I would like to point out **freedom** as well. Freedom to act. To talk. But careful not to overstep others tho.
* The last one is **generosity**, but who likes to share anyway?

# 4. Final Thoughts

This report is amazing. Very helpful for many industries, as it assesses the overall mood of a nation, as well as gives a glimpse into how it is evolving in time.

It also points out to what makes us happy. What we value the most as beings. What do we want in order to feel contempt and happy with our lives.

*And this report gives just that answer: money and healthy relationships... in exactly that order* 😅

If you guys have any ideas on how to improve this, do not hold yourselves.

<div class="alert alert-block alert-info">
<p><p>
<p>If you liked this, don't be shy, upvote! 😁<p>
<b>Cheers!<b>
<p><p>
</div>