# The Happy Data Project

This data project investigates what factors make a country happy. The data used is from the 'World Happiness Report' in the years 2015-2017.

The project is structured as follows:  
1. The data is cleaned and appended
2. ...
3. ...
4. ...

## Preparing the Data

#### Importing relevant packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import ipywidgets as widgets

#### Loading and cleaning data

In [2]:
# loading datas sets
h15 = pd.read_csv("2015.csv")
h16 = pd.read_csv("2016.csv")
h17 = pd.read_csv("2017.csv")   

# Renaming columns for 2017 so they match column names from other years
h17.rename(columns = {'Happiness.Rank':'Happiness Rank'}, inplace=True)
h17.rename(columns = {'Happiness.Score':'Happiness Score'}, inplace=True)
h17.rename(columns = {'Economy..GDP.per.Capita.':'Economy (GDP per Capita)'}, inplace=True)
h17.rename(columns = {'Health..Life.Expectancy.':'Health (Life Expectancy)'}, inplace=True)
h17.rename(columns = {'Trust..Government.Corruption.':'Trust (Government Corruption)'}, inplace=True)

# Checking shape of datasets
print(f'h15 has shape {h15.shape}')
print(f'h16 has shape {h16.shape}')
print(f'h17 has shape {h17.shape}')

h15 has shape (158, 12)
h16 has shape (157, 13)
h17 has shape (155, 12)


The above results show that the datasets have different shapes. Thus, some of the datasets must have different variables, which is examined below and only needed variables are kept before merging the data from the 3 years.

In [3]:
# Investigating which variables differ in the respective years
h15var = list(h15.columns)
h16var = list(h16.columns)
h17var = list(h17.columns)

# Finding the variables that are in all 3 data sets
keep = set(h15var).intersection(h16var, h17var)
print(keep)

# Finding what variables to drop from each data set
d15 = set(h15var).difference(keep)
d16 = set(h16var).difference(keep)
d17 = set(h17var).difference(keep)

print(f'From 2015 drop: {d15}')
print(f'From 2016 drop: {d16}')
print(f'From 2017 drop: {d17}')

{'Country', 'Freedom', 'Happiness Score', 'Health (Life Expectancy)', 'Generosity', 'Happiness Rank', 'Trust (Government Corruption)', 'Family', 'Economy (GDP per Capita)'}
From 2015 drop: {'Region', 'Dystopia Residual', 'Standard Error'}
From 2016 drop: {'Region', 'Upper Confidence Interval', 'Lower Confidence Interval', 'Dystopia Residual'}
From 2017 drop: {'Whisker.low', 'Whisker.high', 'Dystopia.Residual'}


In [4]:
# Dropping the variables found above
h15.drop(d15, axis=1, inplace=True)
h16.drop(d16, axis=1, inplace=True)
h17.drop(d17, axis=1, inplace=True)

In [5]:
# Checking for which countries we do not have data
h15c = list(h15['Country'])
h16c = list(h16['Country'])
h17c = list(h17['Country'])

miss = set(h15c).difference(h16c,h17c)
print(miss)

{'Djibouti', 'Oman', 'Somaliland region', 'Swaziland'}


Now the datasets have the same variables. They still differ in rows, which is probably a difference in countries from year to year. Now we append the data. However, we append a year column is added so we can compare data from different years in our analysis.
#### Appending the data sets

In [6]:
# Adding year column
h15['Year'] = '2015'
h16['Year'] = '2016'
h17['Year'] = '2017'
h15.head(1)

# Appending datasets
hap = pd.concat([h15,h16,h17], sort=True)
print(hap.shape)

(470, 10)


## Data Analysis
In this section we take a closer look on the data and try to find out what makes countries happy.

#### The Happiest Countries 

In [7]:
# Below is used to print text in bold
BOLD = '\033[1m'
END = '\033[0m'

# Creating tables for top 5 happy countries in the 3 years 
fiv15 = pd.DataFrame(h15, columns = ['Happiness Rank', 'Country', 'Happiness Score']).head(5)
print(BOLD + 'Top 5 Happy Countries in 2015' + END)
print(fiv15.to_string(index=False))

fiv16 = pd.DataFrame(h16, columns = ['Happiness Rank', 'Country', 'Happiness Score']).head(5)
print(BOLD + 'Top 5 Happy Countries in 2016' + END)
print(fiv16.to_string(index=False))

fiv17 = pd.DataFrame(h17, columns = ['Happiness Rank', 'Country', 'Happiness Score']).head(5)
print(BOLD + 'Top 5 Happy Countries in 2017' + END)
print(fiv17.to_string(index=False))

[1mTop 5 Happy Countries in 2015[0m
Happiness Rank      Country  Happiness Score
             1  Switzerland            7.587
             2      Iceland            7.561
             3      Denmark            7.527
             4       Norway            7.522
             5       Canada            7.427
[1mTop 5 Happy Countries in 2016[0m
Happiness Rank      Country  Happiness Score
             1      Denmark            7.526
             2  Switzerland            7.509
             3      Iceland            7.501
             4       Norway            7.498
             5      Finland            7.413
[1mTop 5 Happy Countries in 2017[0m
Happiness Rank      Country  Happiness Score
             1       Norway            7.537
             2      Denmark            7.522
             3      Iceland            7.504
             4  Switzerland            7.494
             5      Finland            7.469


Over the 3 years only one country differs in the top 5 - namely Canada that is a top 5 happy country in 2015 but then slides out the next two years where it is replaced by Finland.

In [25]:
# Import figure from bokeh.plotting
from bokeh.io import push_notebook, show, output_notebook, curdoc
from bokeh.layouts import row, widgetbox, column
from bokeh.plotting import figure
from bokeh.models import Dropdown, Select, ColumnDataSource
output_notebook()

source15 = ColumnDataSource(h15)
source16 = ColumnDataSource(h16)
source17 = ColumnDataSource(h17)

# Create plot
print(BOLD + 'Correlation plots 2015'+ END)
f1 = figure(x_axis_label ='Economy (GDP per Capita)', y_axis_label ='Happiness Score', plot_width=250, plot_height=200)
f1.circle('Economy (GDP per Capita)', 'Happiness Score', source=source15)

f2 = figure(x_axis_label ='Freedom', y_axis_label ='Happiness Score', plot_width=250, plot_height=200)
f2.circle('Freedom', 'Happiness Score', source=source15)

f3 = figure(x_axis_label ='Health (Life Expectancy)', y_axis_label ='Happiness Score', plot_width=250, plot_height=200)
f3.circle('Health (Life Expectancy)', 'Happiness Score', source=source15)

f4 = figure(x_axis_label ='Generosity', y_axis_label ='Happiness Score', plot_width=250, plot_height=200)
f4.circle('Generosity', 'Happiness Score', source=source15)

f5 = figure(x_axis_label ='Trust (Government Corruption)', y_axis_label ='Happiness Score', plot_width=250, plot_height=200)
f5.circle('Trust (Government Corruption)', 'Happiness Score', source=source15)

f6 = figure(x_axis_label ='Family', y_axis_label ='Happiness Score', plot_width=250, plot_height=200)
f6.circle('Family', 'Happiness Score', source=source15)

# Display the plot
show(row(f1,f2,f3,f4,f5,f6), notebook_handle=True)

[1mCorrelation plots 2015[0m


Above we see that the correlation between happiness score 