# Data Science Project: World Happiness Report 2021

<br><br>
<!-- <div class="container">
    <div style="float:left;width:40%">
	    <img src="../input/photos/happy2.jpg" alt="Children in South Africa">
    </div>
    <div style="float:left;width:40%">
	    <img src="../input/photos/happy.jpg" alt="Children in Birmania>
    </div>
</div> -->

In [None]:
from IPython.display import Image
import os
Image("../input/images/happy.jpg")

## Table of Contents
<ul>
    <a href="#Intro">0) Introduction  to this project</a> <br> 
    <a href="#EDA"> 1) Exploratory Data Analysis (EDA) </a> <br>
    <ul>
        <a href="#Def"> Data Sources and Variable Definitions </a> <br>
        <a href="#Widgets"> Visualization: Numerical Data and Widgets </a> <br>
        <a href="#Prep"> Preparation for prediction of values.</a> <br>
    </ul>
    <a href="#Simple_Regression"> 2) Linear Regression Modelling </a> <br> 
    <ul>
        <a href="#Var"> Explanatory and response variables </a> <br>
        <a href="#Eval"> Evaluationg the model </a> <br>
        <a href="#RMSE"> Root Mean Squared Error </a> <br>
    </ul>
    <a href="#Example"> 3) Practical Example: Calculating average happiness score for people in Spain</a> <br> 
</ul>

## 0) Introduction <a id= "Intro"></a>

In this project I aim to perform some basic Exploratory Data Analysis (EDA) on the data from the World Happiness Report 2021.  My aim is to get a good understanding of the data and then create a model for Multiple linear Regression Analysis (MRM).  With the beta terms of the regression equation we are able to calculate  the happiness people from different countries might feel on average based on their financial and social situation.<br>

We go through an example and calculate the happiness factor for people living in Spain.

### Source of the data
Link to the dataset in [Kaggle](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021).
The whole report and its' appendices can be downloaded on this [Website](https://worldhappiness.report). <br>

### Context
The World Happiness Report is a landmark survey of the state of global happiness . The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.

### Content
The happiness scores and rankings use data from the Gallup World Poll . The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.

## 1) Exploratory Data Analysis (EDA) <a id= "EDA"></a>

We first want to analyze the data set to summarize it's main characteristics.  We want to better understand the data and draw possible conclusions that can be tested later.

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# We'll create some widgets later
import ipywidgets as widgets
sns.set(rc={'figure.figsize':(11.7,8.27)})

# Confirm that everything is setup
print("Setup complete")

In [None]:
# Load the first csv file
# This file consists of the happiness index for the year 2021
data_2021=pd.read_csv("../input/world-happiness-report-2021/world-happiness-report-2021.csv")
# data_2021=pd.read_csv("world-happiness-report-2021.csv")
data_2021.head()

In [None]:
#Show all columns to decide which we'll use for further analysis
data_2021.columns

In [None]:
#Drop all columns not used for further analysis
#Drop the columns which contain the word "Explained" first
explained_cols= data_2021.columns.str.find("Explained")==0 
happiness_2021= data_2021.loc[:, ~explained_cols]
#Now drop the rest of the columns which we'll not use
happiness_2021= happiness_2021.drop(columns=["Standard error of ladder score", "upperwhisker", 
                                             "lowerwhisker", "Ladder score in Dystopia", "Dystopia + residual"])

#Show first 5 rows of updated dataset
happiness_2021.head()

### Data Sources and Variable Definitions <a id= "Def"></a>

Source of variable definitions: [Link](https://happiness-report.s3.amazonaws.com/2021/Appendix1WHR2021C2.pdf) <br>

The rankings of the World Happiness Report 2021 use data that come from the Gallup World Poll surveys from 2018 to 2020. They are based on answers to the main life evaluation question asked in the poll. This is called the Cantril ladder: it asks respondents to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The rankings are from nationally representative samples, for the years 2018-2020. They are based entirely on the survey scores, using the Gallup weights to make the estimates representative.

1) **Ladder score**: Happiness or subjective well-being score.  Unless stated otherwise, it is the national average response to the question of life evaluations. The English wording of the question is “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”<br>
2) **Logged GDP per capita**: Natural logarithm of the GDP p.c.for each country. For example for Finland the log_e(45100.00)=10.7166.  The gross domestic product measures the country's economic output per person.  It is calculated by dividing the GDP of a country by its population.<br> 
3) **Social support**: (or having someone to count on in times of trouble) is the national average of the binary responses (either 0 or 1) to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”<br> 
4) **Healthy life expactancy**: Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository (Last updated: 2020-09-28). The data at the source are available for the years 2000, 2005, 2010, 2015 and 2016. To match this report’s sample period (2005-2020), interpolation and extrapolation are used.<br> 
5) **Freedom to make life choices**: Freedom to make life choices is the national average of responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” <br>
6) **Generosity**: Generosity is the residual of regressing national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.<br> 
7) **Perceptions of corruption**: The measure is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?” The overall perception is just the average of the two 0-or-1 responses. In case the perception of government corruption is missing, we use the perception of business corruption as the overall perception. The corruption perception at the national level is just the average response of the overall perception at the individual level. <br>



### Data Types

The Regional Indicator is a catagorical variable.  All other variables are numerical.

In [None]:
happiness_2021.dtypes

### Sumary Statistics

In [None]:
happiness_2021.describe()

**Some interesting observations are:**

- There seem to be no missing values as the count for each numerical variable is 149.  
- The happiness score reaches from 2.52 (saddest country) to 7.84 (happiest country). The median is 5.53.
- The Generosity variable takes on negative values as well which should be analized for better understanding.
- The values for social support, freedom to make life choices and perceptions of corruptions seem to be fairly distributed between 0 and 1.

### Visualization: Numerical Data and Widgets <a id= "Widgets"></a>

For better understanding of the data we want to visualize the numerical variables in a histogram. We will create a widget to be able to switch between each of the variables.

In [None]:
# make the plots bigger
sns.set(rc={'figure.figsize':(11.7,8.27)})

# create a widget to plot and compare different histograms
explanatory_slider = widgets.Dropdown(options=["Logged GDP per capita", "Social support", 
                                               "Healthy life expectancy", "Freedom to make life choices",
                                              "Generosity", "Perceptions of corruption"])
display(widgets.interactive(lambda x: happiness_2021.hist(x, bins=30), x=explanatory_slider))

**The distributions can be described as follows:**

**Logged GDP per capita:** The distribution seems to be a bit right skewed with a mode around 9.5 which corresponds to a GDP per capita of about USD13,359.<br>
**Social support:** All values of this distribution fall between 0 and 1 while it is clearly left skewed.  There is a high peak at around 0.95.<br>
**Healthy life expectancy:** The life expectancy for the people in different countries takes on a wide range of values ranging from below 50 up to more than 75 (on average for each country).  The distribution seems to be a bit left skewed.<br>
**Freedom to make life choices:** There is one value that sticks out at below 0.4.  This distribution is clearly left skewed with most of the data concentrated at round 0.7 to 0.9. <br>
**Generosity:** To generate this value, people were asked if they donated money to a charity in the past month.  A NO would count as 0 and a YES would count as 1.  Then the average for a country was regressed against it's GDP per capita.  The residual for each country (difference between the actual and it's predicted value) was calculated and plotted here.  A negative value for generosity hence indicates that people in this country donated less on average then the value that would correspond to them based on the GDP of this country.  The distribution is a bit right skewed. <br>
**Perception on corruption:** This distribution is very left skewed.  There seem to be only a few countries which perceive their government and/or business world as being very resistant to corruption.  People from most countries seem to have answered that they perceive their government and/or business world as corrupt.

### Preparation for prediction of values.  Evaluation of correlation. <a id= "Prep"></a>

To predict the happiness score of a country (the response variable) based on an explanatory variable, we want to plot them against one another using a scatter plot.

In [None]:
# create a widget to make different scatter plots
explanatory_slider = widgets.Dropdown(options=["Logged GDP per capita", "Social support", 
                                               "Healthy life expectancy", "Freedom to make life choices",
                                              "Generosity", "Perceptions of corruption"])
def scatter_widget(var):
    plt.figure(figsize=(20,15))
    sns.scatterplot(x=var, y="Ladder score", data=happiness_2021,
                   label="Happiness score 2021", s=100)
    plt.xlim(happiness_2021[var].min(), happiness_2021[var].max())
    plt.legend()
    for line in range(0,happiness_2021.shape[0]):
        plt.text(x=happiness_2021[var][line]+0.01, y=happiness_2021["Ladder score"][line]+0.01, 
                 s=happiness_2021["Country name"][line], horizontalalignment='left',
                 size='small', color='black')
    plt.legend(loc="upper left")
widgets.interact(scatter_widget, var=explanatory_slider);

**Observations from the scatterplots:**

Based on the scatterplots there seems to be a positive linear correlation between the happiness score and the following explanatory variables: 
- the logged GDP per capita
- the social support someone receives
- a healthy life expectancy
- the freedom to make life choices.

There seems to be no correlation between how generous the people of a country are and the happiness score.  The scatter plot cloud looks mostly shapeless.
<br><br>
There seems to be a week negative correlation between the happiness score and how people of a country perceive the level of corruption in their country.

## 2) Linear Regression Modelling <a id= "Simple_Regression"> </a> 

### Motivation: 
In order to predict how happy people of a certain country might feel (on average) under the circunstances they live in, we will perform regression modelling.  We will do this with a function from the [Scipy statistical analysis Python package](https://www.scipy.org/) called `lstsq`<br>

In [None]:
#Import packages for regression 
from scipy.linalg import lstsq
from sklearn.model_selection import train_test_split

#We set a random seed in order to make our results reproducible.
np.random.seed(28)

# split the data into training and test sets
happiness_2021_train, happiness_2021_test =train_test_split(happiness_2021, train_size=0.8, test_size=0.2)

happiness_2021_train.head()

### Explanatory and response variables <a id= "Var"></a>

Our response variable will be the "Ladder score"/ "Happiness score".  We'll try to predict the happiness people might feel given a personal situation they live in.  The response variable needs to be in an array (not a DataFrame), so we'll get it using indexing.

In [None]:
#response variable: Ladder score (training set)
y_train = happiness_2021_train["Ladder score"]

#response variable: Ladder score (validation set)
y_test = happiness_2021_test["Ladder score"]

#show the first 5 items in the array
y_train.head()

In [None]:
#Define the explanatory variable that should be taken into account for MRM
expl_vars=["Logged GDP per capita", "Social support", "Healthy life expectancy", 
           "Freedom to make life choices", "Generosity" , "Perceptions of corruption" ]

#I will also include an intercept column (at position loc=0, so that the lstsq function corrects the beta_0 value correctly)
#Explanatory variables (training set)
X_train = happiness_2021_train[expl_vars]
X_train.insert(loc=0, column="intercept", value=1)

#Response variables (validation set)
X_test = happiness_2021_test[expl_vars]
X_test.insert(loc=0, column="intercept", value=1)

#Show the first 5 items in the array
X_train.head()
X_test.tail()

In [None]:
# Finding the 𝛽-terms of this regression

# calculate the least squares solution
lstsq_results = lstsq(X_train, y_train)
beta = lstsq_results[0]
beta


In [None]:
X_train.iloc[0,:]

**The happiness score** <br>

From this regression model we receive a beta value which helps us predict the happiness that people from a certain country should feel (on average).  We receive the following formula:<br>

Ladder score = $\beta_0$ + $\beta_1$x ln (GDP per capita) + $\beta_2$x(Social support) + $\beta_3$x(Healthy life expectancy) + $\beta_4$x(Freedom to make life choices) + $\beta_5$x(Generosity) + $\beta_6$x(Perceptions of corruption)   <br>

---

**Happiness score = -2.14 + 0.31 x ln (GDP per capita) + 2.19 x (Social support) + 0.03 x (Healthy life expectancy) + 1.84 x (Freedom to make life choices) + 0.60 x (Generosity) - 0.73 x (Perceptions of corruption)**

---

In [None]:
# take the dot product of every row of training data and beta
# the resulting array is a predicted number for the ladder score/ happiness score for a certain country
predict_train = X_train @ beta

# show the first 5 predictions
predict_train.head()

In [None]:
#Now we can add our predictions to our original DataFrame.

# make a new column in our training data with the predicted ladder score
happiness_2021_train["predicted ladder score"] = predict_train

# show the first five rows of the DataFrame
happiness_2021_train.head()

In [None]:
# make predictions for the test data using beta and the dot product
predict_test = X_test @ beta

# create a new column in our test data with predicted ladder score
happiness_2021_test["predicted ladder score"] = predict_test

# show the first 5 rows of test data
happiness_2021_test.head()

### Evaluating the model  <a id= "Eval"></a>

Our model makes predictions, but we want to evaluate how accurate they are.  We can start to get a sense of how we did by plotting the predictions versus the actual values on our training data on a scatter plot. If our model predicts perfectly:

- the predicted values will be equal to the actual values
- all points in the scatter plot will fall along a straight line with a slope of 1

In [None]:
# make the plots bigger
sns.set(rc={'figure.figsize':(11.7,8.27)})

# create the scatter plot and regression line
sns.regplot(x="Ladder score", y="predicted ladder score", data=happiness_2021_train);


In [None]:
# create the scatter plot and regression line
sns.regplot(x="Ladder score", y="predicted ladder score", data=happiness_2021_test, color=(1.0, 200/256, 44/256));

### Root Mean Squared Error <a id= "RMSE"></a>
Now it would be nice to have a quantitative measure of how good our model is. An intuitive way to evaluate our model is to look at the *error* for each prediction: the size of the difference between the measured Ladder score and the predicted ladder score. We can get this for each item in our data set by **subtracting the array of predictions from the array of actual values**:

In [None]:
# get an array of errors
errors = y_train - predict_train

# square the array of errors
sq_error = errors ** 2

# take the average of the squared errors
mean_sq_error = np.average(sq_error)

# take the square root of the mean squared errors to calculate the RMSE
root_mean_sq_err = np.sqrt(mean_sq_error)
root_mean_sq_err

In [None]:
# calculate the error for each data point
test_errors = y_test - predict_test

# square the errors
test_sq_error = test_errors ** 2

# take the mean of the squared errors
test_mean_sq_error = np.average(test_sq_error)

# take the square root of the mean squared errors
test_rmse = np.sqrt(test_mean_sq_error)
test_rmse

**Conclusion:** <br>

The RMSEs for the training and test data are very similar.

### Visualizing Error
We can also visualize our errors compared to the actual values on a scatter plot.

In [None]:
# plot the training errors on a scatter plot
happiness_2021_train["training error"] = errors
sns.scatterplot(x="Ladder score", y="training error", data=happiness_2021_train);
plt.hlines(0, min(happiness_2021_train["Ladder score"]), max(happiness_2021_train["Ladder score"]));

In [None]:
# plot the test errors on a scatter plot
happiness_2021_test["test error"] = y_test - predict_test
sns.scatterplot(x="Ladder score", y="test error", data=happiness_2021_test, color=(1.0, 200/256, 44/256));
plt.hlines(0, min(happiness_2021_test["Ladder score"]), max(happiness_2021_test["Ladder score"]));

## 3) Practical Example: Calculating average happiness score for people in Spain <a id= "Example"></a>

Coming back to our formula, we are able to calculate the happiness people from a certain 
country should feel (on average) making a few assumptions on how much social support they receive, how much freedom they have to make life choices and according to their perceptions of corruption in the country they live in.

Happiness score = -2.14 + 0.31 x ln (GDP per capita) + 2.19 x (Social support) + 0.03 x (Healthy life expectancy) + 1.84 x (Freedom to make life choices) + 0.60 x (Generosity) - 0.73 x (Perceptions of corruption)

**Since I am living in Spain, let's estimate the happiness my dear Spanish friends should feel (on average) living in this beautiful country:**

Spanish GDP per capita: 29.600 USD<br>
Social support (based my own subjective feeling): 0.8<br>
Healthy life expectancy Spain: 85.8 <br>
Freedom to make life choices: 0.8<br>
Generosity (based on my own subjective feeling on how generous the Spaniards are): 0.8<br>
Perception of corruption: 0.9<br>
<br>
Calculating the Happiness Score with the formula above we get the following value:<br>

**Happiness score for people in Spain** <br>
= -2.14 + 0.31 x ln (29600) + 2.19 x (0.8) + 0.03 x (85.8) + 1.84 x (0.8) + 0.60 x (0.8) - 0.73 x (0.9)<br>
= -2.14 + 3,19 + 1,75 + 2,57 + 1,47 + 0,48 - 0,66<br>
= 6,66<br>


In [None]:
#Crosscheck with the value from the data set (Ladder score for Spain)
happiness_2021.loc[happiness_2021["Country name"]=="Spain", "Ladder score"]

### References

- World Happiness Dataset from Kaggle https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021
- The structure of the notebook and a lot of the coding is adapted from an assignment I had worked on during during this program: [Data Science: Bridging Principles and Practice](https://executive.berkeley.edu/programs/data-science#:~:text=Berkeley's%20Data%20Science%3A%20Bridging%20Principles,manipulate%20and%20analyze%20data%20yourself).
- The header images are photos I took on a trip to South Africa and Myanmar