# 🌏 😊

Updated to include 2021 data.

> What makes us happy? Are some nations happier than others? Are we getting more or less happy over time?
> Has happiness declined this year? Might this reflect the impact of COVID-19 on people's happiness?



References:  
[2019: State of World Happiness. What drives us?](https://www.kaggle.com/andradaolteanu/2019-state-of-world-happiness-what-drives-us/data) by Andrada Olteanu  
[The World Happiness Report](https://worldhappiness.report)

# Load data

In [None]:
# Load libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set default plot style for seaborn
sns.set_style('whitegrid')
sns.set_palette('muted')

In [None]:
# Load data into dictionary
years = [*range(2015, 2022, 1)] # exclusive of 2022
data = {}

for year in years:
    filename = '../input/world-happiness-report-20152021/{}.csv'.format(year)
    data[year] = pd.read_csv(filename)

# 2021

In [None]:
data[2021].head()

In [None]:
data[2021].info()

In [None]:
# Columns of interest
data[2021] = data[2021][['Country name', 'Regional indicator', 'Ladder score',
                         'Logged GDP per capita', 'Social support',
                         'Healthy life expectancy', 'Freedom to make life choices',
                         'Generosity', 'Perceptions of corruption',
                         'Ladder score in Dystopia', 'Dystopia + residual',
                         'Explained by: Log GDP per capita', 
                         'Explained by: Social support',
                         'Explained by: Healthy life expectancy',
                         'Explained by: Freedom to make life choices',
                         'Explained by: Generosity',
                         'Explained by: Perceptions of corruption']]

In [None]:
top_10 = data[2021].head(10)
top_10

> The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative.

Basically, the happiness score is a national average of responses to a question on the Gallup World Poll.

In [None]:
# Long bars draw attention away from the differences between the countries
plt.figure(figsize=(9,6))
g = sns.barplot(data=top_10, x='Ladder score', y='Country name', 
                color='slategrey', alpha=0.95)
g.set_title('Top 10 Happiest Countries in 2021', y=1.02)
g.set(xlabel='Happiness Score',
      ylabel=None)
sns.despine(left=True, bottom=True, top=True);

In [None]:
# Dot plot highlights differences between top countries
plt.figure(figsize=(9,6))
g = sns.stripplot(data=top_10, x='Ladder score', y='Country name',
                  size=8, hue='Regional indicator', dodge=False,
                  palette='Dark2')
g.set_title('Top 10 Happiest Countries in 2021', y=1.02)
g.set(xlabel='Happiness Score',
      ylabel=None)
sns.despine(bottom=True);

The happiest countries are mostly located in Western Europe. Finland is doing well!

In [None]:
bottom_10 = data[2021].tail(10)

plt.figure(figsize=(9,6))
g = sns.barplot(data=bottom_10, x='Ladder score', y='Country name', alpha=0.95, 
                hue='Regional indicator', dodge=False,
                palette='Set1')
g.set_title('The 10 Least Happy Countries in 2021', y=1.02)
g.set(xlabel='Happiness Score',
      ylabel=None)
g.set_xlim(0, 8.25)
sns.despine(left=True, bottom=True, top=True);

The least happy countries are located in Sub-Saharan Africa, with the exception of Yemen (Middle East and North Africa), Haiti (Latin America and Caribbean), and Afghanistan (South Asia).

In [None]:
data[2021]['Ladder score'].mean()

In [None]:
data[2021][(data[2021]['Country name'].str.contains('Aus'))]

In [None]:
data[2021][(data[2021]['Country name'].str.contains('States'))]

In [None]:
data[2021].shape

In 2021 there are 149 countries represented in the survey. There are 195 countries in the world ([Worldometer](https://www.worldometers.info/geography/how-many-countries-are-there-in-the-world/)). I wonder if the countries with no respondents have anything in common that impacts happiness scores (e.g., civil war).

What's the difference between variables such as `Social support` and `Explained by: Social support`?  It appears all factors have a second `Explained by` variable on the 2020 and 2021 data.

In [None]:
print(data[2021]['Social support'].describe())
print('\n')
print(data[2021]['Explained by: Social support'].describe())

These two look very similar.

In [None]:
print(data[2021]['Logged GDP per capita'].describe())
print('\n')
print(data[2021]['Explained by: Log GDP per capita'].describe())

In [None]:
print(data[2021]['Healthy life expectancy'].describe())
print('\n')
print(data[2021]['Explained by: Healthy life expectancy'])

`Healthy life expectancy` most likely reflects actual values, where as the corresponding `Explained by` variable is the output from statistical modelling. I will have to take this into consideration when analysing factors for 2020 and 2021.

In [None]:
data[2020].columns

In [None]:
data[2021].columns

# Prepare all years

Some pre-processing of the data is required before we can analyse it further:
* Variables and variable names vary between years.
* Variables of interest will be retained and renamed for consistency across years.
* The remaining variables will be dropped.
* Datasets will be combined into one DataFrame to facilitate further analysis.

In [None]:
# Keep columns of interest
# If you look closely you will see that column names vary slightly across years
data[2015] = data[2015][['Country', 'Happiness Score', 'Economy (GDP per Capita)', 'Family',
                         'Health (Life Expectancy)', 'Freedom', 'Generosity',
                         'Trust (Government Corruption)']]
data[2016] = data[2016][['Country', 'Happiness Score', 'Economy (GDP per Capita)', 'Family',
                         'Health (Life Expectancy)', 'Freedom', 'Generosity',
                         'Trust (Government Corruption)']]
data[2017] = data[2017][['Country', 'Happiness.Score', 'Economy..GDP.per.Capita.', 'Family',
                         'Health..Life.Expectancy.', 'Freedom', 'Generosity',
                         'Trust..Government.Corruption.']]
data[2018] = data[2018][['Country or region', 'Score', 'GDP per capita', 'Social support',
                         'Healthy life expectancy', 'Freedom to make life choices',
                         'Generosity', 'Perceptions of corruption']]
data[2019] = data[2019][['Country or region', 'Score', 'GDP per capita', 'Social support',
                         'Healthy life expectancy', 'Freedom to make life choices',
                         'Generosity', 'Perceptions of corruption']]
data[2020] = data[2020][['Country name', 'Ladder score', 'Explained by: Log GDP per capita',
                         'Explained by: Social support', 
                         'Explained by: Healthy life expectancy',
                         'Explained by: Freedom to make life choices',
                         'Explained by: Generosity',
                         'Explained by: Perceptions of corruption']]
data[2021] = data[2021][['Country name', 'Ladder score', 'Explained by: Log GDP per capita',
                         'Explained by: Social support',
                         'Explained by: Healthy life expectancy', 
                         'Explained by: Freedom to make life choices',
                         'Explained by: Generosity', 
                         'Explained by: Perceptions of corruption']]

In [None]:
# New column names to apply across all years
new_names = ['Country', 'Happiness Score', 'Economy (GDP per Capita)', 'Social Support',
             'Health (Life Expectancy)', 'Freedom', 'Generosity',
             'Trust (Government Corruption)']

years = [*range(2015, 2022, 1)]
for year in years:
    data[year].columns = new_names # Apply new column names
    data[year]['Year'] = year      # Add Year column

In [None]:
# Combine all years into one DataFrame
data = pd.concat([data[2015], data[2016], data[2017], data[2018],
                  data[2019], data[2020], data[2021]], axis=0)
data.head()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
# Drop observations with missing values
data.dropna(axis=0, inplace=True)

# Exploratory data analysis

In [None]:
# Summary statistics
data.groupby(by='Year')['Happiness Score'].describe()

In [None]:
data['Year'].value_counts()

There are slightly less countries represented over time.

> Are we getting more or less happy over time?  
> Has happiness declined since the spread of COVID-19?

`mean` is the average happiness score for all countries. The remaining variables are measures of variability: The `std` of happiness scores has decreased in the last two years, representing less spread in happiness scores between countries. The 25% percentile has also been increasing, so even the least happy countries are getting happier over time.

So global happiness does appear to be increasing. Additionally, reported happiness has increased since COVID-19 which is unexpected.

In [None]:
scores_by_year = data.groupby(by='Year')['Happiness Score'].describe()

In [None]:
scores_by_year.info()

In [None]:
plt.figure(figsize=(9,6))
g = sns.lineplot(data=scores_by_year['mean'])
g.set_title('Global Happiness Over Time')
g.set(ylabel='Mean Happiness Score')
g.set_ylim(0, 7)
sns.despine(left=True, bottom=True, top=True);

This looks pretty flat over time, though it does seem to increase slightly.

In [None]:
plt.figure(figsize=(9,6))
g = sns.histplot(data=data, x='Happiness Score', bins=20)
g.set_title('Distribution of Happiness Score (All Years)')
sns.despine(left=True);

Happiness scores are approaching a normal distribution.

The remaining columns represent factors which explain the happiness.

In [None]:
# Group data by year and compute mean of factors
grouped = data.groupby(by='Year')[['Happiness Score', 'Economy (GDP per Capita)', 
                                   'Social Support', 'Health (Life Expectancy)',
                                   'Freedom', 'Generosity', 
                                   'Trust (Government Corruption)']].mean().reset_index()

In [None]:
# Reconstruct DataFrame
grouped = pd.melt(frame = grouped, id_vars='Year', 
                  value_vars=['Happiness Score', 'Economy (GDP per Capita)', 
                              'Social Support', 'Health (Life Expectancy)',
                              'Freedom', 'Generosity', 
                              'Trust (Government Corruption)'],
                  var_name='Factor', value_name='Avg_value')

In [None]:
plt.figure(figsize=(12,8))
g = sns.barplot(x=grouped[grouped['Factor'] != 'Happiness Score']['Factor'],
                y=grouped['Avg_value'], hue=grouped['Year'])
g.set_title('Factors That Explain Happiness by Year')
g.set(xlabel=None, ylabel='Mean value for all countries')
sns.despine(left=True, bottom=True, top=True);

In [None]:
grouped_early = grouped[(grouped['Year'] < 2020)]

> What makes us happy?

`Social Support` and `Economy (GDP per Capita)` appear to explain happiness scores the most.

In [None]:
# Compute the correlation matrix
corr = data.corr()
corr

In [None]:
sns.set_theme(style='white')
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(8,6))
cmap = sns.diverging_palette(240, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.9, center=0,
            square=True, linewidths=.5, cbar_kws={'shrink': .5});

# Summary

Some interesting things I have learned:
* The data suggests people are reasonably happy.
* Surprisingly, this appears to have increased slightly in the last six years.
* Social support and money appear to have the greatest impact on happiness scores.
* Health is also strongly associated with happiness.

Other technical details:
* Dictionaries are handy for storing a large number of datasets. This allows you to use for loops for importing data and data wrangling tasks, thereby reducing repetition in your code (see the [DRY principal](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)).