# <b>World Happiness Report Analysis: Companion Notebook</b>

## <b>Introduction</b>

### Brief Overview of the Project Goals
In this project, I will analyze and visualize trends in global happiness using the World Happiness Report data from 2015 to 2019 (or the most recent data available). My analysis will examine how happiness scores have changed over time, identify the top and bottom countries in terms of happiness, explore regional variations, and investigate the factors that correlate with happiness levels. By doing so, I hope to gain a deeper understanding of the determinants of happiness across different countries and regions.


# <b>About the Dataset</b>


### Data Sources
I obtained the dataset for this project from the World Happiness Report, available on Kaggle. This dataset encompasses annual reports from 2015 to 2019, providing comprehensive country-level data on factors influencing happiness. You can find the datasets containing World Happiness Scores for the years 2015-2019 on Kaggle at: https://www.kaggle.com/datasets/unsdsn/world-happiness

For the original survey and more detailed information, please refer to the World Happiness Report website: https://worldhappiness.report/ed/2024/#appendices-and-data

### Data Combination
Data Combination
Initially, the data for each year existed as separate datasets. To facilitate a comprehensive analysis for the entire period from 2015 to 2019, I combined these datasets into a single, unified dataset. This consolidation allows for consistent analysis and comparison of happiness scores and factors across different years.


# Import necessary libraries


In [261]:
import pandas as pd


# Load datasets


In [262]:
happiness_2015 = pd.read_csv('../data/2015.csv')
happiness_2016 = pd.read_csv('../data/2016.csv')
happiness_2017 = pd.read_csv('../data/2017.csv')
happiness_2018 = pd.read_csv('../data/2018.csv')
happiness_2019 = pd.read_csv('../data/2019.csv')

Before merging the datasets for analysis across the years, I need to check the columns of each dataset. This will ensure compatibility and avoid issues during the analysis.

In [263]:
happiness_2015.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

In [264]:
happiness_2016.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')

In [265]:
happiness_2017.columns

Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
       'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'Dystopia.Residual'],
      dtype='object')

In [266]:
happiness_2018.columns

Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

In [267]:
happiness_2019.columns

Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

# Observations: 
* Not all columns are the same across each dataset. Some datasets contain additional columns, which can cause issues during the merge operation. Additionally, some column names are not appropriately labeled. To address this, I need to drop unnecessary columns from some datasets and rename certain columns.

* The happiness_2015 and happiness_2016 datasets include a region column, while others do not. Therefore, I will extract the country and region columns from one of these datasets. After the merge operation, I will add the region column to our merged data, as the regions that countries belong to remain consistent over the years.

* First, I will extract the countries and regions columns, then adjust each dataset individually.


## Extracting the country and region columns from the happiness_2015 dataset.

In [268]:
country_and_regions = happiness_2015[['Country', 'Region']]
country_and_regions

Unnamed: 0,Country,Region
0,Switzerland,Western Europe
1,Iceland,Western Europe
2,Denmark,Western Europe
3,Norway,Western Europe
4,Canada,North America
...,...,...
153,Rwanda,Sub-Saharan Africa
154,Benin,Sub-Saharan Africa
155,Syria,Middle East and Northern Africa
156,Burundi,Sub-Saharan Africa


# Actions for happiness_2015 Dataset:

* I'm cleaning up the dataset by removing the "Region" and "Standard Error" columns. Additionally, I'll be renaming some columns for consistency with other datasets. These include "Happiness Rank," "Happiness Score," "Economy (GDP per Capita)," "Health (Life Expectancy)," "Trust (Government Corruption)," and "Dystopia Residual." The current names contain spaces and special characters, which I'd like to avoid.

In [269]:
print("Before cleaning:")
print(happiness_2015.columns) # current column names 


happiness_2015.drop(['Region', 'Standard Error'], axis=1, inplace=True) # 1. Droppig irrelevant columns

happiness_2015.rename(columns={
    'Happiness Rank': 'Happiness_Rank',
    'Happiness Score': 'Happiness_Score',
    'Economy (GDP per Capita)': 'Economy_GDP_per_Capita',
    'Health (Life Expectancy)': 'Health_Life_Expectancy',
    'Trust (Government Corruption)': 'Trust_Government_Corruption',
    'Dystopia Residual': 'Dystopia_Residual'
}, inplace=True)# 2. Renaming columns for consistency


print("\nAfter cleaning:")
print(happiness_2015.columns)# updated column names



Before cleaning:
Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

After cleaning:
Index(['Country', 'Happiness_Rank', 'Happiness_Score',
       'Economy_GDP_per_Capita', 'Family', 'Health_Life_Expectancy', 'Freedom',
       'Trust_Government_Corruption', 'Generosity', 'Dystopia_Residual'],
      dtype='object')


# Actions for happiness_2016 Dataset:
* In order to ensure consistency with other datasets, I'm refining the data. This involves removing the "Region," "Lower Confidence Interval," and "Upper Confidence Interval" columns. Additionally, I'll be renaming some columns for clarity. These include "Happiness Rank," "Happiness Score," "Economic (GDP per Capita)," "Health (Life Expectancy)," "Government Trust," and "Dystopia Residual." The current names have spaces and special characters, which I'm removing for a cleaner format.

In [270]:
print("Before cleaning:")
print(happiness_2016.columns) # current column names 

happiness_2016.drop(['Region', 'Lower Confidence Interval', 'Upper Confidence Interval'], axis=1, inplace=True) # 1. Droppig irrelevant columns

happiness_2016.rename(columns={
    'Happiness Rank': 'Happiness_Rank',
    'Happiness Score': 'Happiness_Score',
    'Economy (GDP per Capita)': 'Economy_GDP_per_Capita',
    'Health (Life Expectancy)': 'Health_Life_Expectancy',
    'Trust (Government Corruption)': 'Trust_Government_Corruption',
    'Dystopia Residual': 'Dystopia_Residual'
}, inplace=True) # 2. Renaming columns for consistency


print("\nAfter cleaning:")
print(happiness_2016.columns) # updated column names



Before cleaning:
Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')

After cleaning:
Index(['Country', 'Happiness_Rank', 'Happiness_Score',
       'Economy_GDP_per_Capita', 'Family', 'Health_Life_Expectancy', 'Freedom',
       'Trust_Government_Corruption', 'Generosity', 'Dystopia_Residual'],
      dtype='object')


# Actions for happiness_2017 Dataset:
* I'll be following the same procedures here to preserve consistency with the methodology applied to earlier datasets. The "Whisker.high" and "Whisker.low" columns must be eliminated in order to do this. For uniformity and clarity, I'll also be renaming the following columns: "Dystopia Residual," "Government Trust," "Health (Life Expectancy)," "Economy (GDP per Capita)," and "Happiness.Rank." The gaps, special characters, and repetition in the current names are all addressed by these adjustments.

In [271]:

print("Before cleaning:")
print(happiness_2017.columns) # current column names 

happiness_2017.drop(['Whisker.high', 'Whisker.low'], axis=1, inplace=True)# 1. Droppig irrelevant columns

happiness_2017.rename(columns={
    'Happiness.Rank': 'Happiness_Rank',
    'Happiness.Score': 'Happiness_Score',
    'Economy..GDP.per.Capita.': 'Economy_GDP_per_Capita',
    'Health..Life.Expectancy.': 'Health_Life_Expectancy',
    'Trust..Government.Corruption.': 'Trust_Government_Corruption',
    'Dystopia.Residual': 'Dystopia_Residual'
}, inplace=True)# 2. Renaming columns for consistency

print("\nAfter cleaning:")
print(happiness_2017.columns)# updated column names


Before cleaning:
Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
       'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'Dystopia.Residual'],
      dtype='object')

After cleaning:
Index(['Country', 'Happiness_Rank', 'Happiness_Score',
       'Economy_GDP_per_Capita', 'Family', 'Health_Life_Expectancy', 'Freedom',
       'Generosity', 'Trust_Government_Corruption', 'Dystopia_Residual'],
      dtype='object')


# Actions for happiness_2018 Dataset:
* Unlike previous datasets, this one doesn't require any column removal. However, some renaming is needed to improve consistency: "Overall Rank" will be changed to "Overall.Rank," "Country or region" to "Country.Region," and so on for "Score," "GDP per capita," "Social support," "Healthy life expectancy," "Freedom to make life choices," "Generosity," and "Perceptions of corruption." This ensures consistent formatting and removes any spaces or special characters.

In [272]:
print("Before cleaning:")
print(happiness_2018.columns) # current column names 

happiness_2018.rename(columns={
    'Overall rank': 'Happiness_Rank',
    'Country or region': 'Country',
    'Score': 'Happiness_Score',
    'GDP per capita': 'Economy_GDP_per_Capita',
    'Social support': 'Family',
    'Healthy life expectancy': 'Health_Life_Expectancy',
    'Freedom to make life choices': 'Freedom',
    'Generosity': 'Generosity',
    'Perceptions of corruption': 'Trust_Government_Corruption'
}, inplace=True)# Renaming columns for consistency

print("\nAfter cleaning:")
print(happiness_2018.columns)# updated column names




Before cleaning:
Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

After cleaning:
Index(['Happiness_Rank', 'Country', 'Happiness_Score',
       'Economy_GDP_per_Capita', 'Family', 'Health_Life_Expectancy', 'Freedom',
       'Generosity', 'Trust_Government_Corruption'],
      dtype='object')


# Actions for happiness_2019 Dataset:

* This dataset is good to go without any column removal. However, for better clarity and consistency, let's rename some columns. These include "Overall rank" to "Overall.Rank," "Country or region" to "Country.Region," and so on for "Score," "GDP per capita," "Social support," "Healthy life expectancy," "Freedom to make life choices," "Generosity," and "Perceptions of corruption." This will make the column names more consistent and easier to work with.

In [273]:
print("Before cleaning:")
print(happiness_2019.columns)# current column names 


happiness_2019.rename(columns={
    'Overall rank': 'Happiness_Rank',
    'Country or region': 'Country',
    'Score': 'Happiness_Score',
    'GDP per capita': 'Economy_GDP_per_Capita',
    'Social support': 'Family',
    'Healthy life expectancy': 'Health_Life_Expectancy',
    'Freedom to make life choices': 'Freedom',
    'Generosity': 'Generosity',
    'Perceptions of corruption': 'Trust_Government_Corruption'
}, inplace=True)# Renaming columns for consistency

print("\nAfter cleaning:")
print(happiness_2019.columns)# updated column names



Before cleaning:
Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

After cleaning:
Index(['Happiness_Rank', 'Country', 'Happiness_Score',
       'Economy_GDP_per_Capita', 'Family', 'Health_Life_Expectancy', 'Freedom',
       'Generosity', 'Trust_Government_Corruption'],
      dtype='object')


Adding a "year" column to each dataset in order to distinguish them during the merge operation.

In [274]:
happiness_2015['Year'] = 2015
happiness_2016['Year'] = 2016
happiness_2017['Year'] = 2017
happiness_2018['Year'] = 2018
happiness_2019['Year'] = 2019

So far, I have adjusted all our datasets. Now, I can combine them to create a single dataset. Additionally, I will add the region column to our merged data.

In [275]:
happiness_datasets = [happiness_2015, happiness_2016, happiness_2017, happiness_2018, happiness_2019] # List of all happiness datasets

all_happiness_data = pd.concat(happiness_datasets, ignore_index=True)# Concatenating datasets row-wise

all_happiness_data = all_happiness_data.sort_values(by=['Country', 'Year']).reset_index(drop=True)# Sorting by Country and Year

all_happiness_data = pd.merge(all_happiness_data, country_and_regions, on='Country', how='left')# Merging with country and regions data

desired_columns = [
    'Country', 'Region', 'Year', 'Happiness_Rank', 'Happiness_Score',
    'Economy_GDP_per_Capita', 'Family', 'Health_Life_Expectancy',
    'Freedom', 'Trust_Government_Corruption', 'Generosity',
    'Dystopia_Residual'
]# List of columns in the desired order

all_happiness_data = all_happiness_data[desired_columns]# Reindexing columns in all_happiness_data DataFrame

all_happiness_data.head()



Unnamed: 0,Country,Region,Year,Happiness_Rank,Happiness_Score,Economy_GDP_per_Capita,Family,Health_Life_Expectancy,Freedom,Trust_Government_Corruption,Generosity,Dystopia_Residual
0,Afghanistan,Southern Asia,2015,153,3.575,0.31982,0.30285,0.30335,0.23414,0.09719,0.3651,1.9521
1,Afghanistan,Southern Asia,2016,154,3.36,0.38227,0.11037,0.17344,0.1643,0.07112,0.31268,2.14558
2,Afghanistan,Southern Asia,2017,141,3.794,0.401477,0.581543,0.180747,0.10618,0.061158,0.311871,2.150801
3,Afghanistan,Southern Asia,2018,145,3.632,0.332,0.537,0.255,0.085,0.036,0.191,
4,Afghanistan,Southern Asia,2019,154,3.203,0.35,0.517,0.361,0.0,0.025,0.158,


Now my dataset is ready for Exploratory Data Analysis (EDA). Afterward, I can use our dataset for visualizations.

# <b>Exploratory Data Analysis (EDA)</b>


In [276]:
all_happiness_data.head()

Unnamed: 0,Country,Region,Year,Happiness_Rank,Happiness_Score,Economy_GDP_per_Capita,Family,Health_Life_Expectancy,Freedom,Trust_Government_Corruption,Generosity,Dystopia_Residual
0,Afghanistan,Southern Asia,2015,153,3.575,0.31982,0.30285,0.30335,0.23414,0.09719,0.3651,1.9521
1,Afghanistan,Southern Asia,2016,154,3.36,0.38227,0.11037,0.17344,0.1643,0.07112,0.31268,2.14558
2,Afghanistan,Southern Asia,2017,141,3.794,0.401477,0.581543,0.180747,0.10618,0.061158,0.311871,2.150801
3,Afghanistan,Southern Asia,2018,145,3.632,0.332,0.537,0.255,0.085,0.036,0.191,
4,Afghanistan,Southern Asia,2019,154,3.203,0.35,0.517,0.361,0.0,0.025,0.158,


# Let's check the shape of the dataframe:


In [277]:
all_happiness_data.shape

(782, 12)

The dataset has 782 rows and 12 Columns

In [278]:
all_happiness_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 782 entries, 0 to 781
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Country                      782 non-null    object 
 1   Region                       757 non-null    object 
 2   Year                         782 non-null    int64  
 3   Happiness_Rank               782 non-null    int64  
 4   Happiness_Score              782 non-null    float64
 5   Economy_GDP_per_Capita       782 non-null    float64
 6   Family                       782 non-null    float64
 7   Health_Life_Expectancy       782 non-null    float64
 8   Freedom                      782 non-null    float64
 9   Trust_Government_Corruption  781 non-null    float64
 10  Generosity                   782 non-null    float64
 11  Dystopia_Residual            470 non-null    float64
dtypes: float64(8), int64(2), object(2)
memory usage: 73.4+ KB


# Let's count the number of unique values in each column

In [279]:
all_happiness_data.nunique()


Country                        170
Region                          10
Year                             5
Happiness_Rank                 158
Happiness_Score                716
Economy_GDP_per_Capita         742
Family                         732
Health_Life_Expectancy         705
Freedom                        697
Trust_Government_Corruption    635
Generosity                     664
Dystopia_Residual              470
dtype: int64

# Count duplicated rows in the dataset

In [280]:
all_happiness_data.duplicated().sum()


np.int64(0)

# Count missing values in each column:


In [281]:
all_happiness_data.isnull().sum()


Country                          0
Region                          25
Year                             0
Happiness_Rank                   0
Happiness_Score                  0
Economy_GDP_per_Capita           0
Family                           0
Health_Life_Expectancy           0
Freedom                          0
Trust_Government_Corruption      1
Generosity                       0
Dystopia_Residual              312
dtype: int64

As I can observe, the 'Dystopia_Residual' column contains numerous NaN values. Therefore, I have decided to remove this column from my project since it is not critical to the analysis and its presence might affect the overall data quality.

In [282]:
all_happiness_data = all_happiness_data.drop(columns=['Dystopia_Residual'])


In [283]:
all_happiness_data.columns

Index(['Country', 'Region', 'Year', 'Happiness_Rank', 'Happiness_Score',
       'Economy_GDP_per_Capita', 'Family', 'Health_Life_Expectancy', 'Freedom',
       'Trust_Government_Corruption', 'Generosity'],
      dtype='object')

Additionally, I  observe that the 'Region' column contains some NaN values. Let's take a closer look at them.

In [284]:
all_happiness_data[all_happiness_data['Region'].isna()][['Country', 'Region','Year']]



Unnamed: 0,Country,Region,Year
64,Belize,,2016
65,Belize,,2017
66,Belize,,2018
235,Gambia,,2019
280,"Hong Kong S.A.R., China",,2017
482,Namibia,,2016
483,Namibia,,2017
484,Namibia,,2018
485,Namibia,,2019
519,North Macedonia,,2019


We can see that only the following countries have NaN values in the Region column: Belize, Gambia, Hong Kong S.A.R., China, Namibia, North Macedonia, Northern Cyprus, Puerto Rico, Somalia, Somaliland Region, South Sudan, Taiwan Province of China, and Trinidad & Tobago. I checked their actual regions from this Wikipedia link "https://en.wikipedia.org/wiki/List_of_countries_and_territories_by_the_United_Nations_geoscheme", and the countries correspond to these regions:

- Belize: Latin America and the Caribbean
- Gambia: Sub-Saharan Africa
- Hong Kong S.A.R., China: Eastern Asia
- Namibia: Sub-Saharan Africa
- North Macedonia: Southern Europe
- Northern Cyprus: Western Asia
- Puerto Rico: Latin America and the Caribbean
- Somalia: Sub-Saharan Africa
- Somaliland Region: Eastern Africa
- South Sudan: Sub-Saharan Africa
- Taiwan Province of China: Eastern Asia
-Trinidad & Tobago: Latin America and the Caribbean

Let's check how many countries there are in each region.

In [285]:
region_country_counts = all_happiness_data.groupby('Region')['Country'].nunique().to_frame(name='number_of_countries')

region_country_counts

Unnamed: 0_level_0,number_of_countries
Region,Unnamed: 1_level_1
Australia and New Zealand,2
Central and Eastern Europe,29
Eastern Asia,6
Latin America and Caribbean,22
Middle East and Northern Africa,20
North America,2
Southeastern Asia,9
Southern Asia,7
Sub-Saharan Africa,40
Western Europe,21


Upon reviewing the dataset, I noticed that if I manually add these regions to my dataset, the majority of countries will be assigned to 'Sub-Saharan Africa', 'Latin America and the Caribbean', and 'Eastern Asia', where I have sufficient representation from these regions. Therefore, the NaN values will not significantly affect our analysis. Consequently, I decided to keep them as null rather than manually filling in these values from Wikipedia, even though the information is consistent across sources. For this project, I believe everything should be handled automatically.

Displaying the data type of each column

In [286]:
all_happiness_data.dtypes

Country                         object
Region                          object
Year                             int64
Happiness_Rank                   int64
Happiness_Score                float64
Economy_GDP_per_Capita         float64
Family                         float64
Health_Life_Expectancy         float64
Freedom                        float64
Trust_Government_Corruption    float64
Generosity                     float64
dtype: object

Summary statistics for numeric columns

In [287]:
all_happiness_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,782.0,2016.993606,1.417364,2015.0,2016.0,2017.0,2018.0,2019.0
Happiness_Rank,782.0,78.69821,45.182384,1.0,40.0,79.0,118.0,158.0
Happiness_Score,782.0,5.379018,1.127456,2.693,4.50975,5.322,6.1895,7.769
Economy_GDP_per_Capita,782.0,0.916047,0.40734,0.0,0.6065,0.982205,1.236187,2.096
Family,782.0,1.078392,0.329548,0.0,0.869363,1.124735,1.32725,1.644
Health_Life_Expectancy,782.0,0.612416,0.248309,0.0,0.440183,0.64731,0.808,1.141
Freedom,782.0,0.411091,0.15288,0.0,0.309768,0.431,0.531,0.724
Trust_Government_Corruption,781.0,0.125436,0.105816,0.0,0.054,0.091,0.15603,0.55191
Generosity,782.0,0.218576,0.122321,0.0,0.13,0.201982,0.278832,0.838075


In order to incorporate this dataset into my storytelling Jupyter notebook, I am saving it locally and will subsequently read from it for analysis.

In [288]:
all_happiness_data.to_csv('../data/all_happiness_data.csv', index=False)
