# COGS 108 - Data Checkpoint

# Names

- Prabhjyot Sodhi
- Sahithi Karumudi
- Yash Patki
- Joy Yue Lam (Joyce)
- David Boateng


<a id='research_question'></a>
# Research Question

What is the relationship between Family/Social Support, GDP, Health (Life Expectancy), and Generosity in determining the overall Happiness of a country? Which factors have the strongest correlation with Happiness, and how do they interrelate? 

# Hypothesis

We hypothesize that countries with higher values for GDP and Family/Social Support factors are more likely to have a higher, positive correlation with the Happiness scores. 

# Dataset(s)

We will be using data that has been collected by the World Happiness Report. The data includes information that has been collected from 2015 to 2023. 

 Between 2015-2023 of the World Happiness Report, there are about 9 columns by average of 155 countries which gives us roughly 1395 observations for each dataset. The columns include: country, region, happiness rank, happiness score, standard error, economy (GDP per capita), family (social support), health (life expectancy), freedom, trust, generosity, and dystopia residual. Between iterations of the World Happiness Report, there are slight changes in the columns reported on to measure happiness, but we plan to utilize these six factors: economy, freedom, trust, generosity, health, and family along with the countries and the corresponding happiness scores. After 2017, they renamed the feature  “family” to “social support” (Helliwell et al., 2017). 

 We have multiple datasets with observations spanning nine years. Each dataset has a similar structure, as mentioned before, with columns labeled Country, Region, Happiness Rank, Happiness Score, Standard Error, Economy (GDP per capita), Family, Health (Life Expectancy), Freedom, Trust, Government, Generosity, and Dystopia Residual for each country. Over the years, some constructs have begun to be called different things (i.e. ‘Family’ to ‘Social Support’) but we intend to continue using such columns in the same way since they are practically measuring the same factors. We intend to only analyze the properties of countries that have data for every single year in the range (2015-2023) so we can be as consistent and precise in our conclusions to find trends over the years and be able to compare between countries. We acknowledge that these labels have changed slightly but we intend to focus our analysis on the following factors: GDP, Health (Life Expectancy), Family/Social Support, Generosity, Trust, and Freedom in relation to how individuals in each country evaluated their overall happiness. 

## Datasets

- Dataset Name: World Happiness Report 2015
- Link to the dataset: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2015.csv
- Number of observations: 158 Countries and 12 Columns

---

- Dataset Name: World Happiness Report 2016
- Link to the dataset: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2016.csv
- Number of observations: 157 Countries and 13 Columns

---

- Dataset Name: World Happiness Report 2017
- Link to the dataset: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2017.csv
- Number of observations: 155 Countries and 12 Columns

---

- Dataset Name: World Happiness Report 2018
- Link to the dataset: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2018.csv
- Number of observations: 156 Countries and 9 Columns

---

- Dataset Name: World Happiness Report 2019
- Link to the dataset: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2019.csv
- Number of observations: 156 Countries and 9 Columns

---

- Dataset Name: World Happiness Report 2020
- Link to the dataset: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2020.csv
- Number of observations: 153 Countries and 20 Columns

---

- Dataset Name: World Happiness Report 2021
- Link to the dataset: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2021.csv
- Number of observations: 149 Countries and 20 Columns

---

- Dataset Name: World Happiness Report 2022
- Link to the dataset: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2022.csv
- Number of observations: 149 Countries and 20 Columns

---

- Dataset Name: World Happiness Report 2023
- Link to the dataset: https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2023
- Number of observations: 137 Countries and 19 Columns



Since we do have multiple datasets, we plan to standardize column names and remove unnecessary columns from all the datasets. However, we would not need to combine the datasets when analyzing the differences between countries as each dataset represents a different year. On the other hand, when analyzing trends for specific countries over the years, we would need to combine the data for particular countries separated by the data for each year. 

# Setup

In [97]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data Cleaning

We are using the official World Happiness Report datasets posted on Kaggle, therefore the datasets were relatively clean and organized. 

We are planning to use 9 different datasets, one for each year from 2015-2023. The World Happiness Score is determined using 6 primary factors which include GDP, Family, Life Expectancy (Health), Freedom, Trust on Government, and Generosity. Across the datasets, we planned to remove the additional columns, like for instance the lower and upper bounds for confidence, dystopia residuals scores, and other extraneous variables. 

There are two ways we plan to analyze the data: we want to find a trend for specific countries between the years of 2015 and 2023 and we also want to compare differences between countries. Therefore, we want to only include countries that have the necessary data (all 6 factors) for each year between 2015 and 2023. This will allow is to find a trend for the Happiness Score and its relation to other factors through time, as well as enable us to compare scores between countries. 

In other words, we will need to further clean the data by removing rows for countries that do not have data in each of the datasets mentioned above. To get the data in a usable format, we had to change column names to standardize them across all the datasets and delete the extra data as mentioned previously. The steps we took to clean the data are shown below.


In [98]:
# Read in the data
df_15 = pd.read_csv('data/2015.csv')
df_16 = pd.read_csv('data/2016.csv')
df_17 = pd.read_csv('data/2017.csv')
df_18 = pd.read_csv('data/2018.csv')
df_19 = pd.read_csv('data/2019.csv')
df_20 = pd.read_csv('data/2020.csv')
df_21 = pd.read_csv('data/2021.csv')
df_22 = pd.read_csv('data/2022.csv')
df_23 = pd.read_csv('data/2023.csv')

# Create a list of the years
years = [2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]

# Make a list of the dataframes
dfs = [df_15, df_16, df_17, df_18, df_19, df_20, df_21, df_22, df_23]

# Standardize the column names, hard-coded values for each year
column_mapping = {
    2015: {
    'Happiness Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'GDP',
    'Family': 'Social Support',
    'Health (Life Expectancy)': 'Life Expectancy',
    'Freedom': 'Freedom',
    'Trust (Government Corruption)': 'Trust',
    'Generosity': 'Generosity',
    },
    2016: {
    'Happiness Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'GDP',
    'Family': 'Social Support',
    'Health (Life Expectancy)': 'Life Expectancy',
    'Freedom': 'Freedom',
    'Trust (Government Corruption)': 'Trust',
    'Generosity': 'Generosity',
    },
    2017: {
    'Happiness.Score': 'Happiness Score',
    'Economy..GDP.per.Capita.': 'GDP',
    'Family': 'Social Support',
    'Health..Life.Expectancy.': 'Life Expectancy',
    'Freedom': 'Freedom',
    'Trust..Government.Corruption.': 'Trust',
    'Generosity': 'Generosity',
    },
    2018: {
    'Country or region': 'Country',
    'Score': 'Happiness Score',
    'GDP per capita': 'GDP',
    'Social support': 'Social Support',
    'Healthy life expectancy': 'Life Expectancy',
    'Freedom to make life choices': 'Freedom',
    'Perceptions of corruption': 'Trust',
    'Generosity': 'Generosity',
    },
    2019: {
    'Country or region': 'Country',
    'Score': 'Happiness Score',
    'GDP per capita': 'GDP',
    'Social support': 'Social Support',
    'Healthy life expectancy': 'Life Expectancy',
    'Freedom to make life choices': 'Freedom',
    'Perceptions of corruption': 'Trust',
    'Generosity': 'Generosity',
    },
    2020: {
    'Country name': 'Country',
    'Ladder score': 'Happiness Score',
    'Explained by: Log GDP per capita': 'GDP',
    'Explained by: Social support': 'Social Support',
    'Explained by: Healthy life expectancy': 'Life Expectancy',
    'Explained by: Freedom to make life choices': 'Freedom',
    'Generosity': 'Generosity - 2020',
    'Explained by: Generosity': 'Generosity',
    'Explained by: Perceptions of corruption': 'Trust',
    },
    2021: {
    'Country name': 'Country',
    'Ladder score': 'Happiness Score',
    'Explained by: Log GDP per capita': 'GDP',
    'Explained by: Social support': 'Social Support',
    'Explained by: Healthy life expectancy': 'Life Expectancy',
    'Explained by: Freedom to make life choices': 'Freedom',
    'Generosity': 'Generosity - 2021',
    'Explained by: Generosity': 'Generosity',
    'Explained by: Perceptions of corruption': 'Trust',
    },
    2022: {
    'Happiness score': 'Happiness Score',
    'Explained by: GDP per capita': 'GDP',
    'Explained by: Social support': 'Social Support',
    'Explained by: Healthy life expectancy': 'Life Expectancy',
    'Explained by: Freedom to make life choices': 'Freedom',
    'Explained by: Generosity': 'Generosity',
    'Explained by: Perceptions of corruption': 'Trust',
    },
    2023: {
    'Country name': 'Country',
    'Ladder score': 'Happiness Score',
    'Explained by: Log GDP per capita': 'GDP',
    'Explained by: Social support': 'Social Support',
    'Explained by: Healthy life expectancy': 'Life Expectancy',
    'Explained by: Freedom to make life choices': 'Freedom',
    'Generosity': 'Generosity - 2023',
    'Explained by: Generosity': 'Generosity',
    'Explained by: Perceptions of corruption': 'Trust',
    }
}

# Rename the columns for each dataframe
for year, df in zip(years, dfs):
    df.rename(columns=column_mapping[year], inplace=True)

# Clean the 2015 data by dropping columns that are not needed
columns_to_keep = ['Country', 'Happiness Score', 'GDP', 'Social Support', 'Life Expectancy', 'Freedom', 'Trust', 'Generosity']

for df in dfs:
    for column in df.columns:
        if column not in columns_to_keep:
            df.drop(column, axis=1, inplace=True)

# Change country column name
for df in dfs:
    df.rename(columns={'Country': 'Country or Region'}, inplace=True)

# Drop the rows that have NaN values
for df in dfs:
    df.dropna(inplace=True)

# Remove the countries that are not in all of the dataframes
country_sets = []

for df in dfs:
    country_set = set(df['Country or Region'].unique())
    country_sets.append(country_set)

# Find the countries that are in all of the dataframes
countries_in_all = set.intersection(*country_sets)

# Remove the countries that are not in all of the dataframes
for idx, df in enumerate(dfs):
    dfs[idx] = df[df['Country or Region'].isin(countries_in_all)]


In [99]:
# Print the shape of each dataframe
for year, df in zip(years, dfs):
    print(f'{year} Dataframe')
    print(df.shape)

2015 Dataframe
(115, 8)
2016 Dataframe
(115, 8)
2017 Dataframe
(115, 8)
2018 Dataframe
(115, 8)
2019 Dataframe
(115, 8)
2020 Dataframe
(115, 8)
2021 Dataframe
(115, 8)
2022 Dataframe
(115, 8)
2023 Dataframe
(115, 8)


Note that each of the dataframes has the same shape, so we have the same countries and columns for each year. Let's take a look at the head of all the dataframes below to verify this.

In [100]:
df_15.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Trust,Generosity
0,Switzerland,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678
1,Iceland,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363
2,Denmark,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139
3,Norway,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699
4,Canada,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811


In [101]:
df_16.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Trust,Generosity
0,Denmark,7.526,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171
1,Switzerland,7.509,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083
2,Iceland,7.501,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678
3,Norway,7.498,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895
4,Finland,7.413,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492


In [102]:
df_17.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Generosity,Trust
0,Norway,7.537,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964
1,Denmark,7.522,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077
2,Iceland,7.504,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527
3,Switzerland,7.494,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007
4,Finland,7.469,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612


In [103]:
df_18.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Generosity,Trust
0,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,1.42,1.549,0.927,0.66,0.256,0.357


In [104]:
df_19.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Generosity,Trust
0,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,Iceland,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298


In [105]:
df_20.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Generosity,Trust
0,Finland,7.8087,1.28519,1.499526,0.961271,0.662317,0.15967,0.477857
1,Denmark,7.6456,1.326949,1.503449,0.979333,0.66504,0.242793,0.49526
2,Switzerland,7.5599,1.390774,1.472403,1.040533,0.628954,0.269056,0.407946
3,Iceland,7.5045,1.326502,1.547567,1.000843,0.661981,0.36233,0.144541
4,Norway,7.488,1.424207,1.495173,1.008072,0.670201,0.287985,0.434101


In [106]:
df_21.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Generosity,Trust
0,Finland,7.842,1.446,1.106,0.741,0.691,0.124,0.481
1,Denmark,7.62,1.502,1.108,0.763,0.686,0.208,0.485
2,Switzerland,7.571,1.566,1.079,0.816,0.653,0.204,0.413
3,Iceland,7.554,1.482,1.172,0.772,0.698,0.293,0.17
4,Netherlands,7.464,1.501,1.079,0.753,0.647,0.302,0.384


In [107]:
df_22.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Generosity,Trust
0,Finland,7821,1892,1258,775,736,109,534
1,Denmark,7636,1953,1243,777,719,188,532
2,Iceland,7557,1936,1320,803,718,270,191
3,Switzerland,7512,2026,1226,822,677,147,461
4,Netherlands,7415,1945,1206,787,651,271,419


In [108]:
df_23.head()

Unnamed: 0,Country or Region,Happiness Score,GDP,Social Support,Life Expectancy,Freedom,Generosity,Trust
0,Finland,7.804,1.888,1.585,0.535,0.772,0.126,0.535
1,Denmark,7.586,1.949,1.548,0.537,0.734,0.208,0.525
2,Iceland,7.53,1.926,1.62,0.559,0.738,0.25,0.187
3,Israel,7.473,1.833,1.521,0.577,0.569,0.124,0.158
4,Netherlands,7.403,1.942,1.488,0.545,0.672,0.251,0.394
