## Analyze the World Happiness Report Dataset.

Dataset found [HERE](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021).

This notebook aims to look at the number of countries featured in the World Happiness Report dataset, and choose which countries to focus on in future analyses. We also look at the years, and choose which ones provide the cleanest dataset with the countries for further development.

In [1]:
# Dependencies.
import pandas as pd

In [2]:
# Read in the World Happiness Index for country names.
df = pd.read_csv('resources/whr/world-happiness-report.csv')
df = df[['Country name', 'year', 'Life Ladder']]
df.head()

Unnamed: 0,Country name,year,Life Ladder
0,Afghanistan,2008,3.724
1,Afghanistan,2009,4.402
2,Afghanistan,2010,4.758
3,Afghanistan,2011,3.832
4,Afghanistan,2012,3.783


In [3]:
# Look at the data we hope to predict - year 2021.
df_21 = pd.read_csv('resources/whr/world-happiness-report-2021.csv')
df_21 = df_21[['Country name', 'Ladder score']]
df_21.head()

Unnamed: 0,Country name,Ladder score
0,Finland,7.842
1,Denmark,7.62
2,Switzerland,7.571
3,Iceland,7.554
4,Netherlands,7.464


In [4]:
# Print full list of country appearances.
print(df['Country name'].value_counts().to_string())

Ghana                        15
Russia                       15
Georgia                      15
Uganda                       15
Kyrgyzstan                   15
Thailand                     15
Kazakhstan                   15
Spain                        15
United Kingdom               15
Mexico                       15
Saudi Arabia                 15
Cameroon                     15
Denmark                      15
India                        15
Uruguay                      15
Lithuania                    15
Sweden                       15
Turkey                       15
Tanzania                     15
Italy                        15
Cambodia                     15
Germany                      15
Canada                       15
Bangladesh                   15
United States                15
Tajikistan                   15
Israel                       15
Ukraine                      15
China                        15
Kenya                        15
El Salvador                  15
Venezuel

In [5]:
# Check country counts in each dataset.
print(len(df['Country name'].value_counts()))
print(len(df_21['Country name']))

166
149


In [6]:
# Find the countries missing from 2021 - we will probably drop these.
countries = df['Country name'].unique().tolist()
countries_21 = df_21['Country name'].tolist()

missing = []
for country in countries:
    if country not in countries_21:
        missing.append(country)
        
missing

['Angola',
 'Belize',
 'Bhutan',
 'Central African Republic',
 'Congo (Kinshasa)',
 'Cuba',
 'Djibouti',
 'Guyana',
 'Oman',
 'Qatar',
 'Somalia',
 'Somaliland region',
 'South Sudan',
 'Sudan',
 'Suriname',
 'Syria',
 'Trinidad and Tobago']

In [7]:
# Pivot table to years and life ladder scores.
df_pv = df.pivot(index='Country name', columns='year', values='Life Ladder')
df_pv

year,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
Country name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Afghanistan,,,,3.724,4.402,4.758,3.832,3.783,3.572,3.131,3.983,4.220,2.662,2.694,2.375,
Albania,,,4.634,,5.485,5.269,5.867,5.510,4.551,4.814,4.607,4.511,4.640,5.004,4.995,5.365
Algeria,,,,,,5.464,5.317,5.605,,6.355,,5.341,5.249,5.043,4.745,
Angola,,,,,,,5.589,4.360,3.937,3.795,,,,,,
Argentina,,6.313,6.073,5.961,6.424,6.441,6.776,6.468,6.582,6.671,6.697,6.427,6.039,5.793,6.086,5.901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,7.17,6.525,,6.258,7.189,7.478,6.580,7.067,6.553,6.136,5.569,4.041,5.071,5.006,5.081,4.574
Vietnam,,5.294,5.422,5.480,5.304,5.296,5.767,5.535,5.023,5.085,5.076,5.062,5.175,5.296,5.467,
Yemen,,,4.477,,4.809,4.350,3.746,4.061,4.218,3.968,2.983,3.826,3.254,3.058,4.197,
Zambia,,4.824,3.998,4.730,5.260,,4.999,5.013,5.244,4.346,4.843,4.348,3.933,4.041,3.307,4.838


In [8]:
# Drop the countries missing in 2021 dataset.
df_pv = df_pv.drop(labels=missing, axis=0)
df_pv

year,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
Country name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Afghanistan,,,,3.724,4.402,4.758,3.832,3.783,3.572,3.131,3.983,4.220,2.662,2.694,2.375,
Albania,,,4.634,,5.485,5.269,5.867,5.510,4.551,4.814,4.607,4.511,4.640,5.004,4.995,5.365
Algeria,,,,,,5.464,5.317,5.605,,6.355,,5.341,5.249,5.043,4.745,
Argentina,,6.313,6.073,5.961,6.424,6.441,6.776,6.468,6.582,6.671,6.697,6.427,6.039,5.793,6.086,5.901
Armenia,,4.289,4.882,4.652,4.178,4.368,4.260,4.320,4.277,4.453,4.348,4.325,4.288,5.062,5.488,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,7.17,6.525,,6.258,7.189,7.478,6.580,7.067,6.553,6.136,5.569,4.041,5.071,5.006,5.081,4.574
Vietnam,,5.294,5.422,5.480,5.304,5.296,5.767,5.535,5.023,5.085,5.076,5.062,5.175,5.296,5.467,
Yemen,,,4.477,,4.809,4.350,3.746,4.061,4.218,3.968,2.983,3.826,3.254,3.058,4.197,
Zambia,,4.824,3.998,4.730,5.260,,4.999,5.013,5.244,4.346,4.843,4.348,3.933,4.041,3.307,4.838


In [9]:
# Check quantity of countries in each year.
df_pv.count()

year
2005     27
2006     87
2007     99
2008    107
2009    108
2010    118
2011    136
2012    135
2013    132
2014    138
2015    137
2016    138
2017    143
2018    142
2019    144
2020     95
dtype: int64

Based on the analysis so far, we would likely be best served at using the years 2010-2019, by collecting the countries that have values in every year.

In [10]:
# Get DataFrame with just these columns.
short_df = df_pv[[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]]
short_df.head()

year,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
Country name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,4.758,3.832,3.783,3.572,3.131,3.983,4.22,2.662,2.694,2.375
Albania,5.269,5.867,5.51,4.551,4.814,4.607,4.511,4.64,5.004,4.995
Algeria,5.464,5.317,5.605,,6.355,,5.341,5.249,5.043,4.745
Argentina,6.441,6.776,6.468,6.582,6.671,6.697,6.427,6.039,5.793,6.086
Armenia,4.368,4.26,4.32,4.277,4.453,4.348,4.325,4.288,5.062,5.488


In [11]:
# Drop all rows that contain any null values.
short_df = short_df.dropna()
short_df

year,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
Country name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,4.758,3.832,3.783,3.572,3.131,3.983,4.220,2.662,2.694,2.375
Albania,5.269,5.867,5.510,4.551,4.814,4.607,4.511,4.640,5.004,4.995
Argentina,6.441,6.776,6.468,6.582,6.671,6.697,6.427,6.039,5.793,6.086
Armenia,4.368,4.260,4.320,4.277,4.453,4.348,4.325,4.288,5.062,5.488
Australia,7.450,7.406,7.196,7.364,7.289,7.309,7.250,7.257,7.177,7.234
...,...,...,...,...,...,...,...,...,...,...
Uzbekistan,5.095,5.739,6.019,5.940,6.049,5.972,5.893,6.421,6.205,6.154
Venezuela,7.478,6.580,7.067,6.553,6.136,5.569,4.041,5.071,5.006,5.081
Vietnam,5.296,5.767,5.535,5.023,5.085,5.076,5.062,5.175,5.296,5.467
Yemen,4.350,3.746,4.061,4.218,3.968,2.983,3.826,3.254,3.058,4.197


In [12]:
# List the final country list.
main_countries = short_df.index.tolist()
main_countries

['Afghanistan',
 'Albania',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bangladesh',
 'Belarus',
 'Belgium',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Bulgaria',
 'Burkina Faso',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Costa Rica',
 'Croatia',
 'Cyprus',
 'Denmark',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Finland',
 'France',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Guatemala',
 'Honduras',
 'Hungary',
 'India',
 'Indonesia',
 'Ireland',
 'Israel',
 'Italy',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kosovo',
 'Kyrgyzstan',
 'Lebanon',
 'Lithuania',
 'Luxembourg',
 'Mali',
 'Malta',
 'Mauritania',
 'Mexico',
 'Moldova',
 'Mongolia',
 'Montenegro',
 'Nepal',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'North Macedonia',
 'Pakistan',
 'Palestinian Territories',
 'Panama',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Romania',
 'Russia',
 'Saudi Ara

In [13]:
# Check which countries don't exist in the 2020 column.
countries2020 = df_pv[2020].dropna().index.tolist()

for country in main_countries:
    if country not in countries2020:
        print(country)

Afghanistan
Armenia
Azerbaijan
Belarus
Botswana
Burkina Faso
Chad
Costa Rica
Guatemala
Honduras
Indonesia
Lebanon
Luxembourg
Mali
Mauritania
Nepal
Nicaragua
Niger
Pakistan
Palestinian Territories
Panama
Peru
Romania
Senegal
Uzbekistan
Vietnam
Yemen


In [14]:
# Check which countries don't exist in the 2021 dataset.
countries2021 = df_21['Country name'].tolist()

for country in main_countries:
    if country not in countries2021:
        print(country)

Every country exists in the 2021 dataset, so we are covered to use that as our model test.

In [15]:
# Prepare 2021 data for join.
df_21.set_index('Country name', inplace=True)
df_21.rename(columns={'Ladder score': 2021}, inplace=True)
df_21

Unnamed: 0_level_0,2021
Country name,Unnamed: 1_level_1
Finland,7.842
Denmark,7.620
Switzerland,7.571
Iceland,7.554
Netherlands,7.464
...,...
Lesotho,3.512
Botswana,3.467
Rwanda,3.415
Zimbabwe,3.145


In [16]:
# Combine the DataFrames and save the table.
full_df = short_df.join(df_21)
full_df.to_csv('countries_happiness_2010_2019_2021')
full_df.head()

Unnamed: 0_level_0,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2021
Country name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Afghanistan,4.758,3.832,3.783,3.572,3.131,3.983,4.22,2.662,2.694,2.375,2.523
Albania,5.269,5.867,5.51,4.551,4.814,4.607,4.511,4.64,5.004,4.995,5.117
Argentina,6.441,6.776,6.468,6.582,6.671,6.697,6.427,6.039,5.793,6.086,5.929
Armenia,4.368,4.26,4.32,4.277,4.453,4.348,4.325,4.288,5.062,5.488,5.283
Australia,7.45,7.406,7.196,7.364,7.289,7.309,7.25,7.257,7.177,7.234,7.183
