**EVOLUTION OF THE HAPPINESS SCORE AND ITS COMPONENTS**

**Source of information** 
- <a href = 'https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021'> world happiness report 2021</a>
- <a href = 'https://www.kaggle.com/datasets/londeen/world-happiness-report-2020'> world happiness report_2020</a>
- <a href = 'https://www.kaggle.com/datasets/unsdsn/world-happiness'> world happiness report 2019-2017</a>

**Purpose** 
- visualize dynamics of happiness score and its components like GDP, social support, life expectancy, generosity, perception of corruption and freedom of choice. The analysis covers all countries for the last 5 years
- visualize the top achievers amoung countries 

**Steps**
- loading datasets from kaggle.com
- normalizing, creating rank attributes, concatenating data
- Analizing primary data 
- visualizing data mart in Tableau Public

# Loading datasets from Kaggle

We loaded datasets in csv format with names year.csv where year - year of report

In [1]:
#import libraries
import pandas as pd
import numpy as np

Lets read all csv and add them to list. In additional for each dataframe we created column `year`

In [2]:
#reading csv, adding df to list and getting names of columns for each dataframe
lst = []
for year in range(2017, 2022):
    df = pd.read_csv(f"{year}.csv")
    df['year'] = year
    lst.append(df)
    print(year)
    print(df.columns.to_list())

2017
['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high', 'Whisker.low', 'Economy..GDP.per.Capita.', 'Family', 'Health..Life.Expectancy.', 'Freedom', 'Generosity', 'Trust..Government.Corruption.', 'Dystopia.Residual', 'year']
2018
['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'year']
2019
['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'year']
2020
['Country name', 'Regional indicator', 'Ladder score', 'Standard error of ladder score', 'upperwhisker', 'lowerwhisker', 'Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Ladder score in Dystopia', 'Explained by: Log GDP per capita', 'Explained by: Social support',

As we see data is different in each dataframe - we gonne select columns that we need and rename them for futher dataframes concatenation 

# Normalizing data

We gonna select columns for visualization, rename them the same way for different dataframes and make the same float format. Also we gonna create new attributes - ranks of country for each component of happiness score

In [3]:
#selection of necessary attributes
df2017 = lst[0][['Country', 'Happiness.Score', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'year']]
df2018 = lst[1].drop('Overall rank', axis = 1)
df2019 = lst[2].drop('Overall rank', axis = 1)
df2020 = lst[3][['Country name', 'Ladder score','Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
        'year']]
df2021 = lst[4][['Country name', 'Ladder score','Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'year']]

In [4]:
#renaming columns
lst_df = [df2017, df2018, df2019, df2020, df2021]
for df in lst_df:
    df.columns = ['country', 'score', 'gdp per capita', 'social support', 'healthy life expectancy', 
                  'freedom to make choices', 'generosity', 'perceptions of corruption','year']

In [5]:
#creating ranks and standartizing float format 
lst_col = ['score','gdp per capita', 'social support', 'healthy life expectancy', 
                  'freedom to make choices', 'generosity', 'perceptions of corruption']

def get_rank(df, lst_col):
    for col in lst_col:
        df = df.sort_values(by = col, ascending = False).reset_index(drop = 'True').reset_index().rename(columns = {'index':f'rank_{col}'})
        df[f'rank_{col}'] = df[f'rank_{col}'] + 1
        df[col] = round(df[col], 3)
    return df
    
for df in lst_df:
    df = get_rank(df, lst_col)
    display(df.head(2))

Unnamed: 0,rank_perceptions of corruption,rank_generosity,rank_freedom to make choices,rank_healthy life expectancy,rank_social support,rank_gdp per capita,rank_score,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
0,1,31,32,1,56,3,26,Singapore,6.572,1.692,1.354,0.949,0.55,0.346,0.464,2017
1,2,64,19,122,128,139,151,Rwanda,3.471,0.369,0.946,0.326,0.582,0.253,0.455,2017


Unnamed: 0,rank_perceptions of corruption,rank_generosity,rank_freedom to make choices,rank_healthy life expectancy,rank_social support,rank_gdp per capita,rank_score,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
0,1,30,22,2,44,4,34,Singapore,6.343,1.529,1.451,1.008,0.631,0.261,0.457,2018
1,2,58,18,121,131,141,151,Rwanda,3.408,0.332,0.896,0.4,0.636,0.2,0.444,2018


Unnamed: 0,rank_perceptions of corruption,rank_generosity,rank_freedom to make choices,rank_healthy life expectancy,rank_social support,rank_gdp per capita,rank_score,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
0,1,23,20,1,36,3,34,Singapore,6.262,1.572,1.463,1.141,0.556,0.271,0.453,2019
1,2,56,21,108,145,135,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411,2019


Unnamed: 0,rank_perceptions of corruption,rank_generosity,rank_freedom to make choices,rank_healthy life expectancy,rank_social support,rank_gdp per capita,rank_score,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
0,1,57,14,1,37,2,31,Singapore,6.377,1.52,1.395,1.138,0.635,0.219,0.533,2020
1,2,46,4,21,3,13,2,Denmark,7.646,1.327,1.503,0.979,0.665,0.243,0.495,2020


Unnamed: 0,rank_perceptions of corruption,rank_generosity,rank_freedom to make choices,rank_healthy life expectancy,rank_social support,rank_gdp per capita,rank_score,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
0,1,69,14,1,33,2,32,Singapore,6.377,1.695,1.019,0.897,0.664,0.176,0.547,2021
1,2,44,28,107,144,137,147,Rwanda,3.415,0.364,0.202,0.407,0.627,0.227,0.493,2021


# Data frames concatenation 

Now, then dataframes are homogeneous - with same columns, format of float numbers and have rank attributes we can get the final data mart for visualization

In [6]:
df = pd.concat([df2021, df2020, df2019, df2018, df2017])

# Primary data analysis

Lets check a bit quality of data

## Missing values

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 769 entries, 0 to 154
Data columns (total 9 columns):
country                      769 non-null object
score                        769 non-null float64
gdp per capita               769 non-null float64
social support               769 non-null float64
healthy life expectancy      769 non-null float64
freedom to make choices      769 non-null float64
generosity                   769 non-null float64
perceptions of corruption    768 non-null float64
year                         769 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 60.1+ KB


We have just one NaN. Lets have a look what is it

In [8]:
df.loc[df['perceptions of corruption'].isna(), :]

Unnamed: 0,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
19,United Arab Emirates,6.774,2.096,0.776,0.67,0.284,0.186,,2018


To be sure that this Nan came from raw data

In [9]:
lst[1].loc[lst[1]['Country or region'] == "United Arab Emirates", :]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,year
19,20,United Arab Emirates,6.774,2.096,0.776,0.67,0.284,0.186,,2018


Very possible that the survey about corruption was not done in 2018 in Arab Emirates

## Zero in values

We check if there are attributes having value = 0

In [10]:
for col in lst_col: 
    display(df.loc[df[col] == 0, :])

Unnamed: 0,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year


Unnamed: 0,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
139,Burundi,3.775,0.0,0.062,0.155,0.298,0.172,0.212,2021
139,Burundi,3.7753,0.0,0.403575,0.295213,0.275399,0.187402,0.212187,2020
111,Somalia,4.668,0.0,0.698,0.268,0.559,0.243,0.27,2019
97,Somalia,4.982,0.0,0.712,0.115,0.674,0.238,0.282,2018
154,Central African Republic,2.693,0.0,0.0,0.018773,0.270842,0.280876,0.056565,2017


Unnamed: 0,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
148,Afghanistan,2.523,0.37,0.0,0.126,0.0,0.122,0.01,2021
148,Central African Republic,3.4759,0.041072,0.0,0.0,0.292814,0.253513,0.028265,2020
154,Central African Republic,3.083,0.026,0.0,0.105,0.225,0.235,0.035,2019
154,Central African Republic,3.083,0.024,0.0,0.01,0.305,0.218,0.038,2018
154,Central African Republic,2.693,0.0,0.0,0.018773,0.270842,0.280876,0.056565,2017


Unnamed: 0,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
127,Chad,4.355,0.255,0.353,0.0,0.24,0.215,0.084,2021
148,Central African Republic,3.4759,0.041072,0.0,0.0,0.292814,0.253513,0.028265,2020
134,Swaziland,4.212,0.811,1.149,0.0,0.313,0.074,0.135,2019
112,Sierra Leone,4.571,0.256,0.813,0.0,0.355,0.238,0.053,2018
138,Lesotho,3.808,0.521021,1.190095,0.0,0.390661,0.157497,0.119095,2017


Unnamed: 0,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
148,Afghanistan,2.523,0.37,0.0,0.126,0.0,0.122,0.01,2021
152,Afghanistan,2.5669,0.300706,0.356434,0.266052,0.0,0.135235,0.001226,2020
153,Afghanistan,3.203,0.35,0.517,0.361,0.0,0.158,0.025,2019
141,Angola,3.795,0.73,1.125,0.269,0.0,0.079,0.061,2018
139,Angola,3.795,0.858428,1.104412,0.049869,0.0,0.097926,0.06972,2017


Unnamed: 0,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
67,Greece,5.723,1.273,0.811,0.76,0.243,0.0,0.074,2021
76,Greece,5.515,1.12807,1.168974,0.979432,0.173516,0.0,0.048844,2020
81,Greece,5.287,1.181,1.156,0.999,0.067,0.0,0.034,2019
78,Greece,5.358,1.154,1.202,0.879,0.131,0.0,0.044,2018
86,Greece,5.227,1.289487,1.239415,0.810199,0.095731,0.0,0.04329,2017


Unnamed: 0,country,score,gdp per capita,social support,healthy life expectancy,freedom to make choices,generosity,perceptions of corruption,year
59,Croatia,5.882,1.251,1.039,0.703,0.453,0.111,0.0,2021
95,Bulgaria,5.1015,1.046555,1.460579,0.777777,0.41782,0.103834,0.0,2020
70,Moldova,5.529,0.685,1.328,0.739,0.245,0.181,0.0,2019
66,Moldova,5.64,0.657,1.301,0.62,0.232,0.171,0.0,2018
92,Bosnia and Herzegovina,5.129,0.915,1.078,0.758,0.28,0.216,0.0,2018
89,Bosnia and Herzegovina,5.182,0.982409,1.069336,0.705186,0.204403,0.328867,0.0,2017


 To be sure that they exist in raw data 

In [11]:
display(lst[4].loc[lst[4]['Country name'] == "Afghanistan", :])
lst[4].loc[lst[4]['Country name'] == "Greece", :]

Unnamed: 0,Country name,Regional indicator,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,...,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual,year
148,Afghanistan,South Asia,2.523,0.038,2.596,2.449,7.695,0.463,52.493,0.382,...,0.924,2.43,0.37,0.0,0.126,0.0,0.122,0.01,1.895,2021


Unnamed: 0,Country name,Regional indicator,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,...,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual,year
67,Greece,Western Europe,5.723,0.046,5.813,5.632,10.279,0.823,72.6,0.582,...,0.823,2.43,1.273,0.811,0.76,0.243,0.0,0.074,2.561,2021


I do not think that it is possible that some indicator will be zero. We considere that zero is missing data and change zero values by NaN in dataframe

In [12]:
for col in lst_col: 
    df.loc[df[col] == 0, col] = np.nan

In [13]:
df.isna().mean()

country                      0.000000
score                        0.000000
gdp per capita               0.006502
social support               0.006502
healthy life expectancy      0.006502
freedom to make choices      0.006502
generosity                   0.006502
perceptions of corruption    0.009103
year                         0.000000
dtype: float64

We are not going to delete rows with NaN because they contain import information about other attributes. Just will keep in mind that there is tiny percent of missing data

In [14]:
df.to_csv('happiness_index_with_ranks.csv')

# Visualization data mart 

Dashbord is created in Tableau Public. Data mart - our file happiness_index_with_ranks.csv https://disk.yandex.ru/d/Q9j3Pvo3BApCBA

Link for dashboard **Evolution of happiness score and its components by the country period 2017-2021**
https://public.tableau.com/views/Dynamicsofthehappinessscoreanditscomponent/Tableaudebord1?:language=en-US&:display_count=n&:origin=viz_share_linkLink 