<a href="https://www.kaggle.com/code/martinab/world-happiness-plotly-eda?scriptVersionId=113722316" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# World Happiness (EDA)

The aim of this project is to demonstrate different visualisation techniques by using plotly library and analyse World hapiness datasets (2015-2019). 

Before we start exporing and visualising the data, we will be facing a setback - naming conventions of datasets columns are a mismatch. We will sort this issue out in Data Manipulation section. Of course, we could use different approach to tackle this problem 😉. We will add columns for continent and region by using pycountry_convert library. 


We try to use dynamic graphs and charts in EDA section where possible. It's cool to have fancy visuals, but it's even cooler to be able to provide insight... 📊📈📉

![happy](https://www.uncsa.edu/mysa/img/announcements/2018/yellow-happy-blue-sad-balls.jpg)

<code style="background:blue;color:white">Please upvote if you like my work or if you found it helpful. Any feedback on how to improve is welcomed. </code>

## Import Libraries

In [None]:
# Importing numpy, pandas, matplotlib and seaborn:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Imports for plotly:
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

import random

In [None]:
# Explore what's in world-happiness folder:
import os

print(os.listdir('../input/world-happiness/'))

### Load and Explore Data

In [None]:
# Load data for 2015 - 2019 and add year column to each dataset:
df_15 = pd.read_csv('../input/world-happiness/2015.csv')
df_15['Year'] = 2015

df_16 = pd.read_csv('../input/world-happiness/2016.csv')
df_16['Year'] = 2016

df_17 = pd.read_csv('../input/world-happiness/2017.csv')
df_17['Year'] = 2017

df_18 = pd.read_csv('../input/world-happiness/2018.csv')
df_18['Year'] = 2018

df_19 = pd.read_csv('../input/world-happiness/2019.csv')
df_19['Year'] = 2019

In [None]:
# Function to describe variables:
def desc(df):
    d = pd.DataFrame(df.dtypes,columns=['Data_Types'])
    d = d.reset_index()
    d['Columns'] = d['index']
    d = d[['Columns','Data_Types']]
    d['Missing'] = df.isnull().sum().values    
    d['Uniques'] = df.nunique().values
    return d

# Apply desc to df:
tab = ff.create_table(desc(df_19))
tab.show()

### Data Manipulations

The datasets for 2015 - 2019 do not follow the same naming convention, so we will need to rename couple of columns to make it easier to combine them together. 

In [None]:
# We want to concatenate dataframes, but because columns are named differently we will be renaming them:
df_17.rename(columns={'Happiness.Rank': 'Happiness Rank', 'Happiness.Score': 'Happiness Score', 'Economy..GDP.per.Capita.':'Economy (GDP per Capita)'
                      , 'Health..Life.Expectancy.':'Health (Life Expectancy)', 'Trust..Government.Corruption.':'Trust (Government Corruption)'
                      , 'Dystopia.Residual':'Dystopia Residual' }, inplace=True)

df_18.rename(columns={'Overall rank':'Happiness Rank', 'Country or region':'Country','Score':'Happiness Score', 'GDP per capita':'Economy (GDP per Capita)'
                      ,'Social support':'Family', 'Healthy life expectancy':'Health (Life Expectancy)','Freedom to make life choices':'Freedom'
                      ,'Perceptions of corruption':'Trust (Government Corruption)' }, inplace=True)

df_19.rename(columns={'Overall rank':'Happiness Rank', 'Country or region':'Country', 'Score':'Happiness Score', 'GDP per capita':'Economy (GDP per Capita)'
                      ,'Social support':'Family', 'Healthy life expectancy':'Health (Life Expectancy)','Freedom to make life choices':'Freedom'
                      ,'Perceptions of corruption':'Trust (Government Corruption)' }, inplace=True)


Combine all available datasets together by using pandas concat function. We will drop columns which we won't use such as Standard Error, Whisker.high... Final step is renaming columns.

In [None]:
# Concatenate dataframe for word-hapiness:
frames = [df_15, df_16, df_17, df_18, df_19]
df = pd.concat(frames)

# Drop columns that are not populated for all dataframes:
df = df.drop(['Region','Standard Error', 'Dystopia Residual', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Whisker.high', 'Whisker.low'],axis = 1)

# Rename columns, so it's easier to read them:
df.rename(columns = {'Economy (GDP per Capita)':'GDP per Capita', 'Family':'Social Support', 'Health (Life Expectancy)':'Life Expectancy'
                     , 'Trust (Government Corruption)':'Corruption'}, inplace=True)

# Let's have a look at our dataframe:
df.head()

### Region and Continent

In this section we would like to add Region and Continet columns. We will use Regions from df_15 and merge function on our dataframe df. For assigning a Continent we use pycountry_convert library. So let's do this :)

In [None]:
# Match region from df_15 to countries from our df by using merge:
df_reg = df_15[['Country', 'Region']]
df = df.merge(df_reg)

As a next step we would like to add a continent column, for this purpose we need to run !pip install pycountry_convert. We have couple of territories, which are not matching with country names, so we create a function country_2_continent which will fix this problem. 

In [None]:
!pip install pycountry_convert

In [None]:
# Create a function to assign continent to country:

import pycountry_convert as pc

def country_2_continent(country_name):
    try:
        if country_name in ['Holy See', 'Kosovo']:
            return 'Europe'
        if country_name in ['North Cyprus','East Timor','Timor-Leste','West Bank and Gaza','Palestinian Territories','Taiwan Province of China','Hong Kong S.A.R., China']:
            return 'Asia'
        if country_name in ['Congo (Brazzaville)','Congo (Kinshasa)','Somaliland region', 'Somaliland Region']:
            return 'Africa'
        if country_name in ['Trinidad & Tobago']:
            return 'South America'
        else:
            country_alpha2 = pc.country_name_to_country_alpha2(country_name)
            country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
            country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
            return country_continent_name
    
    except:
         return 'Other'    

In [None]:
# Create a Continent column: 
df['Continent'] = df['Country'].apply(country_2_continent)

## Exploratory Data Analysis (EDA)

Let's the fun get started! This is the exciting part of data insight, as we can explore the data further, spot patterns and gain further understanding. 

### Map for Happiness and Other Features in 2019

Maps below show values of Happiness Score, GDP per Capita, Social Support, Health (Life Expectancy), Freedom, Corruption and Generosity for countries.

In [None]:
# Create data-frame for year 2019:
df_19 = df[df.Year == 2019]

# Numerical columns of df_19 which we want to display:
cols_dd = ['Happiness Score', 'GDP per Capita','Social Support', 'Life Expectancy', 'Freedom', 'Corruption','Generosity']
# Define which trade will be visible:
visible = np.array(cols_dd)

# Define traces and buttons:
traces = []
buttons = []
for value in cols_dd:
    traces.append(go.Choropleth(locations=df_19['Country']
                                , locationmode='country names'
                                , z=df_19[value].astype(float)
                                , colorbar_title=value
                                , visible= True if value==cols_dd[0] else False
                                , colorscale='Bluered'
                                , reversescale=True
                               )
                 )

    buttons.append(dict(label=value
                        , method='update'
                        , args=[{'visible':list(visible==value)}
                        , {'title':f"<b>{value}</b>"}]))

updatemenus = [{'active':0
                ,'buttons':buttons
               }]


# Show figure
fig = go.Figure(data=traces,
                layout=dict(updatemenus=updatemenus))
# This is in order to get the first title displayed correctly
first_title = cols_dd[0]
fig.update_layout(title=f"<b>{first_title} (2019)</b>")
fig.show()

Maps above give us overview what countries have highes Happiness Score, GDP per Capita... Just pick a feature you are interested in by making a selection in the dropdown menu (top left corner).

### Correlation for Features

In [None]:
# Correlation matrix for Happiness dataset features:

corr = df[['Happiness Score', 'GDP per Capita','Social Support', 'Life Expectancy', 'Freedom', 'Corruption','Generosity']].astype(float).corr()
l = list(corr.columns)

fig = ff.create_annotated_heatmap(np.array(round(corr,4)), x=l, y=l, colorscale = 'Bluered', reversescale=True )
fig.update_layout(title='')

fig.show()

Happiness Score has strong positive correlation with GDP per Captita (0.7973), Health (Life Expectancy) (0.7477), Social Support (0.6506) followed by Freedom (0.5501). 

This means that the population on rich countries with higher life expectancy, social support and freedom tends to be happier. 


### Low, Medium & High Happiness Score

First at all, we have to split happiness score into 3 categories low, medium and high. We will be working with quartiles here, where low happiness score is below q1, medium is between q1 and q3 and anything above q3 value is classified as high.

We will create boxplots for each of the features split by those 3 categories.

In [None]:
q1, q2, q3 = df['Happiness Score'].quantile([0.25,0.5,0.75])

def category(value):
    if value < q1:
        return 'low'
    if value > q3:
        return 'high'
    else:
        return 'medium'
    
df['Category'] = df['Happiness Score'].apply(category)    

In [None]:
# Overwrite data-frame for year 2019:
df_19 = df[df.Year == 2019]

# Boxplot with dropdown menu for main features:

fig = go.Figure()

# Add Traces

fig.add_trace(go.Box(x=df_19['Category'], y=df_19['GDP per Capita']))
fig.add_trace(go.Box(x=df_19['Category'], y=df_19['Social Support'], visible=False))  
fig.add_trace(go.Box(x=df_19['Category'], y=df_19['Life Expectancy'], visible=False))  
fig.add_trace(go.Box(x=df_19['Category'], y=df_19['Freedom'], visible=False))  
fig.add_trace(go.Box(x=df_19['Category'], y=df_19['Corruption'], visible=False))  
fig.add_trace(go.Box(x=df_19['Category'], y=df_19['Generosity'], visible=False))  
 

# Add Buttons

fig.update_layout(
    updatemenus=[
        dict(
            active=1,
            buttons=list([ 
                
                dict(label='GDP',
                     method='update',
                     args=[{'visible': [True, False,False, False, False, False]},
                           {'title': 'Boxplot for GDP per Capita (Happiness Category split)'}]),
                
                dict(label='Social Support',
                     method='update',
                     args=[{'visible': [False, True, False, False, False, False]},
                           {'title': 'Boxplot for Social Support (Happiness Category split)'}]),
                
                dict(label='Life Expectancy',
                     method='update',
                     args=[{'visible': [False,  False, True, False, False, False]},
                           {'title': 'Boxplot for Health (Happiness Category split)'}]),
                
                dict(label='Freedom',
                     method='update',
                     args=[{'visible': [False, False, False, True,  False, False]},
                           {'title': 'Boxplot for Freedom (Happiness Category split)'}]),
                
                dict(label='Corruption',
                     method='update',
                     args=[{'visible': [False, False, False, False, True, False]},
                           {'title': 'Boxplot for Corruption (Happiness Category split)'}]),
                
                dict(label='Generosity',
                     method='update',
                     args=[{'visible': [False, False, False, False, False,True]},
                           {'title': 'Boxplot for Generosity (Happiness Category split)'}]),
                
               
            ]),
        )
    ])

# Set title
fig.update_layout(title_text='Boxplot for Happiness Score Categories (2019)')

fig.show()

Box-plots above just confirm our main findings derived from heatmap for features correlation. 

Perhaps most interesting feature here is Generosity. Countries with low happiness score are on average more generous than countries with medium score. 
2 out of 3 most generous counties have low happiness score.

### Top 10 Happiest Countries

What are top 10 happiest countries in the world? The following bar chart answers this question for years 2015 - 2019.

In [None]:
df_top = df[df['Happiness Rank']<=10]
df_top = df_top.sort_values(by = ['Year', 'Happiness Rank'])

In [None]:
# Create functions for running bar chart (source: https://towardsdatascience.com/bar-chart-race-with-plotly-f36f3a5df4f1)

def name_to_color(names, r_min=0, r_max=255, g_min=0, g_max=255, b_min=0, b_max=255):
    
    """Mapping of countries to random rgb colors.
    Parameters:
    df (Series): Pandas Series containing countries.
    r_min (int): Mininum intensity of the red channel (default 0).
    r_max (int): Maximum intensity of the red channel (default 255).
    g_min (int): Mininum intensity of the green channel (default 0).
    g_max (int): Maximum intensity of the green channel (default 255).
    b_min (int): Mininum intensity of the blue channel (default 0).
    b_max (int): Maximum intensity of the blue channel (default 255).
    Returns:
    dictionary: Mapping of countries (keys) to random rgb colors (values)
    """    
    mapping_colors = dict()
    
    for name in names.unique():
        red = random.randint(r_min, r_max)
        green = random.randint(g_min, g_max)
        blue = random.randint(b_min, b_max)
        rgb_string = 'rgb({}, {}, {})'.format(red, green, blue)
    
        mapping_colors[name] = rgb_string
    
    return mapping_colors


# Map colors to df_top Country column:
mapping_colors = name_to_color(df_top.Country)
df_top['Color'] = df_top['Country'].map(mapping_colors)

def frames_animation(df, title):
    
    """Creation of a sequence of frames.
    Parameters:
    df (DataFrame): Pandas data frame containing the categorical variable ['Country'],
    the score ['Happiness Score'], the year ['Year'], and the color['Color'] (separated columns).
    title (string): Title of each frame.
    Returns:
    list_of_frames (list): List of frames. Each frame contains a bar plot of a year.
    """  
    
    list_of_frames = []
    initial_year = df['Year'].min()
    final_year = df['Year'].max()

    for year in range(initial_year, final_year +1):
            fdata = df[df['Year'] == year]
            list_of_frames.append(go.Frame(data=[go.Bar(x=fdata['Country']
                                                        , y=fdata['Happiness Score']
                                                        , marker_color=fdata['Color']
                                                        , hoverinfo='none'
                                                        , textposition='outside'
                                                        , texttemplate='%{x}<br>%{y}'
                                                        , cliponaxis=False
                                                       )
                                                ],
                                           layout=go.Layout(font={'size': 10}
                                                            , plot_bgcolor = '#FFFFFF'
                                                            , xaxis={'showline': False, 'visible': False}
                                                            , yaxis={'showline': False, 'visible': False}
                                                            , bargap=0.15
                                                            , title=title + str(year)
                                                           )
                                          )
                                 )
    return list_of_frames 


def bar_race_plot (df, title, list_of_frames):
    
    """Creation of the bar chart race figure.
    Parameters:
    df (DataFrame): Pandas data frame containing the categorical variable ['Name'],
    the count ['Number'], the year ['Year'], and the color ['Color'] (separated columns).
    title (string): Title of the initial bar plot.
    list_of_frames (list): List of frames. Each frame contains a bar plot of a year.
    Returns:
    fig (figure instance): Bar chart race
    """
    
    # initial year - countries (categorical variable), happiness score (numerical variable), and color
    initial_year = df['Year'].min()
    initial_names = df['Country'][df['Year'] == initial_year]
    initial_numbers = df['Happiness Score'][df['Year'] == initial_year]
    initial_color = df['Color'][df['Year'] == initial_year]
    range_max = df['Happiness Score'].max()
    
    fig = go.Figure(
        data=[go.Bar(x=initial_names
                     , y=initial_numbers
                     , marker_color=initial_color
                     , hoverinfo='none',textposition='outside'
                     , texttemplate='%{x}<br>%{y}'
                     ,cliponaxis=False
                    )
             ],
        layout=go.Layout(font={'size': 10}
                         , plot_bgcolor = '#FFFFFF'
                         , xaxis={'showline': False, 'visible': False}
                         , yaxis={'showline': False, 'visible': False, 'range': (0, range_max)}
                         , bargap=0.15
                         , title=title + str(initial_year)
                         ,updatemenus=[dict(type="buttons"
                                            ,buttons=[dict(label="Play"
                                                           , method="animate"
                                                           ,args=[None,{"frame": {"duration": 2000, "redraw": True}, "fromcurrent": True}]),
                                                      dict(label="Stop"
                                                           ,method="animate"
                                                           ,args=[[None],{"frame": {"duration": 0, "redraw": False}, "mode": "immediate","transition": {"duration": 0}}])])]),
        frames=list(list_of_frames))
    
    return fig 

In [None]:
# Animated bar-chart for Happpiness Score: 
title = 'Top 10 Happiest Countries '
list_of_frames = frames_animation(df_top, title)
fig = bar_race_plot(df_top, title, list_of_frames)
fig.show()

### The Happiest Region

Please note that the following statistics is based on average happiness score across the regions per given year. 

In [None]:
# Create region dataset with mean happiness scores:

df_reg = df.groupby(['Region', 'Year'],as_index=False).agg({'Happiness Score':np.mean, 'Happiness Rank':np.mean})
df_reg[df_reg.Year == 2019].sort_values(by=['Happiness Score'], ascending=False).reset_index(inplace = True)
df_reg = df_reg.sort_values(['Year', 'Happiness Rank'])

In [None]:
# Map region to a random colors by using name_to_color function:
mapping_colors = name_to_color(df_reg.Region, 0, 185, 0, 185, 125, 255)
df_reg['Color'] = df_reg['Region'].map(mapping_colors)

def frames_animation(df, title):
    
    """Creation of a sequence of frames.
    Parameters:
    df (DataFrame): Pandas data frame containing the categorical variable ['Region'],
    the score ['Happiness Score'], the year ['Year'], and the color['Color'] (separated columns).
    title (string): Title of each frame.
    Returns:
    list_of_frames (list): List of frames. Each frame contains a bar plot of a year.
    """
    
    list_of_frames = []
    initial_year = df['Year'].min()
    final_year = df['Year'].max()

    for year in range(initial_year, final_year +1):
            fdata = df[df['Year'] == year]
            list_of_frames.append(go.Frame(data=[go.Bar(x=fdata['Region']
                                                        , y=fdata['Happiness Score']
                                                        , marker_color=fdata['Color']
                                                        , hoverinfo='none'
                                                        , textposition='outside'
                                                        , texttemplate='%{x}<br>%{y}'
                                                        , cliponaxis=False
                                                       )
                                                ],
                                           layout=go.Layout(font={'size': 10}
                                                            , plot_bgcolor = '#FFFFFF'
                                                            , xaxis={'showline': False, 'visible': False}
                                                            , yaxis={'showline': False, 'visible': False}
                                                            , bargap=0.15
                                                            , title=title + str(year)
                                                           )
                                          )
                                 )
    return list_of_frames 


def bar_race_plot (df, title, list_of_frames):
    """Creation of the bar chart race figure.
    Parameters:
    df (DataFrame): Pandas data frame containing the categorical variable ['Region'],
    the score ['Happiness Score'], the year ['Year'], and the color ['Color'] (separated columns).
    title (string): Title of the initial bar plot.
    list_of_frames (list): List of frames. Each frame contains a bar plot of a year.
    Returns:
    fig (figure instance): Bar chart race
    """
    
    # initial year - names (categorical variable), number of babies (numerical variable), and color
    initial_year = df['Year'].min()
    initial_names = df['Region'][df['Year'] == initial_year]
    initial_numbers = df['Happiness Score'][df['Year'] == initial_year]
    initial_color = df['Color'][df['Year'] == initial_year]
    range_max = df['Happiness Score'].max()
    
    fig = go.Figure(
        data=[go.Bar(x=initial_names
                     , y=initial_numbers
                     , marker_color=initial_color
                     , hoverinfo='none'
                     , textposition='outside'
                     , texttemplate='%{x}<br>%{y}'
                     , cliponaxis=False
                    )
             ],
        layout=go.Layout(font={'size': 10}
                         , plot_bgcolor = '#FFFFFF'
                         , xaxis={'showline': False, 'visible': False}
                         , yaxis={'showline': False, 'visible': False, 'range': (0, range_max)}
                         , bargap=0.15, title=title + str(initial_year)
                         , updatemenus=[dict(type="buttons"
                                             ,buttons=[dict(label="Play"
                                                            , method="animate"
                                                            , args=[None,{"frame": {"duration": 2000, "redraw": True}, "fromcurrent": True}]),
                                                       dict(label="Stop"
                                                            , method="animate"
                                                            , args=[[None],{"frame": {"duration": 0, "redraw": False}, "mode": "immediate","transition": {"duration": 0}}])])]),
        frames=list(list_of_frames))
    
    return fig

In [None]:
# Animated bar-chart for Happpiness Score: 
title = 'Happiness Score for Regions '
list_of_frames = frames_animation(df_reg, title)
fig = bar_race_plot(df_reg, title, list_of_frames)
fig.show()

### Happiness by Continent

In [None]:
round((max(df['GDP per Capita']))+0.5,2)

In [None]:
# Split Happiness data according to continent:

fig = px.scatter(df
                 , x ="GDP per Capita"
                 , y ="Happiness Score"
                 , animation_frame ="Year"
                 , animation_group ="Country"
                 , size ="GDP per Capita"
                 , color ="Continent"
                 , hover_name ="Country"
                 #, facet_col ="Continent"
                 , size_max = 10
                ) 


fig.update_layout(title_text='Happines vs GDP per Capita (sized by GDP per Capita)')
fig.update_yaxes(range=[2,8])
fig.update_xaxes(range=[-0.01,round((max(df['GDP per Capita']))+0.1,2)])

fig.show()

In [None]:
# Split Happiness data according to continent:

fig = px.scatter(df
                 , x ="Social Support"
                 , y ="Happiness Score"
                 , animation_frame ="Year"
                 , animation_group ="Country"
                 , size ="GDP per Capita"
                 , color ="Continent"
                 , hover_name ="Country"
                 #, facet_col ="Continent"
                 , size_max = 10
                ) 


fig.update_layout(title_text='Happines vs Social Support (sized by GDP per Capita)')
fig.update_yaxes(range=[2,8])
fig.update_xaxes(range=[-0.01,round((max(df['Social Support']))+0.1,2)])

fig.show()

In [None]:
df.columns

In [None]:
# Split Happiness data according to continent:

fig = px.scatter(df
                 , x ="Life Expectancy"
                 , y ="Happiness Score"
                 , animation_frame ="Year"
                 , animation_group ="Country"
                 , size ="GDP per Capita"
                 , color ="Continent"
                 , hover_name ="Country"
                 #, facet_col ="Continent"
                 , size_max = 10
                ) 


fig.update_layout(title_text='Happines vs Life Expectancy (sized by GDP per Capita)')
fig.update_yaxes(range=[2,8])
fig.update_xaxes(range=[-0.01,round((max(df['Life Expectancy']))+0.1,2)])

fig.show()

In [None]:
# Split Happiness data according to continent:

fig = px.scatter(df
                 , x ="Freedom"
                 , y ="Happiness Score"
                 , animation_frame ="Year"
                 , animation_group ="Country"
                 , size ="GDP per Capita"
                 , color ="Continent"
                 , hover_name ="Country"
                 #, facet_col ="Continent"
                 , size_max = 10
                ) 


fig.update_layout(title_text='Happines vs Freedom (sized by GDP per Capita)')
fig.update_yaxes(range=[2,8])
fig.update_xaxes(range=[-0.01,round((max(df['Freedom']))+0.1,2)])

fig.show()

In [None]:
# Split Happiness data according to continent:

fig = px.scatter(df
                 , x ="Generosity"
                 , y ="Happiness Score"
                 , animation_frame ="Year"
                 , animation_group ="Country"
                 , size ="GDP per Capita"
                 , color ="Continent"
                 , hover_name ="Country"
                 #, facet_col ="Continent"
                 , size_max = 10
                ) 


fig.update_layout(title_text='Happines vs Generosity (sized by GDP per Capita)')
fig.update_yaxes(range=[2,8])
fig.update_xaxes(range=[-0.01,round((max(df['Generosity']))+0.1,2)])

fig.show()