# The dataset, its description + imports

Column descriptions from [whr+22.pdf](https://happiness-report.s3.amazonaws.com/2022/WHR+22.pdf):

**RANK**<br>
Ranking based on happiness score

**Country** <br>
researched country.

**Happiness score** <br>
Sum of the six key variables relative to a hypothetical country called "Distopia".

**Whisker-high and Whisker-low** <br>
The quartile (real number between 0 and 10) from which the happiness score is calculated.

**Dystopia (1.83) + residual**<br>
The predicted Cantril ladder for a hypothetical country with the world’s lowest values for each of the six variables.

> “Dystopia”—named because it has values equal to the world’s lowest national averages for 2019-2021 for each of the six key variables used in Table 2.1. We use Dystopia as a benchmark against which to compare contributions from each of the six factors. The choice of Dystopia as a benchmark permits every real country to have a positive (or at least zero) contribution from each of the six factors. Based on the estimates in the first column of Table 2.1, we calculate that Dystopia had a 2019-2021 life evaluation equal to 1.83 on the 0 to 10 scale. The final sub-bar is the sum of two components: the calculated average 2017-2019 life evaluation in Dystopia (=1.83) plus each country’s own prediction error, which measures the extent to which life evaluations are higher or lower than those predicted by our equation in the first column of Table 2.1. These residuals are as likely to be negative as positive

**Explained by: GDP per capita**<br>
Purchasing Power Parity (PPP) adjusted to constant 2017 international dollars, taken from the World Development Indicators (WDI) released by the World Bank on December 16, 1. See Statistical Appendix 1 for more details. GDP data for 2021 are not yet available, so we extend the GDP time series from 2020 to 2021 using country-specific forecasts of real GDP growth from the OECD Economic Outlook No. 110 (Edition December 1) or, if missing, the World Bank’s Global Economic Prospects (Last Updated: 01/11/2022), after adjustment for population growth. The equation uses the natural log of GDP per capita, as this form fits the data significantly better than GDP per capita.

**Explained by: Social support**<br>
The national average of the binary responses (0=no, 1=yes) to the Gallup World Poll (GWP) question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”

**Explained by: Healthy life expectancy**<br>
The time series for healthy life expectancy at birth is constructed based on data from the World Health Organization (WHO) Global Health Observatory data repository, with data available for 2000, 2010, 2015, and 2019. Interpolation and extrapolation are used to match this report’s sample period (2005-2021)

**Explained by: Freedom to make life choices**<br>
The national average of binary responses (0=no, 1=yes) to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”

**Explained by: Generosity**<br>
The residual of regressing the national average of GWP responses to the donation question “Have you donated money to a charity in the past month?” on log GDP per capita.

**Explained by: Perceptions of corruption**<br>
The average of binary answers to two GWP questions: “Is corruption widespread throughout the government in this country or not?” and “Is corruption widespread within businesses in this country or not?” Where data for government corruption are missing, the perception of business corruption is used as the overall corruption-perception measure.

---

**Additional information**<br>
Countries marked with an astarisk do not have survey information in 2020 or 2021. Instead their averages are based on the 2019 survey

**Positive affect** is defined as the average of previous-day affect measures for laughter, enjoyment, and doing or learning something interesting. This marks a change from recent years, where only laughter and enjoyment were included. The inclusion of interest gives us three components in each of positive and negative affect and slightly improves the equation fit in column 4. The general form for the affect questions is: Did you experience the following feelings during a lot of the day yesterday? Only the interest question is phrased differently: Did you learn or do something interesting yesterday? See Statistical Appendix 1 for more details.

**Negative affect** is defined as the average of previous-day affect measures for worry, sadness, and anger.

For more detailed information on each of the predictors, see page 20 & 21 of the pdf

## Imports

In [1]:
import pandas as pd
import seaborn as sns
# import matplotlib.pyplot as plt
import pandas_bokeh
from bokeh.models import Whisker
from bokeh.plotting import show, output_notebook, figure
output_notebook()

pd.set_option('plotting.backend', 'pandas_bokeh')

import warnings
warnings.filterwarnings('ignore')



## Custom functions

In [2]:
DUPLICATE_COLS_TO_CHECK = ["INSERT COLS"] # Based on non-null cols gleaned from df.info()
DUPLICATE_VALUE_COL = "INSERT KEY" # Which col to return values from

## Summary, providing info on duplicates if asked
def mySummary(dataframe: pd.DataFrame, duplicateCheck: bool = False) -> str:
    if duplicateCheck:
        num_of_dupes = dataframe[dataframe[DUPLICATE_COLS_TO_CHECK].duplicated()][DUPLICATE_VALUE_COL].count()
        if num_of_dupes > 10:
            DUPLICATES = (num_of_dupes, "More than 10 duplicates, explore more")
        else:
            DUPLICATES = (num_of_dupes, dataframe[dataframe[DUPLICATE_COLS_TO_CHECK].duplicated()][DUPLICATE_VALUE_COL].values)

        info_string = """
            Rows: {} | Cols: {} | Duplicates: {}
            Duplicates from '{}' column: {}
            """.format(
                df.shape[0],          # rows
                df.shape[1],          # cols
                DUPLICATES[0],
                DUPLICATE_VALUE_COL,
                DUPLICATES[1],
                )
    else:
        info_string = """
            Rows: {} | Cols: {}
            """.format(
                df.shape[0],          # rows
                df.shape[1],          # cols
                )
    return info_string

In [3]:
def myInfo(dataframe: pd.DataFrame) -> pd.DataFrame:
    # Custom info dataframe
    my_info_df = pd.DataFrame()
    my_info_df["columns"] = dataframe.keys()

    # Custom info columns
    my_info_df["missing_data"] = dataframe.isna().sum().values
    my_info_df["col_dtype"] = dataframe.dtypes.values

    my_info_df["bytes"] = dataframe.memory_usage(index=False, deep=True).values
    my_info_df["nunique"] = dataframe.nunique().values
    return my_info_df

# Exploration

## Main Dataframe

In [4]:
# Data
df = pd.read_csv("data/World Happiness Report 2022.csv")
df.head()

Unnamed: 0,RANK,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,1,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534
1,2,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,0.188,0.532
2,3,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,0.27,0.191
3,4,Switzerland,7.512,7.586,7.437,2.153,2.026,1.226,0.822,0.677,0.147,0.461
4,5,Netherlands,7.415,7.471,7.359,2.137,1.945,1.206,0.787,0.651,0.271,0.419


In [5]:
df = pd.read_csv("data/World Happiness Report 2022.csv")
# Conclusion:
# 1 - Don't need rank as it's just a copy of index+1
df.drop("RANK",axis=1,inplace=True)
# 2 - I don't like column names with mixed cases or spaces:
rename_to = {
    "Country": "country",
    "Happiness score": "happiness_score",
    "Whisker-high": "whisker_high",
    "Whisker-low": "whisker_low",
    "Dystopia (1.83) + residual": "dystopia",
    "Explained by: GDP per capita": "gdp_per_capita",
    "Explained by: Social support": "social_support",
    "Explained by: Healthy life expectancy": "life_expectancy",
    "Explained by: Freedom to make life choices": "free_life_choices",
    "Explained by: Generosity": "generosity",
    "Explained by: Perceptions of corruption": "corruption_perception"}
df.rename(columns=rename_to, inplace=True)
df.head(3)

Unnamed: 0,country,happiness_score,whisker_high,whisker_low,dystopia,gdp_per_capita,social_support,life_expectancy,free_life_choices,generosity,corruption_perception
0,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534
1,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,0.188,0.532
2,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,0.27,0.191


## Summary

In [6]:
# Dataframe optimisation
df["country"] = df["country"].astype("string")
for col in df.columns[1:]:
    df[col] = df[col].astype('float32')

# Summary & info
print(mySummary(df))
myInfo(df)


            Rows: 146 | Cols: 11
            


Unnamed: 0,columns,missing_data,col_dtype,bytes,nunique
0,country,0,string,9545,146
1,happiness_score,0,float32,584,141
2,whisker_high,0,float32,584,144
3,whisker_low,0,float32,584,141
4,dystopia,0,float32,584,138
5,gdp_per_capita,0,float32,584,141
6,social_support,0,float32,584,133
7,life_expectancy,0,float32,584,134
8,free_life_choices,0,float32,584,128
9,generosity,0,float32,584,116


In [7]:
# Outlier inspection 1
df.describe().style.background_gradient(cmap = "PuBu")

Unnamed: 0,happiness_score,whisker_high,whisker_low,dystopia,gdp_per_capita,social_support,life_expectancy,free_life_choices,generosity,corruption_perception
count,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0
mean,5.553575,5.673589,5.433568,1.831808,1.410445,0.905863,0.586171,0.517226,0.147377,0.154781
std,1.086843,1.065621,1.10938,0.534994,0.421663,0.280122,0.176336,0.145859,0.082799,0.127514
min,2.404,2.469,2.339,0.187,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.88875,5.00625,4.75475,1.55525,1.0955,0.732,0.46325,0.4405,0.089,0.06825
50%,5.5685,5.68,5.453,1.8945,1.4455,0.9575,0.6215,0.5435,0.1325,0.1195
75%,6.305,6.44875,6.19,2.153,1.78475,1.11425,0.71975,0.626,0.19775,0.1985
max,7.821,7.886,7.756,2.844,2.209,1.32,0.942,0.74,0.468,0.587


## Inspection of additional information

In [8]:
# Countries who's average are based on 2019 data
df[df["country"].str.contains("\*") == True]
# TODO: Explore the 'fairness' of including these countries in the updated statistics.

Unnamed: 0,country,happiness_score,whisker_high,whisker_low,dystopia,gdp_per_capita,social_support,life_expectancy,free_life_choices,generosity,corruption_perception
5,Luxembourg*,7.404,7.501,7.307,2.042,2.209,1.155,0.79,0.7,0.12,0.388
38,Guatemala*,6.262,6.46,6.064,2.746,1.274,0.831,0.522,0.662,0.112,0.115
49,Kuwait*,6.106,6.235,5.977,1.621,1.904,0.983,0.747,0.617,0.087,0.147
64,Belarus*,5.821,5.95,5.693,1.811,1.562,1.157,0.629,0.342,0.04,0.282
77,Turkmenistan*,5.474,5.578,5.371,1.16,1.484,1.319,0.516,0.649,0.314,0.032
78,North Cyprus*,5.467,5.609,5.325,1.078,1.815,0.888,0.819,0.523,0.13,0.213
85,Libya*,5.33,5.543,5.118,1.544,1.476,0.943,0.606,0.477,0.106,0.179
91,Azerbaijan*,5.173,5.265,5.082,1.098,1.458,1.093,0.56,0.601,0.023,0.341
92,Gambia*,5.164,5.409,4.918,2.531,0.785,0.621,0.369,0.367,0.388,0.103
96,Liberia*,5.122,5.428,4.815,2.844,0.636,0.67,0.309,0.405,0.178,0.08


In [9]:
# Adding confidence column
df["confidence"] = df["whisker_high"] - df["whisker_low"]
df.sort_values("confidence")

Unnamed: 0,country,happiness_score,whisker_high,whisker_low,dystopia,gdp_per_capita,social_support,life_expectancy,free_life_choices,generosity,corruption_perception,confidence
135,India,3.777,3.828,3.726,0.795,1.167,0.376,0.471,0.647,0.198,0.123,0.102
4,Netherlands,7.415,7.471,7.359,2.137,1.945,1.206,0.787,0.651,0.271,0.419,0.112
8,Israel,7.364,7.426,7.301,2.634,1.826,1.221,0.818,0.568,0.155,0.143,0.125
0,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534,0.130
71,China,5.585,5.650,5.520,1.516,1.508,0.958,0.705,0.656,0.099,0.142,0.130
...,...,...,...,...,...,...,...,...,...,...,...,...
115,Comoros*,4.609,4.849,4.368,2.304,0.899,0.476,0.424,0.185,0.195,0.125,0.481
103,Niger*,5.003,5.247,4.760,2.667,0.570,0.560,0.326,0.571,0.165,0.145,0.487
92,Gambia*,5.164,5.409,4.918,2.531,0.785,0.621,0.369,0.367,0.388,0.103,0.491
129,Chad*,4.251,4.503,3.999,2.419,0.662,0.506,0.225,0.180,0.182,0.077,0.504


## Hypothesis
- No missing data
- No outliers

# Visualisation

## Stacked barplot with confidence

In [10]:
df.head(3)

Unnamed: 0,country,happiness_score,whisker_high,whisker_low,dystopia,gdp_per_capita,social_support,life_expectancy,free_life_choices,generosity,corruption_perception,confidence
0,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534,0.13
1,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,0.188,0.532,0.147
2,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,0.27,0.191,0.187


In [11]:
df_factors = df.copy()
df_factors.drop(axis=1, columns=["dystopia","whisker_high","whisker_low","happiness_score","confidence"], inplace=True)
df_factors["dystopia"] = df["dystopia"]
df_factors["happiness_score"] = df["happiness_score"]
df_factors["lower"] = df["happiness_score"] - df["confidence"]/2
df_factors["upper"] = df["happiness_score"] + df["confidence"]/2
df_factors.sort_values("happiness_score",inplace=True)
df_i = df_factors.set_index("country").copy()
df_i.head(3)


Unnamed: 0_level_0,gdp_per_capita,social_support,life_expectancy,free_life_choices,generosity,corruption_perception,dystopia,happiness_score,lower,upper
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,0.758,0.0,0.289,0.0,0.089,0.005,1.263,2.404,2.339,2.469
Lebanon,1.392,0.498,0.631,0.103,0.082,0.034,0.216,2.955,2.8615,3.0485
Zimbabwe,0.947,0.69,0.27,0.329,0.106,0.105,0.548,2.995,2.88,3.11


In [24]:
df_factors.to_csv("data.csv")

In [25]:
from bokeh.models import ColumnDataSource,LabelSet,Whisker
from bokeh.palettes import Category10_7
from bokeh.io import curdoc

# apply theme to current document
curdoc().theme = "dark_minimal"

# TODO: Make the legend dynamic, aka filter by factor.
# TODO: Make the label dynamic for above todo & also figure out how to format to 3sf

# Vars
WIDTH,HEIGHT = 1300,2300
TOOLS = "crosshair,pan,wheel_zoom,box_zoom,reset,hover,save"
TOOLTIPS = [
    ("country:","@country"),
    ("happiness:","@happiness_score"),
    ("GDP per Capita:", "@gdp_per_capita")
]

factors = list(df_i.columns)[:-3]
country = list(df_i.index)
source = ColumnDataSource(df_i)

# Plot
p = figure(
    width=WIDTH,
    height=HEIGHT,
    y_range = country,
    title = "Happiness score",
    tools=TOOLS,
    tooltips=TOOLTIPS
);

p.y_range.range_padding = 0.05
p.x_range.range_padding = 0


# Stacked horizontal bar
p.hbar_stack(
    factors,
    source=source,
    y="country",
    color=Category10_7,
    legend_label=factors,
    alpha=0.6,
    height=0.9
)

# Add label
labels = LabelSet(
    x=0.1, y='country',
    text='happiness_score',
    x_offset=0, y_offset=-3,
    source=source,
    text_font_size = "10px",
    text_color="white"
)
p.add_layout(labels);

p.legend.orientation = "horizontal"
p.legend.location = "Wtop_center"
show(p)


In [26]:
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot, plot
init_notebook_mode(connected=True)

# TODO: Figure out how to make the api map dark theme
# TODO: Create widget that allows user to switch factors.

import plotly.io as pio
pio.templates['custom'] = go.layout.Template(
    layout_paper_bgcolor='rgba(0,0,0,0)',
    layout_plot_bgcolor='rgba(0,0,0,0)'
    )
pio.templates.default = 'plotly+custom'

MY_COLORSCALE = [
    [0.0, "rgb(165,0,38)"],
    [0.1111111111111111, "rgb(215,48,39)"],
    [0.2222222222222222, "rgb(244,109,67)"],
    [0.3333333333333333, "rgb(253,174,97)"],
    [0.4444444444444444, "rgb(254,224,144)"],
    [0.5555555555555556, "rgb(224,243,248)"],
    [0.6666666666666666, "rgb(171,217,233)"],
    [0.7777777777777778, "rgb(116,173,209)"],
    [0.8888888888888888, "rgb(69,117,180)"],
    [1.0, "rgb(49,54,149)"]
]

data=dict(
    type="choropleth",
    colorscale=MY_COLORSCALE,
    reversescale=True,
    locations=df['country'],
    locationmode="country names",
    z=df["happiness_score"],
    text=df["country"],
    colorbar={'title':'Happiness scale'}
)

layout=dict(
    title='Global Happiness of countries',
    geo=dict(showframe=False,projection={'type':'natural earth'}),
)

fig=go.Figure(data=data,layout=layout)

fig.update_layout(
    width = 1300, height=650,
    autosize=True,
    margin = {"r":0,"t":0,"l":0,"b":0},
    plot_bgcolor='rgb(17,17,17)',
    paper_bgcolor ='rgb(10,10,10)'
)

fig.update_traces(reversescale=False)
iplot(fig, validate=False)

In [21]:
# Create Bokeh-Table with DataFrame:
from bokeh.models.widgets import DataTable, TableColumn
from bokeh.models import ColumnDataSource

data_table = DataTable(
    columns=[TableColumn(field=Ci, title=Ci) for Ci in df_i.columns],
    source=ColumnDataSource(df_i),
    height=300,width=50
)

# Combine Table and Scatterplot via grid layout:
# pandas_bokeh.plot_grid([[data_table, p]], width=600, height=1700)


# Ideas from other contributers

## Notebook layout design & styling:
https://www.kaggle.com/code/georgyzubkov/world-happy-exploratory-data-analysis-with-plotly#list-tab

## <h1 style='background:#FF69B4; border:2; border-radius: 10px; font-size:250%; font-weight: bold; color:black'><center>World happiness report 2022</center></h1>

The Happy Planet Index (HPI) is an index of human well-being and environmental impact that was introduced by the New Economics Foundation in 2006. Each country's HPI value is a function of its average subjective life satisfaction, life expectancy at birth, and ecological footprint per capita. The exact function is a little more complex, but conceptually it approximates multiplying life satisfaction and life expectancy and dividing that by the ecological footprint. The index is weighted to give progressively higher scores to nations with lower ecological footprints.

The index is designed to challenge well-established indices of countries’ development, such as the gross domestic product (GDP) and the Human Development Index (HDI), which are seen as not taking sustainability into account. In particular, GDP is seen as inappropriate, as the usual ultimate aim of most people is not to be rich, but to be happy and healthy. Furthermore, it is believed that the notion of sustainable development requires a measure of the environmental costs of pursuing those goals.

Out of the 178 countries surveyed in 2006, the best scoring countries were Vanuatu, Colombia, Costa Rica, Dominica, and Panama. In 2009, Costa Rica was the best scoring country among the 143 analyzed, followed by the Dominican Republic, Jamaica, Guatemala and Vietnam. Tanzania, Botswana and Zimbabwe were featured at the bottom of the list.

For the 2012 ranking, 151 countries were compared, and the best scoring country for the second time in a row was Costa Rica, followed by Vietnam, Colombia, Belize and El Salvador. The lowest ranking countries in 2012 were Botswana, Chad and Qatar. In 2016, out of 140 countries, Costa Rica topped the index for the third time in a row. It was followed by Mexico, Colombia, Vanuatu and Vietnam. At the bottom were Chad, Luxembourg and Togo.

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">

<h1 style='background:#FF69B4; border:0; border-radius: 10px; color:black'><center> TABLE OF CONTENTS </center></h1>

### [**1. IMPORTING LIBRARIES AND LOADING DATA**](#title-one)

### [**2. DATA INFORMATION**](#title-two)

### [**3. EXPLORATORY DATA ANALYSIS**](#title-three)

### [**4. STATISTICAL TESTS**](#title-four) 

### [**5. MACHINE LEARNING**](#title-five)

### [**6. RECOMENDATIONS**](#title-six)

<a id="title-one"></a>
<h1 style='background:#FF69B4; border:2; border-radius: 10px; color:black'><center>IMPORTING LIBRARIES AND LOADING DATA</center></h1>