# World Happiness Exploration

Luckily the 2022 year is comes to the end.

It's time to take stock - how happiness has been distributed between countries in recent years.

## Table of Content

- [Data Preparation](#preparation)

  - [Load Data](#preparation-load)

  - [First Look](#preparation-first-look)

  - [Process the Data](#preparation-process)

- [Exploratory Data Analysis](#eda)

  - [General Plots](#eda-general)

  - [Explore Regions](#eda-regions)

  - [Data on Map](#eda-map)

  - [Influence of the Neighbour Countries](#eda-neighbours)

- [Conclusions](#conclusions)

In [1]:
from sys import executable
!{executable} -m pip install colorcet



In [2]:
import numpy as np
import pandas as pd
import colorcet as cc

from lets_plot import *
from lets_plot.mapping import *
from lets_plot.geo_data import *
from lets_plot.bistro import *
LetsPlot.setup_html()

The geodata is provided by © OpenStreetMap contributors and is made available here under the Open Database License (ODbL).


<a class="anchor" id="preparation"></a>

## Data Preparation

<a class="anchor" id="preparation-load"></a>

### Load Data

In [3]:
def get_data(path):
    from os import listdir

    def read_csv(fname):
        year = int(fname.split(".")[0])
        return pd.read_csv("{0}/{1}".format(path, fname)).assign(year=year)

    return pd.concat([
        read_csv(fname) for fname in listdir(path)
    ], ignore_index=True)

In [4]:
raw_df = get_data("data/world_happiness_report")
print(raw_df.shape)
raw_df.head()

(1231, 52)


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),...,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual,RANK,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita
0,Switzerland,Western Europe,1.0,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,...,,,,,,,,,,
1,Iceland,Western Europe,2.0,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,...,,,,,,,,,,
2,Denmark,Western Europe,3.0,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,...,,,,,,,,,,
3,Norway,Western Europe,4.0,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,...,,,,,,,,,,
4,Canada,North America,5.0,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,...,,,,,,,,,,


<a class="anchor" id="preparation-first-look"></a>

### First Look

In [5]:
raw_df.dtypes

Country                                        object
Region                                         object
Happiness Rank                                float64
Happiness Score                               float64
Standard Error                                float64
Economy (GDP per Capita)                      float64
Family                                        float64
Health (Life Expectancy)                      float64
Freedom                                       float64
Trust (Government Corruption)                 float64
Generosity                                    float64
Dystopia Residual                             float64
year                                            int64
Lower Confidence Interval                     float64
Upper Confidence Interval                     float64
Happiness.Rank                                float64
Happiness.Score                               float64
Whisker.high                                  float64
Whisker.low                 

**TODO:**

- Много повторяющихся столбцов

In [6]:
raw_df.select_dtypes(include='object').nunique()

Country                                       192
Region                                         10
Country or region                             160
Country name                                  154
Regional indicator                             10
Explained by: Social support                  420
Explained by: Healthy life expectancy         404
Explained by: Freedom to make life choices    410
Explained by: Generosity                      387
Explained by: Perceptions of corruption       385
Happiness score                               141
Whisker-high                                  144
Whisker-low                                   141
Dystopia (1.83) + residual                    138
Explained by: GDP per capita                  141
dtype: int64

**TODO:**

- Регионов не так уж много

In [7]:
raw_df.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,...,Perceptions of corruption,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Ladder score in Dystopia,Explained by: Log GDP per capita,Dystopia + residual,RANK
count,315.0,315.0,158.0,315.0,470.0,315.0,470.0,315.0,1084.0,315.0,...,613.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,147.0
mean,79.238095,5.378949,0.047885,0.899837,0.990347,0.594054,0.402828,0.140532,0.153545,2.212032,...,0.416267,5.502645,0.056111,5.612629,5.392641,9.363053,2.198127,0.922248,2.19829,74.0
std,45.538922,1.141531,0.017146,0.41078,0.318707,0.24079,0.150356,0.11549,0.167592,0.558728,...,0.34049,1.092111,0.020292,1.07485,1.110534,1.180595,0.229201,0.39183,0.595958,42.579338
min,1.0,2.839,0.01848,0.0,0.0,0.0,0.0,0.0,-0.300907,0.32858,...,0.0,2.523,0.025902,2.596,2.449,6.492642,1.972317,0.0,0.257241,1.0
25%,40.0,4.51,0.037268,0.5949,0.793,0.419645,0.297615,0.061315,0.064828,1.884135,...,0.082,4.7694,0.042,4.885588,4.636008,8.483295,1.972317,0.633963,1.823,37.5
50%,79.0,5.286,0.04394,0.97306,1.025665,0.64045,0.418347,0.10613,0.16214,2.21126,...,0.306,5.5245,0.052321,5.610132,5.426829,9.514612,1.972317,0.982509,2.223108,74.0
75%,118.5,6.269,0.0523,1.229,1.228745,0.78764,0.51685,0.17861,0.252,2.56347,...,0.780623,6.248375,0.066,6.362124,6.136381,10.356,2.43,1.241988,2.61975,110.5
max,158.0,7.587,0.13693,1.82427,1.610574,1.02525,0.66973,0.55191,0.838075,3.83772,...,0.939,7.842,0.173,7.904,7.78,11.647,2.43,1.751,3.482,147.0


**TODO:**

- Столбцы не заполнены целиком

- Похожие столбцы имеют похожие агрегированные значения

<a class="anchor" id="preparation-process"></a>

### Process the Data

In [8]:
df = raw_df.copy()

# country column
df["country"] = df["Country"].fillna(df["Country or region"]).fillna(df["Country name"])
df = df[df["country"] != "xx"]
df = df[~df["country"].str.contains("\*").astype(bool)]
country_vc = df["country"].value_counts()
df = df[df["country"].isin(country_vc[country_vc == country_vc.max()].index)]
# region column
df["region"] = df["country"].replace(df[["country", "Region"]].set_index("country").dropna().to_dict()["Region"])
# happiness score column
df["happiness_score"] = df["Happiness Score"].fillna(df["Happiness.Score"])\
                                             .fillna(df["Score"])\
                                             .fillna(df["Ladder score"])\
                                             .fillna(df["Happiness score"])\
                                             .astype(str).str.replace(",", ".").astype(float)
# happiness rank column
df.sort_values(by=["year", "happiness_score"], ascending=[True, False], inplace=True)
df["happiness_rank"] = df.groupby("year").cumcount() + 1
# drop extra columns
df = df[["year", "country", "region", "happiness_rank", "happiness_score"]]
# sort values
df = df.sort_values(by=["year", "happiness_rank"]).reset_index(drop=True)

print(df.shape)
df.head()

(936, 5)


Unnamed: 0,year,country,region,happiness_rank,happiness_score
0,2015,Switzerland,Western Europe,1,7.587
1,2015,Iceland,Western Europe,2,7.561
2,2015,Denmark,Western Europe,3,7.527
3,2015,Norway,Western Europe,4,7.522
4,2015,Canada,North America,5,7.427


In [9]:
years = df.year.unique()
years

array([2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022], dtype=int64)

In [10]:
years_range = [years.min(), years.max()]
years_range

[2015, 2022]

<a class="anchor" id="eda"></a>

## Exploratory Data Analysis

<a class="anchor" id="eda-general"></a>

### General Plots

In [11]:
top_n = 7
cc_palette = "glasbey_bw"

top_df = df[df.happiness_rank <= top_n]
color_replace = {country: cc.palette[cc_palette][i] for i, country in enumerate(top_df.country.unique())}
top_df = top_df.assign(color=top_df.country.replace(color_replace))

print(top_df.shape)
top_df.head()

(56, 6)


Unnamed: 0,year,country,region,happiness_rank,happiness_score,color
0,2015,Switzerland,Western Europe,1,7.587,#d60000
1,2015,Iceland,Western Europe,2,7.561,#8c3bff
2,2015,Denmark,Western Europe,3,7.527,#018700
3,2015,Norway,Western Europe,4,7.522,#00acc6
4,2015,Canada,North America,5,7.427,#97ff00


In [12]:
bottom_n = 7
cc_palette = "glasbey_bw"

bottom_df = df.sort_values(by=["happiness_rank", "year"]).iloc[-len(years)*bottom_n:].sort_values(by=["year", "happiness_rank"])
color_replace = {country: cc.palette[cc_palette][top_n + i] for i, country in enumerate(bottom_df.country.unique())}
bottom_df = bottom_df.assign(color=bottom_df.country.replace(color_replace))

print(bottom_df.shape)
bottom_df.head()

(56, 6)


Unnamed: 0,year,country,region,happiness_rank,happiness_score,color
110,2015,Tanzania,Sub-Saharan Africa,111,3.781,#ffa52f
111,2015,Guinea,Sub-Saharan Africa,112,3.656,#00009c
112,2015,Ivory Coast,Sub-Saharan Africa,113,3.655,#857067
113,2015,Burkina Faso,Sub-Saharan Africa,114,3.587,#004942
114,2015,Afghanistan,Southern Asia,115,3.575,#4f2a00


In [13]:
rank_df = df[["year", "country", "happiness_rank"]].pivot(index="country", columns="year", values="happiness_rank")[years_range]
rank_df.columns = years_range
rank_df["progress"] = rank_df[years_range[1]] - rank_df[years_range[0]]
rank_df["trend"] = np.where(rank_df.progress == 0, 0, rank_df.progress / np.abs(rank_df.progress)).astype(int)
rank_df.sort_values(by="progress", ascending=False, inplace=True)
rank_df.reset_index(inplace=True)

print(rank_df.shape)
rank_df.head()

(117, 5)


Unnamed: 0,country,2015,2022,progress,trend
0,Venezuela,21,90,69,1
1,Zambia,69,111,42,1
2,Jordan,66,108,42,1
3,Pakistan,65,102,37,1
4,Lebanon,81,116,35,1


In [14]:
ggplot(df, aes("year", "happiness_score")) + \
    geom_violin(color="#084594", fill="#9ecae1") + \
    geom_boxplot(color="#084594", width=.2) + \
    scale_x_continuous(breaks=years) + \
    ylab("happiness score") + \
    ggtitle("Happiness score density through the years") + \
    theme_minimal()

**TODO:**

- Со временем happiness score медленно растет.

- Распределение happiness score со временем сдвигается в сторону больших значений.

In [15]:
ggplot(top_df, aes("year", "happiness_rank", color="color")) + \
    geom_line(size=1) + \
    geom_point(size=4, tooltips=layer_tooltips().line("@|@country")) + \
    scale_x_continuous(breaks=years) + \
    scale_y_continuous(name="happiness rank", trans="reverse", \
                       breaks=top_df.happiness_rank.unique()) + \
    scale_color_identity() + \
    ggtitle("Top {0} countries by happiness rank for past {1} years".format(top_n, len(years))) + \
    theme_minimal()

**TODO:**

- Из топовых стран самый печальный тренд у Норвегии, а самый позитивный - у Финляндии.

- В основном, топ стабилен.

- Есть подозрение, что в топе, в основном, северные (европейские) страны.

In [16]:
ggplot(bottom_df, aes("year", "happiness_rank", color="color")) + \
    geom_line(size=1) + \
    geom_point(size=4, tooltips=layer_tooltips().line("@|@country")) + \
    scale_x_continuous(breaks=years) + \
    scale_y_continuous(name="happiness rank", trans="reverse", \
                       breaks=bottom_df.happiness_rank.unique()) + \
    scale_color_identity() + \
    ggtitle("Bottom {0} countries by happiness rank for past {1} years".format(bottom_n, len(years))) + \
    theme_minimal()

**TODO:**

- Среди менее счастливых стран больше хаоса, обычно никто тут надолго не задерживается.

- Тем не менее, у Танзании и Афганистана все стабильно плохо.

In [17]:
ggplot(top_df) + \
    geom_pie(aes(fill=as_discrete("color", order_by='..count..')), \
             stroke=1, stroke_color="black", \
             tooltips=layer_tooltips().title("@country")\
                     .format("@..count..", "d").line("count|@..count..")) + \
    facet_wrap(facets="happiness_rank") + \
    scale_fill_identity() + \
    ggtitle("Share of each of top {0} countries in each value of happiness rank for past {1} years"\
            .format(top_n, len(years))) + \
    theme_minimal() + theme(axis_title='blank')

**TODO:**

- На первом месте чаще всего оказывается Финляндия, на втором - Дания, на третьем - Исландия; остальные случаи более разнообразны.

In [18]:
ggplot(bottom_df) + \
    geom_pie(aes(fill=as_discrete("color", order_by='..count..')), \
             stroke=1, stroke_color="black", \
             tooltips=layer_tooltips().title("@country")\
                     .format("@..count..", "d").line("count|@..count..")) + \
    facet_wrap(facets="happiness_rank") + \
    scale_fill_identity() + \
    ggtitle("Share of each of bottom {0} countries in each value of happiness rank for past {1} years"\
            .format(bottom_n, len(years))) + \
    theme_minimal() + theme(axis_title='blank')

**TODO:**

- На последнем месте чаще всего оказывается Афганистан.

In [19]:
ggplot(pd.concat([rank_df.head(top_n), rank_df.tail(bottom_n)])) + \
    geom_segment(aes(y=str(years_range[0]), yend=str(years_range[1]), color=as_discrete("trend")), \
                 x=years_range[0], xend=years_range[1], arrow=arrow()) + \
    geom_point(aes(y=str(years_range[0]), color=as_discrete("trend")), x=years_range[0], \
               tooltips=layer_tooltips().title("@country").line("@|@progress")) + \
    geom_point(aes(y=str(years_range[1]), color=as_discrete("trend")), x=years_range[1], \
               tooltips=layer_tooltips().title("@country").line("@|@progress")) + \
    scale_x_continuous(name="years", breaks=years_range) + \
    scale_y_continuous(name="happiness rank", trans="reverse") + \
    scale_color_manual(values=["#a50026", "#006837"]) + \
    ggtitle("Extreme {0} countries by happiness rank change".format(top_n + bottom_n)) + \
    ggsize(600, 800) + \
    theme_minimal() + theme(legend_position='none')

**TODO:**

- Лучший тренд у Румынии.

- Худший тренд у Венесуэлы.

<a class="anchor" id="eda-regions"></a>

### Explore Regions

In [20]:
regions_df = df.groupby("region").agg({"country": "count", "happiness_score": ["mean", "std"]})
regions_df = regions_df.droplevel(0, axis=1).reset_index()
regions_df["count"] = (regions_df["count"] / years.size).astype(int)
regions_df.sort_values(by="count", ascending=False, inplace=True)

print(regions_df.shape)
regions_df.head()

(10, 4)


Unnamed: 0,region,count,mean,std
1,Central and Eastern Europe,24,5.538956,0.589009
8,Sub-Saharan Africa,21,4.414707,0.633909
9,Western Europe,19,6.846412,0.701921
3,Latin America and Caribbean,18,6.065798,0.5339
4,Middle East and Northern Africa,13,5.40789,0.938793


In [21]:
ggplot(df) + \
    geom_boxplot(aes(as_discrete("region", order_by="..middle.."), "happiness_score"), \
                 color="#084594", fill="#9ecae1") + \
    ylab("happiness score") + \
    ggtitle("Happiness score main aggregation values by region") + \
    ggsize(600, 600) + \
    theme_minimal()

**TODO:**

- Несмотря на то, что по отдельным странам в лидеры по счастью выбиваются европейцы, в среднем более счастливыми себя чувствуют люди из австралийского региона.

In [22]:
ggplot(regions_df) + \
    geom_pie(aes(slice="count", fill="mean"), stat='identity', \
             stroke=1, stroke_color="black", size=40, \
             labels=layer_labels(["mean"]).format("@mean", ".3f"), \
             tooltips=layer_tooltips().title("@region")\
                                      .line("countries count|@count")\
                                      .format("@mean", ".3f").line("happiness score mean|@mean")\
                                      .format("@std", ".3f").line("happiness score std|@std")) + \
    scale_fill_gradient(name="mean happiness score", low="#d73027", high="#1a9850") + \
    ggtitle("Happiness score by region with countries counts") + \
    ggsize(800, 800) + \
    theme_classic() + theme(axis='blank')

**TODO:**

- Возможно, в Австралии и в Северной Америке так хорошо идут дела, потому что выборка маленькая.

- Однозначно хорошо выглядит Западная Европа.

- Однозначно плохо выглядит Черная Африка.

<a class="anchor" id="eda-map"></a>

### Data on Map

In [23]:
def map_plot(data, year):
    local_df = data[data.year == year]
    countries = geocode_countries(data.country.unique()).inc_res().get_boundaries()
    return ggplot() + \
        geom_livemap(zoom=1) + \
        geom_map(aes(fill="happiness_score"), data=local_df, map=countries, map_join="country", \
                 size=0, alpha=.5, tooltips=layer_tooltips().title("@country")\
                                                            .line("happiness_score|^fill")) + \
        scale_fill_gradient(name="happiness score", low="#d73027", high="#1a9850", \
                            limits=[data.happiness_score.min(), data.happiness_score.max()]) + \
        ggtitle("Happiness score for each country in {0}".format(year)) + \
        theme(legend_position='bottom')

width, height = 500, 500
bunch = GGBunch()
bunch.add_plot(map_plot(df, df.year.min()), 0, 0, width, height)
bunch.add_plot(map_plot(df, df.year.max()), width, 0, width, height)
bunch.show()

**TODO:**

- Интересно смотрится, что вокруг Афганистана (самой несчастливой страны) все спокойно. Попробуем дальше изучить влияние соседей.

<a class="anchor" id="eda-neighbours"></a>

### Influence of the Neighbour Countries

In [24]:
class CountryRename:
    def __init__(self):
        self._replace = {}

    def search(self, names):
        names = names[~names.isin(self._replace.keys())]
        if not names.any():
            return self
        geocoded_countries = geocode_countries(names.unique()).ignore_not_found().get_geocodes()
        self._replace = {
            **self._replace,
            **geocoded_countries.set_index("country")["found name"].to_dict()
        }
        return self

    def replace(self, names):
        return names.replace(self._replace)

    def transform(self, data, names_col):
        names = data[names_col].astype(str)
        self.search(names)
        result = data[names.isin(self._replace.keys())].reset_index(drop=True)
        result[names_col] = self.replace(names)
        return result

In [25]:
cn = CountryRename()

nb_df = df[df.year == years_range[1]][["country", "happiness_score"]].reset_index(drop=True)
nb_df = cn.transform(nb_df, "country")

cb_df = pd.read_csv("https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/country_borders.csv")
cb_df = cn.transform(cb_df, "country_name")
cb_df = cn.transform(cb_df, "country_border_name")
cb_df = cb_df[cb_df.country_name.isin(nb_df.country)]
cb_df = cb_df[cb_df.country_border_name.isin(nb_df.country)]
cb_df["border_happiness_score"] = cb_df.country_border_name.replace(nb_df.set_index("country")\
                                                           .to_dict()["happiness_score"]).astype(float)
cb_df = cb_df.groupby("country_name").agg({
    "country_border_name": "count",
    "border_happiness_score": ["min", "mean", "max"]
}).reset_index()
cb_df.columns = ["country", "neighbours_count", "neighbour_min_happiness_score", \
                 "neighbour_mean_happiness_score", "neighbour_max_happiness_score"]

nb_df = nb_df.merge(cb_df, on="country", how="left")
nb_df.neighbours_count = nb_df.neighbours_count.fillna(0).astype(int)
nb_df["neighbours"] = nb_df.neighbours_count.apply(lambda r: str(r) if r <= 2 else "≥ 3")
nb_df = nb_df.sort_values(by="happiness_score").reset_index(drop=True)

print(nb_df.shape)
nb_df.head()

(117, 7)


Unnamed: 0,country,happiness_score,neighbours_count,neighbour_min_happiness_score,neighbour_mean_happiness_score,neighbour_max_happiness_score,neighbours
0,افغانستان,2.404,5,4.516,5.2858,6.063,≥ 3
1,لبنان,2.955,2,5.585,5.7065,5.828,2
2,Zimbabwe,2.995,0,,,,0
3,Sierra Leone,3.574,0,,,,0
4,Tanzania,3.702,0,,,,0


In [26]:
ggplot(nb_df) + \
    geom_area_ridges(aes("happiness_score", "neighbours", fill="..quantile.."), \
                     scale=1.5, quantiles=[.1, .25, .5, .75, .9], quantile_lines=True) + \
    scale_y_discrete(breaks=["0", "1", "2", "≥ 3"]) + \
    scale_fill_gradient(low="#d73027", high="#1a9850") + \
    xlab("happiness score") + \
    ggtitle("Distribution of happiness score through neighbours count") + \
    theme_minimal()

In [27]:
ggplot(nb_df, aes(x="happiness_score", fill="happiness_score")) + \
    geom_dotplot(binwidth=.25) + \
    facet_grid(y="neighbours") + \
    scale_fill_gradient(name="happiness score", low="#d73027", high="#1a9850") + \
    xlab("happiness score") + \
    ggtitle("Dotplot distribution of happiness score through neighbours count") + \
    ggsize(600, 800) + \
    theme_minimal() + theme(axis_title_y='blank', axis_text_y='blank')

In [28]:
corr_plot(nb_df.corr(numeric_only=True))\
    .tiles(type='full').labels(type='full', color="black")\
    .palette_gradient(low="#d53e4f", mid="#ffffbf", high="#3288bd")\
    .build() + \
    ggtitle("Correlations between happiness score aggregated values") + \
    ggsize(800, 600)

In [29]:
ggplot(nb_df[nb_df.neighbours != "0"], \
       aes("happiness_score", "neighbour_mean_happiness_score", color="neighbours", fill="neighbours")) + \
    geom_smooth(method='loess') + \
    geom_point() + \
    facet_grid(y="neighbours") + \
    scale_color_brewer(type='qual', palette="Set1") + \
    scale_fill_brewer(type='qual', palette="Set1") + \
    xlab("happiness score") + ylab("mean happiness score of neighbours") + \
    ggtitle("Dependency between happiness score and happiness score of neighbours") + \
    theme_minimal()

<a class="anchor" id="conclusions"></a>

## Conclusions

**TODO:**

- В целом, исследование соседей ничего не дало.

- Счастливого нового года!