# 🙋 WORLD HAPPINESS EXPLANATORY DATA ANALYSIS 😊

![](https://4.bp.blogspot.com/-I3nGmxlA3rg/V4Q49lL1rqI/AAAAAAAAW2M/RpOx1_kTT5YSMSSHuuqr1IA2-sP0QoNuQCLcB/s640/pacific-ocean-4000x2486-big-sur-california-beach-mcway-falls-sunset-389.jpg)

# Introduction
* The World Happiness Report is a landmark survey of the state of global happiness. 
* The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. 
* Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. 
* The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.

<a id = "1"></a>
### Analysis Content
1. [Required Libraries](#1)
2. [Data Content](#2)
3. [Starting to Analyse Data](#3)
4. [Data Distribution for the 2021 Dataset](#4)
5. [The Happiest and Unhappiest Countries in 2021](#5)
6. [Ladder Score Distribution According to the Geographical Location](#6)
7. [Ladder Score Distribution According to Countries with the World Map Visualization](#7)
8. [Most Generous and Most Ungenerous Countries in 2021](#8)
9. [Generousity According to Countries with the World Map Visualization](#9)
10.[Most Generousity and Most Ungenerousity According to the Geographical Location in 2021](#10)

### Required Libraries
* I am going to import the libraries that I need.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for visualization
import seaborn as sns
sns.set_style("whitegrid") # changes frame color from gray (default color) to white
import plotly.express as px # for world map visualization
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot # for working as offline


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

import warnings
warnings.filterwarnings("ignore") # For filtering and not showing the warnings. 

<a id = "2"></a>
### Data Content
* The happiness scores and rankings use data from the Gallup World Poll. 
* The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.

* **Ladder score**: Happiness score or subjective well-being. This is the national average response to the question of life evaluations.
* **Logged GDP per capita**: The GDP-per-capita time series from 2019 to 2020 using countryspecific forecasts of real GDP growth in 2020.
* **Social support**: Social support refers to assistance or support provided by members of social networks (like government) to an individual.
* **Healthy life expectancy**: Healthy life expectancy is the average life in good health - that is to say without irreversible limitation of activity in daily life or incapacities - of a fictitious generation subject to the conditions of mortality and morbidity prevailing that year.
* **Freedom to make life choices**: Freedom to make life choices is the national average of binary responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” ... It is defined as the average of laughter and enjoyment for other waves where the happiness question was not asked
* **Generosity**: Generosity is the residual of regressing national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.
* **Perceptions of corruption**: The measure is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?”
* **Ladder score in Dystopia**: It has values equal to the world’s lowest national averages. Dystopia as a benchmark against which to compare contributions from each of the six factors. Dystopia is an imaginary country that has the world's least-happy people. ... Since life would be very unpleasant in a country with the world's lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom, and least social support, it is referred to as “Dystopia,” in contrast to Utopia.

World Happiness Report Official Website: https://worldhappiness.report/

<a id = "3"></a>
### 3. Starting to the Analyse Data

In [None]:
# reading the survey results which was done before 2021:
before_2021 = pd.read_csv("/kaggle/input/world-happiness-report-2021/world-happiness-report.csv")
before_2021

Let us get more information from the data by using **describe()** method.

In [None]:
before_2021.describe()

* By looking at this table, we can see the max and min values for each variable. 
* If we compare the meand and median values and if the difference is small (for example if we look at the "year" column, mean value is 2013.216008 and the median is 2013.000000. The difference is quite small) these variables may have a Gaussian distribution. 

Besides, by using "**info()**" method, we can get a general idea from our dataset.

In [None]:
before_2021.info()

* In total, we have 11 columns and 1949 rows. 
* Name of the columns are "Country name", "year", "Life Ladder", "Log GDP per capita", "Social support", "Healthy life expectancy at birth", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Positive affect" and "Negative affect".
* Country names are string and years are integer. Remaining variables have float data types.
* Except "Country name", "year" and "Life Ladder" columns, we have some missing values.

Let us analyze the 2021 dataset:

In [None]:
dataset_2021 = pd.read_csv("/kaggle/input/world-happiness-report-2021/world-happiness-report-2021.csv")

In [None]:
dataset_2021

When we look at the dataset above and if we compare it with the previous dataset named by before_2021, column names and row numbers are different from each other. The reason is, in the socend dataset, feature extraction was implemented. It means, some new columns were added by classifying the other columns. For example:

In [None]:
dataset_2021["Country name"].value_counts()

As you see, we have lots of different countries and visualizing all of them is not so meaningful. So, clustering the countries according to their geographical location is a good choice at that point. 

In [None]:
 dataset_2021.describe()

In [None]:
dataset_2021.info()

* "Country name" and "Regional indicator" columns have string data type and the remaining variables are float values.
* We do not have any null variables.

<a id = "4"></a>
### 4. Data Distribution for the 2021 Dataset

In the visualization part, for visualizing the string variables we can use count plots, on the other hand for numeric values, we prefer histograms in order to see the distribution.

Let us look at the countries:

In [None]:
dataset_2021["Country name"].unique()

We can also look at the region of these countries (Regional indicator):

In [None]:
dataset_2021["Regional indicator"].unique()

By using **Seaborn library**, I am going to show you which countries belongs to which regions:

In [None]:
sns.countplot(y = dataset_2021["Regional indicator"])
plt.show()

* I am going to use boxplot for visualizing "Social support", "Freedom to make life choices", "Generosity", "Perceptron of corruption" columns. The reason why I chose these columns is the distribution of the numeric values are similar to each other ( All of them have the range between 0 and 1). 
* In this example I prefered box plot instead of histogram. Because by using box plot, we can see outliers better.

You can see the general form of Box Plot:
![](https://miro.medium.com/max/1400/1*2c21SkzJMf3frPXPAR_gZA.png)

* Boxplots are a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).

* **Median (Q2/50th Percentile)**: the middle value of the dataset.
* **First quartile (Q1/25th Percentile)**: the middle number between the smallest number (not the “minimum”) and the median of the dataset.
* **Third quartile (Q3/75th Percentile)**: the middle value between the median and the highest value (not the “maximum”) of the dataset.
* **Interquartile range (IQR)**: 25th to the 75th percentile.
* **Whiskers** (shown in blue)
* **Outliers** (shown as green circles)
* **Maximum**: Q3 + 1.5*IQR
* **Minimum**: Q1 -1.5*IQR

Following picture shows you the comparison of a boxplot of a nearly normal distribution and a probability density function (pdf) for a normal distribution:

![](https://miro.medium.com/max/656/1*NRlqiZGQdsIyAu0KzP7LaQ.png)

In [None]:
column_names = ["Social support", "Freedom to make life choices", "Generosity", "Perceptions of corruption"]
sns.boxplot(data = dataset_2021.loc[:,column_names], orient = "v")
plt.xticks(rotation = 60)
plt.show()

* Generosity has a little bit different range when we compare with other three variables.
* When we look at the outliers, "Perceptron of corruption" variable has higher than the others.

In [None]:
column_names2 = ["Ladder score", "Logged GDP per capita"]
sns.boxplot(data = dataset_2021.loc[:, column_names2], palette = "Set3")
plt.xticks(rotation = 60)
plt.show()

In [None]:
column_name = ["Healthy life expectancy"]
sns.boxplot(data = dataset_2021.loc[:, column_name])
plt.xticks(rotation = 60)
plt.show()

<a id = "5"></a>
### The Happiest and Unhappiest Countries in 2021

In [None]:
table_with_happiest_and_unhappiest = dataset_2021[(dataset_2021["Ladder score"] > 7.5) | (dataset_2021["Ladder score"] < 4)]
table_with_happiest_and_unhappiest

In [None]:
plt.title("The Happiest and Unhappiest Countries in 2021")
sns.barplot(x = table_with_happiest_and_unhappiest["Ladder score"], y = table_with_happiest_and_unhappiest["Country name"], palette = "coolwarm")
plt.show()

<a id = "6"></a>


<a id = "6"></a>
### 6. Ladder Score Distribution According to the Geographical Location

I am going to use kde plot (Kernel Density Estimation plot) to show the ladder score distribution according to the geographical location.

In [None]:
plt.figure(figsize = (12,8))
sns.kdeplot(dataset_2021["Ladder score"], hue = dataset_2021["Regional indicator"], fill = True, linewidth = 2)
plt.title("Ladder Score Distribution According to the Geographical Location")
plt.show()

* According to kde plot, we can say that the happiest region is Western Europe. After that North America and ANZ follows it. 
* Plot also shows us the unhappiest region is South Asia.

In the following plot, I added the mean line. According to this line we can see which regions are under the mean of the happiness degree or the upper part: 

In [None]:
plt.figure(figsize = (12,8))
sns.kdeplot(dataset_2021["Ladder score"], hue = dataset_2021["Regional indicator"])
plt.axvline(dataset_2021["Ladder score"].mean(), color = "black")
plt.title("Ladder Score Distribution According to the Geographical Location")
plt.show()

Also we can say North America and ANZ has the lowest standard deviation it means the distribution of happy and unhappy people are similar to each other. Besides South Asia has the largest standard deviation value. 

<a id = "7"></a>
### 7. Ladder Score Distribution According to Countries with the World Map Visualization

For the world map visualization, we need to use plotly library.

In [None]:
before_2021.head(1)

In [None]:
#plt.figure(figsize = (15,10))
world_map = px.choropleth(before_2021.sort_values("year"), 
                    locations = "Country name", 
                    color = "Life Ladder",
                    locationmode = "country names",
                    animation_frame = "year")
world_map.update_layout(title = "Ladder Score Distribution According to Countries with the World Map Visualization")
world_map.show()

<a id = "8"></a>
### 8. Most Generous and Most Ungenerous Countries in 2021 

In [None]:
table_with_generous_and_ungenerous = dataset_2021[(dataset_2021["Generosity"] > 0.2) | (dataset_2021["Generosity"] < -0.2)]

In [None]:
plt.figure(figsize = (12,10))
plt.title("Most Generous and Most Ungenerous Countries in 2021")
sns.barplot(x = "Generosity", y = "Country name", data = table_with_generous_and_ungenerous, palette = "Set1")
plt.show()

* Right part of the plot is positive and left part is negative. Indonesia has the highest generousity. After that Myanmar follows it. 
* Greece is the most ungeneroust country. After that Japan and Botswna follow it.

<a id = "9"></a>
### 9. Generousity According to Countries with the World Map Visualization

In [None]:
world_map2 = px.choropleth(before_2021.sort_values("year"), 
                    locations = "Country name", 
                    color = "Generosity",
                    locationmode = "country names",
                    animation_frame = "year")
world_map2.update_layout(title = "Generousity According to Countries with the World Map Visualization")
world_map2.show()

<a id = "10"></a>
### 10. Most Generousity and Most Ungenerousity According to the Geographical Location in 2021

In [None]:
plt.title("Most Generousity and Most Ungenerousity According to the Geographical Location in 2021")
sns.swarmplot(x = "Regional indicator", y = "Generosity", data = dataset_2021)
plt.xticks(rotation = 90)
plt.show()

* The most generous countries are located in Southeast Asia. On the other hand, the ungenerous countries are located in Westersn Europe. 
* For Sub-Saharan Africa, we can say that the standard deviation is very high. It means, there are lots of countries which are generous on the other hand there are many countries which are ungenerous.