## Data from World Happiness Report

The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. It contains articles, and rankings of national happiness based on respondent ratings of their own lives, which the report also correlates with various life factors.

In this notebook we will explore the happiness of different countries and the features associated.
The datasets that we will use are available in *Data*: **happiness2020.pkl** and **countries_info.csv**.

Although the features are self-explanatory, here a summary: 

**happiness2020.pkl**
* country: *Name of the country*
* happiness_score: *Happiness score*
* social_support: *Social support (mitigation the effects of inequality)*
* healthy_life_expectancy: *Healthy Life Expectancy*
* freedom_of_choices: *Freedom to make life choices*
* generosity: *Generosity (charity, volunteers)*
* perception_of_corruption: *Corruption Perception*
* world_region: *Area of the world of the country*

**countries_info.csv**
* country_name: *Name of the country*
* area: *Area in sq mi*
* population: *Number of people*
* literacy: *Literacy percentage*

In [1]:
!head Data/countries_info.csv

country_name,area,population,literacy
afghanistan,647500,31056997,"36,0"
albania,28748,3581655,"86,5"
algeria,2381740,32930091,"70,0"
argentina,2766890,39921833,"97,1"
armenia,29800,2976372,"98,6"
australia,7686850,20264082,"100,0"
austria,83870,8192880,"98,0"
azerbaijan,86600,7961619,"97,0"
bahrain,665,698585,"89,1"


In [2]:
import pandas as pd
%matplotlib inline

DATA_FOLDER = 'Data/'

HAPPINESS_DATASET = DATA_FOLDER+"happiness2020.csv"
COUNTRIES_DATASET = DATA_FOLDER+"countries_info.csv"

## Task 1: Load the data

Load the 2 datasets in Pandas dataframes (called *happiness* and *countries*), and show the first rows.


**Hint**: Use the correct reader and verify the data has the expected format.

In [21]:
# Write your code here

happiness = pd.read_csv(HAPPINESS_DATASET, sep=',')
happiness


Unnamed: 0,country,happiness_score,social_support,healthy_life_expectancy,freedom_of_choices,generosity,perception_of_corruption,world_region
0,Afghanistan,2.5669,0.470367,52.590000,0.396573,-0.096429,0.933687,South Asia
1,Albania,4.8827,0.671070,68.708138,0.781994,-0.042309,0.896304,Central and Eastern Europe
2,Algeria,5.0051,0.803385,65.905174,0.466611,-0.121105,0.735485,Middle East and North Africa
3,Argentina,5.9747,0.900568,68.803802,0.831132,-0.194914,0.842010,Latin America and Caribbean
4,Armenia,4.6768,0.757479,66.750656,0.712018,-0.138780,0.773545,Commonwealth of Independent States
...,...,...,...,...,...,...,...,...
130,Venezuela,5.0532,0.890408,66.505341,0.623278,-0.169091,0.837038,Latin America and Caribbean
131,Vietnam,5.3535,0.849987,67.952736,0.939593,-0.094533,0.796421,Southeast Asia
132,Yemen,3.5274,0.817981,56.727283,0.599920,-0.157735,0.800288,Middle East and North Africa
133,Zambia,3.7594,0.698824,55.299377,0.806500,0.078037,0.801290,Sub-Saharan Africa


In [20]:
countries = pd.read_csv(COUNTRIES_DATASET, sep=',')
countries

Unnamed: 0,country_name,area,population,literacy
0,afghanistan,647500,31056997,360
1,albania,28748,3581655,865
2,algeria,2381740,32930091,700
3,argentina,2766890,39921833,971
4,armenia,29800,2976372,986
...,...,...,...,...
130,venezuela,912050,25730435,934
131,vietnam,329560,84402966,903
132,yemen,527970,21456188,502
133,zambia,752614,11502010,806


## Task 2: Let's merge the data

Create a dataframe called *country_features* by merging *happiness* and *countries*. A row of this dataframe must describe all the features that we have about a country.

**Hint**: Verify to have all the rows in the final dataframe

In [23]:
# Write your code here

merged_df = pd.merge(happiness, countries, right_index=True, left_index=True)
merged_df.drop(columns=['country_name'])
merged_df.head()


Unnamed: 0,country,happiness_score,social_support,healthy_life_expectancy,freedom_of_choices,generosity,perception_of_corruption,world_region,country_name,area,population,literacy
0,Afghanistan,2.5669,0.470367,52.59,0.396573,-0.096429,0.933687,South Asia,afghanistan,647500,31056997,360
1,Albania,4.8827,0.67107,68.708138,0.781994,-0.042309,0.896304,Central and Eastern Europe,albania,28748,3581655,865
2,Algeria,5.0051,0.803385,65.905174,0.466611,-0.121105,0.735485,Middle East and North Africa,algeria,2381740,32930091,700
3,Argentina,5.9747,0.900568,68.803802,0.831132,-0.194914,0.84201,Latin America and Caribbean,argentina,2766890,39921833,971
4,Armenia,4.6768,0.757479,66.750656,0.712018,-0.13878,0.773545,Commonwealth of Independent States,armenia,29800,2976372,986


## Task 3: Where do people are happier?

Print the top 10 countries based on their happiness score (high is better).

In [26]:
# Write your code here

merged_df[['country','happiness_score']].sort_values(ascending=False, by=['happiness_score']).head(10)


Unnamed: 0,country,happiness_score
38,Finland,7.8087
31,Denmark,7.6456
115,Switzerland,7.5599
50,Iceland,7.5045
92,Norway,7.488
87,Netherlands,7.4489
114,Sweden,7.3535
88,New Zealand,7.2996
6,Austria,7.2942
72,Luxembourg,7.2375


We are interested to know in what world region the people are happier. 

Create and print a dataframe with the (1) average happiness score and (2) the number of contries for each world region.
Sort the result to show the happiness ranking.

In [40]:
# Write your code here


average_happiness_df = merged_df[['world_region', 'country','happiness_score']].groupby('world_region').agg(['count','mean'])
average_happiness_df

Unnamed: 0_level_0,happiness_score,happiness_score
Unnamed: 0_level_1,count,mean
world_region,Unnamed: 1_level_2,Unnamed: 2_level_2
Central and Eastern Europe,14,5.891393
Commonwealth of Independent States,12,5.358342
East Asia,3,5.483633
Latin America and Caribbean,20,5.97128
Middle East and North Africa,16,5.269306
North America and ANZ,4,7.173525
South Asia,6,4.355083
Southeast Asia,8,5.517788
Sub-Saharan Africa,32,4.393856
Western Europe,20,6.967405


In [41]:
grouped_merged_df = merged_df.groupby('world_region')
for world_region, country in grouped_merged_df:
    print('world_region', world_region)
    print('country', country)

world_region Central and Eastern Europe
country             country  happiness_score  social_support  healthy_life_expectancy  \
1           Albania           4.8827        0.671070                68.708138   
16         Bulgaria           5.1015        0.937840                66.803978   
28          Croatia           5.5047        0.874624                70.214905   
30   Czech Republic           6.9109        0.914431                70.047935   
36          Estonia           6.0218        0.934730                68.604958   
49          Hungary           6.0004        0.921934                67.609970   
66           Latvia           5.9500        0.918289                66.807465   
71        Lithuania           6.2155        0.926107                67.294075   
73        Macedonia           5.1598        0.820392                67.504425   
98           Poland           6.1863        0.874257                69.311134   
100         Romania           6.1237        0.825162         

The first region has only a few countries! What are them and what is their score?

In [56]:
# Write your code here
hej = merged_df[(merged_df['world_region'] == 'East Asia')]
hej[['world_region', 'country', 'happiness_score' ]]

Unnamed: 0,world_region,country,happiness_score
24,East Asia,China,5.1239
59,East Asia,Japan,5.8708
83,East Asia,Mongolia,5.4562


## Task 4: How literate is the world?

Print the name of countries with a level of literacy of 100%. 

For each country, print the name and the world region with the format: *{region name} - {country name} ({happiness score})*

In [None]:
# Write your code here




What is the global average?

In [None]:
# Write your code here

Calculate the proportion of countries with a literacy level below 50%. Print the value in percentage, formatted with 2 decimals.

In [None]:
# Write your code here

Print the raw number and the percentage of world population that is illiterate.

In [None]:
# Write your code here

## Task 5: Population density

Add to the dataframe a new field called *population_density* computed by dividing *population* by *area*.

In [None]:
# Write your code here

What is the happiness score of the 3 countries with lowest population density?

In [None]:
# Write your code here

## Task 6: Healty and happy?

Plot in scatter plot the happiness score (x) and healty like expectancy (y).

In [None]:
# Write your code here

Feel free to continue the exploration of the dataset! We'll release the solutions next week.

----
Enjoy EPFL and be happy, next year Switzerland must be #1.