## Data from World Happiness Report

The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. It contains articles, and rankings of national happiness based on respondent ratings of their own lives, which the report also correlates with various life factors.

In this notebook we will explore the happiness of different countries and the features associated.
The datasets that we will use are available in *Data*: **happiness2020.pkl** and **countries_info.csv**.

Although the features are self-explanatory, here a summary: 

**happiness2020.pkl**
* country: *Name of the country*
* happiness_score: *Happiness score*
* social_support: *Social support (mitigation the effects of inequality)*
* healthy_life_expectancy: *Healthy Life Expectancy*
* freedom_of_choices: *Freedom to make life choices*
* generosity: *Generosity (charity, volunteers)*
* perception_of_corruption: *Corruption Perception*
* world_region: *Area of the world of the country*

**countries_info.csv**
* country_name: *Name of the country*
* area: *Area in sq mi*
* population: *Number of people*
* literacy: *Literacy percentage*

In [1]:
!head Data/countries_info.csv

country_name,area,population,literacy
afghanistan,647500,31056997,"36,0"
albania,28748,3581655,"86,5"
algeria,2381740,32930091,"70,0"
argentina,2766890,39921833,"97,1"
armenia,29800,2976372,"98,6"
australia,7686850,20264082,"100,0"
austria,83870,8192880,"98,0"
azerbaijan,86600,7961619,"97,0"
bahrain,665,698585,"89,1"


In [1]:
import pandas as pd
%matplotlib inline

DATA_FOLDER = 'Data/'

HAPPINESS_DATASET = DATA_FOLDER+"happiness2020.csv"
COUNTRIES_DATASET = DATA_FOLDER+"countries_info.csv"

## Task 1: Load the data

Load the 2 datasets in Pandas dataframes (called *happiness* and *countries*), and show the first rows.


**Hint**: Use the correct reader and verify the data has the expected format.

In [12]:
# Write your code here
happiness = pd.read_csv(HAPPINESS_DATASET)
happiness["country"] = happiness["country"].str.lower()
countries = pd.read_csv(COUNTRIES_DATASET)
countries.rename(columns={"country_name": "country"}, inplace=True)
print(happiness.head())
print(countries.head())

       country  happiness_score  social_support  healthy_life_expectancy  \
0  afghanistan           2.5669        0.470367                52.590000   
1      albania           4.8827        0.671070                68.708138   
2      algeria           5.0051        0.803385                65.905174   
3    argentina           5.9747        0.900568                68.803802   
4      armenia           4.6768        0.757479                66.750656   

   freedom_of_choices  generosity  perception_of_corruption  \
0            0.396573   -0.096429                  0.933687   
1            0.781994   -0.042309                  0.896304   
2            0.466611   -0.121105                  0.735485   
3            0.831132   -0.194914                  0.842010   
4            0.712018   -0.138780                  0.773545   

                         world_region  
0                          South Asia  
1          Central and Eastern Europe  
2        Middle East and North Africa  
3   

## Task 2: Let's merge the data

Create a dataframe called *country_features* by merging *happiness* and *countries*. A row of this dataframe must describe all the features that we have about a country.

**Hint**: Verify that all the rows are in the final dataframe.

In [14]:
# Write your code here
country_features = pd.merge(happiness, countries, on = "country")
print(country_features)

         country  happiness_score  social_support  healthy_life_expectancy  \
0    afghanistan           2.5669        0.470367                52.590000   
1        albania           4.8827        0.671070                68.708138   
2        algeria           5.0051        0.803385                65.905174   
3      argentina           5.9747        0.900568                68.803802   
4        armenia           4.6768        0.757479                66.750656   
..           ...              ...             ...                      ...   
130    venezuela           5.0532        0.890408                66.505341   
131      vietnam           5.3535        0.849987                67.952736   
132        yemen           3.5274        0.817981                56.727283   
133       zambia           3.7594        0.698824                55.299377   
134     zimbabwe           3.2992        0.763093                55.617260   

     freedom_of_choices  generosity  perception_of_corruption  

## Task 3: Where are people happier?

Print the top 10 countries based on their happiness score (higher is better).

In [16]:
# Write your code here
top10 = country_features.sort_values(by="happiness_score", ascending=False).head(10)
print(top10["country"])

38         finland
31         denmark
115    switzerland
50         iceland
92          norway
87     netherlands
114         sweden
88     new zealand
6          austria
72      luxembourg
Name: country, dtype: object


We are interested to know in what world region people are happier. 

Create and print a dataframe with the (1) average happiness score and (2) the number of contries for each world region.
Sort the result to show the happiness ranking.

In [21]:
# Write your code here
stat_regions = (
    country_features.groupby("world_region")
    .agg(
        avg_happiness=("happiness_score", "mean"),
        nb_countries=("country", "count")
    )
    .reset_index()
    .sort_values(by="avg_happiness", ascending=False)
    )
print(stat_regions)

                         world_region  avg_happiness  nb_countries
5               North America and ANZ       7.173525             4
9                      Western Europe       6.967405            20
3         Latin America and Caribbean       5.971280            20
0          Central and Eastern Europe       5.891393            14
7                      Southeast Asia       5.517788             8
2                           East Asia       5.483633             3
1  Commonwealth of Independent States       5.358342            12
4        Middle East and North Africa       5.269306            16
8                  Sub-Saharan Africa       4.393856            32
6                          South Asia       4.355083             6


The first region has only a few countries! What are them and what is their score?

In [23]:
# Write your code here
subset_1 = country_features[country_features["world_region"] == "North America and ANZ"][["country","happiness_score"]]
print(subset)

           country  happiness_score
5        australia           7.2228
21          canada           7.2321
88     new zealand           7.2996
127  united states           6.9396


## Task 4: How literate is the world?

Print the names of the countries with a level of literacy of 100%. 

For each country, print the name and the world region in the format: *{region name} - {country name} ({happiness score})*

In [34]:
# Write your code here
subset_2 = country_features[country_features["literacy"].str.replace(",", ".").astype(float) == 100.0]
for _, row in subset_2.iterrows():
    print(f"{row['world_region']} - {row['country']} ({row['happiness_score']})")


North America and ANZ - australia (7.222799778)
Western Europe - denmark (7.645599842)
Western Europe - finland (7.808700085)
Western Europe - luxembourg (7.237500191)
Western Europe - norway (7.487999916000001)


What is the global average?

In [36]:
# Write your code here
print(country_features["literacy"].str.replace(",", ".").astype(float).mean())

81.85112781954888


Calculate the proportion of countries with a literacy level below 50%. Print the value in percentage, formatted with 2 decimals.

In [37]:
# Write your code here
below_50 = (country_features["literacy"].str.replace(",", ".").astype(float) < 50).sum()
total = country_features["literacy"].count()
proportion = (below_50 / total) * 100
print(f"{proportion:.2f}%")

12.03%


Print the raw number and the percentage of world population that is illiterate.

In [None]:
# Write your code here
illiterate = 0.0
for _, row in country_features["literacy"].str.replace(",", ".").astype(float).iterrows():
#illiterate = (country_features["literacy"].str.replace(",", ".").astype(float) == 0.0).sum()
    illiterate
proportion_2 = (illiterate / total)*100
print(illiterate)
print(f"{proportion_2:.2f}%")

0
0.00%


## Task 5: Population density

Add to the dataframe a new field called *population_density* computed by dividing *population* by *area*.

In [12]:
# Write your code here

What is the happiness score of the 3 countries with the lowest population density?

In [13]:
# Write your code here

## Task 6: Healty and happy?

Plot in a scatter plot the happiness score (x) and healty life expectancy (y).

In [14]:
# Write your code here

Feel free to continue the exploration of the dataset! We'll release the solutions next week.

----
Enjoy EPFL and be happy, next year Switzerland must be #1.