# World Happiness 
In this exercise, we will play with data from The World Happiness survey (https://worldhappiness.report/) for the years 2005-2022.

Data description:

- Country: Name of the country.
- Region: The world region the country belongs to.
- Year: The year in which the data was collected
- Happiness Score: The "Cantril Ladder": the answer to the question: "think of a ladder, with the best possible life for you being a 10, and the worst possible life for you being a 0, and rate your current life on this 0 to 10 ladder".
- Economy: Log GDP per capita
- Social support: the answer to the question "if you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not."
- Health: Life expectency at birth
- Freedom: the answer to the question "are you satisfied or dissatisfied with your freedom to choose what you do with your life?"
- Generosity:  the residual of regressing the national average responses to the question "Have you donated money to a charity in the past month?" on GDP per capita
- Perceptions of courrption: the average answer to the questions "Is corruption widespread throughout the government or not?" and "Is corruption widespread within businesses or not?"
- Positive affect: the average frequency of happiness, laughter, and enjoyment on the previous day
- Negative affect: the average frequency of worry, sadness, and anger on the previous day

In [None]:
# importing some of the main packages we'll use in this class
import numpy as np # numpy is the main Python package for scientific computing
import pandas as pd # pandas is a package designed for data manipulation and analysis
import matplotlib.pyplot as plt # matplotlib is a package for plotting data
import seaborn as sns # seaborn is also a package for plotting data, built on top of matplotlib

In [None]:
# just a couple of definitions to make the plots look nicer (IMO)
sns.set(style='ticks',font_scale=1.2)
sns.set_palette("deep")

In [None]:
happiness_df = pd.read_csv('happiness 2005_2022.csv') # read the dataset to a dataframe (more on this later)
happiness_df # default print method in Jupyter prints the top 5 and bottom 5 rows


In [None]:
# additional useful methods for overviewing the data
# often, "display" method produces nicer and clearer output than "print"
display(happiness_df.columns.values)  # prints the names of columns in our dataset
print("*****")
display(happiness_df.info()) # gives basic structure information on each variable
print("*****")
happiness_df.describe(include='all') # gives summary statistics (more on this later) for numeric variables and freuqnecy counts for categorical variables


#### What can we learn from these outputs?
(don't worry if you don't know the answers (unless you're reading this before the final exam))
- How many observations do we have here?
- Happiness of how many different countries was measured?
- How many different regions of the world are there? Can you name at least one?
- In which years of the world happiness report do these data come from?
- What is the range (minimum and maximun) of the "happiness score"?

## World's happiest countries

Run the code below to see which 3 countries are the happiest in the world (over the years), and their corresponding mean happiness scores.

In [None]:
# First compute the average country happiness across all years and sort the countries by mean happiness
grpby_country = happiness_df.groupby('Country')
df_avg_happy = grpby_country.agg({'Happiness Score':'mean'}).reset_index()
df_avg_happy.sort_values('Happiness Score', inplace=True, ascending=False)
df_avg_happy['World happiness rank'] = np.arange(len(df_avg_happy))+1

# Now filter to get only the top 3 happiest countries
happiness_top_3 = df_avg_happy.loc[df_avg_happy['World happiness rank'] <= 3] 
# plot happiness by country
p = sns.catplot(x='Country', y='Happiness Score', data=happiness_top_3, kind="bar")
p.fig.set_size_inches(8,4) # changing plot size for readability

##### challenge
1. Try to modify the code below (a copy of the code above) the view the top 7 happiest countries 
2. Try to modify the code below (a copy of the code above) to see which 7 countries are the least happy (clue: there are 164 countries in this dataset)

In [None]:
# 1
happiness_top_3 = df_avg_happy.loc[df_avg_happy['World happiness rank'] <= 3] 
# plot happiness by country
p = sns.catplot(x='Country', y='Happiness Score', data=happiness_top_3, kind="bar")
p.fig.set_size_inches(8,4) # changing plot size for readability

# 2 
happiness_top_3 = df_avg_happy.loc[df_avg_happy['World happiness rank'] <= 3] 
# plot happiness by country
p = sns.catplot(x='Country', y='Happiness Score', data=happiness_top_3, kind="bar")
p.fig.set_size_inches(8,4) # changing plot size for readability

- Is there anything in common to the happiest countries? 
- Is there anything in common to the least happy countries?
***

***
## What about Israel?
Run the code below to compare the happiness in Israel to that in some other countries

In [None]:
# list of countries we want to check out
countries = ['Israel', 'United States', 'Turkey', 'Singapore']  
# filters data to include countries from our list
happiness_countries = happiness_df.loc[happiness_df['Country'].isin(countries)]  
# plot happiness by country (we'll present the same information in 2 ways)
sns.catplot(kind="bar", x='Country', y='Happiness Score', data=happiness_countries)
sns.catplot(kind="point", x='Country', y='Happiness Score', data=happiness_countries, join=False);

(note the little lines on the bars and around the points - these are called confidence intervals, and we will disucss them later in the course)

Are you surprised by these results? 

#### challenge
Change the code below (a copy of the code above) to compare between 5 countries you think would make an interesting comparison:
- Why did you choose these countries? 
- What did you expect to see?
- Are you surprised by the results?

In [None]:
# list of countries we want to check out
countries = ['Israel', 'United States', 'Turkey', 'Singapore']  
# filters data to include countries from our list
happiness_countries = happiness_df.loc[happiness_df['Country'].isin(countries)]  
# plot happiness by country (we'll present the same information in 2 ways)
sns.catplot(kind="bar", x='Country', y='Happiness Score', data=happiness_countries)
sns.catplot(kind="point", x='Country', y='Happiness Score', data=happiness_countries, join=False);


## Let's compare the whole world!

In [None]:
# let's compare worldwide mean happiness in 2005-2022

# First compute the mean happiness over the years 2005-2022 and sort the countries by mean happiness, while keeping "Region"
grpby_country = happiness_df.groupby(['Country','Region'])
df_avg_happy = grpby_country.agg({'Happiness Score':'mean'}).reset_index()
df_avg_happy.sort_values('Happiness Score', inplace=True, ascending=False)

# save the color palette used here, for later
col_dict = dict(zip(df_avg_happy['Region'].unique(),sns.color_palette().as_hex()))

# next, plot the desired figure
p=sns.catplot(kind='bar', x='Happiness Score', y='Country', hue='Region', data=df_avg_happy, dodge=False, legend_out=False, palette=col_dict)
p.fig.set_size_inches(18,35)
p.ax.set_title('Worldwide Happiness 2005-2022 by Country', size=15);

Which **world regions** seem to you the happiest? Which are least happy?
***
Let's check if your answers hold when we look at region averages:

In [None]:
# compute region averages
grpby_region = happiness_df.groupby(['Region'])
avg_happy_region = grpby_region.agg({'Happiness Score':'mean'}).reset_index()
avg_happy_region.sort_values('Happiness Score', inplace=True, ascending=False)

# plot the region averages
p = sns.catplot(kind='bar', y='Happiness Score', x='Region', order=avg_happy_region['Region'], 
                data=happiness_df, dodge=False, palette=col_dict)
p.fig.set_size_inches(12,8)
p.ax.set_title('Worldwide Happiness 2005-2022 by Region', size=15)
p.set_xticklabels(rotation=40, ha="right");

Did your answers from above hold?
***
Which region is Israel in? 
Is it a "happy" region? 
Is Israel a "happy" country?
Why the difference?
***
Let's also look at the variance (more on this later) within regions

In [None]:
# plot the same information but also add some information on how different countries within a region are different (variance)
p = sns.catplot(kind='box', y='Happiness Score', x='Region', order=avg_happy_region['Region'], 
                data=happiness_df, dodge=False, palette=col_dict, showmeans=True)
p.fig.set_size_inches(12,10)
p.ax.set_title('Worldwide Happiness 2005-2022 by Region', size=15)
p.set_xticklabels(rotation=40, ha="right");

Which world region seem to you as the most diverse in terms of happiness?
*** 
Let's look at the distribution of happiness in the Middle East and Northern Africa and compare it to the the distribution in Eastern Asia

In [None]:
# First get only data for the regions of interst
region_to_check = ['Middle East and Northern Africa', 'Eastern Asia'] 
happiness_region = happiness_df.loc[happiness_df['Region'].isin(region_to_check)]

# next, plot the distribution of happiness in this region
p = sns.displot(happiness_region, x='Happiness Score', hue='Region', bins=np.arange(2.5, 8.5, 0.5), common_norm=False, stat='proportion', kde=True)
p.fig.set_size_inches(10,6)
p.ax.set(ylabel='Proportion (number of countries in each bar\n divided by total number of countries in region)');


What can we learn from this?

#### challenge
Change the code below (a copy of the code above) to plot the distribution of another region. 
What did you learn?

In [None]:
# First get only data for the regions of interst
region_to_check = ['Middle East and Northern Africa', 'Eastern Asia'] 
happiness_region = happiness_df.loc[happiness_df['Region'].isin(region_to_check)]

# next, plot the distribution of happiness in this region
p = sns.displot(happiness_region, x='Happiness Score', hue='Region', bins=np.arange(2.5, 8.5, 0.5), common_norm=False, stat='proportion', kde=True)
p.fig.set_size_inches(10,6)
p.ax.set(ylabel='Proportion (number of countries in each bar\n divided by total number of countries in region)');


## Time trends in world happiness

So far we looked at average country happiness over all years of the dataset. We can also look at how happiness changes as a function of time.

The code below plots the mean world happiness for years 2005-2022

In [None]:
p = sns.relplot(kind='line', y='Happiness Score', x='Year', data=happiness_df, errorbar=None)
p.fig.set_size_inches(12,7)
p.set(xticks=range(2005, 2023), xticklabels=range(2005, 2023))
p.fig.suptitle('Mean worldwide happiness 2005-2022', size=15);

#### What happened in 2006?
Any suggestions?

Run the following code:

In [None]:
country_grpby_year = happiness_df[['Country','Year']].groupby('Year')
display(country_grpby_year.count())

So, what do you think happened in 2006?

Here's some more evidence:

In [None]:
region_to_check = ['North America', 'Western Europe', 'Sub-Saharan Africa', 'Southern Asia'] 
some_regions = happiness_df.loc[happiness_df['Region'].isin(region_to_check)]
country_grpby_year_region = some_regions[['Country','Year','Region']].groupby(['Year','Region'])
pd.set_option('display.max_rows', 100) # to display all rows
display(country_grpby_year_region.count())

### Warning! Data is usually not as clean as you'd like it to be!

<img src="0_YCghEemt6BtW9OZV.png" width="400" align="left"/>

Let's remove observations prior to 2007 before we continue:

In [None]:
new_happiness_df = happiness_df.loc[happiness_df['Year'] > 2006].copy() 
# when creating a new df based on an existing one, it is good practice to explicitly tell pandas to make a copy

Now, let's check the trends again.

In [None]:
# Worldwide:
p = sns.relplot(kind='line', y='Happiness Score', x='Year', data=new_happiness_df, errorbar=None)
p.fig.set_size_inches(10,7)
p.fig.suptitle('Mean worldwide happiness 2007-2022', size=15)

# By region
p = sns.relplot(kind='line', y='Happiness Score', x='Year', data=new_happiness_df, errorbar=None, hue='Region', palette=col_dict)
p.fig.set_size_inches(12,7)
p.fig.suptitle('Mean worldwide happiness 2007-2022 - by region', size=15);

Hmm, 2020 was a good year, ha?
Was it?

What can we say on the trends of world happiness between 2007 and 2022?<br>
How meaningful are the changes? Can you tell?

What can we say on the differences in trends across different regions?

Are these plots reliable given the change in countries surveyed across time?

***
***

Let's look on the trends in specific countries. For example, let's compare the 5 most populated countries in the world:

In [None]:
largest_countries = ['China', 'India', 'United States', 'Indonesia', 'Brazil']
largest_happiness = new_happiness_df.loc[happiness_df['Country'].isin(largest_countries)]
p = sns.relplot(kind="line", x="Year", y='Happiness Score', data=largest_happiness, hue='Country')
p.fig.set_size_inches(10,7)
p.fig.suptitle('Happiness 2007-2022 for most populated countries', size=15)

#### Challenge

Think of countries whose happiness trends you'd like to compare (why these countries?) and plot their happiness trends for 2007-2022. Use the code above as reference.<br>

What did you learn?

What can we say happened to happiness of people in the most populated countries in recent years? <br>
Do you find this strange? reaonable? why?

***

Let's compare the average of the most populated and the average of all other countries.

In [None]:
# We can first add a column to the dataframe that marks whether the country is one of the largest countries or not 
# This is an example of feature engineering, more on this later
new_happiness_df['largest'] = "No"
new_happiness_df.loc[new_happiness_df['Country'].isin(largest_countries),'largest'] = "Yes"
display(new_happiness_df.describe(include='all')) # checking the result makes sense

# Now let's plot the difference in average trends between the largest and the other countries:
p = sns.relplot(kind='line', y='Happiness Score', x='Year', data=new_happiness_df, errorbar=('ci', 68), hue='largest', style='largest')
p.fig.set_size_inches(10,7)
p.fig.suptitle('Mean worldwide happiness 2007-2022', size=15);

What can we say about the trends in happiness in the world most populated countries and the rest of the world?

***

## What is related to happiness?

So far we only used 4 coulmns of our original data (Country, Region, Year, & Happiness Score). Let's check out the additional information we have, possible factors that may be related to happiness.

The code below shows the relationship between the Economy (GDP per capita) on the x-axis, and happiness scores on the y-axis. <br>What do you infer from this plot?

In [None]:
# make scatter plot of Economy and happiness
ax = sns.regplot(x='Economy', y='Happiness Score', fit_reg=True, data=happiness_df)

#### Challenge
Do you expect other factors in the data would also be related to happiness? In what way? 
1. Make a list of your intuitive predictions for the relationship of each of the other factors in the data and happiness
2. Write code to test your intuitive predicitons

What are your conclusions? <br>
Can you tell us what makes people happy?

***
***


### Warning: Correlation does not imply causation!

<img src="dilbert.gif" width="640" height="400" align="left"/>

### Some other things to consider

- We analyzed the average happiness in the world/regions with each country as one data-point. Does this make sense? Can you think of another way to do things?
- Look at the definitions for "Positive affect" and "Negative affect". Do you think the six factors explaining the Happiness Score should be related to these? More or less so than their relationship with happiness? How about repeating the exercise for one of these as our main target variable?
- We can think of more complex data manipulations that can help us explore other interesting questions:
    - Which countries' happiness increased the most between 2007 and 2022? Decreased the most? 
    - Can we spot major geo-political events using this data?
    - Can we predict a "new country"'s happiness score based on the six factors in our data? How well?
    - What countries are more or less happy than we would expect them to be given their levels of six factors? 
    - ...