# Happiness Report Analysis #

In this project, I worked on a 2017 Happiness Score Report to figure out three questions listed below. I used two datasets and you can get access to them from here:https://www.kaggle.com/unsdsn/world-happiness/data#2019.csv
You might notice that there are datasets from 2018 and 2019, but I think that the datasets miss some columns, so I decided not to use them. 

Questions to explore:
- what are main factors contributing to the increase of Happiness Score?
- which regions in the world do have the highest average happiness scores? 
- Are there unique facotors contributing to making life evaluation higher in specific regions? 

A reason why I chose to work on this project is that since this is very first my personal project(did some guided projects though), I wanted to work on a dataset which does not require me to do a lot of context researches. Moreover, I was interested in those questions listed above. 


## Summary ##
Q1
- Based on my analysis, all of factors, listed in Happiness Score Report, except generosity contribute to the increse of life evaluation in each country. Especially, gdp_per_capita, family, and health factors are most influential factors of all since their correlatinos with Happiness scores are very strong. 

Q2
- A region, which has the highest average happiness score, is Australia and New Zealand. 2nd is North America, and 3rd is Wesstern Europe. I guess that there is no surprise for this because they are economically prosperous and therefore have environments where people can live in healthy life. 

Q3
- In Southeastern Asia, although gdp_per_capita and family are main factors, freedom and generosity seem to have more influencial to their life evaluation compared with other regions. In other regions, gdp_per_capita and family are main factors contributing to the increase of life evaluation. 

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline 

In [None]:
happiness2017 = pd.read_csv('2017.csv')

Store a csv file as happiness2017

## Basic Information of happiness2017 ##

In [None]:
happiness2017.shape

There are 155 rows and 12 columns in this dataset. 

In [None]:
happiness2017.head()

### About values in columns ###
- **Happiness.Rank** indicate happiness score (0-10)
- Values in columns from **Economy..GDP.per.Capita** to **Generosity** describe the extent to which each of six factors contribute to making life evaluations higher in each country than they are in Dystopia. Dystopis is an imaginary country whose six factors are equal to world’s lowest national averages for each of the six factors. Sinmply, it is the least happy country. 

For instance, in Norway, **Economy..GDP.per.Capita** is the biggest contributor to making its Happiness evaluation higher. 

Note: World Happiness report mentioned that "note that we do not construct our happiness measure in each country using these six factors". They are just telling the extents. 

### Exploration ###

In [None]:
#replaced white spaces with underscore 
happiness2017.columns = happiness2017.columns.str.lower()

I lowered all letters since I personally think that it makes easire for me to work with this dataset

In [None]:
happiness2017.head()

In [None]:
renaming = {'happiness.rank':"rank","happiness.score":"score",'economy..gdp.per.capita.':'gdp_capita',
            'health..life.expectancy.':'health','trust..government.corruption.':'trust'}
happiness2017.rename(renaming,axis =1,inplace=True)

Renamed columns

In [None]:
happiness2017.head()

In [None]:
happiness2017.dtypes

There is not variation of dtypes in this dataset. Almost all columns are float. 

In [None]:
happiness2017.isnull().sum()

There is no null values in each column, so this data set seems like it is very clean. 

In [None]:
happiness2017.describe()

In [None]:
happiness2017[(happiness2017['gdp_capita']==0) | (happiness2017['family']==0)|
              (happiness2017['health']==0) | (happiness2017['freedom']==0)|
              (happiness2017['generosity']==0) | (happiness2017['trust']==0)]


Because it is odd to me that there are 0s at minimum in each of 6 factors contributing to happiness rank, I checked which countries have the values. It might not be unusual to have those values in these countries. For instance, according to Wikipedia, Central Afrrican Republic has one of the poorest population. 

In [None]:
happiness2017.drop(['whisker.high','whisker.low','dystopia.residual'],inplace=True, axis = 1)

Since these three columns do not contribute to happiness meaasurement of each country, I dropped them.

## Main Factors Contributing to Happiness Scores ##

From now on, I will find main foctors contributing to making Happiness scores higher. To find relationships, I will calculate corralation between Hapiness score and each of six factors. 

In [None]:
fig = plt.figure(figsize=(10,10))

num = 1
for label in happiness2017.columns[3:9]: 
    ax = fig.add_subplot(2,3,num)
    plt.scatter(happiness2017['score'],happiness2017[label])
    plt.title(label,loc='center')
    num +=1
plt.show()

In [None]:
correlation=happiness2017.corr()

In [None]:
correlation

In [None]:
correlation['score']

Strong positive correlation can be observed in all columns except generosity. Especially, gdp, family, and health have very strong correlatins with Happiness scores. 

In [None]:
#for practice purposes, I also visualize the corrlation with heatmap. 
figure = plt.figure(figsize=(10,10))
sns.heatmap(correlation, annot = True)

Given the scatter plots and correlation between Happiness score and each of six factors, strong and positive correlations can be observed in relationships between the score and six factors except generosity. 
Especially, gdp_capta, family, and health have very high correlation. Therefore, since the factors express the extent to which they contribute to making life evaluation higher, we can see a trend that more happier countries become, more influential to life evaluation the three factors become. In simpler words, in happy countries, the facotors are likely to be main factors to make the countries happier they are in dystopia. 
Since correlation cannot describe causality, I am not sure that increasing in those factors can make people happier. However, if there is an increase in one or more of the factors, people are likely to be happier. 
Thus, the answer for the first question is that all factors except generosity are likely to strongly contribute to making life evaluation higher. Especially, gdp, family, and health are really influential. 
However, I am not sure they 'CAUSE' the increase of life evaluation. 

## Average Happiness Score by Regions##

2017 happiness score dataset does not have a column describing countries' regions, but 2015 happiness dataset have a region column. So, I will add the region column to the 2017 dataset based on the 2015dataset. You can access to the dataset from the link I attached in the first cell. 

In [None]:
happiness2015 = pd.read_csv('2015.csv')

In [None]:
happiness2015.shape

In [None]:
happiness2015.head()

In [None]:
check = happiness2017['country'].isin(happiness2015['Country'])

In [None]:
happiness2017[~check]

Some countries in happiness2017 do not exist in hapiness2015. However, the number of them is small. Therefore, I will not lose significant amount of data when I combine two datasets. 

In [None]:
happiness2015.columns = happiness2015.columns.str.lower()

In [None]:
happiness2015[['country','region']].head()

In [None]:
happiness2017_region = pd.merge(happiness2017,happiness2015[['country','region']], on='country')

Merging(inner mergin) happiness2015 and happiness2017 datasets based on 'country' column. 

In [None]:
region_score = happiness2017_region.groupby('region',as_index=False)['score'].mean().sort_values(ascending=False,by='score')

In [None]:
region_score

Grouping data based on regions and culculating mean of happiness score in each region. Then, I sort the resulting dataset by the average score. 

In [None]:
region_score.plot.barh(x='region',y='score',figsize=(10,10))

It can be observed that "North America" and "Aurstralia and New Zealand" regions have very high happiness scores on average. (sheeps are really great, I love them) 

## Contributing Factors by Regions##

In [None]:
happiness2017_avg = happiness2017_region.groupby('region',as_index=False).mean()

In [None]:
happiness2017_avg = happiness2017_avg.sort_values(by='score',ascending=False).set_index('region')

In [None]:
happiness2017_avg

In [None]:
figure = figure = plt.figure(figsize=(10,10))
sns.heatmap(happiness2017_avg[['gdp_capita','family','health','freedom','generosity','trust']], annot=True)

When seeing this figure horizontally, we can see that in each region gdp_capita and family factors are main contributor to making life evaluation higher sicne they are brighter than other factors in the figure. However, in Southeastern Asia, it seems like that the difference in brightness between the two factors and other factos are slightly smaller than other regions. 
Therefore, I can argue that there are generally not unique factors contributing to making life evaluation higher in specific regions, but, in Southeastern Asia, freedom and generosity might have slightly more power to contribute to the increase of life evaluation than other regions. 