# Group Assignment: Statistical Data Analysis
#### Group 7:
 - Aggarwal, Brahm, b3aggarw
 - Chellamuthu, Shanmuga, spchella
 - Mammadov, Rashad, r2mammad
 - Mazloomi, Rod, rmazloom
 - Sanchez, Monica, m7sanche
 - Sheikh, Hassan, h25sheik 

### Introduction

#### Research quesion

We will be exploring the following quesitons:
 - What factors are more important for national happiness? 
 - Are these factors constant over time?
 - How happiness changes over time? 

#### Data

We will use data from the World Happiness Report found in kaggle:
 - https://www.kaggle.com/unsdsn/world-happiness

The World Happiness Report is a publication of the Sustainable Development Solutions Network, that was launched in 2012 and has been published annually till 2020, excluding 2014. The Report uses data from the Gallup World Poll and ranks countries by how happy their citizens perceive themselves. The rankings are based on responses to the main life evaluation questions, in which participants are asked to rate their lives on a 0 (worst possible life) to 10 (best possible life) scale. These rankings are stablished in comparison to Dystopia, an imaginary county with the world’s least-happy people among the main six key variables.

kaggle's dataset includes data from the 2015-19 waves of the World Happiness Report. Each wave contains the national happiness score and the country ranking. Additionally, the dataset contains the contribution to happiness evaluation from six main factors: levels of GDP, life expectancy, generosity, social support, freedom, and corruption. The dataset also includes a variable called ‘Dystopia residual’, that reflects the extent to which the six main factors over- or under-explain the average 2014-2016 life evaluations.  

### Data preparation

In [1]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import statsmodels.formula.api as sm

%matplotlib inline
import matplotlib.pyplot as plt

Before importing the data, we edited the colum names in the individual csv files to ensure consistency across files. This allowed us to import only the columns of interestt, i.e. the country and its region, its hapiness ranking, its hapiness score, the contribution to hapiness from each of the six main factors, and the Dystopia residual. With consistent names for variables, we were able to combine the individual csv files into a single data frame, which requiered us to create an additional variable to indicate the year to which the data refered to. As the Report measures countrys' hapiness over time, our resulting data frame is of panel type. 

In [2]:
# import the data of interest
frames = []
col_list = ['Country','Region','Happiness Rank','Happiness Score',
            'Economy (GDP per Capita)','Family','Health (Life Expectancy)',
            'Freedom','Trust (Government Corruption)','Generosity','Dystopia Residual']

for year in range(2015,2020):
    df = pd.read_csv('datasets_894_813759_{:.0f}.csv'.format(year),
                     usecols=lambda col: col in set(col_list))
    df.insert(0,'Year',year)
    frames.append(df)
    
# concatenate frames
data = pd.concat(frames, sort=False)

In [3]:
# visualize the data
data.head()

Unnamed: 0,Year,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,2015,Switzerland,Western Europe,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,2015,Iceland,Western Europe,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,2015,Denmark,Western Europe,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,2015,Norway,Western Europe,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,2015,Canada,North America,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


In [4]:
# understand the data
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 782 entries, 0 to 155
Data columns (total 12 columns):
Year                             782 non-null int64
Country                          782 non-null object
Region                           315 non-null object
Happiness Rank                   782 non-null int64
Happiness Score                  782 non-null float64
Economy (GDP per Capita)         782 non-null float64
Family                           782 non-null float64
Health (Life Expectancy)         782 non-null float64
Freedom                          782 non-null float64
Trust (Government Corruption)    781 non-null float64
Generosity                       782 non-null float64
Dystopia Residual                470 non-null float64
dtypes: float64(8), int64(2), object(2)
memory usage: 79.4+ KB


Our data has 782 rows and 12 columns. All our data is of numeric type, except the country and the region which are text. Our data contains a large amount of missing values for the region and the Dystopia residual. There is also one missing value in the Trust measure. Because the region to which a country belongs is time invariant, we fill the region missing values using the avilable data.   

In [5]:
# check whether there are spelling mistakes for the country and region
for var in ('Country','Region'):
    list1 = data[var].dropna().unique()
    list1.sort()
    print('{:}_list ({:} unique vals) = {:}'.format(var,len(list1),list1))

Country_list (170 unique vals) = ['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Argentina' 'Armenia'
 'Australia' 'Austria' 'Azerbaijan' 'Bahrain' 'Bangladesh' 'Belarus'
 'Belgium' 'Belize' 'Benin' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina'
 'Botswana' 'Brazil' 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia'
 'Cameroon' 'Canada' 'Central African Republic' 'Chad' 'Chile' 'China'
 'Colombia' 'Comoros' 'Congo (Brazzaville)' 'Congo (Kinshasa)'
 'Costa Rica' 'Croatia' 'Cyprus' 'Czech Republic' 'Denmark' 'Djibouti'
 'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador' 'Estonia' 'Ethiopia'
 'Finland' 'France' 'Gabon' 'Gambia' 'Georgia' 'Germany' 'Ghana' 'Greece'
 'Guatemala' 'Guinea' 'Haiti' 'Honduras' 'Hong Kong'
 'Hong Kong S.A.R., China' 'Hungary' 'Iceland' 'India' 'Indonesia' 'Iran'
 'Iraq' 'Ireland' 'Israel' 'Italy' 'Ivory Coast' 'Jamaica' 'Japan'
 'Jordan' 'Kazakhstan' 'Kenya' 'Kosovo' 'Kuwait' 'Kyrgyzstan' 'Laos'
 'Latvia' 'Lebanon' 'Lesotho' 'Liberia' 'Libya' 'Lithuania' 'Luxembourg'
 '

In [6]:
# make countries names consistent across the dataset
data.Country.replace({'Hong Kong S.A.R., China':'Hong Kong',
                      'Trinidad & Tobago':'Trinidad and Tobago',
                      'North Cyprus':'Northern Cyprus',
                      'North Macedonia':'Macedonia',
                      'Taiwan Province of China':'Taiwan'},inplace=True)
data.Country.nunique()

165

In [7]:
# create a dictionary for region and country
dic = {}
for reg in data['Region'].dropna().unique():
    dic[reg] = data[['Region','Country']].groupby('Region').get_group(reg)['Country'].unique()

In [8]:
# fill region nan values using the dictionary
for row in range(0,len(data.Country)):
    if data.isnull().iloc[row,2]:
        for key, val in dic.items():
            for j in range(0,len(val)):
                if data.iloc[row,1] == val[j]:
                    data.iloc[row,2] = key

#check for missing values
data.Region.isna().sum()

1

After mapping the region to each specific country using the available data, we still have one missing observation for the region variable. We explore which country was not assigned a region under the mapping and input it manually besed on its geographical location.

In [9]:
# explore which countries are still missing a region
data.Country[data.Region.isna()]

119    Gambia
Name: Country, dtype: object

In [10]:
# input missing regions manually
data.Region[data.Country=='Gambia'] = 'Sub-Saharan Africa'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


The Dystopia residual is another variable that has a large amount of missing values. We address these using the fact thatthe happiness score is equivalent to the sum of the contribution from each of the 6 major factors and that of the Dystopia residual. We check that this holds in our data and investigate whether the differences are due to reounding errors.

In [11]:
# sum the components
data['HS_sum'] = data.iloc[:,5:12].sum(axis=1,skipna=False)
                    
# compare the hapiness score and the sum of the contributions
data['HS_match'] = (round(data['HS_sum'],3)==round(data['Happiness Score'],3))

# share of true and false, when dystopia residual is not null
data.groupby('HS_match')['Dystopia Residual'].count()

HS_match
False      2
True     468
Name: Dystopia Residual, dtype: int64

There are only 2 cases for which the sum of the components does not match the happiness score, we explore those individualy.

In [12]:
data.groupby('HS_match').get_group(False).dropna()

Unnamed: 0,Year,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,HS_sum,HS_match
14,2015,United States,North America,15,7.119,1.39451,1.24711,0.86179,0.54604,0.1589,0.40105,2.51011,7.11951,False
80,2016,Azerbaijan,Central and Eastern Europe,81,5.291,1.12373,0.76042,0.54504,0.35327,0.17914,0.0564,2.2735,5.2915,False


We can see that for the 2 cases in which we didn't get a match, there is a small rounfding error between our calculation and the one in the dataset. Because these type of error are small and unfrequent, we feel condifent about replacing the missing values for the Dystopia Residual as the difference between the Happiness Score and the 6 core factors contributing to happiness.

In [13]:
# calculate missing values for the Dystopia residual
data['Dystopia Residual'][data['Dystopia Residual'].isna()] = data['Happiness Score'] - data.iloc[:,5:11].sum(axis=1,skipna=False)
data.drop(['HS_sum','HS_match'], axis=1, inplace=True)
data.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Year                             0
Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    1
Generosity                       0
Dystopia Residual                1
dtype: int64

In [14]:
# explore the remaining missing values
data[data.isnull().any(axis=1)]

Unnamed: 0,Year,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
19,2018,United Arab Emirates,Middle East and Northern Africa,20,6.774,2.096,0.776,0.67,0.284,,0.186,


The two remaining missing values in our dataset correspond to the same observation. Because we can not solve for missing values with the available information, we will exclude the observation from the analysis.

In [15]:
# drop the observation from the dataset
data.dropna(inplace=True)

### Data analysis

### Conclusion