In [1]:
#Import all relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
 
## This statement allows the visuals to render within your Jupyter Notebook.
%matplotlib inline

## Loading the data
We can now load the dataset into pandas using the read_csv() function. This converts the CSV file into a Pandas dataframe.

In [2]:
#Read in the csv file and convert to a Pandas dataframe
happiness2015 = pd.read_csv('Data/2015.csv')
happiness2016 = pd.read_csv('Data/2016.csv')
happiness2017 = pd.read_csv('Data/2017.csv')
happiness2018 = pd.read_csv('Data/2018.csv')
happiness2019 = pd.read_csv('Data/2019.csv')

### Viewing the dataframe
We can get a quick sense of the size of our dataset by using the shape method. This returns a tuple with the number of rows and columns in the dataset.

In [5]:
happiness2015.shape, happiness2016.shape, happiness2017.shape, happiness2018.shape, happiness2019.shape

#This shape give us a few possible suggestions. There are exactly 12 features in 2015 & 2017, 13 features in 2016
#and 9 features in 2018 & 2019 so, what happened here? why were some features removed in 2018 & 2019? were they less
#important for people's happiness?

((158, 12), (157, 13), (155, 12), (156, 9), (156, 9))

In [6]:
#Country -> represents the country name
#Region -> represents the region of the country in the continent
#Happiness Rank -> the order of the country in terms of happiness
#Happiness Score -> represents how different the population mean is from the sample mean (how accuractyly a sample
#represents a population)
#Economy (GDP per Capita) -> represents the gross domestic product per person in a given country, and it is measured
#annually
#Family -> represents how much having a family contributes to the happines score
#Health (Life Expectancy) -> represent the average years a person can live
#Freedom -> represents how much freedom contributes to the happiness score
#Trust (Governement Corruption) -> represents  how much the government corruption contributes to the happiness score
#Dystopia Residual -> (Dystopia is the anit-utopia or the opposite of utopia and it is an imaginary has the least happy
#people) it is used as a benchmark to against which all other countries can be compared.
#we can see from these first five columns that four out of the top five happies country are located in Western Europe and
#the fifth one is in North America

In [7]:
happiness2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


In [8]:
#two new columns are added here and one column is not found in this dataset, which is Standard Error
#new columns = lower Confidence Interval & Upper Confidence Interval
#Confidence Interval -> it is a measure of the likelihood that a prediction will be accurate 
#Lower Confidence Interval -> represents the lower limits of conifdence interval
#Upper Confidence Interval -> represents the upper limits of conifdence interval
#Here, the top five happiest countries are all located in Westren Europe!

In [9]:
happiness2016.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


In [10]:
#Lower and Upper Confidence Interval are replaced by Whisker high and Whisker low
#Region feature is removed
#The top five happiest country are still the same from the previous year, but with a different order

In [11]:
happiness2017.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
3,Switzerland,4,7.494,7.561772,7.426227,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007,2.276716
4,Finland,5,7.469,7.527542,7.410458,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612,2.430182


In [12]:
#Whisker high and low, faimly, and dystopia residual features are removed
#Country feature is renamed to Country or  region, Happiness Score is renmaed to score, Happiness rank is renmaed to 
#overall rank, Freedom is renmaed to Freedom to make life choises and Trust (Government Corruption) is renmaed to 
#Perceptions of corruption
#A new feature is added -> Social support
#Social support -> represents how much social support
#The top five happiest country are still the same from the previous year, but with a different order

In [13]:
happiness2018.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408
3,4,Iceland,7.495,1.343,1.644,0.914,0.677,0.353,0.138
4,5,Switzerland,7.487,1.42,1.549,0.927,0.66,0.256,0.357


In [14]:
#in 2019 dataset we see that World Happiness Report decided to focus on the six features to calculate the happiness (GDP per capita,
#Social support, Healthy life expectancy, freedom to make life choices, Generosity, and Perceptions of corruption)
#The top five happiest country are still the same from the previous year, but with a different order. We also find 
#that Finland ranked the first in 2018 and 2019 as the Happiest country. 


In [15]:
happiness2019.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298


## 1. Data Profiling:
Data profiling is a comprehensive process of examining the data available in an existing dataset and collecting statistics and information about that data. 

In [16]:
happiness2015.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

In [17]:
happiness2016.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')

In [18]:
happiness2017.columns

Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
       'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'Dystopia.Residual'],
      dtype='object')

In [19]:
happiness2018.columns

Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

In [20]:
happiness2019.columns

Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

In [21]:
#We see here, the number of enteries (158) and how they are indexe, the columns and thier names and types which are correct.
#The count number of each data type 
#(Apparently,) there is no null values in 2015 dataset

In [22]:
happiness2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
memory usage: 1

In [23]:
#We see here that the entires of 2016 dataset is less than 2015 entries by one. 
#the total number of each data type is increased by one (float toatl number went up to 10 ) and all of them are correct

In [24]:
happiness2016.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        157 non-null    object 
 1   Region                         157 non-null    object 
 2   Happiness Rank                 157 non-null    int64  
 3   Happiness Score                157 non-null    float64
 4   Lower Confidence Interval      157 non-null    float64
 5   Upper Confidence Interval      157 non-null    float64
 6   Economy (GDP per Capita)       157 non-null    float64
 7   Family                         157 non-null    float64
 8   Health (Life Expectancy)       157 non-null    float64
 9   Freedom                        157 non-null    float64
 10  Trust (Government Corruption)  157 non-null    float64
 11  Generosity                     157 non-null    float64
 12  Dystopia Residual              157 non-null    flo

In [25]:
#The total number of entries less than the total number of entries of 2015 and 2016 datasets.
#Unlike 2015 and 2016 we will only have one object data type which is the Country feature. 
#The total numbers of float data type and int data type are still the same

In [26]:
happiness2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        155 non-null    object 
 1   Happiness.Rank                 155 non-null    int64  
 2   Happiness.Score                155 non-null    float64
 3   Whisker.high                   155 non-null    float64
 4   Whisker.low                    155 non-null    float64
 5   Economy..GDP.per.Capita.       155 non-null    float64
 6   Family                         155 non-null    float64
 7   Health..Life.Expectancy.       155 non-null    float64
 8   Freedom                        155 non-null    float64
 9   Generosity                     155 non-null    float64
 10  Trust..Government.Corruption.  155 non-null    float64
 11  Dystopia.Residual              155 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 

In [27]:
#the total number of the entries in 2018 is more than the total numbers of entries in 2017 by one, but it is still less 
#than the total number of entries in 2015 and 2016
#we here have less features, so the totla number of float data type is decreased 

In [28]:
happiness2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     155 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


In [29]:
#The total entries number and the total number of features as their names is as theose in the 2018 dataset which give us
#a simple glimpse that the datasets of World Happiness Report are becoming more consistent over the years.

In [30]:
happiness2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     156 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


In [31]:
#We can understand from this that the countries are all unique and there is no duplicated country.
#There are 10 regions of continent, but Sub-Saharan Africa has the most frequncies out of these 10 regions 
happiness2015.describe(include='object')

Unnamed: 0,Country,Region
count,158,158
unique,158,10
top,Switzerland,Sub-Saharan Africa
freq,1,40


In [32]:
happiness2015.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,79.493671,5.375734,0.047885,0.846137,0.991046,0.630259,0.428615,0.143422,0.237296,2.098977
std,45.754363,1.14501,0.017146,0.403121,0.272369,0.247078,0.150693,0.120034,0.126685,0.55355
min,1.0,2.839,0.01848,0.0,0.0,0.0,0.0,0.0,0.0,0.32858
25%,40.25,4.526,0.037268,0.545808,0.856823,0.439185,0.32833,0.061675,0.150553,1.75941
50%,79.5,5.2325,0.04394,0.910245,1.02951,0.696705,0.435515,0.10722,0.21613,2.095415
75%,118.75,6.24375,0.0523,1.158448,1.214405,0.811013,0.549092,0.180255,0.309883,2.462415
max,158.0,7.587,0.13693,1.69042,1.40223,1.02525,0.66973,0.55191,0.79588,3.60214


In [33]:
#We see here, that no country is duplicated, and Denamrk is the happiest country in 2016.The Sub-Saharan Africa 
#is still the most frequented region, but this time with less total number of frequncies which means two 
#countries from Sub-Saharan Africa are not recorded in 2016
happiness2016.describe(include='object')

Unnamed: 0,Country,Region
count,157,157
unique,157,10
top,Denmark,Sub-Saharan Africa
freq,1,38


In [34]:
happiness2016.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
count,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0
mean,78.980892,5.382185,5.282395,5.481975,0.95388,0.793621,0.557619,0.370994,0.137624,0.242635,2.325807
std,45.46603,1.141674,1.148043,1.136493,0.412595,0.266706,0.229349,0.145507,0.111038,0.133756,0.54222
min,1.0,2.905,2.732,3.078,0.0,0.0,0.0,0.0,0.0,0.0,0.81789
25%,40.0,4.404,4.327,4.465,0.67024,0.64184,0.38291,0.25748,0.06126,0.15457,2.03171
50%,79.0,5.314,5.237,5.419,1.0278,0.84142,0.59659,0.39747,0.10547,0.22245,2.29074
75%,118.0,6.269,6.154,6.434,1.27964,1.02152,0.72993,0.48453,0.17554,0.31185,2.66465
max,157.0,7.526,7.46,7.669,1.82427,1.18326,0.95277,0.60848,0.50521,0.81971,3.83772


In [35]:
#We see here, that no country is duplicated, and Norway is the happiest country in the world in 2017
happiness2017.describe(include='object')

Unnamed: 0,Country
count,155
unique,155
top,Norway
freq,1


In [36]:
happiness2017.describe()

Unnamed: 0,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
count,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0
mean,78.0,5.354019,5.452326,5.255713,0.984718,1.188898,0.551341,0.408786,0.246883,0.12312,1.850238
std,44.888751,1.13123,1.118542,1.14503,0.420793,0.287263,0.237073,0.149997,0.13478,0.101661,0.500028
min,1.0,2.693,2.864884,2.521116,0.0,0.0,0.0,0.0,0.0,0.0,0.377914
25%,39.5,4.5055,4.608172,4.374955,0.663371,1.042635,0.369866,0.303677,0.154106,0.057271,1.591291
50%,78.0,5.279,5.370032,5.193152,1.064578,1.253918,0.606042,0.437454,0.231538,0.089848,1.83291
75%,116.5,6.1015,6.1946,6.006527,1.318027,1.414316,0.723008,0.516561,0.323762,0.153296,2.144654
max,155.0,7.537,7.62203,7.479556,1.870766,1.610574,0.949492,0.658249,0.838075,0.464308,3.117485


In [37]:
#We see here, that no country is duplicated, and Finland is the happiest country in the world in 2018
happiness2018.describe(include='object')

Unnamed: 0,Country or region
count,156
unique,156
top,Finland
freq,1


In [38]:
happiness2018.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0,155.0
mean,78.5,5.375917,0.891449,1.213237,0.597346,0.454506,0.181006,0.112
std,45.177428,1.119506,0.391921,0.302372,0.247579,0.162424,0.098471,0.096492
min,1.0,2.905,0.0,0.0,0.0,0.0,0.0,0.0
25%,39.75,4.45375,0.61625,1.06675,0.42225,0.356,0.1095,0.051
50%,78.5,5.378,0.9495,1.255,0.644,0.487,0.174,0.082
75%,117.25,6.1685,1.19775,1.463,0.77725,0.5785,0.239,0.137
max,156.0,7.632,2.096,1.644,1.03,0.724,0.598,0.457


In [39]:
#We see here, that no country is duplicated, and Finland is the happiest country in the world in 2019. The total number
#of countries is still the same as 2018
happiness2019.describe(include='object')

Unnamed: 0,Country or region
count,156
unique,156
top,Finland
freq,1


In [40]:
happiness2019.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0,156.0
mean,78.5,5.407096,0.905147,1.208814,0.725244,0.392571,0.184846,0.110603
std,45.177428,1.11312,0.398389,0.299191,0.242124,0.143289,0.095254,0.094538
min,1.0,2.853,0.0,0.0,0.0,0.0,0.0,0.0
25%,39.75,4.5445,0.60275,1.05575,0.54775,0.308,0.10875,0.047
50%,78.5,5.3795,0.96,1.2715,0.789,0.417,0.1775,0.0855
75%,117.25,6.1845,1.2325,1.4525,0.88175,0.50725,0.24825,0.14125
max,156.0,7.769,1.684,1.624,1.141,0.631,0.566,0.453


The process of profiling differs slightly for categorical and numerical variables due to their inherent differences.

**The two main types of data are:**
- Quantitative (numerical) data
- Qualitative (categorical) data

### Data Quality Checks
Data quality checks involve the process of ensuring that the data is accurate, complete, consistent, relevant, and reliable. 


**Here are typical steps involved in checking data quality:**

#### 1. Reliability:
Evaluate the data's source and collection process to determine its trustworthiness.

In [58]:
#The source of the datasets is World Happiness Report, as it is mentioned in their website, is a
#partnership of Gallup, the Oxford Wellbeing Research Centre, the UN Sustainable Development Solutions Network, 
#and the WHR’s Editorial Board.This makes the source trustworthy and therefor reliable.

#### 2. Timeliness: 
Ensure the data is up-to-date and reflective of the current situation or the period of interest for the analysis.

In [None]:
#Each dataset reflect the period of interest that includes 2015-2019

#### 3. Consistency: 

Confirm that the data is consistent within the dataset and across multiple data sources. For example, the same data point should not have different values in different places.


In [59]:
#The country feature, which is supposed to be the same across multiple datasets, is consistent.
#The consistency of the data schema was unstable during the years 2015-2017, 
#but it started to be consistent in 2018 when the World Happiness Report standardized six features along with overall rank
#and the country name

#### 4. Relevance: 
Assess whether the data is appropriate and applicable for the intended analysis. Data that is not relevant can skew results and lead to incorrect conclusions.

**Key considerations for relevance include:**

> 1. Sample Appropriateness: Confirm that your data sample aligns with your analysis objectives. For instance, utilizing data from the Northern region will not yield accurate insights for the Western region of the Kingdom.
>
> 2. Variable Selection: Any column will not be relevant for our analysis, we can get rid of these using the drop() method. We will set the “axis” argument to 1 since we’re dealing with columns, and set the “inplace” argument to True to make the change permanent.


In [42]:
'''
To test the relevance of the dataset we need to check the questions we need to answer and the datasets
. What countries or regions rank the highest in overall happiness and each of
the six factors contributing to happiness?
. How did country ranks or scores change between the 2015 and 2016 as well as
the 2016 and 2017 reports?
. Did any country experience a significant increase or decrease in happiness?
. Bounce: Please begin your analysis, and don't hesitate to consider additional
relevant questions.
'''

'\nTo test the relevance of the dataset we need to check the questions we need to answer and the datasets\n\n'

In [41]:
#For now, the following features are the most relevant: [Country, regions, Happiness Rank, 
#Happiness score, Economy (GDP per Capita), Family, Health (Life Expectancy), Freedom,
#Trust (Government Corruption), Generosity]
#The less relevant columns (for now) [Dystopia Residual] 
happiness2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


In [43]:
#here the upper and lower confidence intervals can help us answering the third question:
# Did any country experience a significant increase or decrease in happiness?

# Why was the standard error replaced by lower and upper confidence interval in the 2016 dataset?
#Genreally, standrad error shows the uncertainty in an estimation while confidence interval 
# shows the range of likely true values

# In World Happiness Report website, it is said that the confidence interval is useful to see
#whether countries differ significantly in the average life evaluations, which can help us answering
#the third questions: Did any country experience a significant increase or decrease in happiness? 

happiness2016.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


In [44]:
#Here, the upper and lower confidence interval are replaced by the whicker high and low. WHY?
#Condifdence interval and whickers are used for different purposes.
#Whisker high in a (plot box) shows the maximum value that is not considered as an outlier. 
#Whisker low in a (plot box) shows the minimum value thta is not considered as an outlier.
#They can show extreme changes in happiness and detect outliers, so they may help us answering the 
#third question: Did any country experience a significant increase or decrease in happiness? 
happiness2017.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
3,Switzerland,4,7.494,7.561772,7.426227,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007,2.276716
4,Finland,5,7.469,7.527542,7.410458,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612,2.430182


In [45]:
#Standard error, confidence intervals, and whiskers along with dystopia residual are completely
#removed, and we're left with Overall rank, Country or region, score and the six features/factors.
happiness2018.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408
3,4,Iceland,7.495,1.343,1.644,0.914,0.677,0.353,0.138
4,5,Switzerland,7.487,1.42,1.549,0.927,0.66,0.256,0.357


In [46]:
#The features in the 2019 dataset are the same as the features in the 2018 dataset
happiness2019.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298


In [None]:
'''
After checking the relevant and checking the official website for World Happiness Report, we're not
going to drop any columns because they are all relevant for the use case. Some features may help 
answering one questions and others may help answering all of them. 

even though 2015, 2016, 2017 has differenct statstical metrices we can:
- calcualte the confidence interval using standard error for 2015 and then compare it with 2016
'''

#### 5. Uniqueness: 
Check for and remove duplicate records to prevent skewed analysis results.


In [53]:
happiness2015.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204


In [56]:
#Chekcing for duplicated in 2015
happiness2015.duplicated().sum()

0

In [58]:
happiness2015.duplicated(['Happiness Rank']).sum()

1

In [55]:
#We see here that two countries has the same rank and the same score which is a normal case
#Happiness score (overall score of them 6 facotr scores) -> determines the rank, which means they are 
#both unique
happiness2015[happiness2015.duplicated(['Happiness Rank'], keep=False)]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
81,Jordan,Middle East and Northern Africa,82,5.192,0.04524,0.90198,1.05392,0.69639,0.40661,0.14293,0.11053,1.87996
82,Montenegro,Central and Eastern Europe,82,5.192,0.05235,0.97438,0.90557,0.72521,0.1826,0.14296,0.1614,2.10017


In [69]:
happiness2015.duplicated(['Health (Life Expectancy)']).sum()

1

In [70]:
# The duplicated value of Health (Life Expectancy) belongs to a country divided into Cyprus and 
#North Cyprus which is a normal case
happiness2015[happiness2015.duplicated(['Health (Life Expectancy)'], keep=False)]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
65,North Cyprus,Western Europe,66,5.695,0.05635,1.20806,1.07008,0.92356,0.49027,0.1428,0.26169,1.59888
66,Cyprus,Western Europe,67,5.689,0.0558,1.20813,0.89318,0.92356,0.40672,0.06146,0.30638,1.88931


In [79]:
#Chekcing for duplicated in 2016
happiness2016.duplicated().sum()

0

In [81]:
happiness2016.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137


In [113]:
happiness2016.duplicated(['Happiness Rank']).sum()

3

In [112]:
happiness2016[happiness2016.duplicated(['Happiness Rank'], keep=False)]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
33,Saudi Arabia,Middle East and Northern Africa,34,6.379,6.287,6.471,1.48953,0.84829,0.59267,0.37904,0.30008,0.15457,2.61482
34,Taiwan,Eastern Asia,34,6.379,6.305,6.453,1.39729,0.92624,0.79565,0.32377,0.0663,0.25495,2.61523
56,Poland,Central and Eastern Europe,57,5.835,5.749,5.921,1.24585,1.04685,0.69058,0.4519,0.055,0.14443,2.20035
57,South Korea,Eastern Asia,57,5.835,5.747,5.923,1.35948,0.72194,0.88645,0.25168,0.07716,0.18824,2.35015
144,Burkina Faso,Sub-Saharan Africa,145,3.739,3.647,3.831,0.31995,0.63054,0.21297,0.3337,0.12533,0.24353,1.87319
145,Uganda,Sub-Saharan Africa,145,3.739,3.629,3.849,0.34719,0.90981,0.19625,0.43653,0.06442,0.27102,1.51416


In [87]:
happiness2016.duplicated(['Lower Confidence Interval', 'Upper Confidence Interval']).sum()

0

In [85]:
happiness2016.duplicated(['Lower Confidence Interval']).sum()

3

In [88]:
#Lower Confidence Interval -> 
happiness2016[happiness2016.duplicated(['Lower Confidence Interval'], keep=False)]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
102,Nigeria,Sub-Saharan Africa,103,4.875,4.75,5.0,0.75216,0.64498,0.05108,0.27854,0.0305,0.23219,2.88586
103,Honduras,Latin America and Caribbean,104,4.871,4.75,4.992,0.69429,0.75596,0.58383,0.26755,0.06906,0.2044,2.29551
119,Egypt,Middle East and Northern Africa,120,4.362,4.259,4.465,0.95395,0.49813,0.52116,0.18847,0.10393,0.12706,1.96895
121,Kenya,Sub-Saharan Africa,122,4.356,4.259,4.453,0.52267,0.7624,0.30147,0.40576,0.06686,0.41328,1.88326
146,Yemen,Middle East and Northern Africa,147,3.724,3.621,3.827,0.57939,0.47493,0.31048,0.2287,0.05892,0.09821,1.97295
147,Madagascar,Sub-Saharan Africa,148,3.695,3.621,3.769,0.27954,0.46115,0.37109,0.13684,0.07506,0.2204,2.15075


In [86]:
happiness2016.duplicated(['Upper Confidence Interval']).sum()

3

In [89]:
happiness2016[happiness2016.duplicated(['Upper Confidence Interval'], keep=False)]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
55,Russia,Central and Eastern Europe,56,5.856,5.789,5.923,1.23228,1.05261,0.58991,0.32682,0.03586,0.02736,2.59115
57,South Korea,Eastern Asia,57,5.835,5.747,5.923,1.35948,0.72194,0.88645,0.25168,0.07716,0.18824,2.35015
84,Kyrgyzstan,Central and Eastern Europe,85,5.185,5.103,5.267,0.56044,0.95434,0.55449,0.40212,0.04762,0.38432,2.28136
87,Montenegro,Central and Eastern Europe,88,5.161,5.055,5.267,1.07838,0.74173,0.63533,0.15111,0.12721,0.17191,2.25531
95,Vietnam,Southeastern Asia,96,5.061,4.991,5.131,0.74037,0.79117,0.66157,0.55954,0.11556,0.25075,1.9418
98,Greece,Western Europe,99,5.033,4.935,5.131,1.24886,0.75473,0.80029,0.05822,0.04127,0.0,2.12944


In [103]:
# Checking duplicates in 2017
happiness2017.duplicated().sum()

0

In [104]:
happiness2017.head(3)

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715


In [None]:
#isn't countries with the same score supposed to have the same rank?

In [108]:
happiness2017.duplicated(['Happiness.Score']).sum() #Score is duplicated!

4

In [109]:
happiness2017.duplicated(['Happiness.Rank']).sum() #Rank is not duplicated!

0

In [111]:
#Comparing with 2015 and 2016 -> in these years datasets whichever countries get the same happiness 
#score they also get the same happiness rank (in 2015 standard error was used, in 2016 confidence 
#interval was used)
#Unlike them, in 2017 dataset two countries with the same happiness score have different happiness rank 
#This tell us that  the happiness rank in 2017 dataset doesn't depend solely on the overall score
happiness2017[happiness2017.duplicated('Happiness.Score', keep=False)]

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
8,Sweden,9,7.284,7.344095,7.223905,1.494387,1.478162,0.830875,0.612924,0.385399,0.384399,2.097538
9,Australia,10,7.284,7.356651,7.211349,1.484415,1.510042,0.843887,0.601607,0.477699,0.301184,2.065211
27,Uruguay,28,6.454,6.545906,6.362094,1.21756,1.412228,0.719217,0.579392,0.175097,0.178062,2.17241
28,Guatemala,29,6.454,6.566874,6.341126,0.872002,1.255585,0.54024,0.531311,0.283488,0.077223,2.893891
54,South Korea,55,5.838,5.922559,5.753441,1.401678,1.128274,0.900214,0.257922,0.206674,0.063283,1.880378
55,Moldova,56,5.838,5.908371,5.767629,0.728871,1.251826,0.589465,0.240729,0.208779,0.010091,2.807808
93,Vietnam,94,5.074,5.147281,5.000719,0.788548,1.277491,0.652169,0.571056,0.234968,0.087633,1.462319
94,Nigeria,95,5.074,5.2095,4.9385,0.783756,1.21577,0.056916,0.394953,0.230947,0.026122,2.365391


In [130]:
#checking duplicates in 2018 dataset
happiness2018.duplicated().sum()

0

In [131]:
happiness2018.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408


In [137]:
happiness2018.duplicated(['Overall rank']).sum() #the rank is not duplicated

0

In [136]:
happiness2018.duplicated(['Score']).sum() #the score is duplicated which means countries can have the 
#the same score but not the same rank which also means the rank in 2018 dataset is like the rank in
#2017 dataset doesn't depend solely on the overall score

2

In [138]:
happiness2018[happiness2018.duplicated(['Score'], keep=False)]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
72,73,Belarus,5.483,1.039,1.498,0.7,0.307,0.101,0.154
73,74,Turkey,5.483,1.148,1.38,0.686,0.324,0.106,0.109
78,79,Greece,5.358,1.154,1.202,0.879,0.131,0.0,0.044
79,80,Lebanon,5.358,0.965,1.179,0.785,0.503,0.214,0.136


In [139]:
#This columns wasn't duplicated in the previous years
happiness2018.duplicated(['GDP per capita']).sum() 

9

In [168]:
happiness2018[happiness2018.duplicated(['GDP per capita'], keep=False)].head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
9,10,Australia,7.272,1.34,1.573,0.91,0.647,0.361,0.302
12,13,Costa Rica,7.072,1.01,1.459,0.817,0.632,0.143,0.101
14,15,Germany,6.965,1.34,1.474,0.861,0.586,0.273,0.28
45,46,Thailand,6.072,1.016,1.417,0.707,0.637,0.364,0.029
52,53,Latvia,5.933,1.148,1.454,0.671,0.363,0.092,0.066


In [141]:
happiness2018.duplicated(['Social support']).sum()

10

In [169]:
happiness2018[happiness2018.duplicated(['Social support'], keep=False)].head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
6,7,Canada,7.328,1.33,1.532,0.896,0.653,0.321,0.291
8,9,Sweden,7.314,1.355,1.501,0.913,0.659,0.285,0.383
12,13,Costa Rica,7.072,1.01,1.459,0.817,0.632,0.143,0.101
14,15,Germany,6.965,1.34,1.474,0.861,0.586,0.273,0.28
24,25,Chile,6.476,1.131,1.331,0.808,0.431,0.197,0.061


In [143]:
happiness2018.duplicated(['Healthy life expectancy']).sum()

13

In [167]:
happiness2018[happiness2018.duplicated(['Healthy life expectancy'], keep=False)].head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
6,7,Canada,7.328,1.33,1.532,0.896,0.653,0.321,0.291
7,8,New Zealand,7.324,1.268,1.601,0.876,0.669,0.365,0.389
13,14,Ireland,6.977,1.448,1.583,0.876,0.614,0.307,0.306
14,15,Germany,6.965,1.34,1.474,0.861,0.586,0.273,0.28


In [145]:
happiness2018.duplicated(['Freedom to make life choices']).sum()

20

In [166]:
happiness2018[happiness2018.duplicated(['Freedom to make life choices'], keep=False)].head(2)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
12,13,Costa Rica,7.072,1.01,1.459,0.817,0.632,0.143,0.101
16,17,Luxembourg,6.91,1.576,1.52,0.896,0.632,0.196,0.321


In [147]:
happiness2018.duplicated(['Generosity']).sum()

34

In [148]:
happiness2018.duplicated(['Perceptions of corruption']).sum()

45

In [None]:
'''new questions
in 2015, 2016, and 2017 dataset the features, specfically the six factors were not duplicated. So, what
is happening here? is the happiness over the world increased/decreased?
'''

In [152]:
#Checking duplicates in 2019 dataset
happiness2019.duplicated().sum()

0

In [153]:
happiness2019.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298


In [157]:
happiness2019.duplicated(['Overall rank']).sum()

0

In [156]:
happiness2019.duplicated(['Score']).sum() 
#the same case of overall rank and score is also happening here

1

In [158]:
happiness2019[happiness2019.duplicated(['Score'], keep=False)]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
88,89,Morocco,5.208,0.801,0.782,0.782,0.418,0.036,0.076
89,90,Azerbaijan,5.208,1.043,1.147,0.769,0.351,0.035,0.182


In [159]:
happiness2019.duplicated(['GDP per capita']).sum()

10

In [160]:
happiness2019.duplicated(['Social support']).sum()

11

In [161]:
happiness2019.duplicated(['Healthy life expectancy']).sum()

37

In [162]:
happiness2019.duplicated(['Freedom to make life choices']).sum()

26

In [163]:
happiness2019.duplicated(['Generosity']).sum()

38

In [195]:
happiness2019['Generosity'].value_counts().head() 

0.153    4
0.244    3
0.175    3
0.043    3
0.083    3
Name: Generosity, dtype: int64

In [196]:
#Checking for what countries are the duplicated values are assigned 
#taking 0.153 and 0.244 in the next cell for instance 
#they are assigned for different countries that have ranks aren't close to each
#so, they are not redundant; rather, they are repeated.  

happiness2019[happiness2019['Generosity'] == 0.153]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
29,30,Spain,6.354,1.286,1.484,1.062,0.362,0.153,0.079
110,111,Senegal,4.681,0.45,1.134,0.571,0.292,0.153,0.072
127,128,Mali,4.39,0.385,1.105,0.308,0.327,0.153,0.052


In [197]:
happiness2019[happiness2019['Generosity'] == 0.244]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
9,10,Austria,7.246,1.376,1.475,1.016,0.532,0.244,0.226
86,87,Turkmenistan,5.247,1.052,1.538,0.657,0.394,0.244,0.028
129,130,Sri Lanka,4.366,0.949,1.265,0.831,0.47,0.244,0.047


In [175]:
happiness2019.duplicated(['Perceptions of corruption']).sum()

43

In [187]:
#seeing some duplicated values in the perceptions of corruption feature
happiness2019['Perceptions of corruption'].value_counts().head(30) 

0.028    4
0.078    4
0.089    4
0.041    3
0.064    3
0.056    3
0.100    3
0.167    3
0.055    3
0.027    3
0.034    3
0.093    3
0.006    2
0.086    2
0.073    2
0.087    2
0.110    2
0.047    2
0.164    2
0.097    2
0.050    2
0.053    2
0.080    2
0.085    2
0.082    2
0.025    2
0.182    2
0.060    2
0.076    1
0.114    1
Name: Perceptions of corruption, dtype: int64

In [193]:
#This columns has a high percentage of duplicated values, but it is still an important columns for EDA
happiness2019[(happiness2019['Perceptions of corruption'] == 0.028) |
              (happiness2019['Perceptions of corruption'] == 0.078) |
              (happiness2019['Perceptions of corruption'] == 0.089)]


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
26,27,Guatemala,6.436,0.8,1.269,0.746,0.535,0.175,0.078
51,52,Thailand,6.008,1.05,1.409,0.828,0.557,0.359,0.028
55,56,Jamaica,5.89,0.831,1.478,0.831,0.49,0.107,0.028
58,59,Honduras,5.86,0.642,1.236,0.828,0.507,0.246,0.078
86,87,Turkmenistan,5.247,1.052,1.538,0.657,0.394,0.244,0.028
91,92,Indonesia,5.192,0.931,1.203,0.66,0.491,0.498,0.028
99,100,Nepal,4.913,0.446,1.226,0.677,0.439,0.285,0.089
125,126,Iraq,4.437,1.043,0.98,0.574,0.241,0.148,0.089
131,132,Chad,4.35,0.35,0.766,0.192,0.174,0.198,0.078
141,142,Comoros,3.973,0.274,0.757,0.505,0.142,0.275,0.078


In [198]:
happiness2019[happiness2019.duplicated(['Perceptions of corruption'], keep=False)]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
11,12,Costa Rica,7.167,1.034,1.441,0.963,0.558,0.144,0.093
12,13,Israel,7.139,1.276,1.455,1.029,0.371,0.261,0.082
20,21,United Arab Emirates,6.825,1.503,1.310,0.825,0.598,0.262,0.182
22,23,Mexico,6.595,1.070,1.323,0.861,0.433,0.074,0.073
24,25,Taiwan,6.446,1.368,1.430,0.914,0.351,0.242,0.097
...,...,...,...,...,...,...,...,...,...
145,146,Zimbabwe,3.663,0.366,1.114,0.433,0.361,0.151,0.089
146,147,Haiti,3.597,0.323,0.688,0.449,0.026,0.419,0.110
147,148,Botswana,3.488,1.041,1.145,0.538,0.455,0.025,0.100
149,150,Malawi,3.410,0.191,0.560,0.495,0.443,0.218,0.089


In [None]:
'''
The duplicated columns are the six factors we need for the analysis therefore we cannot drop them.
The number of duplicates in the six factors is increased except for perceptions of corruption. So, does
these duplicates mean anything or are thye just duplicated? since there were no duplicates 
in the period of 2015-2016 and a few duplicates in 2017, then it started to increased in both 2018-2019. Does these duplicates suggests
that the countries are starting to have the same level of happiness? 
'''

#### 6. Completeness: 
Ensure that no critical data is missing. This might mean checking for null values or required fields that are empty.

We will start by checking the dataset for missing or null values. For this, we can use the isna() method which returns a dataframe of boolean values indicating if a field is null or not. To group all missing values by column, we can include the sum() method.

In [1]:
#Display number missing values per column

In [68]:
# go to clean them 

#### 7. Check Accuracy:

Verify that the data is correct and precise. This could involve comparing data samples with known sources or using validation rules.

**The process includes:**
1. Validating the appropriateness of data types for the dataset.
2. Identifying outliers  using established validation  rule

In [2]:
# check columns types 

In [33]:
# go to clean them 

In [3]:
# check outliers 

**What is an Outlier?** 
Outlier is an row/observation that appears far away and diverges from an overall pattern in a sample.

**What are the types of Outliers?**
1. Univariate: These outliers can be found when we look at distribution of a single variable
2. Multivariate: are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions. example (hight=100, weight=100) for a person

**What causes Outliers?**
Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The method to deal with them would then depend on the reason of their occurrence.

Let’s understand various types of outliers:

1. Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data.
2. Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty.
3. Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
4. Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
5. Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment with one of the renowned insurance company, I noticed that the performance of top 50 financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data mining activity with advisors, we used to treat this segment separately.


**What is the impact of Outliers on a dataset?**


![image.png](https://www.analyticsvidhya.com/wp-content/uploads/2015/02/Outlier_31.png)



**How to detect Outliers?**

1. Most commonly used method to detect outliers is visualization (Univariate Graphical Analysis).

We use 3 common visualization methods:
>- Box-plot: A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of the box to show the range of the data. Outlier points are those past the end of the whiskers. Box plots show robust measures of location and spread as well as providing information about symmetry and outliers.
>
>  
>![image.png](https://miro.medium.com/v2/resize:fit:698/format:webp/1*VK5iHA2AB28HSZwWwUbNYg.png)
>
>
>- Histogram
>- Scatter Plot: A scatter plot is a mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. The points that are far from the population can be termed as an outlier.
>
>  
>![image.png](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*Ov6aH-8yIwNoUxtMFwgx4g.png)
>
>

2. Using statistical method (Univariate Non-Graphical analysis):
>- Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
 
![image.png](https://www.whatissixsigma.net/wp-content/uploads/2015/07/Box-Plot-Diagram-to-identify-Outliers-figure-1.png)

>- Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
>- Data points, three or more standard deviation away from mean are considered outlier: The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. While calculating the Z-score we re-scale and center the data and look for data points that are too far from zero. These data points which are way too far from zero will be treated as the outliers. In most of the cases, a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.
> - Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding


In [23]:
# go to univariate graphical analysis
# go to lesson : data visualisation 1 - chart type section
# then go to univariate graphical analysis
# detect outliers using graphs varbaly

In [24]:
# go to lesson: statistics 1 then statistics 3
# then go to univariate Non graphical analysis
# detect outliers using numerical statistics 

In [25]:
# go to delete ouliers

## 2. Data Cleaning: 

Preliminary findings from data profiling can lead to cleaning the data by:
- Handling missing values
- Correcting errors.
- Dealing with outliers.

-------------------



### Handling missing values:

**Why my data has missing values?**
They may occur at two stages:
1. Data Extraction: It is possible that there are problems with extraction process. Errors at data extraction stage are typically easy to find and can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder to correct.

**Why do we need to handle the missing data?**
To avoid:
- Bias the conclusions.
- Leading the business to make wrong decisions.

**Which are the methods to treat missing values ?**
1. Deletion: we delete rows where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size.

2. Imputation: is a method to fill in the missing values with estimated ones. This imputation is one of the most frequently used methods.

    2.1. Mean/ Mode/ Median Imputation: It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
    > It can be of two types:
    > - Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median.
    > - Similar case Imputation: In this case, we calculate average for each group individually of non missing values then replace the missing value based on the group.

    2.2. Constant Value
   
    2.3. Forward Filling
   
    2.4. Backward Filling

6. Prediction Model:  Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data.  In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.

> There are 2 drawbacks for this approach:
> - The model estimated values are usually more well-behaved than the true values
> - If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not be precise for estimating missing values.

9. KNN Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.

   > **Advantages:**
   > - k-nearest neighbour can predict both qualitative & quantitative attributes
   > - Creation of predictive model for each attribute with missing data is not required
   > - Attributes with multiple missing values can be easily treated
   > - Correlation structure of the data is taken into consideration

   > **Disadvantage:**
   > - KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances.
   > - Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.

--------------------


In [80]:
# go back to 6th dimention --> Completeness

### Correcting errors

-------------------

In [None]:
# go back to 7th dimension Accuracy 

### Dealing with outliers:

**How to remove Outliers?**
Most of the ways to deal with outliers are similar to the methods of missing values like deleting rows, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

1. Deleting rows: We delete outlier values if it is due to data entry error, data processing error or outlier rows are very small in numbers. We can also use trimming at both ends to remove outliers.

2. Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier rows and after that we can impute it with predicted values.

3. Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.


## 3. Univariate Analysis: 

This involves examining single variables to understand their characteristics (distribution, central tendency, dispersion, and shape).

We calculate **numerical values** about the data that tells us about the distribution of the data. We also **draw graphs** showing visually how the data is distributed. **To answer the following questions about Features/characteristics of Data:**
- Where is the center of the data? (location)
- How much does the data vary? (scale)
- What is the shape of the data? (shape)

**The benefits of this analysis:**
Statistics summary gives a high-level idea to identify whether the data has any outliers, data entry error, distribution of data such as the data is normally distributed or left/right skewed

**In this step, we will explore variables one by one using following approaches:**

### 1. Univariate Graphical Analysis:
Method to perform uni-variate analysis will depend on whether the variable type is categorical or numerical.

#### I. Categorical Variables:

we’ll use frequency table to understand distribution of each category
- Bar Chart (Ordinal) - Orderd
- Pie Chart (Nominal) - non Orderd

#### II. Numerical Variables:

we need to understand the central tendency and spread of the variable (Descriptive Analysis) using:
   - Box plot
   - Histogram

### 2. Univariate Non-Graphical analysis: 

- Where is the center of the data? (location) --> **Measures of central tendency**
- How much does the data vary? (scale) --> **Measure of variability**
- What is the shape of the data? (shape) --> **Measures of variation combined with an average (measure of center) gives a good picture of the distribution of the data.**

## 4. Bivariate/Multivariate Analysis:

Here, you look at the relationships between two or more variables. This can involve looking for correlations, patterns, and trends that suggest a relationship or an association.

We can perform bi-variate analysis for any combination of categorical and numerical variables. The combination can be:
| bi-variate variables   | Plot type |
| ------------- | ------------- |
| Categorical & Categorical| Stacked Bar Chart |
| Categorical & numerical  | scatter plot, histogram, box plot|
| numerical  & numerical  | Scatter plot, line chart| 


Multivariate Analysis:
- Heat map
- Bar Chart
- Scatter Chart
- Line Chart

**Categorical & Categorical --> (Stacked Column Chart)**

**Categorical & numerical --> (scatter plot, histogram, box plot)**

**numerical & numerical --> (Scatter plot, line chart)**

We could also use a correlation matrix to get more specific information about the relationship between these two variables.