# Kaggle Data Analysis

In this notebook, we will perform an initial analysis of the data, as the first contact with it. The used data is from Kaggle [here](https://www.kaggle.com/unsdsn/world-happiness) corresponding the 2015 - 2019 research data interval. The 2012, 2014, 2020, 2021 reports are not included on Kaggle database and will be analyzed separately.

**The Kaggle data cannot be used to modeling a ML model because it represents the impact of each variable on the score, what means that the sum of all attributes corresponds to the score. The present analysis is only to get any relevant insight about the data to future feature extraction.**

## Preparing the environment

Importing dependencies and configuring the environment to visualization, directories and so on.

In [1]:
import pandas as pd

In [10]:
base_directory = '../data/kaggle'

## Loading the data

In [12]:
df_2015 = pd.read_csv(base_directory + '/2015.csv')
df_2016 = pd.read_csv(base_directory + '/2016.csv')
df_2017 = pd.read_csv(base_directory + '/2017.csv')
df_2018 = pd.read_csv(base_directory + '/2018.csv')
df_2019 = pd.read_csv(base_directory + '/2019.csv')

## Analyzing data attributes over the years

In [17]:
print(f'2015: \n {df_2015.columns.tolist()}\n\n')
print(f'2016: \n {df_2016.columns.tolist()}\n\n')
print(f'2017: \n {df_2017.columns.tolist()}\n\n')
print(f'2018: \n {df_2018.columns.tolist()}\n\n')
print(f'2019: \n {df_2019.columns.tolist()}\n\n')

2015: 
 ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Standard Error', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']


2016: 
 ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']


2017: 
 ['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high', 'Whisker.low', 'Economy..GDP.per.Capita.', 'Family', 'Health..Life.Expectancy.', 'Freedom', 'Generosity', 'Trust..Government.Corruption.', 'Dystopia.Residual']


2018: 
 ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']


2019: 
 ['Overall rank', 'Country or region', 'Score', 'GDP per ca

### Consistency and equivalent attributes from Kaggle Dataset over the years

| # | 2015 | 2016 | 2017 | 2018 | 2019 |
| - | ---- | ---- | ---- | ---- | ---- |
| 1 | Country | Country | Country | Country or region | Country or region |
| 2 | Region | Region | - | - | - |
| 3 | Happiness Rank | Happiness Rank | Happiness.Rank | Overall rank | Overall rank |
| 4 | Happiness Score | Happiness Score | Happiness.Score | Score | Score |
| 5 | Standard Error | - | - | - | - |
| 6 | - | Lower Confidence Interval | Whisker.low | - | - |
| 7 | - | Upper Confidence Interval | Whisker.high | - | - |
| 8 | Economy (GDP per Capita) | Economy (GDP per Capita) | Economy..GDP.per.Capita. | GDP per capita | GDP per capita |
| 9 | Family | Family | Family | Social support | Social support |
| 10 | Health (Life Expectancy) | Health (Life Expectancy) | Health..Life.Expectancy. | Healthy life expectancy | Healthy life expectancy |
| 11 | Freedom | Freedom | Freedom | Freedom to make life choices | Freedom to make life choices |
| 12 | Trust (Government Corruption) | Trust (Government Corruption) | Trust..Government.Corruption. | Perceptions of corruption | Perceptions of corruption |
| 13 | Generosity | Generosity | Generosity | Generosity | Generosity |
| 14 | Dystopia Residual | Dystopia Residual | Dystopia.Residual | - | - |

### What is the difference between `Country` and `Country or region`?

In [18]:
df_2017['Country'].unique()

array(['Norway', 'Denmark', 'Iceland', 'Switzerland', 'Finland',
       'Netherlands', 'Canada', 'New Zealand', 'Sweden', 'Australia',
       'Israel', 'Costa Rica', 'Austria', 'United States', 'Ireland',
       'Germany', 'Belgium', 'Luxembourg', 'United Kingdom', 'Chile',
       'United Arab Emirates', 'Brazil', 'Czech Republic', 'Argentina',
       'Mexico', 'Singapore', 'Malta', 'Uruguay', 'Guatemala', 'Panama',
       'France', 'Thailand', 'Taiwan Province of China', 'Spain', 'Qatar',
       'Colombia', 'Saudi Arabia', 'Trinidad and Tobago', 'Kuwait',
       'Slovakia', 'Bahrain', 'Malaysia', 'Nicaragua', 'Ecuador',
       'El Salvador', 'Poland', 'Uzbekistan', 'Italy', 'Russia', 'Belize',
       'Japan', 'Lithuania', 'Algeria', 'Latvia', 'South Korea',
       'Moldova', 'Romania', 'Bolivia', 'Turkmenistan', 'Kazakhstan',
       'North Cyprus', 'Slovenia', 'Peru', 'Mauritius', 'Cyprus',
       'Estonia', 'Belarus', 'Libya', 'Turkey', 'Paraguay',
       'Hong Kong S.A.R., China', '

In [23]:
df_2018['Country or region'].unique()

array(['Finland', 'Norway', 'Denmark', 'Iceland', 'Switzerland',
       'Netherlands', 'Canada', 'New Zealand', 'Sweden', 'Australia',
       'United Kingdom', 'Austria', 'Costa Rica', 'Ireland', 'Germany',
       'Belgium', 'Luxembourg', 'United States', 'Israel',
       'United Arab Emirates', 'Czech Republic', 'Malta', 'France',
       'Mexico', 'Chile', 'Taiwan', 'Panama', 'Brazil', 'Argentina',
       'Guatemala', 'Uruguay', 'Qatar', 'Saudi Arabia', 'Singapore',
       'Malaysia', 'Spain', 'Colombia', 'Trinidad & Tobago', 'Slovakia',
       'El Salvador', 'Nicaragua', 'Poland', 'Bahrain', 'Uzbekistan',
       'Kuwait', 'Thailand', 'Italy', 'Ecuador', 'Belize', 'Lithuania',
       'Slovenia', 'Romania', 'Latvia', 'Japan', 'Mauritius', 'Jamaica',
       'South Korea', 'Northern Cyprus', 'Russia', 'Kazakhstan', 'Cyprus',
       'Bolivia', 'Estonia', 'Paraguay', 'Peru', 'Kosovo', 'Moldova',
       'Turkmenistan', 'Hungary', 'Libya', 'Philippines', 'Honduras',
       'Belarus', 'Turkey

`Country or region` is similar to `Country`. **Maybe it is possible on _feature extraction_ phase to extract the region from the name of countries based on 2015 e 2016 datasets.**

### Error and confidence interval

Considering that only 2 datasets contain some convidence interval attributes, it is possible _with this data_ to use this columns. **Maybe we could find more data on original dataset** and not from Kaggle Dataset.