# Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

[Context](#Co)<br>
[Import packages and data](#0)<br>
    
1. [**Factors representation**](#Fac)<br>
    
2. [**Historic representation**](#Hi)<br>

</div>
<hr>

<a name="Co"></a>
# Context

The **World Happiness Report** is a landmark survey of the state of global happiness from 2015 to 2019 according to 6 main factors:
* economic production, 
* social support, 
* life expectancy, freedom, 
* absence of corruption, 
* and generosity

### Purposes of the project
<ins> Data analysis: </ins>
1. Give a clear picture of happiness around the world in 2019
2. Analyse trends in happiness from 2015 to 2019

<ins> Forecasting with Machine Learning</ins>(\*)
1. Can we predict a country happiness if we know the gdp per capita, life expectancy and other factors values?
2. Can we predict a country happiness thanks to its history (happiness+factors)?

(\*) *Although data don't contain related information, the global pandemic may have a tremendous impact on the results*

You can find the whole presentation and information about the data in the **Project Presentation** notebook

### Workflow
* Cleaning
* EDA
* Data Visualization
* **Features Engineering**
* Machine Learning

-----

Features Engineering is the process of transforming the data into a better representation that maximizes the efficiency of your machine learning models.

In this notebook, we will create a dataset for each forecasting purpose:
* predict happiness thanks to factors: factor dataset
* predicth happiness thanks to past years historic: historic dataset


------------
------------
<a name="0"></a>
# Import packages and data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# import cleaned and normalized data
df = pd.read_csv('data/data_clean_norm.csv')
print("data dimension:",df.shape)

# list of factors
l_factors = ['life_expectancy', 'gdp_per_capita', 'social_support', 
             'freedom','generosity', 'corruption_perception'] 

data dimension: (705, 11)


------------
------------
<a name="Fac"></a>
# Factors representation

* **Train set** (train model): happiness and factors from 2016 to 2018
* **Test set** (test model accuracy): happiness and factors in 2019
* **Infer set** (our objectiv): factors in 2020 (missing)

In [3]:
fact_df = df.copy()

# sort values by country and years for groupby computing
fact_df.sort_values(by=["country","year"],ascending=True)

# Get year-1 happiness value by shifting values per country
fact_df['happiness_scoreP1'] = fact_df.groupby('country')['happiness_score'].shift()

# set country as index
fact_df.set_index("country",inplace=True)

### 1. Convert categorical variable region into dumy/indicator variables.

In [4]:
# dummify region variable
df_dummies_reg = pd.get_dummies(fact_df['region']).reset_index().drop_duplicates().set_index("country")

# merge dummies with source dataframe
fact_df = pd.merge(fact_df, df_dummies_reg, left_index=True, right_index=True, how="inner")

# drop happiness_rank and region
fact_df.drop(columns=["happiness_rank","region"],inplace=True)

### 2. Export datasets

In [5]:
fact_df[fact_df['year'].isin([2018])].to_csv('data/fact_train_set.csv', index=True)

fact_df[fact_df['year'].isin([2019])].to_csv('data/fact_test_set.csv', index=True)

# hist_infer_set.to_csv('data/_hist_infer_set.csv', index=True

------------
------------
<a name="Hi"></a>
# Historic representation

We decide to predict happiness in 2020 with 3 years long history of data (happiness and factors). To do so,
we have 3 datasets to build:
* **Train set** (train model): happiness in 2018 and data from 2015 to 2017
* **Test set** (test model accuracy: happiness in 2019 and data from 2016 to 2018
* **Infer set** (our objectiv): data from 2017 to 2019 to forecast 2020 happiness

### 1. Get data history 

In [6]:
def get_historic_N3(df, year, l_var, infer=False):
    """
    This functions transforms our data into a dataset containing
    for each country, happiness score for a selected year
    and variables for the 3 previous years
    
    Parameters
    ----------
    df: DataFrame
        our dataset
        
    year: int
        selected year
        
    l_var: list
        historic variables
        
    infer: boolean (default:False)
        if True: create the dataset with only predicting variables (historic)
        
    Return
    ------
    DataFrame
        transformed dataset
    """
    if infer:
        df_y = df[df["year"]==year-1]['region']
    else:
        df_y = df[df["year"]==year][l_var]
    
    
    for i in range(1,4):
        df_p = df[df["year"]==year-i][l_var]
        df_p.drop(columns=["year","region","happiness_rank"],inplace=True)
        
        df_p.rename(
            columns=dict(zip(df_p.columns.tolist(),[col+"P"+str(i) for col  in df_p.columns.tolist()])),
            inplace=True)

        df_y = pd.merge(df_y, df_p, left_index=True, right_index=True, how="inner")
        
    if not infer:
        df_y.drop(columns=[col for col in df.columns if col not in ["year","happiness_score","region"]],inplace=True)
        
    
    return(df_y)

In [7]:
df.set_index("country",inplace=True)

hist_train_set = get_historic_N3(df, 2018, df.columns.tolist())

hist_test_set = get_historic_N3(df, 2019, df.columns.tolist())

hist_infer_set=get_historic_N3(df,2020,df.columns.tolist(), infer=True)

### 2. Convert categorical variable region into dumy/indicator variables.

In [8]:
hist_train_set = pd.merge(hist_train_set, pd.get_dummies(hist_train_set['region']), left_index=True, right_index=True, how="inner")

hist_test_set = pd.merge(hist_test_set, pd.get_dummies(hist_test_set['region']), left_index=True, right_index=True, how="inner")

hist_infer_set = pd.merge(hist_infer_set, pd.get_dummies(hist_infer_set['region']), left_index=True, right_index=True, how="inner")

### 3. Export datasets

In [9]:
hist_train_set.to_csv('data/hist_train_set.csv', index=True)

hist_test_set.to_csv('data/hist_test_set.csv', index=True)

# hist_infer_set.to_csv('data/_hist_infer_set.csv', index=True