Option 1: Données N-3 à N-1
* Train: 2018=f(2017, 2016, 2015)
* Test 2019=f(2018, 2017, 2015)
* Inférence 2020=f(2019, 2018, 2017)


Tester le drift ???


# Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

[Context](#Co)<br>
[Import packages and data](#0)<br>
    
[**Data representation**](#Da)<br>

</div>
<hr>

<a name="Co"></a>
# Context

The **World Happiness Report** is a landmark survey of the state of global happiness from 2015 to 2019 according to 6 main factors:
* economic production, 
* social support, 
* life expectancy, freedom, 
* absence of corruption, 
* and generosity

### Purposes of the project
<ins> Data analysis: </ins>
1. Give a clear picture of happiness around the world in 2019
2. Analyse trends in happiness from 2015 to 2019
    
<ins> Forecasting with Machine Learning</ins>(*)
1. How happy will countries be in 2020 ?
2. In which countries happiness will increase in 2020 ?

(\*) *Although data don't contain related information, the global pandemic may have a tremendous impact on the results*

You can find the whole presentation and information about the data in the **Project Presentation** notebook

### Workflow
* Cleaning
* EDA
* Data Visualization
* **Preprocessing**
* Machine Learning

Before we apply our machine learning models to forecast happiness in 2020, we need to find the best data representation that will allow the models to be the most efficient.

In our case, we will modify our dataset to predict 2020 happiness according to a 3 years long historic happiness factors

------------
------------
<a name="0"></a>
# Import packages and data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# import cleaned and normalized data
df = pd.read_csv('data/data_clean_norm.csv')
print("data dimension:",df.shape)
# 
df.set_index("country",inplace=True)

# list of factors
l_factors = ['life_expectancy', 'gdp_per_capita', 'social_support', 
             'freedom','generosity', 'corruption_perception'] 

data dimension: (705, 11)


------------
------------
<a name="Da"></a>
# Data representation

In our case, we decide to predict happiness in 2020 with 3 years long historic of data (happiness and factors). To do so,
we have 3 datasets to build:
* **Train set** (train model): happiness in 2018 and data from 2015 to 2017
* **Test set** (test model accuracy: happiness in 2019 and data from 2016 to 2018
* **Infer set** (our objectiv) : data from 2017 to 2019

### 1. Get historic 

In [8]:
def get_historic_N3(df, year, l_var, infer=False):
    """
    """
    if infer:
        df_y = df[df["year"]==year-1]['region']
    else:
        df_y = df[df["year"]==year][l_var]
    
    
    for i in range(1,4):
        df_p = df[df["year"]==year-i][l_var]
        df_p.drop(columns=["year","region","happiness_rank"],inplace=True)
        
        df_p.rename(
            columns=dict(zip(df_p.columns.tolist(),[col+"P"+str(i) for col  in df_p.columns.tolist()])),
            inplace=True)

        df_y = pd.merge(df_y, df_p, left_index=True, right_index=True, how="inner")
        
    if not infer:
        df_y.drop(columns=[col for col in df.columns if col not in ["year","happiness_score","region"]],inplace=True)
        
    
    return(df_y)

In [9]:
train_set = get_historic_N3(df, 2018, df.columns.tolist())

test_set = get_historic_N3(df, 2019, df.columns.tolist())

infer_set=get_historic_N3(df,2020,df.columns.tolist(), infer=True)

### 2. Convert categorical variable into dummy/indicator variables.

In [10]:
train_set = pd.merge(train_set, pd.get_dummies(train_set['region']), left_index=True, right_index=True, how="inner")

test_set = pd.merge(test_set, pd.get_dummies(test_set['region']), left_index=True, right_index=True, how="inner")

infer_set = pd.merge(infer_set, pd.get_dummies(infer_set['region']), left_index=True, right_index=True, how="inner")

### 3. Export datasets

In [11]:
train_set.to_csv('data/train_set.csv', index=True)

test_set.to_csv('data/test_set.csv', index=True)

infer_set.to_csv('data/infer_set.csv', index=True)