Notes:
**Regression linéaire tuto Python:** 
* https://medium.com/all-about-ml/linear-regression-d41a6a5dcab6
* https://anaconda.org/sanchitiitr/linear-regression-model/notebook

**Polynomiale:**
* https://medium.com/analytics-vidhya/understanding-polynomial-regression-5ac25b970e18 (voir cost function)
* https://medium.com/@rushilp2311/in-this-article-we-will-explore-more-about-regression-algorithm-ca3c5594a0b0 (sklearn.preprocessing import PolynomoalFeatures)
* https://towardsdatascience.com/polynomial-regression-bbe8b9d97491

**Random Forest**
* https://medium.com/swlh/random-forest-and-its-implementation-71824ced454f

**Autre:**
* Stepwise ?
* A la fin, un tableau récapitulatif modèle + métrique

# Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

[Context](#Co)<br>
[Import packages and data](#0)<br>
    
[**Data representation**](#Da)<br>

</div>
<hr>

<a name="Co"></a>
# Context

The **World Happiness Report** is a landmark survey of the state of global happiness from 2015 to 2019 according to 6 main factors:
* economic production, 
* social support, 
* life expectancy, freedom, 
* absence of corruption, 
* and generosity

### Purposes of the project
<ins> Data analysis: </ins>
1. Give a clear picture of happiness around the world in 2019
2. Analyse trends in happiness from 2015 to 2019
    
<ins> Forecasting with Machine Learning</ins>(*)
1. How happy will countries be in 2020 ?
2. In which countries happiness will increase in 2020 ?

(\*) *Although data don't contain related information, the global pandemic may have a tremendous impact on the results*

You can find the whole presentation and information about the data in the **Project Presentation** notebook

### Workflow
* Cleaning
* EDA
* Data Visualization
* Preprocessing
* **Machine Learning**

--------
On va tester les algo suivants : N = N-1 for benchmark, regression linéaire (sans variable selection et avec), régression polynomiale et random forest.
Placer Machine Learning

------------
------------
<a name="0"></a>
# Import packages and data

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
# import cleaned and normalized data
train_set = pd.read_csv('data/train_set.csv', index_col="country")
test_set = pd.read_csv('data/test_set.csv', index_col="country")
infer_set = pd.read_csv('data/infer_set.csv', index_col="country")

# drop region and year
train_set.drop(columns=['year','region'],inplace=True)
test_set.drop(columns=['year','region'],inplace=True)
infer_set.drop(columns=['region'],inplace=True)

# list of factors
l_factors = ['life_expectancy', 'gdp_per_capita', 'social_support', 
             'freedom','generosity', 'corruption_perception'] 

------------
------------
<a name="Da"></a>
# Forecasting

In [3]:
X_train = train_set.copy().drop(columns=['happiness_score'])
Y_train = train_set["happiness_score"].copy()

X_test = test_set.copy().drop(columns=['happiness_score'])
Y_test = test_set["happiness_score"].copy()

y_true = test_set["happiness_score"]

## Benchmark Happiness year N = Happiness year N-1

In [4]:
y_pred_bench = test_set["happiness_scoreP1"]

benchmark_mse = mean_squared_error(y_true, y_pred_bench)

print("benchmark mse:",benchmark_mse)

benchmark mse: 17.492573441032466


## Régression linéaire sans selection de variable

In [96]:
model = LinearRegression() 
linear_fit = model.fit(X_train, Y_train)
y_pred_lin = linear_fit.predict(X_test)
y_train_pred_lin = linear_fit.predict(X_train)

In [97]:
linear_mse_test = mean_squared_error(y_true, y_pred_lin)
linear_mse_train = mean_squared_error(Y_train, y_train_pred_lin)

print("linear mse test:",linear_mse_test)
print("linear mse train:",linear_mse_train)

linear mse test: 32.71871849467362
linear mse train: 9.09998014942466




## Régression linéaire avec selection de variable

In [47]:
l_var_remove = ["gdp_per_capitaP2", "life_expectancyP2", "gdp_per_capitaP3", "happiness_scoreP2","corruption_perceptionP2",
                "social_supportP2", "generosityP2","freedomP2","life_expectancyP3","corruption_perceptionP3",
                "happiness_scoreP3","Western Europe","social_supportP3","freedomP3", "generosityP3","social_supportP1",
                "gdp_per_capitaP1","happiness_scoreP1", "freedomP1","life_expectancyP1"]


X_train_sel = X_train.copy().drop(columns=l_var_remove)
X_test_sel = X_test.copy().drop(columns=l_var_remove)

vif = pd.DataFrame() #Let us show th VIF scores in a data frame
vif["Features"] = X_train_sel.columns
vif["VIF Factor"] = [variance_inflation_factor(X_train_sel.values, i) for i in range(X_train_sel.shape[1])] #variance_inflation_factor calculates the scores #for each Feature
vif.sort_values(by="VIF Factor", ascending=False)

display(vif.Features.tolist())

['corruption_perceptionP1',
 'generosityP1',
 'Australia and New Zealand',
 'Central and Eastern Europe',
 'Eastern Asia',
 'Latin America and Caribbean',
 'Middle East and Northern Africa',
 'North America',
 'Southeastern Asia',
 'Southern Asia',
 'Sub-Saharan Africa']

In [61]:
l_reg = ['Australia and New Zealand', 'Central and Eastern Europe', 'Eastern Asia', 'Latin America and Caribbean', 
         'Middle East and Northern Africa', 'North America', 'Southeastern Asia', 'Southern Asia', 'Sub-Saharan Africa']

l_sel_var = ["happiness_scoreP1","happiness_scoreP2","happiness_scoreP3"]

X_train_sel= X_train[l_sel_var+l_reg]
X_test_sel = X_test[l_sel_var+l_reg]


model = LinearRegression() 
linear_varsel_fit = model.fit(X_train_sel, Y_train)
y_pred_lin = linear_varsel_fit.predict(X_test_sel)

linear_mse_varsel = mean_squared_error(y_true, y_pred_lin)

print("linear_varsel mse:",linear_mse_varsel)

linear_varsel mse: 13.149272474939718


In [92]:
from sklearn.feature_selection import RFE
from sklearn.svm import SVR

model = LinearRegression() 
selector = RFE(model, n_features_to_select=10, step=1)
selector = selector.fit(X_train, Y_train)

display(selector.support_)

display(selector.ranking_)

array([False, False, False, False,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True])

array([19, 15, 10, 18,  1, 12,  7, 21, 14, 13, 17,  3,  5, 16, 20, 11,  9,
       22,  8,  4,  6,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1])

In [93]:
selector_idx = [i for i in range(len(selector.support_)) if selector.support_[i]]
display(selector_idx)

selected_var = [X_train.columns[i] for i in selector_idx]
display(selected_var)

[4, 21, 22, 23, 25, 26, 27, 28, 29, 30]

['happiness_scoreP1',
 'Australia and New Zealand',
 'Central and Eastern Europe',
 'Eastern Asia',
 'Middle East and Northern Africa',
 'North America',
 'Southeastern Asia',
 'Southern Asia',
 'Sub-Saharan Africa',
 'Western Europe']

In [99]:
X_train_sel= X_train[selected_var]
X_test_sel = X_test[selected_var]


model = LinearRegression() 
linear_varsel_fit = model.fit(X_train_sel, Y_train)

y_pred_lin_varsel = linear_varsel_fit.predict(X_test_sel)
y_train_pred_lin_varsel = linear_varsel_fit.predict(X_train_sel)


linear_varsel_mse_test = mean_squared_error(y_true, y_pred_lin_varsel)
linear_varsel_mse_train = mean_squared_error(Y_train, y_train_pred_lin_varsel)

print("linear varsel mse test:",linear_varsel_mse_test)
print("linear varsel mse train:",linear_varsel_mse_train)

linear varsel mse test: 14.573936484712481
linear varsel mse train: 13.195777478829317
