## Multiple Linear Regression
### and backward elimination

The purpose of this notebook is to 
* apply Multiple Linear Regression to the preprocessed dataset
* apply backward elimination to the model
* ultimately find out the independent variables (World Development Idicators) which influence the dependent variable (Happy Planet Index) the most.

The model will be applied to the "wdi_hpi_2016_df" dataset, which was created in the Data Preprocessing JNotebook. This dataset is based on
* the Happy Planet Index for 2016 (see https://happyplanetindex.org/),
* the World Development Indicators (1960 - 2019) by the World Bank (see https://datacatalog.worldbank.org/dataset/world-development-indicators)

In [1]:
# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Import dataset
dataset = pd.read_pickle('../data/wdi_hpi_2016_df.pkl')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, dataset.shape[1]-1].values

In [3]:
# Split datasets into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [4]:
# Fit Multiple Linear Regression Model to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [5]:
# Predict the Test set results
y_pred = regressor.predict(X_test)

In [6]:
# Reduce less important variables with Backward Elimination
import statsmodels.regression.linear_model as sm
# for statsmodel to understand the multiple linear regression equation a new column with b0 equals one is required (y = b0 + b1*x1 + b2*x2 + ... + bn*xn)
X_opt = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)   # Add X to the newly created array of 1s
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.908
Model:,OLS,Adj. R-squared:,0.699
Method:,Least Squares,F-statistic:,4.339
Date:,"Wed, 12 Feb 2020",Prob (F-statistic):,4.13e-07
Time:,18:09:34,Log-Likelihood:,-307.15
No. Observations:,139,AIC:,808.3
Df Residuals:,42,BIC:,1093.0
Df Model:,96,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-26.2703,44.902,-0.585,0.562,-116.886,64.345
x1,0.0054,0.003,2.120,0.040,0.000,0.011
x2,-0.0196,0.040,-0.489,0.627,-0.101,0.061
x3,-0.1998,0.166,-1.206,0.235,-0.534,0.135
x4,-0.0351,0.039,-0.899,0.374,-0.114,0.044
x5,-0.0623,0.040,-1.546,0.130,-0.144,0.019
x6,0.0338,0.031,1.101,0.277,-0.028,0.096
x7,-0.0560,0.053,-1.050,0.300,-0.164,0.052
x8,-0.0196,0.069,-0.282,0.779,-0.160,0.120

0,1,2,3
Omnibus:,0.101,Durbin-Watson:,1.927
Prob(Omnibus):,0.951,Jarque-Bera (JB):,0.226
Skew:,0.051,Prob(JB):,0.893
Kurtosis:,2.832,Cond. No.,1.09e+16


In [7]:
# Fuction to automatically remove columns where P-value is below significance level of 5%
'''
def backward_elimination(x, significance):
    num_vars = len(x[0])
    temp = np.zeros((X_opt.shape)).astype(int)
    for i in range(0, num_vars):
        regressor_OLS = sm.OLS(endog = y, exog = x).fit()
        max_var = max(regressor_OLS.pvalues).astype(float)
        adjR_before = regressor_OLS.rsquared_adj.astype(float)
        if max_var > significance:
            for j in range(0, num_vars - i):
                if (regressor_OLS.pvalues[j].astype(float) == max_var):
                    temp[:,j] = x[:, j]
                    x = np.delete(x, j, axis = 1)
                    tmp_regressor = sm.OLS(endog = y, exog = x).fit()
                    adjR_after = tmp_regressor.rsquared_adj.astype(float)
                    if (adjR_before >= adjR_after):
                        x_rollback = np.hstack((x, temp[:,[0,j]]))
                        x_rollback = np.delete(x_rollback, j, 1)
                        print (regressor_OLS.summary())
                        return x_rollback
                    else:
                        continue
    regressor_OLS.summary()
    return x

significance = 0.05
X_opt = X_opt[:, list(range(X_opt.shape[1]))]
X_opt2 = backward_elimination(X_opt, significance)
'''

'\ndef backward_elimination(x, significance):\n    num_vars = len(x[0])\n    temp = np.zeros((X_opt.shape)).astype(int)\n    for i in range(0, num_vars):\n        regressor_OLS = sm.OLS(endog = y, exog = x).fit()\n        max_var = max(regressor_OLS.pvalues).astype(float)\n        adjR_before = regressor_OLS.rsquared_adj.astype(float)\n        if max_var > significance:\n            for j in range(0, num_vars - i):\n                if (regressor_OLS.pvalues[j].astype(float) == max_var):\n                    temp[:,j] = x[:, j]\n                    x = np.delete(x, j, axis = 1)\n                    tmp_regressor = sm.OLS(endog = y, exog = x).fit()\n                    adjR_after = tmp_regressor.rsquared_adj.astype(float)\n                    if (adjR_before >= adjR_after):\n                        x_rollback = np.hstack((x, temp[:,[0,j]]))\n                        x_rollback = np.delete(x_rollback, j, 1)\n                        print (regressor_OLS.summary())\n                        r

In [10]:
import statsmodels.formula.api as sm
def backwardElimination(x, sl):
    numVars = len(x[0])
    for i in range(0, numVars):
        regressor_OLS = sm.OLS(y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)
        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
    regressor_OLS.summary()
    return x
         
SL = 0.05
X_opt = X_opt[:, list(range(X_opt.shape[1]))]
X_Modeled = backwardElimination(X_opt, SL)
X_Modeled.shape

(139, 39)