## Multiple Linear Regression with Backward Elimination


A few things to know : 
1. Dummy variabe trap
2. p-Value
3. OLS

#### Dummy Variable trap : 
Dummy variables alternatively called as indicator variables take discrete values such as 1 or 0 marking the presence or absence of a particular category. By default we can use only variables of numeric nature in a regression model. Therefore if the variable is of character by nature, we will have to transform into a quantitative variable. A simple transformation is not a dummy variable. A dummy is when we create an indicator variable. Let us see what this means by taking an example.

Let us say if we want to study the impact on price of a car – Scorpio and the location or city is one of the attributes that would probably have an impact on the price of a car. Let us say if we have four cities under consideration – Mumbai, Chennai, Bangalore and Kolkata and City is the name of this variable. The first step here would be to create four variables one each for Mumbai, Chennai, Bangalore and Kolkata respectively. Then we add them separately in the model but instead of adding four cities, we use only three. This is because the fourth city acts a baseline indicator and does not provide any incremental information to the model.

The obvious question is how to decide which variable to drop? The answer is any. For a continuous independent variable – Y = alpha + beta * X, we interpret the beta coefficient as follows – A unit change in the independent variable X will bring about beta time change in the dependent variable Y.


#### p-Value
p-Value is the value which tells, the effect of that X on predicted Y.<br>
If it's  < 0.05 :  then only consider that, otherwise eliminate tha parameter

In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [47]:
data = pd.read_csv('datasets/50_Startups.csv')
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [48]:
X = data.iloc[: , : -1]
print(X.head())
y = data['Profit']

   R&D Spend  Administration  Marketing Spend       State
0  165349.20       136897.80        471784.10    New York
1  162597.70       151377.59        443898.53  California
2  153441.51       101145.55        407934.54     Florida
3  144372.41       118671.85        383199.62    New York
4  142107.34        91391.77        366168.42     Florida


In [49]:
# one hot encomde, state column
X = pd.get_dummies(X, columns=['State'])
print(X.head())

   R&D Spend  Administration  Marketing Spend  State_California  \
0  165349.20       136897.80        471784.10                 0   
1  162597.70       151377.59        443898.53                 1   
2  153441.51       101145.55        407934.54                 0   
3  144372.41       118671.85        383199.62                 0   
4  142107.34        91391.77        366168.42                 0   

   State_Florida  State_New York  
0              0               1  
1              0               0  
2              1               0  
3              0               1  
4              1               0  


In [50]:
# to avaoid dummy_varible trap, remove last column/ can remove any one column here in state
X = X.iloc[: , :-1]
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_California,State_Florida
0,165349.2,136897.8,471784.1,0,0
1,162597.7,151377.59,443898.53,1,0
2,153441.51,101145.55,407934.54,0,1
3,144372.41,118671.85,383199.62,0,0
4,142107.34,91391.77,366168.42,0,1


In [51]:
import statsmodels.formula.api as sm
# forget the df, get the numpy array
X = X.values
print(X[0])
# We need to add a column of 1 to the X values so that the dot product respects the constant value in the regression formula.
# the statsmodels does not do it by itself. 
X = np.append(arr=np.ones((50, 1)).astype(int), values=X, axis=1)
print(X[0])

[165349.2 136897.8 471784.1      0.       0. ]
[1.000000e+00 1.653492e+05 1.368978e+05 4.717841e+05 0.000000e+00
 0.000000e+00]


In [52]:
X_opt = X[:, [0, 1, 2, 3, 4, 5]] # include all wors and 0-5 columns
# Here endog is the dependent variable, and exog are the independent variables
regressor_ols = sm.OLS(endog=y, exog=X_opt).fit() # more : https://en.wikipedia.org/wiki/Ordinary_least_squares
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Sun, 20 Jan 2019",Prob (F-statistic):,1.34e-27
Time:,16:26:59,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.008e+04,6952.587,7.204,0.000,3.61e+04,6.41e+04
x1,0.8060,0.046,17.369,0.000,0.712,0.900
x2,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x3,0.0270,0.017,1.574,0.123,-0.008,0.062
x4,41.8870,3256.039,0.013,0.990,-6520.229,6604.003
x5,240.6758,3338.857,0.072,0.943,-6488.349,6969.701

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1470000.0


#### In the abpve summary, we nmotice that the x4(state_of califormia_ variable has the highest p value of about 0.99 <br>
#### which is abysmal, now rebuild the model removing that variable)

In [53]:
X_opt = X[: , [0, 1, 2, 3, 4]]
# Here endog is the dependent variable, and exog are the independent variables
regressor_ols = sm.OLS(endog=y, exog=X_opt).fit() # more : https://en.wikipedia.org/wiki/Ordinary_least_squares
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Sun, 20 Jan 2019",Prob (F-statistic):,8.51e-29
Time:,17:21:44,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.016e+04,6798.992,7.377,0.000,3.65e+04,6.39e+04
x1,0.8057,0.046,17.646,0.000,0.714,0.898
x2,-0.0268,0.052,-0.520,0.606,-0.131,0.077
x3,0.0272,0.017,1.627,0.111,-0.006,0.061
x4,-70.2265,2828.752,-0.025,0.980,-5767.625,5627.172

0,1,2,3
Omnibus:,14.785,Durbin-Watson:,1.281
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.242
Skew:,-0.949,Prob(JB):,2.44e-05
Kurtosis:,5.568,Cond. No.,1440000.0


#### We still have variables with P value more than 0.05 so we keep on removing until all the variables have P values less than 0.05

In [56]:
X_opt = X[:, [0, 1, 3]]
# Here endog is the dependent variable, and exog are the independent variables
regressor_ols = sm.OLS(endog=y, exog=X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Sun, 20 Jan 2019",Prob (F-statistic):,2.1600000000000003e-31
Time:,17:23:16,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.266,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


In [57]:
X_opt = X[:, [0, 1]]
# Here endog is the dependent variable, and exog are the independent variables
regressor_ols = sm.OLS(endog=y, exog=X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Sun, 20 Jan 2019",Prob (F-statistic):,3.5000000000000004e-32
Time:,17:23:35,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


### After removing every variable we are left only with research funding which has P value less than 0.05 which depicts it is statstically important.