# Multiple Linear Regression

- ## All features
- ## Backward Elimination
- ## Forward Selection
- ## Hybrid of Backward and Forward (also called Stepwise)
- ## All combinations

#### Data Preparation
- Getting the necessary python libraries
- Loading the dataset
- Dealing with Missing values & Categorical features
- Splitting the data into Training sets & Testing sets
- Normalization of features

In [2]:
#Getting the necessary python libraries
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd

#Loading the dataset
dataset = pd.read_csv('employee.csv') #Store the dataset in a dataframe
#print(dataset)

# [:, :-1] Store all the raws, Store all the columns except the last one
X = dataset.iloc[:,:-1].values   
# [:,3] Store all the raws,  Store colum 3
y = dataset.iloc[:,4].values    

#Dealing with categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder 
labelencoder_X = LabelEncoder() 
X[:,0] = labelencoder_X.fit_transform(X[:,0]) 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform(X).toarray()

# Removing the extra dummy vareiable
X = X[:, 1:]

# Splitting the data into Training Set and Test Set
from sklearn.model_selection import train_test_split
#Test size = 20%, training size = 80%
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [6]:
dataset

Unnamed: 0,Department,WorkedHours,Certification,YearsExperience,Salary
0,Development,2300,0,1.1,39343
1,Testing,2100,1,1.3,46205
2,Development,2104,2,1.5,37731
3,UX Designer,1200,1,2.0,43525
4,Testing,1254,2,2.2,39891
5,UX Designer,1236,1,2.9,56642
6,Development,1452,2,3.0,60150
7,Testing,1789,1,3.2,54445
8,UX Designer,1645,1,3.2,64445
9,UX Designer,1258,0,3.7,57189


#### After the train/test split section, we perform Multiple Linear Regression

In [3]:
#Fitting Multiple Linear Regression to Training Set
from sklearn.linear_model import LinearRegression 

mlrObj= LinearRegression()
mlrObj.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


#### Make prediction on the test set
### All features

In [27]:
#Predicting on the Test Set
y_pred= mlrObj.predict(X_test)

In [28]:
print (y_test)
print ()
print (y_pred)

[ 46205  54445  98273  66029 105582  39891]

[ 37961.51601741  54861.37364043  91160.08487645  74425.41535057
 110256.67948683  44585.24577587]


### Backward Elimination
- Decide on a significant level for a variable to be retained in the model (usually set at 0.05)
- Fit the model with all independent variables.
- Select the variable with the highest p-value.
- Remove that variable
- Fit the model without the variable


In [29]:
#Backward Elimination
#import statsmodels.formula.api as sm
import statsmodels.api as sm

#We need to add a column of ones to X (corresponding to the intercept term)
X = np.append(arr=np.ones((30,1)).astype(int), values=X, axis=1)

#According to Backward elimination we first need to fit a model with all variables
X_sig= X[:,[0,1,2,3,4,5]]
obj_OLS = sm.OLS(endog= y, exog= X_sig).fit()

#The summary function displays a table with p-values for all the variables.
obj_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.961
Model:,OLS,Adj. R-squared:,0.952
Method:,Least Squares,F-statistic:,117.2
Date:,"Tue, 15 Oct 2019",Prob (F-statistic):,4.71e-16
Time:,20:31:08,Log-Likelihood:,-300.09
No. Observations:,30,AIC:,612.2
Df Residuals:,24,BIC:,620.6
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.146e+04,4689.376,4.576,0.000,1.18e+04,3.11e+04
x1,-1421.4255,2683.353,-0.530,0.601,-6959.595,4116.744
x2,92.8656,2735.524,0.034,0.973,-5552.978,5738.710
x3,3.3361,2.410,1.384,0.179,-1.637,8.310
x4,-423.7607,1282.271,-0.330,0.744,-3070.237,2222.716
x5,9437.2530,510.978,18.469,0.000,8382.645,1.05e+04

0,1,2,3
Omnibus:,2.02,Durbin-Watson:,1.918
Prob(Omnibus):,0.364,Jarque-Bera (JB):,1.629
Skew:,0.414,Prob(JB):,0.443
Kurtosis:,2.214,Cond. No.,8000.0


In [30]:
#- Select the variable with the highest p-value.
#- Remove that variable
#- Fit the model without the variable

X_sig= X[:,[0,1,3,4,5]]
obj_OLS = sm.OLS(endog= y, exog= X_sig).fit()
obj_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.961
Model:,OLS,Adj. R-squared:,0.954
Method:,Least Squares,F-statistic:,152.6
Date:,"Tue, 15 Oct 2019",Prob (F-statistic):,3.54e-17
Time:,20:33:15,Log-Likelihood:,-300.09
No. Observations:,30,AIC:,610.2
Df Residuals:,25,BIC:,617.2
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.151e+04,4315.092,4.986,0.000,1.26e+04,3.04e+04
x1,-1466.4735,2285.214,-0.642,0.527,-6172.961,3240.014
x2,3.3230,2.331,1.426,0.166,-1.477,8.124
x3,-422.1845,1255.570,-0.336,0.739,-3008.079,2163.710
x4,9439.2110,497.467,18.975,0.000,8414.659,1.05e+04

0,1,2,3
Omnibus:,2.056,Durbin-Watson:,1.918
Prob(Omnibus):,0.358,Jarque-Bera (JB):,1.63
Skew:,0.407,Prob(JB):,0.443
Kurtosis:,2.2,Cond. No.,7200.0


In [31]:
#- Select the variable with the highest p-value.
#- Remove that variable
#- Fit the model without the variable

X_sig= X[:,[0,1,3,5]]
obj_OLS = sm.OLS(endog= y, exog= X_sig).fit()
obj_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.956
Method:,Least Squares,F-statistic:,210.7
Date:,"Tue, 15 Oct 2019",Prob (F-statistic):,2.35e-18
Time:,20:38:10,Log-Likelihood:,-300.16
No. Observations:,30,AIC:,608.3
Df Residuals:,26,BIC:,613.9
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.122e+04,4154.694,5.108,0.000,1.27e+04,2.98e+04
x1,-1479.8541,2245.558,-0.659,0.516,-6095.665,3135.956
x2,3.3059,2.290,1.443,0.161,-1.402,8.014
x3,9336.1224,385.024,24.248,0.000,8544.694,1.01e+04

0,1,2,3
Omnibus:,2.34,Durbin-Watson:,1.842
Prob(Omnibus):,0.31,Jarque-Bera (JB):,1.662
Skew:,0.375,Prob(JB):,0.436
Kurtosis:,2.124,Cond. No.,7040.0


In [32]:
#- Select the variable with the highest p-value.
#- Remove that variable
#- Fit the model without the variable

X_sig= X[:,[0,3,5]]
obj_OLS = sm.OLS(endog= y, exog= X_sig).fit()
obj_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,322.5
Date:,"Tue, 15 Oct 2019",Prob (F-statistic):,1.42e-19
Time:,20:39:24,Log-Likelihood:,-300.41
No. Observations:,30,AIC:,606.8
Df Residuals:,27,BIC:,611.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.102e+04,4099.682,5.127,0.000,1.26e+04,2.94e+04
x1,3.1236,2.250,1.389,0.176,-1.492,7.739
x2,9340.2046,380.920,24.520,0.000,8558.621,1.01e+04

0,1,2,3
Omnibus:,1.451,Durbin-Watson:,1.842
Prob(Omnibus):,0.484,Jarque-Bera (JB):,1.323
Skew:,0.388,Prob(JB):,0.516
Kurtosis:,2.325,Cond. No.,7010.0


In [33]:
#- Select the variable with the highest p-value.
#- Remove that variable
#- Fit the model without the variable

X_sig= X[:,[0,5]]
obj_OLS = sm.OLS(endog= y, exog= X_sig).fit()
obj_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.957
Model:,OLS,Adj. R-squared:,0.955
Method:,Least Squares,F-statistic:,622.5
Date:,"Tue, 15 Oct 2019",Prob (F-statistic):,1.14e-20
Time:,20:40:29,Log-Likelihood:,-301.44
No. Observations:,30,AIC:,606.9
Df Residuals:,28,BIC:,609.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.579e+04,2273.053,11.347,0.000,2.11e+04,3.04e+04
x1,9449.9623,378.755,24.950,0.000,8674.119,1.02e+04

0,1,2,3
Omnibus:,2.14,Durbin-Watson:,1.648
Prob(Omnibus):,0.343,Jarque-Bera (JB):,1.569
Skew:,0.363,Prob(JB):,0.456
Kurtosis:,2.147,Cond. No.,13.2


##### Backward Elimination is now complete. Let us now build a model using the significant predictors

In [34]:
#Splitting the data into Training Set and Test Set
X_sig_train, X_sig_test, y_sig_train, y_sig_test= train_test_split(X_sig, y, test_size=0.2,random_state=0)

mlrObj_sig= LinearRegression()
mlrObj_sig.fit(X_sig_train, y_sig_train)

y_sig_pred= mlrObj_sig.predict(X_sig_test)

In [36]:
print (y_test)
print ()
print (y_pred)
print ()
print (y_sig_pred)

[ 46205  54445  98273  66029 105582  39891]

[ 37961.51601741  54861.37364043  91160.08487645  74425.41535057
 110256.67948683  44585.24577587]

[ 40748.96184072 122699.62295594  64961.65717022  63099.14214487
 115249.56285456 107799.50275317]


In [5]:
TP = np.sum(income) 
FP = income.count() - TP 
TN = 0 
FN = 0

# TODO: Calculate accuracy, precision and recall [True Positives/(True Positives + False Negatives)]
accuracy = float (TP) / (TP+FP)
recall = float (TP) / (TP+FN)
precision = float (TP) / (TP+FP)

# TODO: Calculate F-score using the formula above for beta = 0.5 and correct values for precision and recall. 
# 𝐹𝛽 =(1+𝛽2) * (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑟𝑒𝑐𝑎𝑙𝑙) / ((𝛽2*𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) + 𝑟𝑒𝑐𝑎𝑙𝑙)
b2 = 0.5 #Beta = 0.5
fscore = (1+ b2**2)*(precision*recall)/((b2**2 * precision)+ recall)

# Print the results 
print("MLR: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore))

NameError: name 'income' is not defined