## Machine Learning A-Z™: Hands-On Python & R In Data Science

### Kirill Eremenko, Hadelin de Ponteves, SuperDataScience Team

https://www.udemy.com/machinelearning/

Part 2: Data Preprocessing Template

**Section 5: Multiple Linear Regression**

Scenario: 
- Goal: Venture Capitalist fund provides data about 50 companies
- Analyze dataset and create a model based on R&D and marketing spend
- nderstand where the performance of the company
- Which of the spend (R&D, marketing, admin) yields better profit margin
- Model will help in determing which criteria to be used when assessing 
- Investigate the 50 companies given to you, investigate 

---
**Multiple Linear Regression**
- Multiple Linear regression == y = b + b1*x1 + b2*x2 + ....
- It is still linear the coefficients (b0, b1, b2, b3) are linear
- y = dependent variable
- b0 = slope of the regression line
- b = y intercept
- Assumptions of a linear regression

- We have a categorical column: State, 
- To deal with this, convert them into dummy variables: 0,1
- Be wary of the dummy variable trap - avoid including the 2nd dummy variable
- Multicolinearity - the dummy variable trap // you cannot have both the slope
- As well as the dummy variables in the same equation

---

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')

In [4]:
#let's take a look at the dataset
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
R&D Spend          50 non-null float64
Administration     50 non-null float64
Marketing Spend    50 non-null float64
State              50 non-null object
Profit             50 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.0+ KB


In [9]:
#let's create matrix of features
x = dataset.iloc[:, :-1].values
print(x)
#here the matrix contains different types
#we need to encode our categorical table

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

In [10]:
#create the dependent variable vector (y-value)
y = dataset.iloc[:,4].values
print(y)

[192261.83 191792.06 191050.39 182901.99 166187.94 156991.12 156122.51
 155752.6  152211.77 149759.96 146121.95 144259.4  141585.52 134307.35
 132602.65 129917.04 126992.93 125370.37 124266.9  122776.86 118474.03
 111313.02 110352.25 108733.99 108552.04 107404.34 105733.54 105008.31
 103282.38 101004.64  99937.59  97483.56  97427.84  96778.92  96712.8
  96479.51  90708.19  89949.14  81229.06  81005.76  78239.91  77798.83
  71498.49  69758.98  65200.33  64926.08  49490.75  42559.73  35673.41
  14681.4 ]


In [11]:
#Encoding categorical data: note, machine learning is based on equation
#Encoding the Independent Variable
#we don't need to encode the dependent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:, 3] = labelencoder_x.fit_transform(x[:,3]) #change text to numbers
onehotencoder = OneHotEncoder(categorical_features = [3])
x = onehotencoder.fit_transform(x).toarray()
print(x)

[[0.0000000e+00 0.0000000e+00 1.0000000e+00 1.6534920e+05 1.3689780e+05
  4.7178410e+05]
 [1.0000000e+00 0.0000000e+00 0.0000000e+00 1.6259770e+05 1.5137759e+05
  4.4389853e+05]
 [0.0000000e+00 1.0000000e+00 0.0000000e+00 1.5344151e+05 1.0114555e+05
  4.0793454e+05]
 [0.0000000e+00 0.0000000e+00 1.0000000e+00 1.4437241e+05 1.1867185e+05
  3.8319962e+05]
 [0.0000000e+00 1.0000000e+00 0.0000000e+00 1.4210734e+05 9.1391770e+04
  3.6616842e+05]
 [0.0000000e+00 0.0000000e+00 1.0000000e+00 1.3187690e+05 9.9814710e+04
  3.6286136e+05]
 [1.0000000e+00 0.0000000e+00 0.0000000e+00 1.3461546e+05 1.4719887e+05
  1.2771682e+05]
 [0.0000000e+00 1.0000000e+00 0.0000000e+00 1.3029813e+05 1.4553006e+05
  3.2387668e+05]
 [0.0000000e+00 0.0000000e+00 1.0000000e+00 1.2054252e+05 1.4871895e+05
  3.1161329e+05]
 [1.0000000e+00 0.0000000e+00 0.0000000e+00 1.2333488e+05 1.0867917e+05
  3.0498162e+05]
 [0.0000000e+00 1.0000000e+00 0.0000000e+00 1.0191308e+05 1.1059411e+05
  2.2916095e+05]
 [1.0000000e+00 0.000

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [13]:
#avoiding the dummy variable trap
#most python library take care of this, but keep in mind
x = x[:, 1:]

In [14]:
#Splitting the dataset into the Training set and Test set
#random is set to 0 so that  we get same andwer as the instructor
#consider 40/10 split for the 50 observations
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

In [15]:
#Fitting multiple linear regression t the training set
##the multiple linear regression machine is learning on the training set
## on the correlation
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [22]:
#Predicting the test results
## y_pred is the vector prediction of the dependent variable
y_pred = regressor.predict(x_test)

In [23]:
##let's run and compare their salary
#we won't be plotting a graph,  as we have multiple predictor
print(y_test)

[103282.38 144259.4  146121.95  77798.83 191050.39 105008.31  81229.06
  97483.56 110352.25 166187.94]


In [24]:
print(y_pred)

[103615.70496732 132245.69745432 133070.23906339  72592.46097845
 179075.96157176 116014.3380813   67853.79186105  98837.47482921
 114480.26282341 168492.58649243]


**Observation:** There is a linear dependency on the salary and years of experience

- The model did a great job in predicting the values
- There is a strong relationship between the dependent and independent variable
- The regression line does an amazing job in predicting the test results

In [31]:
# Backward Elimination
## goal: which optimal team  of independent variable that has the most impact
## on the dependent variable
import statsmodels.formula.api as sm

In [32]:
#we need a column of 1s for matrix of features x
# use numpy library
#add the column at the beginning of the matrix x
x = np.append(arr = np.ones((50,1)).astype(int), values = x, axis = 1 )

In [33]:
#Create a new matrix of features which is the optimal matrix of features
x_opt = x[:, [0,1,2,3,4,5]]

In [35]:
#step1: select significance level to stay in the model (SL = 0.05)
#step2: fit the full model with all possible predictors
## ordinary least squares OLS
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()

#step 3: consider the predictor with the highest p-value
#note: the lover the p-value, the more significant
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.948
Model:,OLS,Adj. R-squared:,0.944
Method:,Least Squares,F-statistic:,278.7
Date:,"Wed, 10 Apr 2019",Prob (F-statistic):,1.68e-29
Time:,22:00:16,Log-Likelihood:,-526.81
No. Observations:,50,AIC:,1062.0
Df Residuals:,46,BIC:,1069.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.836e+04,2048.649,8.960,0.000,1.42e+04,2.25e+04
x1,1.836e+04,2048.649,8.960,0.000,1.42e+04,2.25e+04
x2,1.836e+04,2048.649,8.960,0.000,1.42e+04,2.25e+04
x3,-573.7029,2838.043,-0.202,0.841,-6286.386,5138.981
x4,0.8624,0.030,28.282,0.000,0.801,0.924
x5,-0.0530,0.050,-1.063,0.294,-0.154,0.047

0,1,2,3
Omnibus:,14.902,Durbin-Watson:,1.199
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.212
Skew:,-0.964,Prob(JB):,2.48e-05
Kurtosis:,5.543,Cond. No.,4.89e+17


In [37]:
#step4: remove the p-value with the highest p-value (x2)
#step5: fit model without the variable
#remove 2: remove state: it has no impact
x_opt = x[:, [0,1,3,4,5]] 
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.948
Model:,OLS,Adj. R-squared:,0.944
Method:,Least Squares,F-statistic:,278.7
Date:,"Wed, 10 Apr 2019",Prob (F-statistic):,1.68e-29
Time:,22:01:02,Log-Likelihood:,-526.81
No. Observations:,50,AIC:,1062.0
Df Residuals:,46,BIC:,1069.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.753e+04,3072.973,8.960,0.000,2.13e+04,3.37e+04
x1,2.753e+04,3072.973,8.960,0.000,2.13e+04,3.37e+04
x2,-573.7029,2838.043,-0.202,0.841,-6286.386,5138.981
x3,0.8624,0.030,28.282,0.000,0.801,0.924
x4,-0.0530,0.050,-1.063,0.294,-0.154,0.047

0,1,2,3
Omnibus:,14.902,Durbin-Watson:,1.199
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.212
Skew:,-0.964,Prob(JB):,2.48e-05
Kurtosis:,5.543,Cond. No.,1.53e+17


In [38]:
#repeat step 3-5 till you get the optimal p-value not higher than SL = 0.05

In [39]:
#run 1: remove 2: Admin spend have no impact
x_opt = x[:, [0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.948
Model:,OLS,Adj. R-squared:,0.944
Method:,Least Squares,F-statistic:,278.7
Date:,"Wed, 10 Apr 2019",Prob (F-statistic):,1.68e-29
Time:,22:01:22,Log-Likelihood:,-526.81
No. Observations:,50,AIC:,1062.0
Df Residuals:,46,BIC:,1069.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.507e+04,6145.947,8.960,0.000,4.27e+04,6.74e+04
x1,-573.7029,2838.043,-0.202,0.841,-6286.386,5138.981
x2,0.8624,0.030,28.282,0.000,0.801,0.924
x3,-0.0530,0.050,-1.063,0.294,-0.154,0.047

0,1,2,3
Omnibus:,14.902,Durbin-Watson:,1.199
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.212
Skew:,-0.964,Prob(JB):,2.48e-05
Kurtosis:,5.543,Cond. No.,674000.0


In [40]:
#run 2: remove 4: 
x_opt = x[:, [0,3,5]] 
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.041
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,1.01
Date:,"Wed, 10 Apr 2019",Prob (F-statistic):,0.372
Time:,22:01:33,Log-Likelihood:,-599.6
No. Observations:,50,AIC:,1205.0
Df Residuals:,47,BIC:,1211.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,7.613e+04,2.59e+04,2.942,0.005,2.41e+04,1.28e+05
x1,2555.2116,1.2e+04,0.212,0.833,-2.16e+04,2.68e+04
x2,0.2885,0.205,1.404,0.167,-0.125,0.702

0,1,2,3
Omnibus:,0.119,Durbin-Watson:,0.097
Prob(Omnibus):,0.942,Jarque-Bera (JB):,0.139
Skew:,0.099,Prob(JB):,0.933
Kurtosis:,2.835,Cond. No.,567000.0


In [41]:
#run 3: remove 5: only slightly higer than 0.05: so remove marketing spend
x_opt = x[:, [0,3]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.001
Model:,OLS,Adj. R-squared:,-0.02
Method:,Least Squares,F-statistic:,0.04727
Date:,"Wed, 10 Apr 2019",Prob (F-statistic):,0.829
Time:,22:01:45,Log-Likelihood:,-600.63
No. Observations:,50,AIC:,1205.0
Df Residuals:,48,BIC:,1209.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.111e+05,7085.628,15.682,0.000,9.69e+04,1.25e+05
x1,2642.1322,1.22e+04,0.217,0.829,-2.18e+04,2.71e+04

0,1,2,3
Omnibus:,0.011,Durbin-Watson:,0.021
Prob(Omnibus):,0.994,Jarque-Bera (JB):,0.082
Skew:,0.022,Prob(JB):,0.96
Kurtosis:,2.807,Cond. No.,2.41


---
**Observation:** It looks like R&D is a very powerful (and the most significant) predictor of the profit

In [None]:
#note on automation of the above iteration
"""
Automation:
    Hi guys,

if you are also interested in some automatic implementations of 
Backward Elimination in Python, please find two of them below:

Backward Elimination with p-values only:
_
import statsmodels.formula.api as sm
def backwardElimination(x, sl):
    numVars = len(x[0])
    for i in range(0, numVars):
        regressor_OLS = sm.OLS(y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)
        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
    regressor_OLS.summary()
    return x
 
SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL)

___

Backward Elimination with p-values and Adjusted R Squared:

import statsmodels.formula.api as sm
def backwardElimination(x, SL):
    numVars = len(x[0])
    temp = np.zeros((50,6)).astype(int)
    for i in range(0, numVars):
        regressor_OLS = sm.OLS(y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)
        adjR_before = regressor_OLS.rsquared_adj.astype(float)
        if maxVar > SL:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    temp[:,j] = x[:, j]
                    x = np.delete(x, j, 1)
                    tmp_regressor = sm.OLS(y, x).fit()
                    adjR_after = tmp_regressor.rsquared_adj.astype(float)
                    if (adjR_before >= adjR_after):
                        x_rollback = np.hstack((x, temp[:,[0,j]]))
                        x_rollback = np.delete(x_rollback, j, 1)
                        print (regressor_OLS.summary())
                        return x_rollback
                    else:
                        continue
    regressor_OLS.summary()
    return x
 
SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL)

"""