In [1]:
import pandas as pd
import numpy as np

In [2]:
import matplotlib.pyplot as plt

#### Importing Dataset

In [3]:
dataset = pd.read_csv("50_Startups.csv")
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


#### Handeling the categorical values

In [4]:
N_state = pd.get_dummies(dataset["State"], drop_first=True)
dataset = pd.concat([dataset,N_state],axis=1)

In [5]:
dataset.drop(columns=["State"], axis=1, inplace=True)

In [6]:
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,Florida,New York
0,165349.2,136897.8,471784.1,192261.83,0,1
1,162597.7,151377.59,443898.53,191792.06,0,0
2,153441.51,101145.55,407934.54,191050.39,1,0
3,144372.41,118671.85,383199.62,182901.99,0,1
4,142107.34,91391.77,366168.42,166187.94,1,0


__NOTE:__


__Multiple Linear Regression: y = b0 + b1X1 + b2+X2 + ........ bnXn__

where: b0,b1,b2...bn -> coefficient
X1,X2,......Xn -> independent variable

Now, if you look at the above dataset you can see that we have some categorical values ie. 0 and 1 and rest are numerical values which are much higher than those. So a question may arrise do we have to perform any feature scaling in the data set? 

The answer is no we dont have to apply any feature scalling as in multiple linear regression we have coefficient multiplied with each independent feature therefore it dosent matter if some variable have high values or low values this coefficient will companset to put every thing in the same scale.

## __NOTE:__

As we are dealing with multiple linear regression there are certain things we should keep in mind. <br>

Remember what we have in simple linear regression for one dependent variable we have one independent variable i.e. <br>
                        X1 ---> Y
                      
But it's not the same with Multiple Linear Regression, in this we have multiple independent variable for one dependent variable i.e. <br>
                        X1,X2,X3,X4,X5.......Xn ---> Y
                        
there are so many of them so we have to decide what we have to keep and what not. So, the reason of doing this is if we put all the variable including those that dosent make sence or not so important will ultimetly result into garbage, The model will not be reliable, it wont do what it is soupose to do.

Therefor its important to keep only those features or variables that are important, or we need to do feature selection.

##### Ways of feature selection:
1. All in
2. Backward Elimination
3. Forward selection

we will be looking into this toward the end of the notebook.

#### Spliting the data into dependent and independent variables.

In [18]:
X = dataset.drop(columns=["Profit"],axis=1)
y = dataset["Profit"].to_numpy()



In [13]:
y.to_numpy()

array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
       156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
       141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
       124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
       108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
        99937.59,  97483.56,  97427.84,  96778.92,  96712.8 ,  96479.51,
        90708.19,  89949.14,  81229.06,  81005.76,  78239.91,  77798.83,
        71498.49,  69758.98,  65200.33,  64926.08,  49490.75,  42559.73,
        35673.41,  14681.4 ])

#### Spliting into Train set and Test set.

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

#### Training the model: Multiple Linear Regression

In [20]:
from sklearn.linear_model import LinearRegression

mlr = LinearRegression()
mlr.fit(X_train,y_train)

LinearRegression()

##### Comparing predicted and actual values

In [26]:
y_predict = mlr.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_predict.reshape(len(y_predict),1), y_test.reshape(len(y_test),1)), axis=1))

[[126362.88 134307.35]
 [ 84608.45  81005.76]
 [ 99677.49  99937.59]
 [ 46357.46  64926.08]
 [128750.48 125370.37]
 [ 50912.42  35673.41]
 [109741.35 105733.54]
 [100643.24 107404.34]
 [ 97599.28  97427.84]
 [113097.43 122776.86]]


In [71]:
mlr.score(X_train, y_train)

0.9537019995248526

In [72]:
mlr.score(X_test, y_test)

0.8987266414328635

## Using Backward Elimination

Steps:

Step 1: First selecct a signinficance level to stay in the model. Generally it is 5% or 0.05

Step 2: Fit the model with all posible independendent variables/ predictors

Step 3: Chosse the predictor which has the highest P-value. <br>
            --> If P-value > SL (Significance level, go to step 4 <br>
            --> Else Finish our model is ready
            
Step 4: Remove the predictor

Step 5: Rebuilt and fit the model with remaining variables.

In [30]:
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,Florida,New York
0,165349.2,136897.8,471784.1,192261.83,0,1
1,162597.7,151377.59,443898.53,191792.06,0,0
2,153441.51,101145.55,407934.54,191050.39,1,0
3,144372.41,118671.85,383199.62,182901.99,0,1
4,142107.34,91391.77,366168.42,166187.94,1,0


In [42]:
new_X = dataset.iloc[:,[0,1,2,4,5]]
new_Y = dataset.iloc[:,[3]]

In [44]:
new_X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Florida,New York
0,165349.2,136897.8,471784.1,0,1
1,162597.7,151377.59,443898.53,0,0
2,153441.51,101145.55,407934.54,1,0
3,144372.41,118671.85,383199.62,0,1
4,142107.34,91391.77,366168.42,1,0


In [45]:
new_X.insert(0,"ones",np.ones((50,1)).astype(int))

In [48]:
new_X.head()

Unnamed: 0,ones,R&D Spend,Administration,Marketing Spend,Florida,New York
0,1,165349.2,136897.8,471784.1,0,1
1,1,162597.7,151377.59,443898.53,0,0
2,1,153441.51,101145.55,407934.54,1,0
3,1,144372.41,118671.85,383199.62,0,1
4,1,142107.34,91391.77,366168.42,1,0


In [50]:
import statsmodels.api as sm

In [56]:
X_opt = new_X.iloc[:, [0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog= new_Y, exog= X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Mon, 15 Nov 2021",Prob (F-statistic):,1.34e-27
Time:,09:14:41,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ones,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
R&D Spend,0.8060,0.046,17.369,0.000,0.712,0.900
Administration,-0.0270,0.052,-0.517,0.608,-0.132,0.078
Marketing Spend,0.0270,0.017,1.574,0.123,-0.008,0.062
Florida,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
New York,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


In [58]:
X_opt = new_X.iloc[:, [0,1,2,3,4]]
regressor_OLS = sm.OLS(endog= new_Y, exog= X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Mon, 15 Nov 2021",Prob (F-statistic):,8.49e-29
Time:,09:16:06,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ones,5.011e+04,6647.870,7.537,0.000,3.67e+04,6.35e+04
R&D Spend,0.8060,0.046,17.606,0.000,0.714,0.898
Administration,-0.0270,0.052,-0.523,0.604,-0.131,0.077
Marketing Spend,0.0270,0.017,1.592,0.118,-0.007,0.061
Florida,220.1585,2900.536,0.076,0.940,-5621.821,6062.138

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


In [59]:
X_opt = new_X.iloc[:, [0,1,2,3]]
regressor_OLS = sm.OLS(endog= new_Y, exog= X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Mon, 15 Nov 2021",Prob (F-statistic):,4.53e-30
Time:,09:27:31,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ones,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
R&D Spend,0.8057,0.045,17.846,0.000,0.715,0.897
Administration,-0.0268,0.051,-0.526,0.602,-0.130,0.076
Marketing Spend,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


In [60]:
X_opt = new_X.iloc[:, [0,1,3]]
regressor_OLS = sm.OLS(endog= new_Y, exog= X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Mon, 15 Nov 2021",Prob (F-statistic):,2.1600000000000003e-31
Time:,09:28:20,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ones,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
R&D Spend,0.7966,0.041,19.266,0.000,0.713,0.880
Marketing Spend,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


In [61]:
X_opt = new_X.iloc[:, [0,1]]
regressor_OLS = sm.OLS(endog= new_Y, exog= X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Mon, 15 Nov 2021",Prob (F-statistic):,3.5000000000000004e-32
Time:,09:31:01,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ones,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
R&D Spend,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


As we now know that only the "R&D" in the most important feature that impacts the "profite" at this point we are done with feature selection now we will built the model using this feature and observe the result.

##### Spliting the data into train set and testing set

In [68]:
X1 = new_X.iloc[:,[1]]
y1 = new_Y.to_numpy()

X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, test_size=0.2, random_state=42)

In [69]:
bmlr = LinearRegression()
bmlr.fit(X1_train, Y1_train)

LinearRegression()

In [70]:
ny_predict = bmlr.predict(X1_test)
np.set_printoptions(precision=2)
print(np.concatenate((ny_predict.reshape(len(ny_predict),1), Y1_test.reshape(len(Y1_test),1)), axis=1))

[[127862.21 134307.35]
 [ 82250.56  81005.76]
 [102255.72  99937.59]
 [ 50190.47  64926.08]
 [130136.88 125370.37]
 [ 49799.37  35673.41]
 [113638.08 105733.54]
 [104535.05 107404.34]
 [103463.05  97427.84]
 [123105.31 122776.86]]


In [73]:
bmlr.score(X1_train, Y1_train)

0.9467864227524652

In [74]:
bmlr.score(X1_test, Y1_test)

0.9265108109341951