## I. Multiple Linear Regression - 

### It is an algorithm that models the linear relationship between single dependent continuous variable and more than one independent variable.

### Example - Prediction of CO2 emission based on engine size and number of cylunders in a car.

### y = b0 + b1x1 + b2x2 + b3x3 + ... + bnXn

## II. Assumptions in Multiple Linear Regression - 

### 1. A linear relationship should exist between the Target and predictor variables.
### 2. The regression residuals must be normally distributed.
### 3. MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
<!-- a. Linearity
b. Homoscedasticity
c. Multivariate Normality
d. Independence of errors
e. Lack of mulitcollinearity -->

## III. Building a Model 

### 1. All-in -> Consider all the variables

### 2. Backward Elimination - Stepwise regression ->

#### Step 1 - Select a significance level to stay in the model (eg: SL = 0.05)
#### Step 2 - Fit the full model with all possible predictors
#### Step 3 - Consider the predictor with highest P-value. If P > SL, then go to Step 4, otherwise, go to FIN
#### Step 4 - Remove the predictor
#### Step 5 - Fit the model without the variable*****. Go to Step 3 and check next predictor and check the condition

### 3. Forward Selection - Stepwise regression ->

#### Step 1 - Select a significance level to enter the model (eg: SL = 0.05)
#### Step 2 - Fit all simple regression model y ~ xn and select the one with lowest P-value
#### Step 3 - Fit all simple regression model with the selected ones 
#### Step 4 - Consider the predictor with the lowest P-value. If P < SL, go to step 3, otherwise, go to FIN

### 4. Bi-directional Elimination - Stepwise regression ->

#### Step 1 - Select a significance level to enter and to stay in the model (eg: SLSTAY = 0.05 and SLENTER = 0.05)
#### Step 2 -  Perform the next step of forward selection (new variables must have: P < SLENTER to enter)
#### Step 3 - Perform all steps of backward elimination (old variables must have P < SLSTAY to stay), and go to step 2
#### Step 4 - No new variables can enter and no old variables can exit.

### 5. Score Comparison of all possible models ->

#### Step 1 - Select a criterion of goodness of fit (eg: Akaike criterion)
#### Step 2 - Construct all possible regression models: 2^n - 1 total combinations
#### Step 3 - Select the one with best criterion and your model is ready

In [105]:
## Data Preprocessing Steps

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

# handle categorical 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = ct.fit_transform(X)

In [106]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)

In [107]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)

LinearRegression()

In [108]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), axis = 1, out=None))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


In [109]:
import statsmodels.api as sm

X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis = 1)
X_temp = X[:,[0,1,2,3,4,5]]
print(X_temp)
X_temp = X_temp.astype(np.float64)

regressor_OLS = sm.OLS(endog = y, exog = X_temp).fit()
regressor_OLS.summary()

[[1 0.0 0.0 1.0 165349.2 136897.8]
 [1 1.0 0.0 0.0 162597.7 151377.59]
 [1 0.0 1.0 0.0 153441.51 101145.55]
 [1 0.0 0.0 1.0 144372.41 118671.85]
 [1 0.0 1.0 0.0 142107.34 91391.77]
 [1 0.0 0.0 1.0 131876.9 99814.71]
 [1 1.0 0.0 0.0 134615.46 147198.87]
 [1 0.0 1.0 0.0 130298.13 145530.06]
 [1 0.0 0.0 1.0 120542.52 148718.95]
 [1 1.0 0.0 0.0 123334.88 108679.17]
 [1 0.0 1.0 0.0 101913.08 110594.11]
 [1 1.0 0.0 0.0 100671.96 91790.61]
 [1 0.0 1.0 0.0 93863.75 127320.38]
 [1 1.0 0.0 0.0 91992.39 135495.07]
 [1 0.0 1.0 0.0 119943.24 156547.42]
 [1 0.0 0.0 1.0 114523.61 122616.84]
 [1 1.0 0.0 0.0 78013.11 121597.55]
 [1 0.0 0.0 1.0 94657.16 145077.58]
 [1 0.0 1.0 0.0 91749.16 114175.79]
 [1 0.0 0.0 1.0 86419.7 153514.11]
 [1 1.0 0.0 0.0 76253.86 113867.3]
 [1 0.0 0.0 1.0 78389.47 153773.43]
 [1 0.0 1.0 0.0 73994.56 122782.75]
 [1 0.0 1.0 0.0 67532.53 105751.03]
 [1 0.0 0.0 1.0 77044.01 99281.34]
 [1 1.0 0.0 0.0 64664.71 139553.16]
 [1 0.0 1.0 0.0 75328.87 144135.98]
 [1 0.0 0.0 1.0 72107.6 

0,1,2,3
Dep. Variable:,y,R-squared:,0.948
Model:,OLS,Adj. R-squared:,0.943
Method:,Least Squares,F-statistic:,205.0
Date:,"Sun, 25 Apr 2021",Prob (F-statistic):,2.9e-28
Time:,14:37:21,Log-Likelihood:,-526.75
No. Observations:,50,AIC:,1064.0
Df Residuals:,45,BIC:,1073.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.122e+04,4607.941,8.945,0.000,3.19e+04,5.05e+04
x1,1.339e+04,2421.500,5.529,0.000,8511.111,1.83e+04
x2,1.448e+04,2518.987,5.748,0.000,9405.870,1.96e+04
x3,1.335e+04,2459.306,5.428,0.000,8395.623,1.83e+04
x4,0.8609,0.031,27.665,0.000,0.798,0.924
x5,-0.0527,0.050,-1.045,0.301,-0.154,0.049

0,1,2,3
Omnibus:,14.275,Durbin-Watson:,1.197
Prob(Omnibus):,0.001,Jarque-Bera (JB):,19.26
Skew:,-0.953,Prob(JB):,6.57e-05
Kurtosis:,5.369,Cond. No.,9.18e+17
