# Multiple Linear Regression
The same as simple linear regression, but with multiple x-variables (independent variables).

### Formula
```y = m1*x1 + m2*x2 + ... mn*xn + b```
- n = number of x-variables

### 3D Visual Example
<img src="images/mlr/mlr_example.png" height="60%" width="60%"></img>

https://towardsdatascience.com/graphs-and-ml-multiple-linear-regression-c6920a1f2e70

- The two independent variables are Weight and Horsepower
    - It's important to notice that these variables are independent from each other
- The dependent variable is MPG

As a result of the linear regression, a hyper-plane is created in 3D space.
- The hyper-plane was created through the OLS (Ordinary Least Squares) method

### Feature Scaling? Not Needed in Regression!
Feature scaling is not needed for even multi-variate linear regression models. This is because the dependent variable is a combination of the independent variables, so the coefficients (slopes) of each independent variable would adopt a scale to put everything on the same scale.

For example, ```y = 0.5x```. If y = 1000 when x = 2000, then the coefficient (slope) equals 0.5 to scale the result properly. This same concept applies in multi-variate linear regression.

# Dummy Variables
Sometimes variables are non-numeric and show categorical information. These variables/columns would require dummy variables to use within a mathematical equation like Multiple Linear Regression.

<img src="images/mlr/dummy_variables.png" height="75%" width="75%"></img>

In the example above, the "State" column is a categorical variable, thus it needs dummy variables.
- For each unique value in the State column, create a Dummy variable/column that indicates if the category was present or not present at the row

### Dummy Variable Trap
In the example, we crossed-out the "California" dummy variable. But why?

In a multi-variate model, it's necessary to always omit one dummy variable.
- This is because there's always a remaining duplicate variable

The dummy variable trap causes perfect multicollinearity (highly correlated variables), which would scew the regression model. Perfect multicollinearity is when a variable can determine the value of another variable, so one predictor variable can be used to predict another. This creates redundancy, which skews the results of a regression model.

### Handling Dummy Variable Trap
Fortunately, SKLearn automatically removes highly correlated variables (by calculating correlation coefficients), so the dummy variable trap would be automatically resolved by SKLearn.

However, other statistical models (such as the statsmodels API) may not handle the trap. To handle the trap, just omit a single dummy variable.

# Omitting Variables (Model Building)
Sometimes variables must be thrown-out because they're "garbage-in" data that would lead to "garbage-out" predictions.

We don't want this garbage data because they might not predict anything useful for the model.

### P-Values
Tells us how likely it is to get a result like this if the Null Hypothesis is true.

P-values can determine if an outcome is statistically significant.
- Lower p-values means that the outcome is significant to study
- Higher p-values means that the outcome was out of luck, and not significant to study

By convention, the signifiance level (SL) is usually 0.05:    
If p < 0.05, then we reject the null hypothesis.  
If p >= 0.05, then we accept null hypothesis.

### 5 Methods of Building Models
1. All-in: Use all the variables

2. Backward Elimination
    - Step 1: Select a signifiance level (0.05) to stay in the model
    - Step 2: Fit the full model with all possible predictors from Step 1
    - Step 3: Consider the predictor with the highest P-value.
        - If P > 0.05, go to Step 4
        - Otherwise, the model is ready
    - Step 4: Remove the predictor and re-fit the model without the predictor. Then go to Step 3.
    

3. Forward Selection
    - Step 1: Select a signifiance level (0.05) to enter in the model
    - Step 2: Fit ALL possible regression models, ```y ~ xn```. Select the one with lowest P-value.
    - Step 3: Keep this variable, then fit all possible models an extra predictor added to the one(s) you kept
    - Step 4: Consider the predictor with the lowest P-value.
        - If P < 0.05, go to Step 3
        - Otherwise, the model is ready
        

4. Bidirectional Elimination (Stepwise Regression)
    - Step 1: Select a significance level (0.05) to enter and to stay in the model
    - Step 2: Perform thenext step of forward selection 
        - New variables must have P < 0.05 to enter
    - Step 3: Perform Backward Elimination
        - Old variables must have P < 0.05 to stay
        - Now go back, and repeat Step 3
    - Step 4: No new variables can enter and no old variables can exit. The model is ready.


5. Score Comparision (All Possible Models)
    - Step 1: Select a criteriion of goodness of fit.
        - Ex: Akaike Criterion
    - Step 2: Construct all possible regression models
        - 2^N - 1 total combinations
    - Step 3: Select the one with the best criterion. Then your model is ready.

# SKLearn Multi-Linear Regression Model

In [2]:
# import libraries
import numpy as np
import pandas as pd

In [3]:
# import the data set
salary_df = pd.read_csv("datasets/50_startups.csv")

salary_df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [4]:
"""
x is the R&D Spend, Administration, Marketing Spend, and State columns.
Because the State is a categorical variable, we must encode it using OneHotEncoder.
"""
x = salary_df.iloc[:, :-1].values

# encode the State column; no need to use LabelEncoder with latest SKLearn for OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
dummy_transformer = make_column_transformer((OneHotEncoder(), [3]),
                                            remainder="passthrough")
x = dummy_transformer.fit_transform(x)

# avoid the dummy variable trap; optional because SKLearn handles this automatically
x = x[:, 1:]

# y is the Profit column
y = salary_df.iloc[:, 4].values

In [5]:
# split the data set into training and testing data sets
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [6]:
# import the linear regression class
from sklearn.linear_model import LinearRegression

# create a regressor model
regressor = LinearRegression()

# fit the training data, feature scaling is not needed for regression models
regressor.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [7]:
# predict the test set results, feature scaling is not needed for regression models
y_pred = regressor.predict(x_test)

y_pred

array([103015.20159796, 132582.27760815, 132447.73845175,  71976.09851258,
       178537.48221056, 116161.24230166,  67851.69209676,  98791.73374687,
       113969.43533013, 167921.06569551])

In [8]:
# compare y_pred (prediction) to the y_test (actual)
i = 0
while i < len(y_pred):
    diff = abs(round(y_pred[i]) - y_test[i])
    print("Predicted: " + str(round(y_pred[i])) + " vs Actual: " + str(y_test[i]) +
          " ---> Difference: " + str(diff))
    i += 1

Predicted: 103015.0 vs Actual: 103282.38 ---> Difference: 267.38000000000466
Predicted: 132582.0 vs Actual: 144259.4 ---> Difference: 11677.399999999994
Predicted: 132448.0 vs Actual: 146121.95 ---> Difference: 13673.950000000012
Predicted: 71976.0 vs Actual: 77798.83 ---> Difference: 5822.830000000002
Predicted: 178537.0 vs Actual: 191050.39 ---> Difference: 12513.390000000014
Predicted: 116161.0 vs Actual: 105008.31 ---> Difference: 11152.690000000002
Predicted: 67852.0 vs Actual: 81229.06 ---> Difference: 13377.059999999998
Predicted: 98792.0 vs Actual: 97483.56 ---> Difference: 1308.4400000000023
Predicted: 113969.0 vs Actual: 110352.25 ---> Difference: 3616.75
Predicted: 167921.0 vs Actual: 166187.94 ---> Difference: 1733.0599999999977


# Re-Building The Model with Backward Elimination
There might've been garbage variables that scewed the predictions. We can attempt to remove these garbage variables through a backward elimination process.

In [9]:
# import a stats model (sm)
import statsmodels.formula.api as sm

In [10]:
"""
add a column of 1s as the first column of x.
this column is represented as b0 for the stats model to perform the backward elimination.
"""
fifty_ones = np.ones((50, 1)).astype(int)
x = np.append(fifty_ones, x, axis=1)

# deep copy x onto a x_opt variable, which will contain the significant independent variables
x_opt =  np.array(x[:, [0, 1, 2, 3, 4, 5]], dtype=float)

In [11]:
# (Step 2) create an OLS regressor to fit the model with all possible predictors
regressor_OLS = sm.OLS(endog=y, exog=x_opt).fit()

"""
Reading the Summary, we notice:
- const (index 0) has a p-value of 0.000
- x1 (index 1) has a p-value of 0.000
- x2 (index 2) has a p-value of 0.000
- x3 (index 3) has a p-value of 0.732
- x4 (index 4) has a p-value of 0.003
- x5 (index 5) has a p-value of 0.030

We can remove the x3 and x4 columns because it has p-values greater than 0.05.
"""
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Tue, 28 May 2019",Prob (F-statistic):,1.34e-27
Time:,12:27:22,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
x1,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
x2,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
x3,0.8060,0.046,17.369,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


In [12]:
# (Step 2) re-create an OLS regressor to fit the model with all possible predictors
x_opt =  np.array(x[:, [0, 1, 2, 5]], dtype=float)
regressor_OLS = sm.OLS(endog=y, exog=x_opt).fit()

"""
Based on the summary results, there are no more columns with a p-value greater than 0.05.
Therefore, we finally got the significant columns among the independent variables!
"""
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.562
Model:,OLS,Adj. R-squared:,0.534
Method:,Least Squares,F-statistic:,19.71
Date:,"Tue, 28 May 2019",Prob (F-statistic):,2.32e-08
Time:,12:27:22,Log-Likelihood:,-579.99
No. Observations:,50,AIC:,1168.0
Df Residuals:,46,BIC:,1176.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.864e+04,8984.018,6.527,0.000,4.06e+04,7.67e+04
x1,-1194.5800,9818.999,-0.122,0.904,-2.1e+04,1.86e+04
x2,4196.5465,9467.707,0.443,0.660,-1.49e+04,2.33e+04
x3,0.2480,0.033,7.525,0.000,0.182,0.314

0,1,2,3
Omnibus:,3.72,Durbin-Watson:,1.174
Prob(Omnibus):,0.156,Jarque-Bera (JB):,2.973
Skew:,-0.299,Prob(JB):,0.226
Kurtosis:,4.034,Cond. No.,811000.0


# Re-Creating The Model?
Since we finally determined the significant independent variables, we may re-create the Multiple Linear Regression model using the ```x_opt``` variable.

Although, that process is simple and redundant, so a demonstration is not needed.