# Multiple Linear Regression

Problem statement: The venture capitalist fund has hired you as a datascienteist to develop a model which will tell you which companies are worth investing in based on profit (meaning here the dependent variable is profit). They want to also understand which companies work better in new york or cali, how marketing spend affects profit, etc. Basically parameter relationships as well. 

## Importing Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

END = "\n-------------------------------------\n"

Now we need to decide which variables to keep and which to throw away because garbage_in = garbage_out. 
There's 5 ways to make a model:
* All in - when you know all are necessary based on pre-existing knowledge.
* Backward elimination:
    * Select significance level (say 0.05)
    * Fit full model with all variables
    * Consider predictor with highest P value. If P value > SL, go to step 4, else last step.
    * Remove predictor with highest P value.
    * Fit model.
    * Repeat from 3 steps until highest P value lesser than significance level. 
    * Now, your model is ready.
    This is fastest and ideal for now
* Forward Selection
    * Select significance level (0.05)
    * Fit all possible simple regression models for dependent variable. 
    * Select the one with the lowest P value.
    * Keep variable, fit all the others again but with this one included. 
    * Select the one with the lowest P value. If P < SL, Go to prev step. Else, last step.
    * Now, model is ready.
* Bidirectional Elimination aka Stepwise Regression
    * Select SL to enter and SL to stay in the model (SLEnter = 0.05, SLStay = 0.05)
    * Perform forward selection (New vars must have P < SLENTER to enter)
    * Perform all steps of Backward elimination (Old vars must have P < SLStay to stay)
    * Repeat prev 2 until no new can enter and no old can exit
    * Model ready
* All Possible models (i.e. Score Comparison)
    * Select criterion of goodness (like akaike criterion)
    * Construct all possible regression models (2^n - 1 for n attributes)
    * Select one with best criterion
    * Ready model
    We hate this because 10 columns mean 1023 models. It's pain. Very bad approach. Very resource consuming. 



## Actual Code
Note that unlike in simple linear regression, feature scaling is unnecessary because the coefficients in the equation will handle the weightage and therefore the scaling of those variables.
Also, in case of multiple linear regression, there's no need to go through the assumptions. After implementation, if the model has poor accuracy, that tells us that Multi Linear Regrssion is not the way to go, after which we just try something else. Simple. 

#### Data preprocessing

In [16]:
# First steps is data preprocessing
df = pd.read_csv('50_Startups.csv')
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# Checked for missing and encoding required data. State requires one hot encoding.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough') # [3] means column 3 is being encoded.
x = np.array(ct.fit_transform(x))

# Splitting into train test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

#### Training the model

In [17]:
# Do we need to do something to avoid the dummy variable trap? No need to do anything, the classes in sklearn take care of it.
# Do we have to work with features to select best ones like backward elimination? No, the class also takes care of that.

# Training the model
regressor = LinearRegression()
regressor.fit(x_train, y_train)



#### Predictions

In [18]:
# Since there's 4 features, we can't just plot a simple graph, we'd need a 5 dim graph here.
# Instead, we display 2 vectors: Vectors of real profit on the test set, and vectors of the predicted profits to compare if it's close. 
# After this, we'll do metrics.

# Displaying the results
y_pred = regressor.predict(x_test)
np.set_printoptions(precision=2) # 2 decimal accuracy
print("Predicted profits vs Real profits")
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), axis=1))

Predicted profits vs Real profits
[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


Clearly, some are okay, some aren't great, and some are amazing. This means that the Multiple Linear Regression model works on this dataset for sure. The goal is to be efficient, so we will still try other models and then select the best one.

In [19]:
# To make a prediction on a single value, say where R&D = 16000, Admin = 130000, Marketing = 300000, state = california.
print(regressor.predict([[1,0,0, 160000, 130000, 300000]]))
# 1,0,0 are dummy vars for the 3 possible states, Cali, Florida, or NY. Always enter them first, because dummy variables are always set in the beginning.

[181566.92]


In [20]:
# Also, to get the final equation out of the regressor,
print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.5292485298


This means:
Profit = 86.6×Dummy State 1−873×Dummy State 2+786×Dummy State 3+0.773×R&D Spend+0.0329×Administration+0.0366×Marketing Spend+42467.53