# Multiple Linear Regression

### WHAT ARE WE TRYING TO ACHIEVE:

We are trying to predict profit based on 4 variables(Research Spend, Administration, Marketing Spend & State)

## Importing the libraries

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the dataset

In [51]:
dataset = pd.read_csv("50_Startups.csv")
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [52]:
x = dataset.iloc[:,:-1].values
x

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 29491

In [53]:
y = dataset.iloc[:,-1].values
y

array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
       156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
       141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
       124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
       108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
        99937.59,  97483.56,  97427.84,  96778.92,  96712.8 ,  96479.51,
        90708.19,  89949.14,  81229.06,  81005.76,  78239.91,  77798.83,
        71498.49,  69758.98,  65200.33,  64926.08,  49490.75,  42559.73,
        35673.41,  14681.4 ])

## Encoding categorical data

In [54]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder',OneHotEncoder(),[3])], remainder="passthrough")
x = np.array(ct.fit_transform(x))

In [55]:
x

array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [0.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [0.0, 1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [1.0, 0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [0.0, 1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [1.0, 0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [0.0, 1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1.0, 0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 0.0, 1.0, 94657.16, 145077.58

In [56]:
dataset['State'].nunique()

3

#### NOTE: 
We have 3 states, 3 categories in state variable and it has created 3 variables/columns for us

## Splitting the dataset into the Training set and Test set

In [57]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state = 1)

#### NOTE
We don't need to apply feature scaling in multiple linear regression. The coefficient of each variable in the equation will be according to its values in the data. Coefficient will compensate get its value in same range

## Training the Multiple Linear Regression model on the Training set

#### Question: We have 3 columns for 3 categories in State variable. Do we not have to delete 1 as it's redundant. (We only keep N-1 variables for N categoies)

Answer: NO. The Linear Regression class will take care of the dummy variable trap: i.e it will automatically only consider two.

Linear Regression class also takes care of selecting the most statistically significant variables

In [58]:
from sklearn.linear_model import LinearRegression

In [59]:
regressor = LinearRegression()

In [60]:
regressor.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#### Regressor has been trained. It has developed correlations between profit(dependant variable) & "ALL" the other independant variables as to how they will impact the dependant variable

"ALL" is what makes this a MULTIPLE Linear Regression
In SIMPLE Linear Regression: It just evaluated one variable with dependant. Now it has done for "ALL"

##### Understanding numpy's concatenate & reshape

In [61]:
a = np.array([[1,3],[4,6]])
a

array([[1, 3],
       [4, 6]])

In [62]:
b = np.array([[11,15],[12,16]])
b

array([[11, 15],
       [12, 16]])

In [63]:
#numpy axis: https://www.youtube.com/watch?v=uB3o8a0g8Hg
#axis=0(default) means concatenate vertically(same number of columns) like down arrows stacked in columns. 
#axis=1 means horizontally(same number of rows): Like right arrows stacked in rows

In [64]:
np.concatenate((a,b),axis=0)

array([[ 1,  3],
       [ 4,  6],
       [11, 15],
       [12, 16]])

In [65]:
np.concatenate((a,b),axis=1)

array([[ 1,  3, 11, 15],
       [ 4,  6, 12, 16]])

---

In [66]:
#Reshape
#https://www.geeksforgeeks.org/numpy-reshape-python/

In [67]:
a.reshape(4,1)

array([[1],
       [3],
       [4],
       [6]])

In [129]:
c = np.array([1,2,3,4,5])
c

array([1, 2, 3, 4, 5])

In [132]:
#I need 1 columns and -1 says "Figure out the number of rows yourself so that number of elements remain same"
#This is same as : c.reshape(len(c),1)
c.reshape(-1,1)

array([[1],
       [2],
       [3],
       [4],
       [5]])

The new shape should be compatible with the original shape. If an integer, then the result will be a 1-D array of that length. One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.

## Predicting the Test set results

In [68]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2) #All numerical values will have 2 decimals

In [69]:
#Doing this to see original profit and predicted profit side by side vertically
print(np.concatenate((y_pred.reshape(len(y_pred),1),Y_test.reshape(len(Y_test),1)),axis=1))

[[114664.42 105008.31]
 [ 90593.16  96479.51]
 [ 75692.84  78239.91]
 [ 70221.89  81229.06]
 [179790.26 191050.39]
 [171576.92 182901.99]
 [ 49753.59  35673.41]
 [102276.66 101004.64]
 [ 58649.38  49490.75]
 [ 98272.03  97483.56]]


#### NOTE:
Backward Elimination is irrelevant in Python, because the Scikit-Learn library automatically takes care of selecting the statistically significant features when training the model to make accurate predictions.

## Making a single prediction (for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California')

In [70]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[180892.25]


#### Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $180892.25.


#### Important note 1: Notice that the values of the features were all input in a double pair of square brackets. 
That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

1,0,0,160000,130000,300000→scalars 

[1,0,0,160000,130000,300000]→1D array / List

[[1,0,0,160000,130000,300000]]→2D array 

#### Important note 2: Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. 
That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

## Getting the final linear regression equation with the values of the coefficients

In [71]:
print(regressor.coef_)
print(regressor.intercept_)

[-2.85e+02  2.98e+02 -1.24e+01  7.74e-01 -9.44e-03  2.89e-02]
49834.88507321703


## Profit = -285×Dummy State 1 + 298×Dummy State 2 - 12.4×Dummy State 3 + 0.774×R&D Spend-0.009×Administration + 0.0289×Marketing Spend + 49834.88

Important Note: To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.

#### Interpretation: https://www.youtube.com/watch?v=fTfMdCQJz4s&t=2s

Every 1 unit of profit is related to +0.8 R&D Spend. That is: 1 unit increase in R&D spend will yield 0.8 unit of increase in profit.

How to read Categorical Variable Coefficient: <b>On Average</b>, Dummy State 2 is related to a 298$ higher profit. That is having a presence in Dummy State 2 is related to +298 higher profit, relative to other states.

So if we with every other metric same, Dummy State 1 will yield +298$ more profit.

#### Let's prove this:

In [119]:
dataset['State'].unique()

array(['New York', 'California', 'Florida'], dtype=object)

In [126]:
#Regression prediction of dummy state 2: California with other arbitrary values
print(regressor.predict([[0, 1, 0, 160000, 130000, 300000]]))

[181474.99]


In [127]:
#All States as 0 keeping the values of other independant variables same
print(regressor.predict([[0, 0, 0, 160000, 130000, 300000]]))

[181177.43]


In [128]:
181474.99 - 181177.43

297.5599999999977

---

#### Let's try to understand this: https://www.youtube.com/watch?v=TBJsEb2UCPs

If we are predicting "risk of disease" against "smoker(yes/no)": we will make a new column with 1: smoker, 0: non-smoker.
Let's say it gives us a regression result:
<br>
p value = less than 0.5 (means this is statistically significant: i.e yes smoking is related to risk)
<br>
coefficent = 8.7
<br>
intercept = -91

<b>Equation: Predicted Risk = -91 + 8.7D </b>

What this tells me is that being a smoker INCREASES the risk by 8.7

HOW?

Predicted Risk (Smoker) = -91 + 8.7(1) = -82.3
<br>
Predicted Risk (Non-Smoker) = -91 + 8.7(0) = -91

If I plot this lines, Intercept of Smoker line will change by 8.7 and the slope will be same.
That is how it indicates the difference
