# Multiple Linear Regression

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [2]:
# pd = pandas, read_csv = function that can read csv file that we have in the folder
dataset = pd.read_csv("50_Startups.csv")

# creating matrix fitures/predictor (independent variables) and dependent variables
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

In [3]:
print(x)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

## Encoding categorical data

In [4]:
# Encoding "City" column (NY, Cal, Flo)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


ct = ColumnTransformer(transformers=[("encoder",OneHotEncoder(),[3])], remainder="passthrough")


x = np.array(ct.fit_transform(x))

In [5]:
print(x)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

The city columns becomes the first column (or zero column in python)

NY becomes 0.0 0.0 1.0

Cal becomes 1.0 0.0 0.0

Flo becomes 0.0 1.0 0.0 

## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test =  train_test_split(x,y,test_size=0.2,random_state=0)

## Training the Multiple Linear Regression model on the Training set

In [7]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

# Regarding fit method https://intellipaat.com/community/10656/what-does-fit-method-in-scikit-learn-do

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Predicting the Test set results

In [13]:
# Because we have 4 features (independent variables) unlike the simple linear regression,
# we will display 2 vectors : 1. vectors of the real profit in the test set (y_test)
# 2. vector of predicted profit from the test set (y_pred)



# predict function is a function from sklearn 
# predict() : given a trained model, predict the label of a new set of data. 
# This method accepts one argument, the new data X_new (e.g. model. predict(x_test) ), and returns the learned label for each object in the array.

y_pred = regressor.predict(x_test)
np.set_printoptions(precision=2) # Set the decimal 2 numbers behind comma

# print to compare two numerical vectors: y_pred and y_test
print(np.concatenate((y_pred.reshape(len(y_pred), 1),y_test.reshape(len(y_pred), 1)),1))
print("\nleft is the y_pred and right is y_test")


[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]

left is the y_pred and right is y_test


We can see from the comparison between y_pred (predicted y) and y_test (the real/actual y) that the model perform not so bad

## Note: In Multiple Linear Regression There is no need to put Feature Scalling, because in The Equation of Multiple Linear Regression ( yhat = a0 + a1x1 + a2x2 + ... + anxn) has coeficient (a) that will balance all the variables.

For more information check video 58. Multiple Linear Regression in Python - Step 2

## Making a single prediction (for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California' or 1.0 0.0 0.0)

In [16]:
print("The Profit for R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California' or 1.0 0.0 0.0 are: ")
print(regressor.predict([[1,0,0, 160000, 130000, 300000 ]]), "USD")

The Profit for R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California' or 1.0 0.0 0.0 are: 
[181566.92] USD


Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $ 181566,92.

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

$1, 0, 0, 160000, 130000, 300000 \rightarrow \textrm{scalars}$

$[1, 0, 0, 160000, 130000, 300000] \rightarrow \textrm{1D array}$

$[[1, 0, 0, 160000, 130000, 300000]] \rightarrow \textrm{2D array}$

**Important note 2:** Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

## Getting the final linear regression equation with the values of the coefficients

In [17]:
print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.52924853204


From above we can get:

yhat/profit = 86.6 Dummy State1 - 873 Dummy State2 + 786 Dummy State3 + 0.773 R&D + 0.0329 Administration + 0.0366 Marketingspend + 42467.53

We have 3 dummy states because there are 3 state: NY, Cal and Flo