# Multiple Linear Regression: Application

In this notebook, we examine how to apply multiple linear regression in Python on a sample dataset containing multiple independent variables, in order to predict a company's profits.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

In [1]:
# Import machine learning support
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

# Import analytical libraries
import pandas as pd
import numpy as np

# import other support
import os

## Load & Preview Data

In [2]:
# Define profit data file path
profit_data_file_path = os.path.join('Data', '50_Startups.csv')

# Load profit data
profits = pd.read_csv(profit_data_file_path)

In [3]:
# Preview data
print(profits.shape)
display(profits.head())
display(profits.describe().T)

(50, 5)


Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
R&D Spend,50.0,73721.6156,45902.256482,0.0,39936.37,73051.08,101602.8,165349.2
Administration,50.0,121344.6396,28017.802755,51283.14,103730.875,122699.795,144842.18,182645.56
Marketing Spend,50.0,211025.0978,122290.310726,0.0,129300.1325,212716.24,299469.085,471784.1
Profit,50.0,112012.6392,40306.180338,14681.4,90138.9025,107978.19,139765.9775,192261.83


## Prepare Data

In [4]:
# Define features and labels
X = profits.drop('Profit', axis = 1).values
y = profits['Profit'].values

Recall that because one of our features is a categorical variable (State), we must use one-hot encoding to convert this into something a machine learning algorithm could understand.

In [5]:
# Create transformer object
transformer = ColumnTransformer(transformers = [('encoder'
                                  ,OneHotEncoder()
                                  ,[3])] # State in index position 3
                                ,remainder = 'passthrough'
                               )

# Fit transformer to data
transformer.fit(X)

# One-hot encode features
X = transformer.transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [6]:
print(X_train)

[[0.0 1.0 0.0 55493.95 103057.49 214634.81]
 [0.0 0.0 1.0 46014.02 85047.44 205517.64]
 [0.0 1.0 0.0 75328.87 144135.98 134050.07]
 [1.0 0.0 0.0 46426.07 157693.92 210797.67]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 1000.23 124153.04 1903.93]
 [0.0 0.0 1.0 542.05 51743.15 0.0]
 [0.0 0.0 1.0 65605.48 153032.06 107138.38]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [0.0 1.0 0.0 61994.48 115641.28 91131.24]
 [1.0 0.0 0.0 63408.86 129219.61 46085.25]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [1.0 0.0 0.0 23640.93 96189.63 148001.11]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 15505.73 127382.3 35534.17]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [1.0 0.0 0.0 64664.71 139553.16 137962.62]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [1.0 0.0 0.0 28754.33 118546.05 172795.67]
 [1.

In multiple linear regression, feature scaling is not always required, as the coefficients will naturally compensate for variables of different magnitudes.

(come back and try it with and without to compare results)

Before we proceed, notice that we have three binary values resulting from our one-hot encoding (i.e. 0/1 for California, Florida, and New York).  In theory, we would omit one of these variables to avoid the dummy variable trap; however, in practice, the LinearRegression class we use automatically manages this for us, meaning <b>we do not need to take extra steps to remove one variable on our own</b>.  If we were to do so, then it would automatically manage and ignore the remaining second variable.

Furthermore, this class will also determine which variables are statistically significant on its own, meaning that we do not have to employ any method to determine statistically significant variables.

As we will soon see, all we must do, is simply call the LinearRegression class, and let it do all the work under the hood.

## Prepare Regressor

In [7]:
# Initialize regressor object
regressor = LinearRegression()

# Fit regressor to data
regressor.fit(X_train, y_train)

# Test regressor
print(regressor.score(X_test, y_test))

0.9347068473282303


## Evaluate Regressor

We cannot create a simple visualization of our regressor's results the way we did with simple linear regressions, since we now have multiple variables and multiple dimensions.  Instead, we will examine a vector of predicted profits next to the actual profits.

In [8]:
# Predict test labels
y_predicted = regressor.predict(X_test)

# Display 2 decimals places
np.set_printoptions(precision=2)

In [9]:
# Define test set length
vector_length = len(y_predicted)

# Reshape predicted values & labels to single column
y_predicted = y_predicted.reshape(vector_length, 1)
y_test = y_test.reshape(vector_length, 1)

# Concatenate predicted values with actual values
print(np.concatenate((y_predicted, y_test), axis=1))
print('\nFigure 1.')

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]

Figure 1.


Examining Figure 1, we see that our first predicted profit is \\$103,015, while the actual profit is \\$103,282, a pretty close prediction.  Examining the sixth company however, we predicted a profit of \\$116,161, while the actual profit was \\$105,008, not as great.  Similar to simple linear regression, multiple linear regression is best suited for linear datasets.  While some predictions will be accurate enough, others may be off due to outliers.

## Making A Prediction

Now that we have trained our machine learning regressor, we can use it to make predictions.  Suppose we have a startup based in Florida, that spends 260,000 on Research & Development, 240,000 on Administrative costs, and 8,000 on marketing.

In [10]:
# Define sample features
features = [[0, 1, 0, 260000, 240000, 8000]]

# Predict sample profit
regressor.predict(features)

array([250881.54])

We may also want to know the actual equation that produces these results.  We can determine this by getting the coefficients, and $y$-intercept, as follows:

In [11]:
# Get y-intercept
print(regressor.intercept_)
print(regressor.coef_)

42467.52924854249
[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]


Therefore, our multiple linear regression equation would be:

$$
  y = 42,467.53 + 86.6 D_1 - 873 D_2 + 786 D_3 + 0.77 x_1  + 0.03 x_2 + 0.04 x_3
$$

where $D_1$ is the dummy variable "Is California," $D_2$ is the dummy variable "Is Florida," $D_3$ is the dummy variable "Is New York," $x_1$ is the R&D Spend, $x_2$ is the Administrative Costs, and $x_3$ is the Marketing Spend.