<a href="https://colab.research.google.com/github/JosephWildey/MachineLearningModels/blob/main/Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression

While a Simple Linear Regression uses a single term to explain an output, a multilinear regression model takes in several inputs to explain a simple output.

## Problem Statement

An investor wants to know how several factors impact a business's profits: the state in which it does business, how much it spends on R&D, how much it spends on its administration, and how much it spends on marketing.

## Index
1. [Data Preprocessing](http://localhost:8888/notebooks/MachineLearningModels/Multiple_Linear_Regression.ipynb#Data-Preprocessing)
2. [Model Training](http://localhost:8888/notebooks/MachineLearningModels/Multiple_Linear_Regression.ipynb#Training-the-Multiple-Linear-Regression-Model)
3. [Predictions and Final Equation](http://localhost:8888/notebooks/MachineLearningModels/Multiple_Linear_Regression.ipynb#Predictions-and-final-equation)
4. [Conclusion](http://localhost:8888/notebooks/MachineLearningModels/Multiple_Linear_Regression.ipynb#Conclusion)

## Data Preprocessing

### import libraries

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### import the dataset

This dataset includes financial performance information from 50 startups. It uses the "50_startups" file in the "Data" folder.

In [20]:
dataset = pd.read_csv("\\Data\\50_Startups.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

### Encoding Categorical Data

Keep in mind the country data will switch from the final column to the first column. The method of encoding used will be one-hot encoding which uses binary (0,1) values to change strings to numerical values.

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [-1])], remainder='passthrough')
x = np.array(ct.fit_transform(x))

### Splitting the dataset 

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

## Training the Multiple Linear Regression Model

In [14]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

LinearRegression()

## Predictions and final equation

### Predicting the Test Set Results

Import caveat: This dataset has several features, not just one. Up to now everything has followed the same methodology as having one feature.

Instead of attempting to graph a 5-D plot, we will display two vectors. The first vector will be the predicted profit in the test set. There will also be a vector to show real results.

The step to get the first vector is the same as to get the prediction in Simple Linear Regression.

The final step will be to display the predicted and profit vector results side-by-side using Numpy. Numpy has a lot of methods to handle arrays making it a natural choice for this goal.The method is concatenate which allows the concatenation of either vertically or horizontally two vectors or even arrays.

The step of combining and displaying two vectors starts with print, so we can print the output to the screen. Within print we can call the concatenate method of numpy.The concatenate method requires two vectors of the same shape (eg. 3x1, 3x1 or 1x3, 1x3). Also, they must be in parentheses because the concatenate method expects a tuple of vectors to concatenate.

Reshape is an attribute function of numpy concatenate that allows you to reshape vectors vertically or horizontally. We use len because the function requires the amount of columns in y_pred. The same idea can then be applied to the second vector, Y_test.

Concatenate can take two arguments: 0 or 1. 0 means you want to a vertical concatenation, and 1 means you want to do a horizontal concatenation.

In [15]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), Y_test.reshape(len(Y_test), 1)), 1))

[[114664.42 105008.31]
 [ 90593.16  96479.51]
 [ 75692.84  78239.91]
 [ 70221.89  81229.06]
 [179790.26 191050.39]
 [171576.92 182901.99]
 [ 49753.59  35673.41]
 [102276.66 101004.64]
 [ 58649.38  49490.75]
 [ 98272.03  97483.56]]


### Making a Single Prediction

In [16]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[180892.25]


This means according to our model that the profit of a Californian startup which spent \$160,000 in R&D, \$130,000 in Admin, and \$300,000 in marketing was \$180,892.25.

We use a double array because the predict function always expects a 2d array. 

Dummy variables are the first three values, not the last three.

### Getting the final linear regression equation with the values of the coefficients

In [17]:
print(regressor.coef_)
print(regressor.intercept_)

[-2.85e+02  2.98e+02 -1.24e+01  7.74e-01 -9.44e-03  2.89e-02]
49834.885073226884


Profit = -285 x Dummy State 1 + 298 x Dummy State 2 - 12.4 x Dummy State 3 + 0.774 x R&D Spend - 0.00944 x Admin 0.0289 x Marketing + 49834.89

## Conclusion

Marketing and Dummy State two seem to be the best terms for producing a company's profits. 