# Multiple Linear Regression

## Dataset and Problem Statement
\
In this notebook I look at multiple linear regression sometimes called multivariant linear regression. I do this while looking at a dataset representing 50 businesses with data relating to:
- R&D Spend (Feature)
- Admin Costs (Feature)
- Marketing Spend (Feature)
- State (Feature)
- Profit (Our Target/ Label)


With this we want to try work out which sort of companys would be best to invest in, and create a model that we can use to predict the profit of feature companys by inputting data based on the four features. 

In other words we want to use the independant variables to try predict the dependant variable.

Unlike with Simple Single Variant Linear Regression notice that this dataset features 4 features. So we have to use a slightly different formula:

<b><center>y^ = (b1 x1) + (b2 x2) + (b3 x3) + (b4 x4) + b0</center></b>

Where the different x values corespond to the different feature values, and the different b values represent the weights applied to each feature.

## Assumptions of Linear Regression

1. Firstly you want to make sure that there is a linear relationship between variables. X and Y.

2. You want equal variance, if variance is increasing or decreasing for difference values of the independent variable then the model will start to perform badly as it tends towards these values.

3. Multivariant Normality. You want the data to be normal distrabution across the whole dataset. 

4. No autocorrelation. We dont want to see any of the independent variables being influence by other independent variables as these features are required to be independnt. This is easily noticable as you will see patterns apearing in the data.

5. Also worth adding a check for outliers at this point.

## Importing the Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

## Importing the Data

In [2]:
df = pd.read_csv('50_startups.csv')

In [3]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

## Encoding the Categorical Data

In [4]:
myColumnTransformer = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [3])], remainder = 'passthrough')
X = np.array(myColumnTransformer.fit_transform(X))
pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3,4,5
0,0.0,0.0,1.0,165349.2,136897.8,471784.1
1,1.0,0.0,0.0,162597.7,151377.59,443898.53
2,0.0,1.0,0.0,153441.51,101145.55,407934.54
3,0.0,0.0,1.0,144372.41,118671.85,383199.62
4,0.0,1.0,0.0,142107.34,91391.77,366168.42


## Train/ Test Split the Data

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training a Mulitple Linear  Regression Model

In [6]:
myRegressor = LinearRegression()
myRegressor.fit(X_train, y_train)

## Predicting Profit in Testset

In [7]:
yHat = myRegressor.predict(X_test)

In [13]:
np.set_printoptions(precision=2)
print(np.concatenate((yHat.reshape(len(yHat),1), y_test.reshape(len(y_test), 1)), axis = 1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]
