# Multiple Linear Regression
The business challenge: we have 50 companies in total and the have extracts from their income statements (R$D, Admin, Marketing, State, Profit)

Create a model that will output the profit based on the factors and allow a venture capitalist firm to identify the type of companies they should look for. e.g. spending more or less on R&D compared to marketing New York over Florida. We want to maximize the VC profit.

Multiple Regression Intuition: when we want multiple explanatory variables to account for the response variable

y = b + m1x1 + m2x2 + m3x3

## Importing the libraries

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [25]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values


## Checking for correlations

In [13]:
dataset.corr()['Profit'].sort_values(ascending = False)

Profit             1.000000
R&D Spend          0.972900
Marketing Spend    0.747766
Administration     0.200717
Name: Profit, dtype: float64

R$D Spending and Marketing Spending seem to be more correlated and useful, but nonetheless, we should still include it as a part of our values

Ideally, we don't want to include ALL the variables. This is known as feature selection because we want to chose the most relevant predictors of the response variable. 

It is also ok if the model does not work properly because there are so many more models that we can use. We don't have to check for the assumptions because it is just a waste of time. We can chose a model as a hypothesis, not as a definite solution.

## Encoding categorical data and Imputing Missing Data

In [26]:
#Used for the State
#Remember from the Part 1 the column transformer and one hot encoder

from sklearn.impute import SimpleImputer
imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(X[:,:3])
X[:,:3] = imr.transform(X[:,:3])

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')  #3 because that is the fourth column (python indexing!!)
X = np.array(ct.fit_transform(X))




In [27]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

NOTE: We do not need to use feature scaling because the coefficients for each of the independent variables will be accordingly adjusted. It is only really used for when calculating the distance between two datapoints

## Splitting the dataset into the Training set and Test set

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

In [31]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)


LinearRegression()

## Metrics, R^2 Value

In [33]:
regressor.score(X_test, y_test)

0.9347068473282303

## Predicting the Test set results

In [47]:
y_pred = regressor.predict(X_test)
#This code takes the predicted and test values and makes a dictionary of them, which is then converted into a pandas dataframe
values = [y_pred, y_test]
labels = ["predicted", "actual"]
data_compar = pd.DataFrame(dict(zip(labels, values)))


## Calculating the differenece betweeen the predicted and actual values

In [48]:
#lambda is a really useful function worth learning. It applies the calculaute difference function to the entire data columns of the predicted and actual, and each row in a new column Difference will be 
#the difference between the two rows

import math
def calc_diff(pred,real):
    return pred-real
data_compar["Difference"] = data_compar.apply(lambda x: calc_diff(x.predicted, x.actual), axis =1)

In [52]:
data_compar

Unnamed: 0,predicted,actual,Difference
0,103015.201598,103282.38,-267.178402
1,132582.277608,144259.4,-11677.122392
2,132447.738452,146121.95,-13674.211548
3,71976.098513,77798.83,-5822.731487
4,178537.482211,191050.39,-12512.907789
5,116161.242302,105008.31,11152.932302
6,67851.692097,81229.06,-13377.367903
7,98791.733747,97483.56,1308.173747
8,113969.43533,110352.25,3617.18533
9,167921.065696,166187.94,1733.125696


In [53]:
data_compar["Difference"].mean()

-3952.0102448099483