## Business problem

We have a dataset of 50 startups with three spend types; R&D, administration, marketing. The data also shows the state they reside in and the profit the company made.

We have been hired by a venture capitilist to create a model and advise on which company they should invest in. The venture capitilist wants to understand things like:

In which state do companies perform better?
Does a company that spends more on marketing yield a better profit?
Or does a company that spends more on R&D yield more profit?

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
df = pd.read_csv('50_Startups.csv')

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

## Encoding categorical features

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

When we are carrying out multiple regression, we do not need to apply feature scaling. This is because we have a coefficient multiplied to each independent variable, so it does not matter than some features have higher values than others because their coefficients will compensate to put everything on the same scale. - Note to self, dive deeper into this.

## Splitting the data into training and testing

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

When we carry out our ML model, we do not need a class to account for the removal of the dummy variable. The model will do this for us.

The same goes for selecting the features with the highest p-value for backward elimination. Our model will do this for us.

## Training the model

In [5]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

## Predicting the test set

In [6]:
y_pred = regressor.predict(X_test)

# Print arrays to 2dp
np.set_printoptions(precision=2)

# Concatenate the arrays and then flip horizontally using reshape
np.concatenate(
    (
        y_pred.reshape(len(y_pred), 1), 
        y_test.reshape(len(y_test), 1)
    ),
    1
)

array([[103015.2 , 103282.38],
       [132582.28, 144259.4 ],
       [132447.74, 146121.95],
       [ 71976.1 ,  77798.83],
       [178537.48, 191050.39],
       [116161.24, 105008.31],
       [ 67851.69,  81229.06],
       [ 98791.73,  97483.56],
       [113969.44, 110352.25],
       [167921.07, 166187.94]])

## Challenges

Question 1: How do I use my multiple linear regression model to make a single prediction, for example, the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = California?

### My answers

There are two answers I can carry out for this. I can use the predict method.

In [7]:
q1_X = np.array([0, 1, 0, 160000, 130000, 300000]).reshape(1, -1)
q1_X

array([[     0,      1,      0, 160000, 130000, 300000]])

In [8]:
q1_a_pred = regressor.predict(q1_X)
q1_a_pred

array([180607.64])

Or, I can take my answer from question 2 and input the values.

In [28]:
q1_b_pred = intercept 
+ (coef[0] * 0) 
+ (coef[1] * 1)
+ (coef[2] * 0)
+ (coef[3] * 160000)
+ (coef[4] * 130000)
+ (coef[5] * 300000)

q1_b_pred

42467.52924853204

### Tutorial answer

The answer provided in the tutorial was similar to how I carried out my first method, note how the categorical variable is defined in the first column, compared to my second column which indicates, it is encode alphabetically.

In [30]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[181566.92]


Question 2: How do I get the final regression equation y = b0 + b1 x1 + b2 x2 + ... with the final values of the coefficients?

### My answer

In [29]:
coef = regressor.coef_
intercept = regressor.intercept_

### Tutorial answer

The calculation of the intercept and coefficients were correct but the tutorial took it one step further to define the complete equation as.

Profit = 42467.53 + 86.6*x1* - 873*x2* + 786*x3* + 0.773*x4* + 0.0329*x5* + 0.0366*x6*