## Loading the libraries

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Loading the dataset

## Part 1

In [29]:
df = pd.read_csv('50_Startups.csv')

In [30]:
# df

## Part 2

In [31]:
x = df.iloc[:, :-1]
y = df.iloc[:, -1]

## Taking care of the missing data

In [32]:
#there is no missing data as we look the df, we can also check by "df.isna()"

flag = 0
for item in df.isna():
    if item == True:
        flag += 1
print(flag)

0


## Encoding categorical data

## Encoding the independent variables

In [33]:
# we should encode "state", which is independant variable:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [3])], remainder = 'passthrough')
x = ct.fit_transform(x)

In [34]:
# x

## Encoding the dependent variable

In [35]:
#The independent variable is numerical and doesn't need encoding.

## Train test split

In [36]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

In [39]:
# x_train
# x_test

## Model

In [40]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Predicting the test set results

In [41]:
y_pred_test = regressor.predict(x_test)
# y_pred_train = regressor.predict(x_train)

In [47]:
np.set_printoptions(precision = 2)
print(np.concatenate((y_pred_test.reshape(len(y_pred_test), 1), y_test.values.reshape(len(y_test), 1)), 1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## Making a single prediction (for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California')

In [48]:
sample = [[1, 0, 0, 160000, 130000, 300000]] #should be a 2D array for the next line:
print(regressor.predict(sample))

[181566.92]


Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $ 181566,92.

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

$1, 0, 0, 160000, 130000, 300000 \rightarrow \textrm{scalars}$

$[1, 0, 0, 160000, 130000, 300000] \rightarrow \textrm{1D array}$

$[[1, 0, 0, 160000, 130000, 300000]] \rightarrow \textrm{2D array}$

**Important note 2:** Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

## Getting the final linear regression equation with the values of the coefficients

In [49]:
print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.529248549545


Therefore, the equation of our multiple linear regression model is:

$$\textrm{Profit} = 86.6 \times \textrm{Dummy State 1} - 873 \times \textrm{Dummy State 2} + 786 \times \textrm{Dummy State 3} - 0.773 \times \textrm{R&D Spend} + 0.0329 \times \textrm{Administration} + 0.0366 \times \textrm{Marketing Spend} + 42467.53$$

**Important Note:** To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.

In [None]:
# We can ALSO conduct backward elimination, but it is done within SKLEARN, so we do not have to do it here, seperately. But it is available i the videos.