# Main Template:
1 - Importing the Libraries \
2 - Importing the dataset \
3 - Encoding Categorical data \
4 - Splitting the dataset into the Training set and Test set \
5 - Training the MLR model on Training set \
6 - Predicting the Test set results

# 1 - Importing the Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 2 - Importing the Dataset

In [9]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

# 3 - Encoding Categorical data

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# 3rd element of 'transformers' is the index position where we want to apply OneHotEncoder
# 'State' is in 3rd index column. It is a categorical value
ct = ColumnTransformer(transformers= [('encoder',OneHotEncoder(), [3])],
                       remainder ='passthrough')

X = np.array(ct.fit_transform(X))
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

# 4 - Splitting the Dataset into Training and Testing

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2,
                                                    random_state = 0)

# 5 - Training the  MLR Model on training set

In [5]:
from sklearn.linear_model import LinearRegression # LenearRegression is a 'Class'

In [10]:
# creating object for LinearRegression class
regressor = LinearRegression()    # builds the multiple linear regressions 
                                  # regressor is created as an instance of LinearRegression class

regressor.fit(X_train,y_train) # trains the model on our training dataset. 

LinearRegression()

# 6 - Predicting the test set results
This time we have multiple variables. Therefore, if will be difficult to display the predicion on a 2-D screen. Instead we can display 2 vectors:
 - the vector of the real profits
 - the vector of the predicted profits. 
we can compare them against the test sets. 

to display the 2 vectors of the real profits and the predicted profits together next to each other. we will use 'Concatenate'. it is a function of numpy that allows us to concatenate either vertically or horizonatally 2 vectos or even arrays. \
The concatenate has 2 parameters.\
- array : this further has 2 parameters i.e. vector os predicted profits and vector of real profits
- axis : takes integer value = **0** -> for horixonal display. and **1** -> for vertical display \
the entire **.reshape(len(y_pred), 1)** is meant only to reshape the display from default horizontal to vertical for easy comparison

In [18]:
y_pred = regressor.predict(X_test) # 'predict' is a method
np.set_printoptions(precision= 2)  # will display any numerical value with only 2 decimals after comma. 


# to display the 2 vectors of the real profits and the predicted profits together next to each other

                                # number of rows, number of columns
print(np.concatenate((y_pred.reshape(len(y_pred), 1),       # vector of predicted profits
                      y_test.reshape(len(y_test), 1)),      # vector of real profit
                      1))                                   # the axis = 1 implies that 

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


there are 2 vectors, left (vector or predicted profits) and right(vector of real profits) for the 10 startups of the test set. \
the first prediction (1st startup) - the predicted profit of 103015.2 and the real profit of 103282.38. Pretty clost prediction !


In [15]:
# y_pred = regressor.predict(X_test) # 'predict' is a method
# np.set_printoptions(precision= 2) # will display any numerical value with only 2 decimals after comma. 
# # to display the 2 vectors of the real profits and the predicted profits together next to each other
# print(np.concatenate((y_pred,  # vector of predicted profits
#                       y_test), # vector of real profit
#                      0)) # the axis = 1 implies that 