# Two most important resources for Multiple Regression 
https://www.youtube.com/watch?v=fTfMdCQJz4s for lecture 
<br> https://www.youtube.com/watch?time_continue=1553&v=NUXdtN1W1FE for python tutorial 

In [1]:
# Importing the libraries 
import numpy as np # numerical comuputation 
import matplotlib.pyplot as plt # 2D plotting library  
import pandas as pd # for dataframe, 2 D data with rows and columns 
import seaborn as sns # Seaborn is a library for making statistical graphics in Python. 
#It is built on top of matplotlib and closely integrated with pandas data structures.
%matplotlib inline 
# The output of plotting commands is displayed inline within frontends like the Jupyter notebook, 
# directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.

In [2]:
# Importing the dataset 
companies = pd.read_csv('1000_Companies.csv')
companies.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [3]:
# Extracting the Independent and Dependent variables
# By convention X should be uppercase and y should be lowercase 
X = companies.iloc[:, :-1].values # all rows, all columns except last
y = companies.iloc[:, 4].values # all rows, fifth column (starts from 0, which is R&D Spend)
print (X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 ...
 [100275.47 241926.31 227142.82 'California']
 [128456.23 321652.14 281692.32 'California']
 [161181.72 270939.86 295442.17 'New York']]


In [4]:
# Building the Correlation Matrix 
companies.corr()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
R&D Spend,1.0,0.582434,0.978407,0.945245
Administration,0.582434,1.0,0.520465,0.74156
Marketing Spend,0.978407,0.520465,1.0,0.91727
Profit,0.945245,0.74156,0.91727,1.0


In [5]:
# Encoding categorical data of State (New York, California and Florida)
# LabelEncoder: ----------------------------------------------------------------
# LabelEncoder is used to encode labels with value between 0 and n_classes-1
# It can be used to transform non-numerical labels to numerical labels 
# ------------------------------------------------------------------------------
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder = LabelEncoder() # creating a labelencoder object of class LabelEncoder
X[:, 3] = labelencoder.fit_transform(X[:, 3]) # New York = 0, California = 1, Florida = 2 
# picking the column that needs to be labeled only 
print (X) 

[[165349.2 136897.8 471784.1 2]
 [162597.7 151377.59 443898.53 0]
 [153441.51 101145.55 407934.54 1]
 ...
 [100275.47 241926.31 227142.82 0]
 [128456.23 321652.14 281692.32 0]
 [161181.72 270939.86 295442.17 2]]


## NOTE  LabelEncoder and OneHotEncoder 
Label Encoding is not good enough. If New York = 0, California = 1, Florida = 2 
Say supposing your model internally calculates average This implies that: Average of New York and Florida is California. 
<br>This is definitely a recipe for disaster.
<br> LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but then the imposed ordinality means that the average 
of dog and mouse is cat. Still there are algorithms like decision trees and random forests that can work with categorical 
variables just fine and LabelEncoder can be used to store values using less disk space.One-Hot-Encoding has a the advantage
that the result is binary rather than ordinal and that everything sits in an orthogonal vector space. 
The disadvantage is that for high cardinality, the feature space can really blow up quickly and you start fighting 
with the curse of dimensionality.
<br> For more information ------------------------------------------------------------------------------------------------------
<br> https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
<br> https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor

<br> One hot encoder takes only integer labels. So we must use LabelEncoder first 

In [6]:
# OneHotEncoder -------------------------------------------------------------------------------------------------------------
# OneHotEncoder encodes label encoding into binary form from ordinal 
#  (New York, California and Florida) into integers (0, 1 and 2) using LabelEncoder
# and integers (0, 1 and 2) into binary (1,0), (0,1) and (0,0) using OneHotEncoder
# One hot encoder takes only integer labels. So we must use LabelEncoder first  
# two dummy variables are needed for 3 types of categorical data (n-1)
# New York = (1, 0), California = (0, 1), Florida = (0, 0)
onehotencoder = OneHotEncoder(categorical_features =[3]) # categorical_features = column of dataFrame that needs to be HotEncoded only
# Since 4th (0,1,2,3) column has label encoded data in the dataFrame
# Unlike labelEncoder, we have to give the entitre dataFrame as a parameter to OneHotEncoder
X = onehotencoder.fit_transform(X).toarray() # transform the encoded data to an array 
print (X) 

[[0.0000000e+00 0.0000000e+00 1.0000000e+00 1.6534920e+05 1.3689780e+05
  4.7178410e+05]
 [1.0000000e+00 0.0000000e+00 0.0000000e+00 1.6259770e+05 1.5137759e+05
  4.4389853e+05]
 [0.0000000e+00 1.0000000e+00 0.0000000e+00 1.5344151e+05 1.0114555e+05
  4.0793454e+05]
 ...
 [1.0000000e+00 0.0000000e+00 0.0000000e+00 1.0027547e+05 2.4192631e+05
  2.2714282e+05]
 [1.0000000e+00 0.0000000e+00 0.0000000e+00 1.2845623e+05 3.2165214e+05
  2.8169232e+05]
 [0.0000000e+00 0.0000000e+00 1.0000000e+00 1.6118172e+05 2.7093986e+05
  2.9544217e+05]]


In [7]:
# One hot encoder automatically generates  1 extra dummy variable  
# We need only two dummy variables (0,1) for 3 types of categorical data (n-1), where n = no of categories 
# New York = 1, 0  California = 0, 1  Florida = 0, 0 
X = X[:, 1:] # all rows, 2nd to last column (0 is the 1st column)
# All this is doing is removing the first column 
print (X)

[[0.0000000e+00 1.0000000e+00 1.6534920e+05 1.3689780e+05 4.7178410e+05]
 [0.0000000e+00 0.0000000e+00 1.6259770e+05 1.5137759e+05 4.4389853e+05]
 [1.0000000e+00 0.0000000e+00 1.5344151e+05 1.0114555e+05 4.0793454e+05]
 ...
 [0.0000000e+00 0.0000000e+00 1.0027547e+05 2.4192631e+05 2.2714282e+05]
 [0.0000000e+00 0.0000000e+00 1.2845623e+05 3.2165214e+05 2.8169232e+05]
 [0.0000000e+00 1.0000000e+00 1.6118172e+05 2.7093986e+05 2.9544217e+05]]


In [8]:
# Splitting the dataset into the Training set and Test set 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) 
# Convention --> (X,y) X_train, X_test, y_train, y_test 
# test_size = 0.2, 20% of the total data would be used for testing, rest for training 
# random_state:===================================================================================================== 
# If you don't specify the random_state in your code, then every time you run(execute) your code a new 
# random value is generated and the train and test datasets would have different values each time.
# However, if a fixed value is assigned like random_state = 42 (or 0) then no matter how many times
# you execute your code the result would be the same .i.e, same values in train and test datasets.
# In practice I would say, you should set the random_state to some fixed number while you test stuff, 
# but then remove it in production if you really need a random (and not a fixed) split.
# ==================================================================================================================

In [9]:
# Fitting Multiple Linear Regression to the Training set 
from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() # regressor object of the class LinearRegression 
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [10]:
# Predicting the Test set results 
y_pred = regressor.predict(X_test)
print (y_pred)
y_pred.size

[ 89790.61532914  88427.0718736   94894.67836971 175680.86725613
  83411.73042087 110571.90200074 132145.2293644   91473.37719685
 164597.05380608  53222.82667398  66950.19050988 150566.43987006
 126915.20858596  59337.85971048 177513.91053064  75316.28143049
 118248.14406603 164574.40699904 170937.28981071 182069.11645087
 118845.03252688  85669.95112227 180992.59396146  84145.08220143
 105005.83769213 101233.56772746  53831.07669088  56881.41475222
  68896.39346903 210040.00765886 120778.72270894 111724.87157654
 101487.90541517 137959.02649624  63969.95996741 108857.91214126
 186014.7253199  171442.64130749 174644.26529207 117671.49128195
  96731.37857432 165452.25779411 107724.34331255  50194.54176911
 116513.89532178  58632.48986818 158416.46827611  78541.48521608
 159727.66671745 131137.87699644 184880.70924519 174609.08266882
  93745.66352058  78341.13383416 180745.90439082  84461.61490551
 142900.90602904 170618.44098399  84365.09530837 105307.3716218
 141660.07290787  52527.34

200

In [11]:
# Calculating the Coefficients 
print (regressor.coef_)

[-8.80536598e+02 -6.98169073e+02  5.25845857e-01  8.44390881e-01
  1.07574255e-01]


### Note on number of coeffients and independent variables 
Remember there are 5 coefficients and one intercept. 
<br> y   =  m1x1 + m2x2 + m3x3 + m4x4 + m5x5 + C 
<br> Profit = R&D Spend, Administration,	Marketing Spend, State (3 different states) 
<br> We have three different states but we need only 2 coefficients ! 

In [12]:
# Calculating the Intercept 
print (regressor.intercept_)

-51035.22972405603


In [13]:
# Calculating the R squared value 
from sklearn.metrics import r2_score 
r2_score(y_test, y_pred)

0.9112695892268936

## Let's use Statsmodels and see the multiple regression summary 

In [14]:
import statsmodels.formula.api as smf
df = pd.DataFrame(X_train, y_train) # Statsmodels takes only dataFrame 
reg = smf.ols(formula='y_train~X_train', data=df).fit()
# ols -->ordinary least squares or linear regression 
reg.summary()

0,1,2,3
Dep. Variable:,y_train,R-squared:,0.959
Model:,OLS,Adj. R-squared:,0.958
Method:,Least Squares,F-statistic:,3672.0
Date:,"Wed, 08 Aug 2018",Prob (F-statistic):,0.0
Time:,09:13:36,Log-Likelihood:,-8375.2
No. Observations:,800,AIC:,16760.0
Df Residuals:,794,BIC:,16790.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-5.104e+04,3998.483,-12.764,0.000,-5.89e+04,-4.32e+04
X_train[0],-880.5366,738.404,-1.192,0.233,-2329.991,568.917
X_train[1],-698.1691,737.843,-0.946,0.344,-2146.523,750.185
X_train[2],0.5258,0.034,15.523,0.000,0.459,0.592
X_train[3],0.8444,0.031,26.983,0.000,0.783,0.906
X_train[4],0.1076,0.016,6.548,0.000,0.075,0.140

0,1,2,3
Omnibus:,1508.437,Durbin-Watson:,2.004
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2536304.512
Skew:,12.949,Prob(JB):,0.0
Kurtosis:,277.624,Cond. No.,3760000.0
