# What Is Multiple Linear Regression (MLR)?

<img src="First.png">

<img src="Image1.jpg">

Multiple linear regression (MLR), also known simply as multiple regression,
is a statistical technique that uses several dependent or explanatory variables 
to predict the outcome of an Independentvariable.
b0 ,b1, b2, b3... are the regression coefficients for the inpendent variable

With simple linear regression, there are only two regression coefficients - b0 and b1.
There are only two normal equations. Finding a least-squares solution involves solving
two equations with two unknowns - a task that is easily managed with ordinary algebra

With multiple regression, things get more complicated. There are k independent variables and k + 1 regression coefficients. There are k + 1 normal equations. Finding a least-squares solution involves solving k + 1 equations with k + 1 unknowns. This can be done with ordinary algebra, but it is unwieldy.

Below equations are formulated using Ordinary Least Square Method
<img src="Linear_Equation.png">

# Error Function

<img src="Error.png">

In [20]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [21]:
#Importing the dataset
dataset=pd.read_csv('Data1.csv')
print(dataset.head(2))

   R&D Spend  Administration  Infrastructure Spend Country     Profit
0   165349.2       136897.80             471784.10      US  192261.83
1   162597.7       151377.59             443898.53      UK  191792.06


In [22]:
X=dataset.iloc[:,:-1].values # all except Profit column
y=dataset.iloc[:,4].values # only profit column
dataset[:][:4].corr() # checking correlation between independent variables (features)
# print(X)
# print(y)

Unnamed: 0,R&D Spend,Administration,Infrastructure Spend,Profit
R&D Spend,1.0,0.667633,0.976422,0.899008
Administration,0.667633,1.0,0.669849,0.338165
Infrastructure Spend,0.976422,0.669849,1.0,0.813424
Profit,0.899008,0.338165,0.813424,1.0


In [23]:
#Encoding the categorical data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelEncoder_X=LabelEncoder() # this is to convert our country column to ordinal format [1,2,3,2,3,1,2,3,2]
X[:,3]=labelEncoder_X.fit_transform(X[:,3])
print(X[:][:2]) #printing 2 rows
oneHotEncoder=OneHotEncoder(categorical_features=[3])
X=oneHotEncoder.fit_transform(X).toarray() # once we have column in ordinal format, will convert to Nominal format [[1,0,0],[0,1,0],[...]]
print(X[:][:2]) #printing 2 rows


[[165349.2 136897.8 471784.1 2]
 [162597.7 151377.59 443898.53 1]]
[[0.0000000e+00 0.0000000e+00 1.0000000e+00 1.6534920e+05 1.3689780e+05
  4.7178410e+05]
 [0.0000000e+00 1.0000000e+00 0.0000000e+00 1.6259770e+05 1.5137759e+05
  4.4389853e+05]]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [24]:
#avoiding dummy variable trap
#The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which 
#two or more variables are highly correlated; in simple terms one variable can be predicted from the others.
X=X[:,1:] #example [cat, dog]--> if not cat then obviosly dog. so we will predict only cat, else case is dog 

#Splitting into training and test set

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

#fitting multuiple regression model
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)
print(regressor.coef_)
print(regressor.intercept_)

[9.59284160e+02 1.65865321e+03 7.73467193e-01 3.28845975e-02
 3.66100259e-02]
41594.8834576995


In [25]:
#predicting the test set result
y_pred=regressor.predict(X_test)
print(y_pred)

[103015.20159796 132582.27760814 132447.73845175  71976.09851259
 178537.48221055 116161.24230166  67851.69209678  98791.73374688
 113969.43533013 167921.06569551]


In [27]:
#building the optimal model using backward elimination
import statsmodels.formula.api as sm
import statsmodels.regression.linear_model as lm
X=np.append(arr=np.ones((50,1)).astype(int),values=X,axis=1)
X_opt=X[:,[0,1,2,3,4,5]]

regressor_OLS=lm.OLS(endog=y,exog=X_opt).fit() # OLS= ordinary least square
print(regressor_OLS.summary())

# #eliminating x1 from X, removimg first column
# X_opt = X[:, [0, 2, 3, 4, 5]]
# regressor_OLS=lm.OLS(endog=y,exog=X_opt).fit() # OLS= ordinary least square
# # # print(regressor_OLS.summary())

# #eliminating x2 from X, removimg second column
# X_opt = X[:, [0, 3, 4, 5]]
# regressor_OLS=lm.OLS(endog=y,exog=X_opt).fit() # OLS= ordinary least square
# # print(regressor_OLS.summary())

# #eliminating x4 from X, removimg fourth column
# X_opt = X[:, [0, 3, 5]]
# regressor_OLS=lm.OLS(endog=y,exog=X_opt).fit() # OLS= ordinary least square
# # print(regressor_OLS.summary())

# #eliminating x5 from X, removimg fifth column
# X_opt = X[:, [0, 3]]
# regressor_OLS=lm.OLS(endog=y,exog=X_opt).fit() # OLS= ordinary least square
# print(regressor_OLS.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.948
Model:                            OLS   Adj. R-squared:                  0.943
Method:                 Least Squares   F-statistic:                     205.0
Date:                Sat, 28 Nov 2020   Prob (F-statistic):           2.90e-28
Time:                        17:05:59   Log-Likelihood:                -526.75
No. Observations:                  50   AIC:                             1064.
Df Residuals:                      45   BIC:                             1073.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.785e+04   3251.266      8.565      0.0