# Multiple Linear Regression

Multiple Linear Regression (MLR) is used for more than 1 variables or features to find the relationship by fitting a linear equation.   

Y: 1 continuous target variable     
X: 2 or more predictor variables        
b0: intercept (X=0)     
b1: the coefficient or parameter of x1   
b2: the coefficient of parameter x2 and so on...        


To find the the parameter or coefficients for multiple linear regression with very large dataset and high dimensionality., you should use optimization approach .

Step 1: Data Preprocessing

In [1]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

# fix_yahoo_finance is used to fetch data 
import fix_yahoo_finance as yf
yf.pdr_override()

In [2]:
# input
symbol = 'AMD'
start = '2014-01-01'
end = '2018-08-27'

# Read data 
dataset = yf.download(symbol,start,end)

# View columns 
dataset.head()

[*********************100%***********************]  1 of 1 downloaded


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-02,3.85,3.98,3.84,3.95,3.95,20548400
2014-01-03,3.98,4.0,3.88,4.0,4.0,22887200
2014-01-06,4.01,4.18,3.99,4.13,4.13,42398300
2014-01-07,4.19,4.25,4.11,4.18,4.18,42932100
2014-01-08,4.23,4.26,4.14,4.18,4.18,30678700


In [3]:
X = dataset.iloc[ : , 0:4].values
Y = dataset.iloc[ : ,  4].values

In [4]:
# Encoding Categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()



In [5]:
# Acoiding Dummy Variable Trap
X = X[: , 1:]

In [6]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

Step 2: Fitting Multiple Linear Regression to the Training set

In [7]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Step 3: Predicting the Test set results

In [8]:
y_pred = regressor.predict(X_test)
y_pred.shape

(235,)

In [9]:
Y_test.shape

(235,)

In [10]:
Y_test.shape

(235,)

In [11]:
print('Multiple Linear Coefficients:', regressor.coef_)
print('Multiple Linear Intercept:', regressor.intercept_)
print('Multiple Linear Score:', regressor.score(X_test, Y_test))

Multiple Linear Coefficients: [  4.00000000e-02   5.00000000e-02   8.00000000e-02   9.00000000e-02
   1.00000000e-01   1.10000000e-01   4.78364709e-02  -9.23875088e-03
   1.40000000e-01   1.50000000e-01   1.60000000e-01   1.70000000e-01
   1.80000000e-01   1.90000000e-01   2.00000000e-01   2.10000000e-01
   2.20000000e-01   2.30000000e-01   2.40000000e-01   2.50000000e-01
   2.60000000e-01   2.70000000e-01   2.80000000e-01   2.90000000e-01
   3.00000000e-01   3.10000000e-01   3.20000000e-01   3.30000000e-01
   3.40000000e-01   3.50000000e-01   3.60000000e-01   3.70000000e-01
   3.80000000e-01   3.90000000e-01   4.00000000e-01   4.10000000e-01
   4.30000000e-01   4.50000000e-01   4.60000000e-01   4.70000000e-01
   4.80000000e-01   4.90000000e-01   5.00000000e-01   5.10000000e-01
   5.20000000e-01   1.09728445e-02   5.40000000e-01   5.60000000e-01
   5.70000000e-01   5.80000000e-01   5.90000000e-01   6.00000000e-01
   6.10000000e-01   1.71066983e-10   6.30000000e-01   6.40000000e-01
   6

In [12]:
import statsmodels.api as sm

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.310e+28
Date:                Wed, 21 Aug 2019   Prob (F-statistic):               0.00
Time:                        08:34:06   Log-Likelihood:                 34567.
No. Observations:                1172   AIC:                        -6.781e+04
Df Residuals:                     511   BIC:                        -6.446e+04
Df Model:                         660                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.6200   8.69e-14   1.86e+13      0.0

In [13]:
model.rsquared

1.0

In [14]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

model = ols("Close ~ Open + High + Low", dataset).fit()

# Print the summary
print(model.summary())

print("\nRetrieving manually the parameter estimates:")
print(model._results.params)

# Peform analysis of variance on fitted linear model
anova_results = anova_lm(model)

print('\nANOVA results')
print(anova_results)

plt.show()

                            OLS Regression Results                            
Dep. Variable:                  Close   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 8.718e+05
Date:                Wed, 21 Aug 2019   Prob (F-statistic):               0.00
Time:                        08:34:07   Log-Likelihood:                 998.51
No. Observations:                1172   AIC:                            -1989.
Df Residuals:                    1168   BIC:                            -1969.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0037      0.005     -0.694      0.4

In [15]:
model.rsquared

0.99955360755490297

In [16]:
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error, r2_score
ex_var_score = explained_variance_score(Y_test, y_pred)
m_absolute_error = mean_absolute_error(Y_test, y_pred)
m_squared_error = mean_squared_error(Y_test, y_pred)
r_2_score = r2_score(Y_test, y_pred)

print("Explained Variance Score: "+str(ex_var_score))
print("Mean Absolute Error "+str(m_absolute_error))
print("Mean Squared Error "+str(m_squared_error))
print("R Squared Error "+str(r_2_score))

Explained Variance Score: -0.0184926564206
Mean Absolute Error 3.15829141594
Mean Squared Error 33.2780801426
R Squared Error -0.454451340198


In [17]:
msk = np.random.rand(len(dataset)) < 0.8
train = dataset[msk]
test = dataset[~msk]

regr = LinearRegression()
x = np.asanyarray(train[['Open','High','Low','Volume']])
y = np.asanyarray(train[['Adj Close']])
regr.fit (x, y)
print ('Coefficients: ', regr.coef_)
y_= regr.predict(test[['Open','High','Low','Volume']])
x = np.asanyarray(test[['Open','High','Low','Volume']])
y = np.asanyarray(test[['Adj Close']])
print("Residual sum of squares: %.2f"% np.mean((y_ - y) ** 2))
print('Variance score: %.2f' % regr.score(x, y))

Coefficients:  [[ -4.91601198e-01   7.66890389e-01   7.27064720e-01  -5.52148661e-10]]
Residual sum of squares: 0.01
Variance score: 1.00
