### Assumptions of Multiple Linear Regression

1) Linearity<br>
2) No Multicollinearlity<br>
3) No Autocorrelation<br>
4) Homoscedasticity (No Hetroscedasticity)<br>
5) Data is normally distributed<br>
6) No missing values<br>
7) No Outliers<br>

### 1) Linearity

a) All independent variables should have a linear relationship with dependent variables<br>
b) <b>How to check linearlity</b><br>
i) Plot the scatter chart between the independent variable and dependent varibale <br>
ii) Pairplot<br>
c) <b>What to do if this assumption is violated</b><br>
i) Apply a nonlinear transformation to the independent and/or dependent variable. Common examples include taking the log, the square root, or the reciprocal of the independent and/or dependent variable.


### 2) No Multicollinearity
1) The independent variables should not be correlated. Absence of this phenomenon is known as multicollinearity.<br>
2) This phenomenon exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of a predictors with response variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable.<br>

3) <b>How to Check Multicollinearity</b><br>
a) You can use scatter plot to visualize correlation effect among variables.<br>
b) Use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. VIF = Variance Inflation Factor.<br> 
c) VIF - Measures how much the variance of estimated Regression coefficient increases if your predictors are correlated. If variance increases, model is not relaible<br>
c) A correlation table should also solve the purpose.<br>

### 3) Homoscedasticity (No Hetroscedasticity)
1) The presence of non-constant variance in the error terms results in heteroskedasticity. Generally, non-constant variance arises in presence of outliers <br>
2) Under Linear Reg it is assumed that the variance in erros is constant (Homoscedasticity)<br>
2) <b>How to check</b><br>
a) Breusch-Pagan Test<br>
b) Goldfeld Quant Test

### 4) No Autocorrelation
1) The presence of correlation in error terms drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error.<br>
2) Errors are assumed to be randomly spread across error lines<br>
2) Autocorrelation measures relationship between a variable's present value and its past value.

3) <b>How to Check</b><br>
a) Durbin Watson test<br>

#### 5) Data is Normally Distributed
#### 6) No missing values.
#### 7) No Outliers

### OLS Model (Ordinary Least Square) 
This method aims to minimize SSE (Sum of Square Error)

#### Least Square Linear Regression
ypred = mx + c<br>
Error = (y1 - ypred1)^2 + (y2 - ypred2)^2 (y3 - ypred3)^2 + ....  + (yn - ypredn)^n

SSE = Sum(yi - ypredi)^2

<img src="cost_fun.png" height="150" width="250">
<img src="cost_fun2.png" height="150" width="250">


To minimize the error, first order derivative should be equal to zero
Solving the Equation we get<br>
<img src="mc_linreg_ols.png">

### Decomposition Variablity
It deals with determinants for a good regression model
1) SSE<br>
2) SSR<br>
3) SST<br>
4) R2 Score<br>
5) Adjusted R2 Score<br>
6) AIC <br>
7) BIC <br>

#### Can R2 be negative
<img src="r2_negative.png">

### Adjusted R2 Score
1)	It measures the proportion of variation explained by only those independent variables that really help in explaining the dependent variable. It penalizes you for adding independent variable that don’t help in predicting dependent variable.<br>
2)	The Adjusted R-squared takes into account the number of independent variables used for predicting the target variable. In doing so, we can determine whether adding new variables to the model actually increases the model fit.<br>
3)	Adjusted R-squared is positive, not negative. It is always lower than the R-squared.
<img src="adjusted_r2.png">

where<br>
n represents the number of data points in our dataset<br>
k represents the number of independent variables<br>
R represents the R-squared values determined by the model.<br>


5)	If R-squared does not increase significantly on the addition of a new independent variable, then the value of Adjusted R-squared will actually decrease.<br>
6)	If on adding the new independent variable we see a significant increase in R-squared value, then the Adjusted R-squared value will also increase.

<img src="adjusted_r2_theory.png" heighgt="200" width="350" align="left">

### AIC (Akaike Information Criteia)
<img src="aic.png" align="left" height="250" width="400">

### BIC (Bayesian Information Criteia)
<img src="bic.png" align="left" height="300" width="400">

In [10]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [18]:
a = [20,30,25,17,36,40,46,55,52,27,50,33,42,21,48,37]
b = [41,63,51,39,78,75,87,98,102,60,101,70,86,47,90,69]

df = pd.DataFrame({'x':a,'y':b})
df.head()

Unnamed: 0,x,y
0,20,41
1,30,63
2,25,51
3,17,39
4,36,78


In [19]:
x = df[['x']]
y = df['y']
print(x.shape)
print(y.shape)

(16, 1)
(16,)


In [20]:
import statsmodels.api as sm

In [22]:
x = sm.add_constant(x)

result = sm.OLS(y, x).fit()
 
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.967
Model:                            OLS   Adj. R-squared:                  0.965
Method:                 Least Squares   F-statistic:                     415.7
Date:                Mon, 30 May 2022   Prob (F-statistic):           8.28e-12
Time:                        09:10:18   Log-Likelihood:                -43.439
No. Observations:                  16   AIC:                             90.88
Df Residuals:                      14   BIC:                             92.42
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.8313      3.170      3.417      0.0



Interpreting Results

1) R-squared : the coefficient of determination. It is the proportion of the variance in the dependent variable that is predictable/explained

2) Adj. R-squared : Adjusted R-squared is the modified form of R-squared adjusted for the number of independent variables in the model. Value of adj. R-squared increases, when we include extra variables which actually improve the model.

3) F-statistic : the ratio of mean squared error of the model to the mean squared error of residuals. It determines the overall significance of the model.

4) coef : the coefficients of the independent variables and the constant term in the equation.

5) t : the value of t-statistic. It is the ratio of the difference between the estimated and hypothesized value of a parameter, to the standard error

In [30]:
# Calculation of Adjusted R2_score
n=16
k=1
adj_r2 = 1 - ((1-0.967)*(n-1)/(n-k-1))
print(adj_r2)

0.9646428571428571


#### Generating Predictions

In [31]:
x = df[['x']]
y = df['y']
print(x.shape)
print(y.shape)

(16, 1)
(16,)


In [32]:
model = LinearRegression()
model.fit(x,y)
print(model.score(x,y))

0.9674199185802056


In [34]:
# Y_pred = mx + c
m = model.coef_
c = model.intercept_
print('Coefficient or Slope',m)
print('Intercept or Constant',c)

Coefficient or Slope [1.69896233]
Intercept or Constant 10.831300639658856


#### Example - 2

In [65]:

Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
                'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
                'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
                'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
                'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]        
                }

df1 = pd.DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price']) 
df1.head()

Unnamed: 0,Year,Month,Interest_Rate,Unemployment_Rate,Stock_Index_Price
0,2017,12,2.75,5.3,1464
1,2017,11,2.5,5.3,1394
2,2017,10,2.5,5.3,1357
3,2017,9,2.5,5.3,1293
4,2017,8,2.5,5.4,1256


In [91]:
df.shape

(10, 2)

#### Problem Statement - Based on Interest Rate,Unemployment rate, predict the Stock_Index_price