# Multiple Linear Regression.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

## Objectives

## Regression With Multiple Predictors.

Multiple linear regression is a statistical technique that allows us to model the relationship between two or more independent variables and a dependent variable.

Enables study of how a dependent variable changes as a result of changes in multiple independent variables.

Use multiple linear regression when you want to know:
* How strong the relationship is between two or more independent variables and one dependent variable.
* The value of the dependent variable at a certain value of the independent variables.

Formula and Calculation of Multiple Linear Regression

>>>$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \varepsilon$

Where:
* $y$ is the dependent variable
* $\beta_0$ is the intercept
* $\beta_1,\beta_2,...,\beta_p$ are the regression coefficients, which represent the change in y when the corresponding independent variable changes by one unit, holding all other variables constant.
* $x_1,x_2,...,x_p$ are the independent variables.
* $\varepsilon$ is the error term or the random variable variation in the y that is not explained by the independent variables

To estimate the values of the regression coefficients, use Ordinary Least Squares(OLS), which minimizes the sum of squared residuals between the predicted values and the actual values of the dependent variable.
The OLS methods find the value of $\beta0$, $\beta1$,$\beta2$,...,$\beta{p}$ that minimize the following equation:

>>> $\sum\limits_{i=1}^n(y_i-\hat{y}i)^z = \sum\limits{i=1}^n(y_1 - \beta_0 - \beta_1x_{i1}-\beta_2x_{i2}-...-\beta_px_{ip})^2$

Where:

* $y_i$ is the actual value of the dependent variable.
* $\hat{y}i$ is the predicted value of the dependent variable.
* $x_{i1},x_{i2},...,x_{ip}$ are the independent variables.


Once we have estimated the regression coefficients, we can use the model to make predictions.

## Comparison of Multiple linear regression with simple linear regression.

The main idea here is quite simple. Whereas, in simple linear regression we used the dependent variable to be a function only of a single independent varible, here we'll be taking the dependent variable to be a function of multiple independent variables.

MLR accounts for the variance in the dependent variable due to all the predictor variables, whereas the SLR accounts for the variance in the dependent variable due to only one predictor variable. Therefore, multiple linear regression models are generally more accurate than simple linear regression models in predicting the values of the dependent variable.

📝 Adding more predictor variables to the model does not necessarily increase the accuracy of the model. Sometimes adding too many predictor variables to the model can lead to overfitting, where the model fits the training data too closely and does not generalize well to new data.

## Assumptions of Multiple Linear Regression

* **Linearity** - There is a linear relationship between the independent and dependent variable. 

For multiple regression, with multiple predictors, also check for collinearity. 

**Collinearity** occurs when two or more predictor variables are highly correlated.Leads to unreliable coefficient estimates, inflated standard errors and makes it difficult to interpret the individual effects of each predictor on the response variable.

Multicollinearity occurs when two ore more independent variables in a dataframe have a high correlation with one another in a regression model.

To check, we would look at the output table for the coefficients and check the colinearity statistics for Tolerance and (Variance Inflation Factor)VIF.

Tolerance measures the proprtion of variation in one predictor variable that can be explained by the other predictor variables in the model. A Tolerance value close to 1 indicates that there is little or no collinearity and value close to 0 indicates that there is high collinearity between the predictor variables.

VIF is calculated as 1/Tolerance. Value for VIF starts at 1 and has no upper limit. A VIF value of 1 indicates no collinearity, a value between 1 and 5 indicates moderate correlation between a given predictor variable and other predictor variables, often not severe enough to require attention while a VIF value greater than 5 suggest severe correlation between a give predictor variable and other predictor variables in the model. High VIF value means that an independent variable can be predicted by other independent variables in the data.

$R^2$ value is determined to find out how well an independent variable is described by the other independent variables. A high value of $R^2$ means that the variable is highly correlated with the other variables. 

>>>$VIF = \frac{1}{1-R^2}$

Closer the $R^2$ value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.

In [3]:
# load data 
data = pd.read_csv("../data/advertising.csv")
data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9


In [4]:
# define predictors and response
X = data[["TV", "Radio", "Newspaper"]]
y = data["Sales"]

In [6]:
# add constant to the model
x = sm.add_constant(X)

# create model object
model = sm.OLS(y, x)

In [7]:
# fit the model
result = model.fit(method='pinv')

In [8]:
# print the summary output 
result.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.903
Model:,OLS,Adj. R-squared:,0.901
Method:,Least Squares,F-statistic:,605.4
Date:,"Tue, 09 May 2023",Prob (F-statistic):,8.13e-99
Time:,12:55:47,Log-Likelihood:,-383.34
No. Observations:,200,AIC:,774.7
Df Residuals:,196,BIC:,787.9
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.6251,0.308,15.041,0.000,4.019,5.232
TV,0.0544,0.001,39.592,0.000,0.052,0.057
Radio,0.1070,0.008,12.604,0.000,0.090,0.124
Newspaper,0.0003,0.006,0.058,0.954,-0.011,0.012

0,1,2,3
Omnibus:,16.081,Durbin-Watson:,2.251
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27.655
Skew:,-0.431,Prob(JB):,9.88e-07
Kurtosis:,4.605,Cond. No.,454.0


In [9]:
# calculate VIF and Toleance

def calc_vif(X):
    # calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return vif

In [10]:
calc_vif(X)

Unnamed: 0,variables,VIF
0,TV,2.486772
1,Radio,3.285462
2,Newspaper,3.055245
