# What is multicollinearity ?
>Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

>Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. This means that an independent variable can be predicted from another independent variable in a regression model


# Why Multi-Collinearity is a problem?

When independent variables are highly correlated, change in one variable would cause change to another and so the model results fluctuate significantly. The model results will be unstable and vary a lot given a small change in the data or model. This will create the following problems:

1>It would be hard for you to choose the list of significant variables for the model if the model gives you different results every time.

2>Coefficient Estimates would not be stable and it would be hard for you to interpret the model. In other words, you cannot tell the scale of changes to the output if one of your predicting factors changes by 1 unit.

3>The unstable nature of the model may cause overfitting. If you apply the model to another sample of data, the accuracy will drop significantly compared to the accuracy of your training dataset

# How to identify that multi-colinearity exist ?
1> correlation  is greater > 0.8 between 2 variables 

2>Variance inflation factor(VIF) >20 

3>R Squared& Adj R- Squared  value should in between 0 to 1 [As close to 1 it will be good ]

4>Check the Coeffient value should not be high

5>If Coeffient value is negative then it means that newspaper price change in 1 unit price will decrease by .0010 (As co-efficent value is negative)

6>Standard error should not high it means multi -co-relation exist 

7>Higher p-value shoild be ignored 


In [7]:
#IMPORTING ALL THE REQUIRED Library
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
!pip install statsmodels
import statsmodels.api as sm

Collecting statsmodels
  Downloading statsmodels-0.13.0-cp39-none-win_amd64.whl (9.4 MB)
Collecting patsy>=0.5.2
  Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.2 statsmodels-0.13.0


In [10]:
#Reading the dataset
df_adv = pd.read_csv('Advertising.csv', index_col=0)
X = df_adv[['TV', 'radio','newspaper']]
y = df_adv['sales']
df_adv.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [12]:
#Adding a constraint
X=sm.add_constant(X)

In [13]:
X

Unnamed: 0,const,TV,radio,newspaper
1,1.0,230.1,37.8,69.2
2,1.0,44.5,39.3,45.1
3,1.0,17.2,45.9,69.3
4,1.0,151.5,41.3,58.5
5,1.0,180.8,10.8,58.4
...,...,...,...,...
196,1.0,38.2,3.7,13.8
197,1.0,94.2,4.9,8.1
198,1.0,177.0,9.3,6.4
199,1.0,283.6,42.0,66.2


In [15]:
## fit a OLS model with intercept on TV and Radio
#Here y=endog(OUTPUT) * X=exog(INPUT)
Model=sm.OLS(y,X).fit()

In [16]:
Model.summary()
#R Squared& Adj R- Squared  value should in between 0 to 1 [As close to 1 it will be good ]
#Check the Coeffient value should not be high
#If Coeffient value is negative then there is a negative co - relation exists it means that newspaper price change in 1 unit price will decrease by .0010 (As co-efficent value is negative)
#Standard error should not high it means multi -co-relation exist 
#Higher p-value is there that means multi co relation is there 

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Sat, 23 Oct 2021",Prob (F-statistic):,1.58e-96
Time:,17:33:48,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [17]:
X.iloc[:,1:].corr()

Unnamed: 0,TV,radio,newspaper
TV,1.0,0.054809,0.056648
radio,0.054809,1.0,0.354104
newspaper,0.056648,0.354104,1.0


In [18]:
#Lets see a exapmle of a dataset where multi - co relation exists

df_salary = pd.read_csv('Salary_Data.csv')
df_salary.head()

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


In [19]:

X = df_salary[['YearsExperience', 'Age']]
y = df_salary['Salary']

In [23]:
X=sm.add_constant(X)
model_2=sm.OLS(y,X).fit()

In [24]:
model_2.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,323.9
Date:,"Sat, 23 Oct 2021",Prob (F-statistic):,1.35e-19
Time:,17:50:20,Log-Likelihood:,-300.35
No. Observations:,30,AIC:,606.7
Df Residuals:,27,BIC:,610.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6661.9872,2.28e+04,-0.292,0.773,-5.35e+04,4.02e+04
YearsExperience,6153.3533,2337.092,2.633,0.014,1358.037,1.09e+04
Age,1836.0136,1285.034,1.429,0.165,-800.659,4472.686

0,1,2,3
Omnibus:,2.695,Durbin-Watson:,1.711
Prob(Omnibus):,0.26,Jarque-Bera (JB):,1.975
Skew:,0.456,Prob(JB):,0.372
Kurtosis:,2.135,Cond. No.,626.0


In [25]:
X.iloc[:,1:].corr()

Unnamed: 0,YearsExperience,Age
YearsExperience,1.0,0.987258
Age,0.987258,1.0


# By the above summary we have observed that
1>Age is highly co-relation that is greater thatn .80

2>High co - efficient for (Coef=6153.3533-Year of Experience)

3>High standard deviation Error (2337)

4>High p value for Age (.165) & co relation is high so we can drop this feature from the calculation

