## Multicollinearity in linear regression

In regression, "multicollinearity" refers to predictors that are correlated with other predictors. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant.

A little bit of multicollinearity isn't necessarily a huge problem.But severe multicollinearity is a major problem, because it increases the variance of the regression coefficients, making them unstable. The more variance they have, the more difficult it is to interpret the coefficients.

In [1]:
## Here to find out whether there is multicollinearity , we use OLS regression model


## Ordinary least squares (OLS) regression : 
It is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable; the method estimates the relationship by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable configured as a straight line

In [3]:
## OLS is nothing but a linear regression

In [6]:
import pandas as pd
df1 = pd.read_csv("Advertising.csv",index_col=0)
df1.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [7]:
X=df1.iloc[:,:-1]
Y=df1.iloc[:,-1]

In [8]:
## Importing imp library

import statsmodels.api as sm   ## Here we have different types of regression models

In [11]:
X= sm.add_constant(X) 

In [12]:
## We added this constant for Bo values. 

In [10]:
X.head()

Unnamed: 0,const,TV,radio,newspaper
1,1.0,230.1,37.8,69.2
2,1.0,44.5,39.3,45.1
3,1.0,17.2,45.9,69.3
4,1.0,151.5,41.3,58.5
5,1.0,180.8,10.8,58.4


In [13]:
model = sm.OLS(Y,X).fit()                ##endog, exog are nothing but output and input feature respectively

In [14]:
model.summary()  ## This describes every thing related to regression applied on the model

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Thu, 03 Dec 2020",Prob (F-statistic):,1.58e-96
Time:,13:58:19,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [15]:
## This table will help us to find whether there is multicollinearity in linear regression

In [16]:
## We have to check 4 things ( coef, **std err, r2_score, p>t)
## Coef is Bo values
## std_err should be less and closer to zero, which indicates that there is no correlation between features in dataset.
## If std_error is greater in value, then there may be some high corelation between features
## P>|t| should also be close to zero and <0.05 to indicate there is no correlation between features.(Here though it is greater wrt newspaper but that is due to negative coefficient.)
## and finally r_square score must be > 0.8.


In [17]:
## As here, all the conditions look stable , therefore there is no multicollinearity in our dataset

In [19]:
## We can cross check

X.corr()

Unnamed: 0,const,TV,radio,newspaper
const,,,,
TV,,1.0,0.054809,0.056648
radio,,0.054809,1.0,0.354104
newspaper,,0.056648,0.354104,1.0


In [20]:
X.iloc[:,1:].corr()


Unnamed: 0,TV,radio,newspaper
TV,1.0,0.054809,0.056648
radio,0.054809,1.0,0.354104
newspaper,0.056648,0.354104,1.0


In [21]:
## We can see clearly , there is no correlation between features as all the values are closer to 0

In [22]:
## Now we will try with different dataset

In [23]:
df2=pd.read_csv("Salary_Data.csv")

In [24]:
df2.head()

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


In [26]:
X=df2.iloc[:,:-1]
Y=df2.iloc[:,-1]

In [27]:
X = sm.add_constant(X) ## For B0 values

In [28]:
model = sm.OLS(Y,X).fit()

In [29]:
model.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,323.9
Date:,"Thu, 03 Dec 2020",Prob (F-statistic):,1.35e-19
Time:,14:16:00,Log-Likelihood:,-300.35
No. Observations:,30,AIC:,606.7
Df Residuals:,27,BIC:,610.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6661.9872,2.28e+04,-0.292,0.773,-5.35e+04,4.02e+04
YearsExperience,6153.3533,2337.092,2.633,0.014,1358.037,1.09e+04
Age,1836.0136,1285.034,1.429,0.165,-800.659,4472.686

0,1,2,3
Omnibus:,2.695,Durbin-Watson:,1.711
Prob(Omnibus):,0.26,Jarque-Bera (JB):,1.975
Skew:,0.456,Prob(JB):,0.372
Kurtosis:,2.135,Cond. No.,626.0


In [30]:
## Coef is very high, which means that if i increase my yearsofexperience by 1 unit this(6153.35) much of value in my output feature will change  

In [31]:
## Now if we look at r2_score and coef it's fine.
## But when we look at std_error, it's huge and is no where near to zero, which gives us "CLUE" that there may be some correlation between features
## again , by looking at P>|t|, for yearsofexp it's fine as it less than 0.05 but for age it's > 0.05 and yearsofexp, which again gives us clue that there may be correlation

In [32]:
## To confirm, 

X.iloc[:,1:].corr()

Unnamed: 0,YearsExperience,Age
YearsExperience,1.0,0.987258
Age,0.987258,1.0


In [None]:
## They are highly correlated, as it's almost near to one.

#### Now that we know there exist multicollinearity in our dataset, we can do two things:

##### 1) Ignore them and do modelling with multicollinearity 

but in some cases, datasets with high features, it may not be good practice.

##### 2) Handle them by removing one of the correlated feature :

Here as age p value is greater than yearsofexp, we can drop age as it is highly correlated with yearsofexp