# MultiCollinearity in Linear Regession

Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model.

let I have a dataset of salary wrt age and experience

y=B0+B1x1+B2x2

so if there is a correlation between x1 (age) and x2 (exp) then it is said to be MultiCollinearity.
B0-> intercept
B1 and B2 are coefficients

In [1]:
import pandas as pd

In [2]:
import statsmodels.api as sm

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

In [4]:
# reading data from the csv
df_adv = pd.read_csv('Advertising.csv',index_col=0)

In [5]:
df_adv.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [23]:
# defining the variables
x = df_adv[['TV','radio','newspaper']]
y = df_adv['sales']

In [24]:
# adding the constant term
x = sm.add_constant(x)   ## it will generate a new column with constant values 1

In [25]:
x.head()

Unnamed: 0,const,TV,radio,newspaper
1,1.0,230.1,37.8,69.2
2,1.0,44.5,39.3,45.1
3,1.0,17.2,45.9,69.3
4,1.0,151.5,41.3,58.5
5,1.0,180.8,10.8,58.4


In [26]:
x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   const      200 non-null    float64
 1   TV         200 non-null    float64
 2   radio      200 non-null    float64
 3   newspaper  200 non-null    float64
dtypes: float64(4)
memory usage: 7.8 KB


# OLS = Ordinary Least Squared

In OLS method, we have to choose the values of b_1 and b_0 such that, the total sum of squares of the difference between the calculated and observed values of y, is minimised.

To get the values of b_0 and b_1 which minimise S(Error), we can take a partial derivative for each coefficient and equate it to zero.

Approach for OLS model :

* First we define the variables x and y. In the example below, the variables are read from a csv file using pandas. The file used in the example can be downloaded here.   [DONE]
* Next, We need to add the constant b_0 to the equation using the add_constant() method.
* The OLS() function of the statsmodels.api module is used to perform OLS regression. It returns an OLS object. Then fit() method is called on this object for fitting the regression line to the data.
* The summary() method is used to obtain a table which gives an extensive description about the regression results.
     

In [27]:
## fit a OLS model with intercept on TV and Radio
# performing the regression
# and fitting the model

model = sm.OLS(y,x).fit()

In [28]:
# printing the summary table

model.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Thu, 09 Sep 2021",Prob (F-statistic):,1.58e-96
Time:,13:06:24,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


The above data will tell us whether there is multicollinearity between independent features or not.

* in the second table the coef column values are the constant B0, B1, B2 anf B3. These constant shows that for a unit increase in sales we should increase TV expenditure by 0.0458 since B1 = 0.0458 similarly for others.
 
* the negative value of newspaper coefficient that is B3 = -0.0010 is telling us that we dont need to expand that much on newspaper. this means if we will decrease our newspaper expenses by 0.0010 there will be an increase of unity (1).

* The R^2 and Adjusted R^2 values will be ranging betwen 0-1, The most common interpretation of r-squared is how well the regression model fits the observed data. so here our R^2 = 0.896 which is > 0.6 this means our data fit the regression model very well.

* Now [Main Thing] Std error of the 3 independent feature are very less this means no two features show multicollinearity. This Std error will be a less value if there is no relation between independent values. 

* we can also see all the P valeues < 0.5 only one feature i.e. newspaper P=0.860 , it is showing that we can neglect the expenditure on newpaper.

we can also show the correlation by using corr()

In [29]:
import matplotlib.pyplot as plt
x.iloc[:,1:].corr()

Unnamed: 0,TV,radio,newspaper
TV,1.0,0.054809,0.056648
radio,0.054809,1.0,0.354104
newspaper,0.056648,0.354104,1.0


here we can see the correlation between each and every variable is less

In [30]:
#Lets try some different data set where we can see the correlation betwen independent variables

In [32]:
df_salary = pd.read_csv('Salary_Data2.csv')

In [33]:
df_salary.head()

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


In [35]:
x = df_salary[['YearsExperience','Age']]
y = df_salary['Salary']

In [36]:
x = sm.add_constant(x)
model = sm.OLS(y,x).fit()
model.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,323.9
Date:,"Thu, 09 Sep 2021",Prob (F-statistic):,1.35e-19
Time:,13:38:42,Log-Likelihood:,-300.35
No. Observations:,30,AIC:,606.7
Df Residuals:,27,BIC:,610.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6661.9872,2.28e+04,-0.292,0.773,-5.35e+04,4.02e+04
YearsExperience,6153.3533,2337.092,2.633,0.014,1358.037,1.09e+04
Age,1836.0136,1285.034,1.429,0.165,-800.659,4472.686

0,1,2,3
Omnibus:,2.695,Durbin-Watson:,1.711
Prob(Omnibus):,0.26,Jarque-Bera (JB):,1.975
Skew:,0.456,Prob(JB):,0.372
Kurtosis:,2.135,Cond. No.,626.0


* R^2 and adj R^2 are >0.6 this means our model is great
* Std err values are very high (huge) for both the independent features this means there is a huge correlaton between years of experiens and age.
* Coeff are high meand if i will chage my YearsExperience value is changed (increase) by unity 6153.3533 chage (increase) will occur in salary. same for Age.

## Note:- 
if there is one more independent variable with high correlation with other independent features the Std err values increase and they will become very huge values.

* P values :-  For YEarsExperience the P value = 0.014 (<0.05). but in case of Age P=0.165 (>0.05) this means In Age and Years Experience there may be some kind of correlation.

## T confirm that we will use corr()


In [37]:
x.iloc[:,1:].corr()

Unnamed: 0,YearsExperience,Age
YearsExperience,1.0,0.987258
Age,0.987258,1.0


so here we can see both independent features have the correlation of 0.987258 which is very very high  (since it is > 90%)

## Note :- 
Since there is >90% correlation this meand only one feature out of inependent feature is more than enough to calculate the salary.

Sol :-
1. Dont  do anything leave it as it is.
2. check the P values of different independent features since P value of Age is > 0.05 or greater that the P value of YEars Experience so we can drop the age Feature. and train  your model using only one independent feature.
