# Multicollinearity

## Learning Objectives
- What is multicollinearity
- Identifying multicollinearity
- Dealing with multicollinearity

In this notebook, we're going to discuss a concept known as **multicollinearity**. At a high level, multicollinearity occurs when two exogenous variables are heavily related to each other - to the point where one of the exogenous variables can be accurately predicted from the others.

Let's use an example regression formula of the energy burnt from a runner. The exogenous variables provided to us are their average speed, distance covered, and number of track laps:
$$
\hat{\text{Energy Burnt}}_i = \beta_0 + \beta_1(\text{Speed}_i) + \beta_2(\text{Distance}_i) + \beta_3(\text{Num Laps}_i) + \epsilon_i
$$

Recall from our last notebook/lecture, that any given $\beta$ accounts for a unit increase/decrease of the endogenous variable, given a unit increase of its respective exogenous variable, *holding the other exogenous variables constant*. Take a second to understand that and apply it to the formula given above.

You may have realised that the distance a runner runs is perfectly correlated to the number of laps that they run around. It doesn't make sense to have an increase in the number of laps while holding distance constant. Ok, so that explains what *perfect multicollinearity* is. In another example, we could have a salary variable as the response, with age and years of experience as some explanatory variables. Once again, age and years of experience have near direct affects on each other. This example is more likely to be a case of *imperfect multicollinearity* as, although we'd expect an increase of age to be a direct increase of years of experience, an older person may have switched to a new career meaning that their years of experience is lower, despite their age being higher.

Let's load in a dataset, perform some analysis, identify the multicollinear variables, and see how, when and why this could be an issue.

In [80]:
import pandas as pd
import numpy as np
student_df = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Science/student_marks.tsv", delimiter="\t")
student_df = student_df.drop(["Student", "GPA"], axis=1)
student_df.columns = ["marks", "IQ", "study_hrs"]

sleep_hours = np.random.normal(7.5, 1.5, 50)
student_df["sleep_hrs"] = sleep_hours
student_df

Unnamed: 0,marks,IQ,study_hrs,sleep_hrs
0,100,125,30,9.743647
1,95,104,40,6.349966
2,92,110,25,7.762835
3,90,105,20,4.756762
4,85,100,20,8.517011
5,80,100,20,4.99033
6,78,95,15,9.593273
7,75,95,10,8.435627
8,72,85,0,6.240353
9,65,90,5,9.72057


Because multicollinearity occurs over the *linear relationship* between variables, a correlation heatmap is a great way to get us started on identifying which variables are highly related:

In [81]:
import plotly.express as px
px.imshow(student_df.corr(), title="Correlation heatmap of student dataframe")

Looks like we have some heavy correlation in the heatmap. Let's fit a linear regression model on the data. We're going to (initially) try and predict the amount of hours a student studied based on their marks and IQ.

In [78]:
import statsmodels.formula.api as smf

## Fit a linear regression model to try and predict study_hrs from marks and IQ
model0 = smf.ols("study_hrs ~ marks + IQ", student_df).fit()
model0.summary()

0,1,2,3
Dep. Variable:,study_hrs,R-squared:,0.702
Model:,OLS,Adj. R-squared:,0.689
Method:,Least Squares,F-statistic:,55.27
Date:,"Tue, 11 Aug 2020",Prob (F-statistic):,4.52e-13
Time:,20:55:15,Log-Likelihood:,-155.29
No. Observations:,50,AIC:,316.6
Df Residuals:,47,BIC:,322.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-11.1155,3.318,-3.350,0.002,-17.791,-4.440
marks,0.3181,0.165,1.924,0.060,-0.015,0.651
IQ,0.0452,0.160,0.282,0.779,-0.277,0.368

0,1,2,3
Omnibus:,5.861,Durbin-Watson:,1.672
Prob(Omnibus):,0.053,Jarque-Bera (JB):,6.724
Skew:,0.329,Prob(JB):,0.0347
Kurtosis:,4.671,Cond. No.,474.0


Interesting... despite our correlation plot showing that that both `marks` and `IQ` have strong correlation with `study_hrs`, both of these values are coming up as insignificant (at alpha = 0.05) 🤔. In non-mathematical terms, this happens because both marks and IQ are 'fighting' for an effect on study_hrs, and the model is struggling to identify which variable is more significant because they're moving in the same direction.

Importantly - and this is a point that we'll pick up more on later - the coefficients and $R^2$ values are still reliable. Infact, the only part which becomes 'unreliable' are the columns following 'coef'.

Our correlation heatmap earlier on is one of the two ways that we can check for multicollinearity. We saw that marks and IQ are heavily correlated with each other and hence we can say that they are multicollinear. A question you're probably thinking is "how much correlation is too much"? And yes, that's a valid question. Unfortunately there's no law or *strict* rule which can answer this for us (although -1, and 1 definitely is "too much"). However, a loose rule of thumb that anything above 0.9 is probably starting to be too much - although some might suggest that anything under 0.95 isn't a problem. Personally, I'd exercise some caution and *think* (not drop) about the variables in a bit more depth if I see their value above 0.85.

The second method is something known as the Variation Inflation Factor, or VIF. This method is more powerful than the aforementioned as we fit a linear regression model on one of our exogeneous variables against all the other exogenous variables. Mathematically, for one variable $X_1$:
$$
\hat{X_1} = \beta_0 + \beta_1(X_2) + \beta_2(X_3) + ...
$$

Subsequently, we'd work out the VIF, which uses the $R^2$ obtained from the model:
$$
\text{VIF} \equiv \frac{1}{1-R^2_1}
$$

The higher the $R^2$, the higher the VIF. This process would be carried out for all the exogenous variables we would be considering, and we'd store the VIF score obtained for each one. If the VIF for one of our variables is too high (10 is a common rule of thumb), then we can say that that variable is being adequetly explained by our other variables, and it would a valid assumption to drop that variable.

In [82]:
model1 = smf.ols("study_hrs ~ marks + IQ + sleep_hrs", student_df).fit()
model1.summary()

0,1,2,3
Dep. Variable:,study_hrs,R-squared:,0.707
Model:,OLS,Adj. R-squared:,0.688
Method:,Least Squares,F-statistic:,37.05
Date:,"Tue, 11 Aug 2020",Prob (F-statistic):,2.49e-12
Time:,21:27:27,Log-Likelihood:,-154.82
No. Observations:,50,AIC:,317.6
Df Residuals:,46,BIC:,325.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-15.1949,5.469,-2.778,0.008,-26.204,-4.186
marks,0.3207,0.166,1.937,0.059,-0.013,0.654
IQ,0.0479,0.160,0.298,0.767,-0.275,0.371
sleep_hrs,0.4940,0.526,0.939,0.353,-0.565,1.553

0,1,2,3
Omnibus:,5.677,Durbin-Watson:,1.727
Prob(Omnibus):,0.059,Jarque-Bera (JB):,6.433
Skew:,0.315,Prob(JB):,0.0401
Kurtosis:,4.641,Cond. No.,784.0


In [87]:
## Code up three models which, in turn, model one of the exogenous variables against the other two.
exog_marks_model = smf.ols("marks ~ IQ + sleep_hrs", student_df).fit()
exog_iq_model = smf.ols("IQ ~ marks + sleep_hrs", student_df).fit()
exog_sleep_model = smf.ols("sleep_hrs ~ marks + IQ", student_df).fit()

## print the R^2 for each model
print("R^2 for model: \n Marks: {} \n IQ: {} \n Sleep: {}".format(exog_marks_model.rsquared, 
                                                                  exog_iq_model.rsquared, 
                                                                  exog_sleep_model.rsquared))

R^2 for model: 
 Marks: 0.9562144580110146 
 IQ: 0.9562162533509233 
 Sleep: 0.025685476095520632


In [88]:
## Code up a VIF function
def VIF(r2):
    return 1/(1-r2)

## Work out the VIF scores for each of the models
vif_marks = VIF(exog_marks_model.rsquared)
vif_iq = VIF(exog_iq_model.rsquared)
vif_sleep = VIF(exog_sleep_model.rsquared)

## print the VIF scores
print("VIF scores: \n Marks: {}, \n IQ: {} \n Sleep: {}".format(vif_marks, vif_iq, vif_sleep))

VIF scores: 
 Marks: 22.838589054157584, 
 IQ: 22.839525543917514 
 Sleep: 1.0263626123447163


Ok, so we've identified multicollinearity! Great! we know that Marks and IQ are the collinear variables (thanks to the correlation map), so we're safe to drop one of them. But *should we* actually do so?

Well, there's a couple of things worth discussing on this point. Perhaps the most straightforwad thing to do is: **do nothing**. Why? Well, the regression model will still fit the data - and we can validate this with the $R^2$ terms across different models. Considering the effect of multicollinearity is purely dependant on what information we are seeking to obtain from our model. If we only to use our model for predictive purposes, and don't care about the interpretation of coefficients, then keeping the collinear terms in is completely valid.

A second option could be to **remove one of the correlated terms** (you will *have* to do this in the case of perfect collinearity). If we care about interpreting the coefficients (which some industries such as insurance or finance do care about), then this would be the strategy to use. However, if we exercise this, in some situations, we may be subjecting ourselves to a problem known as [omitted variable bias](https://www.youtube.com/watch?v=b4jhrK03zhs).

A third option could be to use PCA or another type of regression known as partial least squares. However, these both have the disadvantage of leaving your model coefficeints as uninterpretable, and with this being the case, then it's simpler to do nothing.

👍