<a href="https://colab.research.google.com/github/Kaushal-DCU-2023-25/CA683I_DA_AM_Assignment/blob/main/M4_2_3_Measuring_Multicollinearity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Measuring MultiCollinearity**

In Topic M1.3.12 we have shown how VIF can be used to measure multicollinearity in a feature datasets. VIF is calculated by regressing a independent variable against all the other variables and using the following formula:

>>$X_1=\alpha + X_2+...+X_n$

We then get the $R^2$ from each variables regression model and then calculate the VIF using the following equation:

>>$ V.I.F. = 1 / (1 - R^2). $

If your VIF factor is >10 then you really need to  drop variables from your model, but if it is between 5-10 then you need to consider it.

Lets look at the Boston housing data fro Scikit Learn.

In [1]:
from sklearn.datasets import fetch_california_housing

import numpy as np
import pandas as pd

#Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt

#To plot the graph embedded in the notebook
%matplotlib inline
boston = fetch_california_housing()
#print(boston.DESCR)
bos = pd.DataFrame(boston.data, columns = boston.feature_names)
bos['PRICE'] = boston.target
#print(bos.describe())
#print(bos.columns)

HTTPError: HTTP Error 403: Forbidden

So we have 14 variables and we will now examine the correlation matrix. From this matrix you will already see a number of high correlations >0.7. This tells us that we are likely to have issues.

In [None]:
bos_1 = pd.DataFrame(boston.data, columns = boston.feature_names)

correlation_matrix = bos_1.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True)

Now this is the important part which we didn't show in M1.3.12. When you run your model in python it gives 2 small warnings at the bottom of the summary. Don't worry about [1] but
\[2\] tells us that the inverted $X^tX$ matrix is close to non-invertible. This is telling us that there is possible multcollinearity in our data. The condition number is the ratio of the largest eigenvalue to the smallest eigenvalue in the $X^TX$(design matrix) matrix. Now this eigenvalue ratio may also be high becuase of scaling differences in our design matrix. So we will have to calculate the VIF's for each variable before we can decide what to do next.

In [None]:
import statsmodels.api as sm
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error

X=bos[boston.feature_names]
# Use the next line if you want to drop DIS and RAD
#X= X[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE' , 'TAX', 'PTRATIO','B', 'LSTAT']]

y=bos['PRICE']
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size = 0.2, random_state=5)


X = sm.add_constant(X_train_1)

model = sm.OLS(np.log(y_train_1),X)
results = model.fit()
y_pred=results.predict(X)

rms = np.sqrt(mean_squared_error(y_train_1, y_pred))
#

X_test = sm.add_constant(X_test_1)
y_test_pred=results.predict(X_test)
rms_test = np.sqrt(mean_squared_error(y_test_1, y_test_pred))

print("training root mean Square error is: ",rms)
print("test root mean Square error is: ",rms_test)
print(results.summary())

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(y_test_pred, np.log(y_test_1), 'o', label="Test")
ax.plot(y_pred, np.log(y_train_1), 'o', label="Train")


ax.legend(loc="best");

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrices
from statsmodels.api import add_constant
import pandas as pd



X = add_constant(X_train_1)

vif = [variance_inflation_factor(X.to_numpy(), i) for i in range(X.to_numpy().shape[1])]
print(vif[1:])

print("VIF > 5:",X.columns[np.where(np.asarray(vif[1:])>5)])


Now the code above is used to find the variables that are affected by multicollinearity. They are the DIS (weighted distances to five Boston employment centres) and the RAD (index of accessibility to radial highways) variables. Now if we remove these variable you will see the condition number is lower but still high, however the VIFs are all fine. This high condition number is purely a scaling issue as the Multicollinearity is now gone. The literature tells that a condition number above 20 is high. However, this can be caused by variables comming from differing scales as well as multicollinearity.

I would now like you to experiment with the variables in this model and see what happens when you reduce them further. What happens your predictions? Are you concerned about the low values of the Y variable? And can you explain why we have put a log value around Y?