## Imports and loading data
Loading the boston dataset directly from its "[source](http://lib.stat.cmu.edu/datasets/boston)".

In [1]:
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

  from pandas import Int64Index as NumericIndex


In [2]:
import load
data = load.boston()

In [3]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('PRICE', axis=1), np.log(data['PRICE']), test_size=0.2, random_state=10)
x_incl_const = sm.add_constant(X_train)

model = sm.OLS(y_train, x_incl_const)
results = model.fit()

# Testing for Multicollinearity

$$ TAX = \alpha_0 + \alpha_1RM + \alpha_2NOX+...+\alpha_{12}LSTAT $$
$$ VIF_{TAX}=\frac{1}{(1-R^2_{TAX})} $$

In [4]:
variance_inflation_factor(exog=np.asarray(x_incl_const), exog_idx=1)

1.7145250443932485

In [5]:
variance_inflation_factor(exog=x_incl_const.values, exog_idx=1)

1.7145250443932485

In [6]:
# Both ways above return the same array:
np.array_equal(x_incl_const.values, np.asarray(x_incl_const)), np.all(x_incl_const.values == np.asarray(x_incl_const))

(True, True)

In [7]:
x_incl_const.shape[1], len(x_incl_const.columns) # Two ways to get the amount of columns

(14, 14)

In [8]:
vif = [variance_inflation_factor(exog=x_incl_const.values, exog_idx=idx) for idx in range(x_incl_const.shape[1])]
pd.DataFrame(
    {
        'coef_name': x_incl_const.columns,
        'VIF': np.around(vif,2)
    }
)

Unnamed: 0,coef_name,VIF
0,const,597.55
1,CRIM,1.71
2,ZN,2.33
3,INDUS,3.94
4,CHAS,1.08
5,NOX,4.41
6,RM,1.84
7,AGE,3.33
8,DIS,4.22
9,RAD,7.31


VIF under 10 means we probably don't have to care much about multicollinearity. Some academics prefer 5.