# Multicollinearity Detection Using VIF

source: https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/

The term “variance inflation factor” (VIF) indicates the degree to which correlations among predictors inflate variance. 

VIFs assess the precision of coefficient estimates, influencing the width of confidence intervals. Lower VIF values are preferable; 
- **values between 1 and 5** suggest manageable correlation, 
- **values exceeding 5** indicate severe multicollinearity. 

Industry standards often recommend maintaining VIF below 5, although some texts consider VIF greater than 10 as severe, with judgment playing a role in deciding corrective measures.

For instance, a VIF of 10 means existing multicollinearity inflates coefficient variance tenfold compared to a model without multicollinearity. 

In [7]:
# to avoid SSLCertVerificationError while loading dataset
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
ssl._create_default_https_context = ssl._create_unverified_context

In [30]:
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import minmax_scale

housing = fetch_california_housing()
print(housing.data.shape, housing.target.shape)
cols = [
    'MedInc',
    'HouseAge',
    'AveRooms',
    'AveBedrms',
    'Population',
    'AveOccup',
    'Latitude',
    'Longitude'
]
print(housing.DESCR)

(20640, 8) (20640,)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, 

In [31]:
housing.data

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

In [32]:
X = housing.data
y = housing.target

In [35]:
# Scaling
X = minmax_scale(X, axis=0)

print(np.max(X[:, 1]))
print(np.average(X[:, 1]))

1.0
0.541950714394285


In [36]:
import pandas as pd

# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor


def calc_vif(X, cols):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = cols
    vif["VIF"] = [variance_inflation_factor(
        X, i) for i in range(X.shape[1])]

    return (vif)

In [37]:
print(calc_vif(X, cols))

    variables        VIF
0      MedInc   7.879822
1    HouseAge   5.556537
2    AveRooms  33.975595
3   AveBedrms  23.430221
4  Population   2.751701
5    AveOccup   1.059813
6    Latitude   4.117505
7   Longitude   7.484599


VIFs assess the precision of coefficient estimates, influencing the width of confidence intervals. Lower VIF values are preferable; 
- **values between 1 and 5** suggest manageable correlation, 
- **values exceeding 5** indicate severe multicollinearity. 

Industry standards often recommend maintaining VIF below 5, although some texts consider VIF greater than 10 as severe, with judgment playing a role in deciding corrective measures.

For instance, a VIF of 10 means existing multicollinearity inflates coefficient variance tenfold compared to a model without multicollinearity. 