# Calculating VIF

## Colinearity
Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoef function.

Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

## VIF
The <em>Variance Inflation Factor (VIF)</em> is a measure of colinearity among predictor variables within a multiple regression. 

It provides an index that measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity.

It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.

![Interpreting VIF](img/vif.png)

https://www.youtube.com/watch?v=qmt7ZZoiDwc

Approach
    - The aim of this script is to calculate VIF and determine which features to consider for model training
    - No extensive cleaning, outlier, or trend outlier consideration will be done
        - The focus is VIF, not neccesarily model performance

In [1]:
from ipywidgets import interactive
from sklearn.linear_model import LinearRegression
import ipywidgets as widgets
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import ipywidgets as widgets

## Ingestion

In [2]:
# ingestion
file = "data/ames_housing.xlsx"
aimes_df = pd.read_excel(file)
df = aimes_df.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Order          2930 non-null   int64  
 1   Lot Area       2930 non-null   int64  
 2   Street         2930 non-null   object 
 3   Lot Config     2930 non-null   object 
 4   Neighborhood   2930 non-null   object 
 5   Overall Qual   2930 non-null   int64  
 6   Overall Cond   2930 non-null   int64  
 7   Mas Vnr Area   2907 non-null   float64
 8   Total Bsmt SF  2929 non-null   float64
 9   1st Flr SF     2930 non-null   int64  
 10  2nd Flr SF     2930 non-null   int64  
 11  Gr Liv Area    2930 non-null   int64  
 12  Full Bath      2930 non-null   int64  
 13  Half Bath      2930 non-null   int64  
 14  Kitchen AbvGr  2930 non-null   int64  
 15  TotRms AbvGrd  2930 non-null   int64  
 16  Fireplaces     2930 non-null   int64  
 17  Garage Cars    2929 non-null   float64
 18  Garage A

## Rinse, clean and scrub
Approx 25 observations with empty values removed

In [3]:
df = df.dropna(axis = 0)
df.drop(["Order"], axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2905 entries, 0 to 2929
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Lot Area       2905 non-null   int64  
 1   Street         2905 non-null   object 
 2   Lot Config     2905 non-null   object 
 3   Neighborhood   2905 non-null   object 
 4   Overall Qual   2905 non-null   int64  
 5   Overall Cond   2905 non-null   int64  
 6   Mas Vnr Area   2905 non-null   float64
 7   Total Bsmt SF  2905 non-null   float64
 8   1st Flr SF     2905 non-null   int64  
 9   2nd Flr SF     2905 non-null   int64  
 10  Gr Liv Area    2905 non-null   int64  
 11  Full Bath      2905 non-null   int64  
 12  Half Bath      2905 non-null   int64  
 13  Kitchen AbvGr  2905 non-null   int64  
 14  TotRms AbvGrd  2905 non-null   int64  
 15  Fireplaces     2905 non-null   int64  
 16  Garage Cars    2905 non-null   float64
 17  Garage Area    2905 non-null   float64
 18  Porch Ar

Predictor & response definitions

In [4]:
predictors = [
    'Lot Area', 'Street', 'Lot Config', 'Neighborhood',
    'Overall Qual', 'Overall Cond', 'Mas Vnr Area', 'Total Bsmt SF',
    '1st Flr SF', '2nd Flr SF', 'Gr Liv Area', 'Full Bath', 'Half Bath',
    'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Cars',
    'Garage Area', 'Porch Area', 'Pool Area'
]

response = ['SalePrice']

# convert to dataframe
df_predictors = df.loc[:, predictors]
df_response = df.loc[:, response]
df_predictors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2905 entries, 0 to 2929
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Lot Area       2905 non-null   int64  
 1   Street         2905 non-null   object 
 2   Lot Config     2905 non-null   object 
 3   Neighborhood   2905 non-null   object 
 4   Overall Qual   2905 non-null   int64  
 5   Overall Cond   2905 non-null   int64  
 6   Mas Vnr Area   2905 non-null   float64
 7   Total Bsmt SF  2905 non-null   float64
 8   1st Flr SF     2905 non-null   int64  
 9   2nd Flr SF     2905 non-null   int64  
 10  Gr Liv Area    2905 non-null   int64  
 11  Full Bath      2905 non-null   int64  
 12  Half Bath      2905 non-null   int64  
 13  Kitchen AbvGr  2905 non-null   int64  
 14  TotRms AbvGrd  2905 non-null   int64  
 15  Fireplaces     2905 non-null   int64  
 16  Garage Cars    2905 non-null   float64
 17  Garage Area    2905 non-null   float64
 18  Porch Ar

Produce dataframe with binary categorical transformations

In [5]:
categorical = [
    "Street", "Lot Config", "Neighborhood"
]

cat_dummies = pd.get_dummies(df.loc[:, categorical])
df_cat = pd.concat([
    df.loc[:, df.columns.isin(categorical) == False],
    cat_dummies],
    axis = 1
)

df_cat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2905 entries, 0 to 2929
Data columns (total 53 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Lot Area              2905 non-null   int64  
 1   Overall Qual          2905 non-null   int64  
 2   Overall Cond          2905 non-null   int64  
 3   Mas Vnr Area          2905 non-null   float64
 4   Total Bsmt SF         2905 non-null   float64
 5   1st Flr SF            2905 non-null   int64  
 6   2nd Flr SF            2905 non-null   int64  
 7   Gr Liv Area           2905 non-null   int64  
 8   Full Bath             2905 non-null   int64  
 9   Half Bath             2905 non-null   int64  
 10  Kitchen AbvGr         2905 non-null   int64  
 11  TotRms AbvGrd         2905 non-null   int64  
 12  Fireplaces            2905 non-null   int64  
 13  Garage Cars           2905 non-null   float64
 14  Garage Area           2905 non-null   float64
 15  Porch Area           

## Inspect collinearity

This is a large matrix. Lets interactively find features with the highest positive or negative correlation.

In [6]:
def f(Threshold):
    display(
        df_cat.corr() \
        [(df_cat.corr() > Threshold) | (df_cat.corr() < -Threshold)]. \
        replace({None: ""})
    )
    return Threshold

widgets.interactive(f,  Threshold=(0, 1, 0.1))

interactive(children=(FloatSlider(value=0.0, description='Threshold', max=1.0), Output()), _dom_classes=('widg…

In [34]:
def f(Threshold):
    display(
            df_cat.corr()[((df_cat.corr() > 0.5) | (df_cat.corr() < -0.5))].\
        loc[:, "SalePrice"][df_cat.corr()[((df_cat.corr() > Threshold) | (df_cat.corr() < - Threshold))].\
        loc[:, "SalePrice"].isnull() == False]
    )
    return Threshold

widgets.interactive(f,  Threshold=(0, 1, 0.1))

interactive(children=(FloatSlider(value=0.0, description='Threshold', max=1.0), Output()), _dom_classes=('widg…

In [8]:
def f(Threshold):
    display(
        df_cat["SalePrice"].corr() \
        [(df_cat.corr() > Threshold) | (df_cat.corr() < -Threshold)]. \
        replace({None: ""})
    )
    return Threshold

widgets.interactive(f,  Threshold=(0, 1, 0.1))

interactive(children=(FloatSlider(value=0.0, description='Threshold', max=1.0), Output()), _dom_classes=('widg…

# Calculate VIF for each feature

In [35]:
feature_name = []
vif_value = []
rsq_value = []

for i in range(0, len(df_cat.columns)):
    X = df_cat.loc[:, df_cat.columns != df_cat.columns[i]]
    y = df_cat.loc[:, df_cat.columns == df_cat.columns[i]]
    
    lr = LinearRegression().fit(X, y)
    
    rsq = lr.score(X, y)
    if rsq != 1:
        vif = round(1 / (1 - rsq), 2)
    else:
        vif = float("inf")
    
    feature_name.append(df_cat.columns[i])
    rsq_value.append(rsq)
    vif_value.append(vif)
    
vif_df = pd.DataFrame({
        "r_squared": rsq_value,
        "vif": vif_value },
    index = feature_name
).sort_values(by="vif")

## Observations

Notes,
1. The R-squared value is that of the predictor regressed to all other features. 
2. The VIF represents the magnitude of multicollinearity with respect to other features.

Findings,
1. Only the dummy predictors have infinite VIF.

**inf indicates perfect correlations
    - This means that the variance of the predictors coefficient in inflated by inf because of collinearity

In [36]:
vif_df

Unnamed: 0,r_squared,vif
Pool Area,0.066594,1.07
Overall Cond,0.238349,1.31
Porch Area,0.241045,1.32
Lot Area,0.296761,1.42
Kitchen AbvGr,0.299679,1.43
Fireplaces,0.381397,1.62
Mas Vnr Area,0.38958,1.64
Half Bath,0.548995,2.22
Full Bath,0.641082,2.79
Total Bsmt SF,0.706724,3.41


# Feature request

In [None]:
class Slider():
    
    def __init__(self, name, df, widgets):
        self.name = name
        self.df = df
        self.widgets = widgets
        self.initialize_widget()
        
    def f(rows):
        display(self.df.head(something))
        return x
    
    def initialize_widget(self):
        self.widgets.interactive(self.f,  rows=(0, 10, 1))

Slider("Neel", df_cat, widgets)