# Detecting Multicollinearity using VIF

VIF (Variable Inflation Factors).

Let’s try detecting multicollinearity in a dataset to give you a flavor of what can go wrong.

I have created a dataset determining the salary of a person in a company based on the following features:

- Gender (0 – female, 1- male)
- Age
- Years of service (Years spent working in the company)
- Education level (0 – no formal education, 1 – under-graduation, 2 – post-graduation)

In [None]:
df=pd.read_csv(r'C:/Users/Dell/Desktop/salary.csv')
df.head()

>So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.

In [None]:
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

> - VIF starts at 1 and has no upper limit 
- VIF = 1, no correlation between the independent variable and the other variables
- VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

In [None]:
X = df.iloc[:,:-1]
calc_vif(X)

## Fixing Multicollinearity
Dropping one of the correlated features will help in bringing down the multicollinearity between correlated features:

In [None]:
X = df.drop(['Age','Salary'],axis=1)
calc_vif(X)

In [None]:
df2 = df.copy()
df2['Age_at_joining'] = df.apply(lambda x: x['Age'] - x['Years of service'],axis=1)
X = df2.drop(['Age','Years of service','Salary'],axis=1)
calc_vif(X)