### Multicollinearity Problems

**Multicollinearity** refers to a scenario when multiple independent variables in a dataset have a strong correlation between other independent variables apart from correlation with a target variable. Multicollinearity is a problem because **independent variables should be independent**

A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant.

However, when independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently.

### What Problems Does Multicollinearity Case?

- The weights of the model can swing wildly based on which other independent variables are in the model. The coefficients become very sensitive to small changes in the model. Thus, their precision is wonky. It weakens the statistical power of a regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant. As a result weights interpretation is hard. 

However, features that have low VIF values can be trusted. Features that are highly correlated can't be trusted as their coefficients doesn't reflect the reality + their p - values usually high indicating that these feature are not statistically significant.

Multicollinearity doesn’t affect the predictions or goodness-of-fit. If you just want to make predictions, the model with severe multicollinearity is just as good!

### Perfect Multicollinearity
Perfect Multicollinearity is when one or more independent variables have linear relationship with other independent variables. This is simply when the correlation of two explanatory variables is exactly 1.

### Imperfect Multicollinearity

- **High Multicollinearity** - if the correlation is near 0.5 < r < 0.9
- **Low Multicollinearity** - if the correlation is less than 0.5

### Options on how to deal with Multicollinearity
- **PCA** - but model interpretation is decreasing
- **VIF** - calculate variance influation factor and conduct **backward elimination algorithm**
- **LASSO / Ridge / Elastic Net** - applie regularization
- **Feature Combination** - combine highly correlated features
- **Feature Removing** - using CorrMatrix, determine a threshold then drop features that exceed the threshold 

The main pipeline which is often applied is implement features dropping based on correlation matrix and then apply VIF. I've tested it on only one dataset and combination Corr dropping - VIF and only VIF led to the same final number of features. VIF calculation is computationally expensive, thus it is better use combination Corr dropping - VIF.

**VIF values can be**:
- 1 - No correlation
- 1-5 - Moderate Correlation
- More than 5 Serious Correlation

### Important notes
- if number of features is a lot, set threshold 5, otherwise 10