# 1. Multicollinearity in multiple linear regression models

Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a multiple regression model are highly correlated. This means that one independent variable can be linearly predicted from the others with a substantial degree of accuracy. Multicollinearity can cause problems in estimating the coefficients of the regression model, making it difficult to determine the effect of each independent variable on the dependent variable. 

## 1.1. Disadvantages of multicollinearity

The disadvantages of multicollinearity in a regression model primarily relate to interpretation and estimation accuracy. Here are some of the main issues caused by multicollinearity:

1. **Unstable Coefficient Estimates**

    - When multicollinearity is present, small changes in the data or model specification can lead to large changes in the estimated coefficients. This instability makes it difficult to rely on the coefficients for understanding relationships between predictors and the response variable.

2. **Increased Standard Errors**

    - Multicollinearity inflates the standard errors of the coefficients, making them less precise. This inflation reduces the statistical power of hypothesis tests, making it harder to detect significant relationships.

3. **Difficulty in Assessing the Importance of Predictors**

    - When predictors are highly correlated, it becomes challenging to determine the individual contribution of each predictor to the model. Coefficients may be counterintuitive or misleading, as changes in one predictor are often associated with changes in others.

4. **Misleading Significance Tests**

    - High multicollinearity can cause coefficients to appear statistically insignificant even when they should be significant. This can lead to incorrect conclusions about which predictors are important.

5. **Overfitting and Model Interpretation**

    - Multicollinearity can contribute to overfitting, where the model fits the training data very well but performs poorly on new data. Additionally, interpreting a model with multicollinearity is problematic since it’s unclear which variables are driving the results.

6. **Limited Extrapolation**

    - Models with multicollinearity may not generalize well to other datasets. Predictions made outside the range of the data can be particularly unreliable.

7. **Complicated Variable Selection**

    - In the presence of multicollinearity, standard variable selection techniques (e.g., stepwise selection) may not work effectively, as removing or adding predictors can dramatically change the model coefficients.

8. **Increased Sensitivity to Multivariate Outliers**

    - Models with multicollinearity are more sensitive to outliers in the predictor variables, which can disproportionately affect the model's stability and predictions.

## 1.2. Techniques to handle multicollinearity

### 1.2.1. Detecting Multicollinearity

Before addressing multicollinearity, it is important to detect it using the following techniques:

- **Correlation Matrix**

    - Calculate the correlation matrix to identify pairs of variables with high correlation coefficients (typically above 0.8 or 0.9).

    - High correlation between two variables indicates that they might be explaining the same variance in the dependent variable.

- **Variance Inflation Factor (VIF)**

    - Calculate the VIF for each independent variable. A VIF value greater than 5 or 10 indicates a multicollinearity problem.

    - The formula for VIF is:

$$
\text{VIF}(X_i) = \frac{1}{1 - R_i^2}
$$

Where:
- $R_i^2$ is the R-squared value obtained by regressing the variable $X_i$ on all other independent variables.
- A higher VIF indicates a higher level of multicollinearity.

By assessing the VIF values and the correlation matrix, you can identify and address potential multicollinearity issues in your regression model.


### 1.2.2. Methods to Handle Multicollinearity

- **Remove Highly Correlated Predictors**

  - **Identify and Remove:** If two or more variables are highly correlated, consider removing one of them, especially if it does not add significant value to the model. Use domain knowledge to decide which variable to retain.

- **Combine Variables**

  - **Create Composite Variables:** Combine correlated variables into a single composite variable, such as by taking the average or sum. This can reduce multicollinearity while retaining the explanatory power.

- **Principal Component Analysis (PCA)**

  - **Dimensionality Reduction:** Use PCA to transform correlated variables into a smaller set of uncorrelated variables (principal components). These components can then be used as predictors in the regression model.

- **Regularization Techniques**

  - **Ridge Regression (L2 Regularization):** Add a penalty term to the loss function proportional to the square of the coefficients, which helps to shrink the coefficients of correlated variables and reduce multicollinearity.

  - The loss function for Ridge Regression is:

  $$
  \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum w_i^2
  $$

  - **Lasso Regression (L1 Regularization):** Add a penalty term proportional to the absolute value of the coefficients, which can shrink some coefficients to zero and effectively perform variable selection.

  - The loss function for Lasso Regression is:

  $$
  \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum |w_i|
  $$

- **Partial Least Squares Regression (PLS)**
  - **PLS Regression:** Similar to PCA, PLS reduces the predictors to a smaller set of uncorrelated components, but it also considers the response variable, ensuring the components are relevant for prediction.

- **Feature Selection Techniques**
  - **Backward Elimination, Forward Selection, or Stepwise Selection:** Use these techniques to iteratively add or remove variables based on their statistical significance and contribution to the model, which can help reduce multicollinearity.