In [2]:
import pandas as pd
from sklearn.datasets import load_diabetes

# Colinearity

Colinearity is the state of having more than one independent variable in a regression model that are highly correlated, which breaks the assumptions of the linear regression model that the independent variables are not correlated with each other.

## Why is this a problem?

If two variables are perfectly correlated, then the OLS estimator equation cannot be solved at all because $X^TX$ is not invertible.

In practice, this means that the problem cannot distinguish between the effect of the two variables, and the results are unstable.

## How to detect multicolinearity?

1. Check that variables are not added twice.
1. Check for redundant variables. 
1. Use a different estimation method (Ridge, Lasso)
1. Combine the features into a set of linearly separable features such as principal components analysis (PCA).

Notice that you shouldn't remove variables because they are not significant in a model that has multicolinearity, they may be the ones that have predictive power.




In [3]:
def sklearn_to_df(sklearn_data):
    df = pd.DataFrame(sklearn_data.data, columns=sklearn_data.feature_names)
    df["target"] = sklearn_data.target
    return df


In [4]:
data = load_diabetes()
df = sklearn_to_df(data)
df.head()


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [10]:
C = df.corr().round(2)

In [13]:

def colour_large_values_red(val):
    """Colour cells based on their value - a useful pattern for reports!"""
    color = 'red' if abs(val) >= 0.9 else 'lightgrey'
    return 'background-color: %s' % color

In [14]:
C.style.applymap(colour_large_values_red)


  C.style.applymap(colour_large_values_red)


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
age,1.0,0.17,0.19,0.34,0.26,0.22,-0.08,0.2,0.27,0.3,0.19
sex,0.17,1.0,0.09,0.24,0.04,0.14,-0.38,0.33,0.15,0.21,0.04
bmi,0.19,0.09,1.0,0.4,0.25,0.26,-0.37,0.41,0.45,0.39,0.59
bp,0.34,0.24,0.4,1.0,0.24,0.19,-0.18,0.26,0.39,0.39,0.44
s1,0.26,0.04,0.25,0.24,1.0,0.9,0.05,0.54,0.52,0.33,0.21
s2,0.22,0.14,0.26,0.19,0.9,1.0,-0.2,0.66,0.32,0.29,0.17
s3,-0.08,-0.38,-0.37,-0.18,0.05,-0.2,1.0,-0.74,-0.4,-0.27,-0.39
s4,0.2,0.33,0.41,0.26,0.54,0.66,-0.74,1.0,0.62,0.42,0.43
s5,0.27,0.15,0.45,0.39,0.52,0.32,-0.4,0.62,1.0,0.46,0.57
s6,0.3,0.21,0.39,0.39,0.33,0.29,-0.27,0.42,0.46,1.0,0.38


As we can see, s1 and s2 are highly correlated with each other.

In [18]:
import statsmodels.formula.api as smf

columns1 = "+".join(df.columns.difference(["target"]))
formula1 = "target ~" + columns1
all_model = smf.ols(formula=formula1, data=df).fit()
print(all_model.summary())


                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Mon, 21 Oct 2024   Prob (F-statistic):           3.83e-62
Time:                        11:23:04   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    152.1335      2.576     59.061      0.0

In [16]:
columns2 = "+".join(df.columns.difference(["target"]))
formula2 = "target ~" + columns2 + "-s1"
model2 = smf.ols(formula=formula2, data=df).fit()
print(model2.summary())



                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.514
Model:                            OLS   Adj. R-squared:                  0.504
Method:                 Least Squares   F-statistic:                     50.71
Date:                Mon, 21 Oct 2024   Prob (F-statistic):           3.06e-62
Time:                        11:22:50   Log-Likelihood:                -2387.8
No. Observations:                 442   AIC:                             4796.
Df Residuals:                     432   BIC:                             4837.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    152.1335      2.584     58.883      0.0

In [17]:
columns3 = "+".join(df.columns.difference(["target"]))
formula3 = "target ~" + columns3 + "-s2"
model3 = smf.ols(formula=formula3, data=df).fit()
print(model3.summary())

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.516
Model:                            OLS   Adj. R-squared:                  0.505
Method:                 Least Squares   F-statistic:                     51.08
Date:                Mon, 21 Oct 2024   Prob (F-statistic):           1.37e-62
Time:                        11:22:56   Log-Likelihood:                -2387.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     432   BIC:                             4835.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    152.1335      2.579     58.995      0.0

As we can see, eliminating s1 or s2 does not change the R-squared of the model much, which means that they are not adding predictive power to the model when used together.

More over, as seen in experiments 2 and 3, s2 seems to not be significant. s1 is significant when alone. When together, s2 is not significant.

## Multicolinearity

If we reduce the threshold for colinearity, we can find more pairs of variables that are highly correlated with each other.

In [20]:
def colour_medium_values_red(val):
    color = 'red' if abs(val) > 0.7 else 'lightgrey'
    return 'background-color: %s' % color
C.style.applymap(colour_medium_values_red)

  C.style.applymap(colour_medium_values_red)


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
age,1.0,0.17,0.19,0.34,0.26,0.22,-0.08,0.2,0.27,0.3,0.19
sex,0.17,1.0,0.09,0.24,0.04,0.14,-0.38,0.33,0.15,0.21,0.04
bmi,0.19,0.09,1.0,0.4,0.25,0.26,-0.37,0.41,0.45,0.39,0.59
bp,0.34,0.24,0.4,1.0,0.24,0.19,-0.18,0.26,0.39,0.39,0.44
s1,0.26,0.04,0.25,0.24,1.0,0.9,0.05,0.54,0.52,0.33,0.21
s2,0.22,0.14,0.26,0.19,0.9,1.0,-0.2,0.66,0.32,0.29,0.17
s3,-0.08,-0.38,-0.37,-0.18,0.05,-0.2,1.0,-0.74,-0.4,-0.27,-0.39
s4,0.2,0.33,0.41,0.26,0.54,0.66,-0.74,1.0,0.62,0.42,0.43
s5,0.27,0.15,0.45,0.39,0.52,0.32,-0.4,0.62,1.0,0.46,0.57
s6,0.3,0.21,0.39,0.39,0.33,0.29,-0.27,0.42,0.46,1.0,0.38


In [25]:
formula = "target ~ s1 + s2 + s3 + s4 + s5 + s6 + age + bmi + sex + bp"
est = smf.ols(formula=formula, data=df).fit()
print(est.summary())



                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Mon, 21 Oct 2024   Prob (F-statistic):           3.83e-62
Time:                        11:36:02   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    152.1335      2.576     59.061      0.0

In [26]:
est.condition_number

np.float64(227.2248048548479)

Another way to see if multicolinearity is a problem is to check the condition number of the design matrix. If it is greater than 30, then there is a problem.

Also, if the overall R-squared is high, but the individual variables are not significant, then there is a chance that multicolinearity is present. This is also evident by wide confidence intervals. This means that each invidifual variable does not have a significant effect on the target variable, but together they do.

## Gauss-Markov Theorem

The Gauss-Markov theorem states that in a linear regression model where the errors have the properties of being homoscedastic and uncorrelated with mean zero, the best linear unbiased estimator (BLUE) is given by the OLS estimator.

If multicolinearity is present, then the coefficients will be unstable and the R-squared will be high, but the individual variables will not be significant. The OLS estimator will still be unbiased, but not efficient.

In consequence, the estimates will be inefficient but useful as an overall estimator.