# K: Multicollinearity

## Library Imports

In [1]:
library(car)

Loading required package: carData



## Introduction

* **intercorrelation** exists whenever the predictor variables are correlated
* **multicollinearity** generally reserved for instances where the correlation is very high (greater than 0.9)
* multicollinearity can make it difficult to
  * judge relative importance of predictor variables
  * cause problems when fitting and interpreting the regression model
* Ex: in a regression analysis using response variable max vertical jump and predictor variables height, shoe size, and hours spent practicing per day, we would see that height and shoe size are likely to be highly correlated with each other, so multicollinearity will likely be a problem
* multicollinearity is a problem because when two or more predictor variables are highly correlated, it becomes difficult to change one variable without changing the other variable
  * make it difficult for the model to estimate relationship between each predictor variable and response variable independently because predictor variables tend to change in union
* 

## Example: Credit Card Balance Data

**Description:** A simulated data set containing information on ten
thousand customers. The aim here is to predict which customers
will default on their credit card debt.

In [2]:
# load the dataset
library(ISLR2)
Credit

Unnamed: 0_level_0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<dbl>
1,14.891,3606,283,2,34,11,No,No,Yes,South,333
2,106.025,6645,483,3,82,15,Yes,Yes,Yes,West,903
3,104.593,7075,514,4,71,11,No,No,No,West,580
4,148.924,9504,681,3,36,11,Yes,No,No,West,964
5,55.882,4897,357,2,68,16,No,No,Yes,South,331
6,80.180,8047,569,4,77,10,No,No,No,South,1151
7,20.996,3388,259,2,37,12,Yes,No,No,East,203
8,71.408,7114,512,2,87,9,No,No,No,West,872
9,15.125,3300,266,5,66,13,Yes,No,No,South,279
10,71.061,6819,491,3,41,19,Yes,Yes,Yes,East,1350


* Find the correlation between Limit and Age

In [3]:
cor(Credit$Limit, Credit$Age)

Find the correlation between Limit and Rating

In [4]:
cor(Credit$Limit, Credit$Rating)

Fit two MLR based on the following:
* Model1: Balance on Age and Limit
* Model2: Balance on Rating and Limit

Also, check the standard error of Limit

In [8]:
model1 = lm(Balance~Age+Limit, data=Credit)
summary(model1)
print('')
print('VIF Model:')
vif(model1)


Call:
lm(formula = Balance ~ Age + Limit, data = Credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-696.84 -150.78  -13.01  126.68  755.56 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.734e+02  4.383e+01  -3.957 9.01e-05 ***
Age         -2.291e+00  6.725e-01  -3.407 0.000723 ***
Limit        1.734e-01  5.026e-03  34.496  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 230.5 on 397 degrees of freedom
Multiple R-squared:  0.7498,	Adjusted R-squared:  0.7486 
F-statistic:   595 on 2 and 397 DF,  p-value: < 2.2e-16


[1] ""
[1] "VIF Model:"


In [9]:
model2 = lm(Balance~Rating+Limit, data=Credit)
summary(model2)
print('')
print('VIF Model:')
vif(model2)


Call:
lm(formula = Balance ~ Rating + Limit, data = Credit)

Residuals:
   Min     1Q Median     3Q    Max 
-707.8 -135.9   -9.5  124.0  817.6 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -377.53680   45.25418  -8.343 1.21e-15 ***
Rating         2.20167    0.95229   2.312   0.0213 *  
Limit          0.02451    0.06383   0.384   0.7012    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 232.3 on 397 degrees of freedom
Multiple R-squared:  0.7459,	Adjusted R-squared:  0.7447 
F-statistic: 582.8 on 2 and 397 DF,  p-value: < 2.2e-16


[1] ""
[1] "VIF Model:"


* Calculate the VIF when a regression of Balance on Age, Rating, and Limit

In [10]:
model3 = lm(Balance~Age+Rating+Limit, data=Credit)
summary(model3)
print('')
print('VIF Model:')
vif(model3)


Call:
lm(formula = Balance ~ Age + Rating + Limit, data = Credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-729.67 -135.82   -8.58  127.29  827.65 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -259.51752   55.88219  -4.644 4.66e-06 ***
Age           -2.34575    0.66861  -3.508 0.000503 ***
Rating         2.31046    0.93953   2.459 0.014352 *  
Limit          0.01901    0.06296   0.302 0.762830    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 229.1 on 396 degrees of freedom
Multiple R-squared:  0.7536,	Adjusted R-squared:  0.7517 
F-statistic: 403.7 on 3 and 396 DF,  p-value: < 2.2e-16


[1] ""
[1] "VIF Model:"
