-
Notifications
You must be signed in to change notification settings - Fork 4
/
correlated.qmd
64 lines (52 loc) · 3.24 KB
/
correlated.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
pagetitle: "Feature Engineering A-Z | Correlated Overview"
---
# Correlated Overview {#sec-correlated}
::: {style="visibility: hidden; height: 0px;"}
## Correlated Overview
:::
Correlation happens when two or more variables contain similar information. We typically refer to correlation when we talk about predictors. This can be a problem for some machine learning models as they don't perform well with correlated predictors. There are many different ways to calculate the degree of correlation. And those details aren't going to matter much right now. The important thing is that it can happen. Below we see such examples
```{r}
#| label: fig-correlation-examples
#| echo: false
#| message: false
#| fig-cap: |
#| uncorrelated features and correlated feature
#| fig-alt: |
#| 4 Scatter charts. All present the relationship between predictor_1 and
#| predictor_2. The first chart shows points uniformly spread. With each of
#| the following charts have stronger and stronger trends between the two
#| predictors.
library(ggplot2)
library(dplyr)
library(patchwork)
set.seed(1234)
g1 <- tibble(predictor_1 = runif(200)) |>
mutate(predictor_2 = runif(200)) |>
ggplot(aes(predictor_1, predictor_2)) +
geom_point() +
theme_minimal() +
labs(title = "zero correlation")
g2 <- tibble(predictor_1 = runif(200)) |>
mutate(predictor_2 = predictor_1 + rnorm(200, sd = 0.5)) |>
ggplot(aes(predictor_1, predictor_2)) +
geom_point() +
theme_minimal() +
labs(title = "mild correlation")
g3 <- tibble(predictor_1 = runif(200)) |>
mutate(predictor_2 = predictor_1 + rnorm(200, sd = 0.1)) |>
ggplot(aes(predictor_1, predictor_2)) +
geom_point() +
theme_minimal() +
labs(title = "medium correlation")
g4 <- tibble(predictor_1 = runif(200)) |>
mutate(predictor_2 = predictor_1 + rnorm(200, sd = 0.05)) |>
ggplot(aes(predictor_1, predictor_2)) +
geom_point() +
theme_minimal() +
labs(title = "strong correlation")
(g1 + g2) / (g3 + g4)
```
The reason why correlated features are bad for our models is that two correlated features have the potential to share information that is useful. Imagine we are working with strongly correlated variables. Furthermore, we propose that `predictor_1` is highly predictive in our model, since `predictor_1` and `predictor_2` are correlated, we can conclude that `predictor_2` would also be highly predictive. The problem then arises when one of these predictors is used, the other predictor will no longer be a predictor since they share their information. Another way to think about it is that we could replace these two predictors with just one predictor with minimal loss of information. This is one of the reasons why we sometimes want to do dimension reduction, as seen in @sec-too-many.
We will see how we can use the correlation structure to figure out which variables we can eliminate, this is covered in @sec-correlated-filter. This is a more specialized version of the methods we cover in @sec-too-many-filter as we are looking at correlation to determine which variables to remove rather than their relationship to the outcome.
Another set of methods that works well in anything PCA-related, which are covered in @sec-too-many-pca and @sec-too-many-pca-variants. The resulting data coming out of PCA will be uncorrelated.