-
Notifications
You must be signed in to change notification settings - Fork 4
/
categorical-frequency.qmd
104 lines (74 loc) · 3.44 KB
/
categorical-frequency.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
pagetitle: "Feature Engineering A-Z | Frequency Encoding"
---
# Frequency Encoding {#sec-categorical-frequency}
::: {style="visibility: hidden; height: 0px;"}
## Frequency Encoding
:::
Frequency encoding takes a categorical variable and replaces each level with its frequency in the training data set. This results in a single numeric variable, with values between 0 and 1. This is a trained method since we need to keep a record of the frequencies from the training data set.
This method isn't a silver bullet, as it will only sometimes be useful. It is useful when the frequency/rarity of a category level is related to our outcome. Imagine we have data about wines and their producers, some big producers produce many wines, and small producers only produce a couple. This information could potentially be useful and would be easily captured in frequency encoding. This method is not able to distinguish between two levels that have the same frequency.
Unseen levels can be automatically handled by giving them a value of 0 as they are unseen in the training data set. Thus no extra treatment is necessary. Sometimes taking the logarithm can be useful if you are having a big difference between the number of occurrences in your levels.
::: callout-note
This is similar to **count encoding** in the sense that both these encodings calculate the same quantity, the difference is just what you put in the denominator. Since we divide by a constant value in frequency encoding, these will be treated as identical methods.
:::
## Pros and Cons
### Pros
- Powerful and simple when used correctly
- High interpretability
### Cons
- Is not able to distinguish between two levels that have the same frequency
- May not add predictive power
## R Examples
We will be using the `ames` data set for these examples. The `step_encoding_frequency()` function from the [extrasteps](https://github.com/emilhvitfeldt/extrasteps) package allows us to perform frequency encoding.
```{r}
#| echo: false
set.seed(1234)
# To avoid changing recipe ID columns
```
```{r}
#| message: false
library(recipes)
library(extrasteps)
library(modeldata)
data("ames")
ames |>
select(Sale_Price, MS_SubClass, MS_Zoning)
```
We can take a quick look at the possible values `MS_SubClass` takes
```{r}
ames |>
count(MS_SubClass, sort = TRUE)
```
We can then apply frequency encoding using `step_encoding_frequency()`. Notice how we only get 1 numeric variable for each categorical variable
```{r}
dummy_rec <- recipe(Sale_Price ~ ., data = ames) |>
step_encoding_frequency(all_nominal_predictors()) |>
prep()
dummy_rec |>
bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
glimpse()
```
We can pull the frequencies for each level of each variable by using `tidy()`.
```{r}
dummy_rec |>
tidy(1)
```
## Python Examples
```{python}
#| echo: false
import pandas as pd
from sklearn import set_config
set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```
We are using the `ames` data set for examples. {category_encoders} provided the `CountEncoder()` method we can use. This performs count encoding, which we know is functionally equivalent to frequency encoding.
```{python}
from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.count import CountEncoder
ct = ColumnTransformer(
[('count', CountEncoder(), ['MS_Zoning'])],
remainder="passthrough")
ct.fit(ames)
ct.transform(ames).filter(regex="count.*")
```