/
categorical-binary.qmd
142 lines (104 loc) · 4.11 KB
/
categorical-binary.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
pagetitle: "Feature Engineering A-Z | Binary Encoding"
---
# Binary Encoding {#sec-categorical-binary}
::: {style="visibility: hidden; height: 0px;"}
## Binary Encoding
:::
Binary encoding encodes each category by encoding it as its binary representation. From the categorical variables, you assign an integer value to each level, in the same way as in @sec-categorical-label. That value will then be converted to its binary representation, and that value will be returned.
Suppose we have the following variable, and the values they take are (cat = 11, dog = 3, horse = 20). We are using a subset to gain a better understanding of what is happening.
```{r}
#| echo: false
c("dog", "cat", "horse", "dog", "cat")
```
The first thing we need to do is to calculate the binary representation of these numbers. And we should do it to 5 digits since it is the highest we need in this hypothetical example. 11 = 01011, 3 = 00011, 20 = 10100. We can then encode this in the following matrix
```{r}
#| echo: false
dummy <- matrix(0L, nrow = 5, ncol = 5)
colnames(dummy) <- c(16, 8, 4, 2, 1)
dummy[1, 5] <- 1L
dummy[1, 4] <- 1L
dummy[2, 5] <- 1L
dummy[2, 4] <- 1L
dummy[2, 2] <- 1L
dummy[3, 1] <- 1L
dummy[3, 3] <- 1L
dummy[4, 5] <- 1L
dummy[4, 4] <- 1L
dummy[5, 5] <- 1L
dummy[5, 4] <- 1L
dummy[5, 2] <- 1L
dummy
```
Each we would be able to uniquely encode `2^5=32` different values with just 5 columns compared to the 32 it would take if you used dummy encoding from @sec-categorical-dummy. In general, you will be able to encode `n` variables in `ceiling(log2(n))` columns.
::: callout-note
This style of encoding is generalized to other bases. Binary encoding is a base-2 encoder. You could just as well have a base 3, or base 10 encoding. We will not cover these methods further than this mentioned as they are similar in function to binary encoding.
:::
This method isn't widely used. It does a good job of showing the midpoint between dummy encoding and label encoding in terms of how sparse we want to store our data. Its limitations come in terms of how interpretable the final model ends up being. Further, if you want to encode your data more compactly than dummy encoding, you will find better luck using some of the later described methods.
::: {.callout-caution}
# TODO
link to actual methods
:::
::: {.callout-caution}
# TODO
talk about grey encoding
:::
## Pros and Cons
### Pros
- uses fewer variables to store the same information as dummy encoding
### Cons
- Less interpretability compared to dummy variables
## R Examples
We will be using the `ames` data set for these examples. The `step_encoding_binary()` function from the [extrasteps](https://github.com/emilhvitfeldt/extrasteps) package allows us to perform binary encoding.
```{r}
#| echo: false
set.seed(1234)
# To avoid changing recipe ID columns
```
```{r}
#| message: false
library(recipes)
library(extrasteps)
library(modeldata)
data("ames")
ames |>
select(Sale_Price, MS_SubClass, MS_Zoning)
```
We can take a quick look at the possible values `MS_SubClass` takes
```{r}
ames |>
count(MS_SubClass, sort = TRUE)
```
We can then apply binary encoding using `step_encoding_binary()`. Notice how we only get 1 numeric variable for each categorical variable
```{r}
dummy_rec <- recipe(Sale_Price ~ ., data = ames) |>
step_encoding_binary(all_nominal_predictors()) |>
prep()
dummy_rec |>
bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
glimpse()
```
We can pull the number of distinct levels of each variable by using `tidy()`.
```{r}
dummy_rec |>
tidy(1)
```
## Python Examples
```{python}
#| echo: false
import pandas as pd
from sklearn import set_config
set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```
We are using the `ames` data set for examples. {category_encoders} provided the `BinaryEncoder()` method we can use.
```{python}
from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.binary import BinaryEncoder
ct = ColumnTransformer(
[('binary', BinaryEncoder(), ['MS_Zoning'])],
remainder="passthrough")
ct.fit(ames)
ct.transform(ames).filter(regex="binary.*")
```