/
categorical-unseen.qmd
86 lines (58 loc) · 3.66 KB
/
categorical-unseen.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
pagetitle: "Feature Engineering A-Z | Unseen Levels"
---
# Unseen Levels {#sec-categorical-unseen}
::: {style="visibility: hidden; height: 0px;"}
## Unseen Levels
:::
When you are dealing with categorical variables, it is understood that they can take many values. And we have various methods about how to deal with these categorical values, regardless of what values they take. One problem that eventually will happen for you is that you try to apply a trained preprocessor on data that has levels in your categorical variable that you haven't seen before. This can happen when you are fitting your model using resampled data when you are applying your model on the testing data set, or, if you are unlucky, at some future time in production.
::: {.callout-caution}
# TODO
add diagram here
:::
The reason why you need to think about this problem is that some methods and/or models will complain and even error if you are providing unseen levels. Some implementations will allow you to deal with this at the method level. Other methods such as @sec-categorical-hashing don't care at all that you have unseen levels in your data.
One surefire way to deal with this issue is to add a step in your data preprocessing pipeline that will turn any unseen levels into `"unseen"`. What this method does in practice, is that it looks at your categorical variables during training, taking note of all the levels it sees and saving them. Then any time the preprocessing is applied to new data it will look at the levels again, and if it sees a level it hasn't seen, label it `"unseen"` (or any other meaningful label that doesn't conflict with the data). This way, you have any future levels.
::: callout-note
The above method will only work if the programming language you are modeling with has a factor like data class.
:::
::: {.callout-caution}
# TODO
add diagram here
:::
## R Examples
We will be using the [nycflights13](https://github.com/tidyverse/nycflights13) data set. We are downsampling just a bit to only work on the first day and doing a test-train split.
```{r}
#| message: false
library(recipes)
library(rsample)
library(nycflights13)
flights <- flights |>
filter(year == 2013, month == 1, day == 1)
set.seed(13630)
flights_split <- initial_split(flights)
flights_train <- training(flights_split)
flights_test <- testing(flights_split)
```
Now we are doing the cardinal sin by looking at the testing data. But in this case, it is okay because we are doing it for educational purposes.
```{r}
flights_train |> pull(carrier) |> unique() |> sort()
flights_test |> pull(carrier) |> unique() |> sort()
```
Notice that the testing data includes the carrier `"AS"` and `"HA"` but the training data doesn't know that. Let us see what would happen if we were to calculate dummy variables without doing any adjusting.
```{r}
dummy_spec <- recipe(arr_delay ~ carrier, data = flights_train) |>
step_dummy(carrier)
dummy_spec_prepped <- prep(dummy_spec)
bake(dummy_spec_prepped, new_data = flights_test)
```
We get a warning, and if you look at the rows that were affected we see that it produces NAs. Let us now use the function `step_novel()` that implements the above-described method.
```{r}
novel_spec <- recipe(arr_delay ~ carrier, data = flights_train) |>
step_novel(carrier) |>
step_dummy(carrier)
novel_spec_prepped <- prep(novel_spec)
bake(novel_spec_prepped, new_data = flights_test)
```
And we see that we get no error or anything.
## Python Examples
I'm not aware of a good way to do this in a scikit-learn way. Please file an [issue on github](https://github.com/EmilHvitfeldt/feature-engineering-az/issues/new?assignees=&labels=&projects=&template=general-issue.md&title) if you know of a good way.