/
numeric-splines.qmd
392 lines (327 loc) · 14.3 KB
/
numeric-splines.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
---
pagetitle: "Feature Engineering A-Z | Splines"
---
# Splines {#sec-numeric-splines}
::: {style="visibility: hidden; height: 0px;"}
## Splines
:::
**Splines**, is one way to represent a curve that makes it useful in modeling content, since it allows us to model non-linear relationships between predictors and outcomes. This is a trained method.
Being able to transform a numeric variable that has a non-linear relationship with the outcome into one or more variables that do have linear relationships with the outcome is of great importance, as many models wouldn't be able to work with these types of variables effectively themselves. Below is a toy example of one such variable
```{r}
#| echo: false
set.seed(1234)
data_toy <- tibble::tibble(
predictor = rnorm(100) + 1:100
) |>
dplyr::mutate(outcome = sin(predictor/25) + rnorm(100, sd = 0.1) + 10)
```
```{r}
#| label: fig-splines-predictor-outcome
#| echo: false
#| message: false
#| fig-cap: |
#| Non-linear relationship between predictor and outcome.
#| fig-alt: |
#| Scatter chart. Predictor along the x-axis and outcome along the y-axis.
#| The data has some wiggliness to it, but it follows a curve. You would not
#| be able to fit a straight line to this data.
library(ggplot2)
data_toy |>
ggplot(aes(predictor, outcome)) +
geom_point() +
theme_minimal()
```
Here we have a non-linear relationship. It is a fairly simple one, the outcome is high when the predictor takes values between 25 and 50, and outside the ranges, it takes over values. Given that this is a toy example, we do not have any expert knowledge regarding what we expect the relationship to be outside this range. The trend could go back up, it could go down or flatten out. We don't know.
As we saw in @sec-numeric-binning, one way to deal with this non-linearity is to chop up the predictor and emit indicators for which region the values take. While this works, we are losing quite a lot of detail by the rounding that occurs. This is where splines come in. Imagine that instead of indicators for whether a value is within a range, we have a set of functions that gives each predictor value a set of values, related to its location in the distribution.
```{r}
#| echo: false
#| message: false
library(recipes)
rec_splines <- recipe(outcome ~ predictor, data = data_toy) |>
step_bs(predictor, keep_original_cols = TRUE, degree = 7) |>
prep()
data_splines <- rec_splines |>
bake(new_data = data_toy) |>
select(-predictor_bs_7) |>
rename_all(\(x) {stringr::str_replace(x, "predictor_bs_", "Spline Feature ")})
```
```{r}
#| label: fig-splines-spline-curves
#| echo: false
#| message: false
#| fig-cap: |
#| Each part of the spline detects a part of the data set.
#| fig-alt: |
#| Facetted line chart. Predictor along the x-axis, value along the y-axis.
#| Each of the curves starts at 0, goes smoothly, and then down to zero.
#| The highpoint for each curve goes further to the right for each curve
#| shown.
data_splines |>
select(-outcome) |>
tidyr::pivot_longer(cols = -predictor) |>
ggplot(aes(predictor, value)) +
geom_line() +
facet_wrap(~name) +
theme_minimal()
```
```{r}
#| echo: false
spline_rounder <- function(value, name) {
name <- paste0("predictor_bs_", name)
value <- bake(rec_splines, tibble(predictor = value))[[name]]
value <- round(value / 0.05) * 0.05
sprintf("%.2f", value)
}
```
Above we see an example of a Basis function that creates 6 features. The curves represent the area where they are "activated". So if the predictor has a value of 15 then the first basis function returns `r spline_rounder(15, 1)`, the second basis function returns `r spline_rounder(15, 2)` and so one, with the last basis function returning `r spline_rounder(15, 6)` since it is all flat over there.
This is a trained method as the location and shape of these functions are determined by the distribution of the variables we are trying to apply the spline to.
So in this example, we are taking 1 numeric variable and turning it into 6 numeric variables.
```{r}
#| label: tbl-splines-values
#| tbl-cap: Spline values for different values of the predictor
#| echo: false
rec_splines |>
bake(tibble(predictor = c(0, 10, 35, 50, 80))) |>
select(-predictor_bs_7) |>
mutate_all(round, 2) |>
rename_all(stringr::str_replace, "predictor_bs_", "Spline Feature ") |>
knitr::kable()
```
This spline is set up in such a way, that each spline function signals if the values are close to a given region. This way we have a smooth transition throughout the distribution.
If we take a look at how this plays out when we bring back the outcome we can look at this visualization
```{r}
#| label: fig-splines-spline-highlight
#| echo: false
#| message: false
#| fig-cap: |
#| Each part of the spline detects a part of the data set.
#| fig-alt: |
#| Facetted scatter chart. Predictor along the x-axis, outcome along the
#| y-axis. Each of the facets shows the same non-linear relationship between
#| predictor and outcome. Color is used to show how each spline term
#| highlights a different part of the predictor. The highlight goes further
#| to the right for each facet.
data_splines |>
tidyr::pivot_longer(cols = -c(outcome, predictor)) |>
ggplot(aes(predictor, outcome, color = value)) +
geom_point() +
facet_wrap(~name) +
scale_color_gradient(high = "darkblue", low = "white") +
theme_minimal()
```
and we have that since the different spline features highlight different parts of the predictor, we have that at least some of them are useful when we look at the relationship between the predictor and the outcome.
It is important to point out that this transformation only uses the predictor variable to do its calculations. And the fact that it works in a modeling sense is that the outcome predictor relationship in this case, and many real-life cases, can helpfully be explained by "the predictor value has these values".
```{r}
#| label: fig-splines-spline-outcome
#| echo: false
#| message: false
#| fig-cap: |
#| Some spline terms have a better relationship to the outcome than others.
#| fig-alt: |
#| Facetted scatter chart. Spline value along the x-axis, outcome along the
#| y-axis. Each facet shows the relationship between one of the spline terms
#| and the outcome. Some of them are non-linear, and a couple of them are
#| fairly linear. A fitted line is overlaid in blue.
data_splines |>
tidyr::pivot_longer(cols = -c(outcome, predictor)) |>
ggplot(aes(value, outcome)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE) +
facet_wrap(~name) +
scale_color_viridis_c() +
theme_minimal()
```
As we see in the above visualization, some of these new predictors are not much better than the original. But a couple of them do appear to work pretty well, especially the 3rd one. Depending on which model we use, having these 6 variables is gonna give us higher performance than using the original variable alone.
One thing to note is that you will get back correlated features when using splines. Some values of the predictor will influence multiple of the spline features as the spline functions overlap. This is expected but is worth noting. If you are using a model type that doesn't handle correlated features well, then you should take a look at the methods outlined in @sec-correlated for ways to deal with correlated features.
```{r}
#| label: fig-splines-correlation
#| echo: false
#| message: false
#| fig-cap: |
#| Neighboring features are highly correlated and anti-correlated with
#| far away features.
#| fig-alt: |
#| Correlation chart. The spline basis features are lined up one after
#| another. Neighboring features show high correlation, features 2 apart are
#| slightly correlated, and other features are anti-correlated.
data_splines |>
dplyr::select(-predictor, -outcome) |>
corrr::correlate(quiet = TRUE) |>
autoplot(method = "identity")
```
Lastly, the above spline functions you saw were called B-splines, but they are not the only kind of splines you can use.
```{r}
#| echo: false
data_example <- data.frame(x = rnorm(10000))
plot_convex <- recipe(~ x, data = data_example) |>
step_spline_convex(x, keep_original_cols = TRUE, deg_free = 6) |>
prep() |>
bake(new_data = data_example) |>
tidyr::pivot_longer(-x) |>
ggplot(aes(x, value, color = name)) +
geom_line() +
guides(color = "none") +
theme_minimal() +
labs(title = "C-spline", x = NULL)
plot_monotone <- recipe(~ x, data = data_example) |>
step_spline_monotone(x, keep_original_cols = TRUE, deg_free = 6) |>
prep() |>
bake(new_data = data_example) |>
tidyr::pivot_longer(-x) |>
ggplot(aes(x, value, color = name)) +
geom_line() +
guides(color = "none") +
theme_minimal() +
labs(title = "M-spline", x = NULL)
plot_natural <- recipe(~ x, data = data_example) |>
step_spline_natural(x, keep_original_cols = TRUE, deg_free = 6,
complete_set = TRUE) |>
prep() |>
bake(new_data = data_example) |>
tidyr::pivot_longer(-x) |>
ggplot(aes(x, value, color = name)) +
geom_line() +
guides(color = "none") +
theme_minimal() +
labs(title = "Natural spline", x = NULL)
plot_b <- recipe(~ x, data = data_example) |>
step_spline_b(x, keep_original_cols = TRUE, deg_free = 6,
options = list(periodic = TRUE), complete_set = TRUE) |>
prep() |>
bake(new_data = tibble(x = seq(-3, 15, by = 0.01))) |>
tidyr::pivot_longer(-x) |>
ggplot(aes(x, value, color = name)) +
geom_line() +
guides(color = "none") +
theme_minimal() +
labs(title = "Periodic b-spline", x = NULL)
```
```{r}
#| label: fig-splines-types-of-splines
#| echo: false
#| message: false
#| fig-cap: |
#| Neighboring features are highly correlated and anti-correlated with
#| far away features.
#| fig-alt: |
#| 4 charts in a grid. Each represents a different type of spline. The
#| C-splines here are all increasing at different rates of change. The
#| M-splines appear to have a sigmoidal shape, starting at 0 and ending
#| at 1. The natural splines look very similar to the basic splines we saw
#| earlier. And the last chart shows a periodic b-spline. These splines are
#| the same kind as earlier, but they have been modified to repeat at a
#| specific interval.
library(patchwork)
(plot_convex + plot_monotone) / (plot_natural + plot_b)
```
Above we see several different kinds of splines. As we can see they are all trying to do different things. You generally can't go too wrong by picking any of them, but knowing the data can help guide which of them you should use. The M-splines intuitively can be seen as threshold features. The periodic example is also interesting. Many of the types of splines can be formulated to work periodically. This can be handy for data that has a naturally periodic nature to them.
Below is a chart of how well using splines works when using it on our toy example. Since the data isn't that complicated, a small `deg_free` is sufficient to fit the data well.
```{r}
#| label: fig-splines-different-degrees
#| echo: false
#| message: false
#| fig-cap: |
#| All the splines follow the data well, the higher degrees appear to
#| overfit quite a bit.
#| fig-alt: |
#| Scatter chart. Predictor along the x-axis and outcome along the y-axis.
#| The data has some wiggliness to it, but it follows a curve. You would not
#| be able to fit a straight line to this data. 4 spline fits are
#| plotted to fit the data. deg_free = 5 appears to fit well without
#| overfitting, the rest are overfitting the data.
library(tidymodels)
map(
c(5, 15, 25, 35),
\(x) {
workflow(
recipe(outcome ~ predictor, data = data_toy) |>
step_spline_b(predictor, deg_free = x),
linear_reg()
) |>
fit(data = data_toy) |>
augment(new_data = arrange(data_toy, predictor))
}
) |>
list_rbind(names_to = "degree") |>
mutate(degree = c(5, 15, 25, 35)[degree]) |>
mutate(degree = as.factor(degree)) |>
ggplot(aes(predictor, .pred)) +
geom_point(aes(predictor, outcome), data = data_toy) +
geom_line(aes(color = degree, group = degree)) +
theme_minimal() +
scale_color_viridis_d() +
labs(y = "outcome", color = "deg_free")
```
## Pros and Cons
### Pros
- Works fast computationally
- Good performance compared to binning
- is good at handling continuous changes in predictors
### Cons
- arguably less interpretable than binning
- creates correlated features
- can produce a lot of variables
- have a hard time modeling sudden changes in distributions
## R Examples
```{r}
#| echo: false
#| message: false
library(tidymodels)
data("ames")
```
We will be using the `ames` data set for these examples.
```{r}
library(recipes)
library(modeldata)
ames |>
select(Lot_Area, Year_Built)
```
{recipes} provides a number of steps to perform spline operations, each of them starting with `step_spline_`. Let us use a B-spline and a M-spline as examples here:
```{r}
log_rec <- recipe(~ Lot_Area + Year_Built, data = ames) |>
step_spline_b(Lot_Area) |>
step_spline_monotone(Year_Built)
log_rec |>
prep() |>
bake(new_data = NULL) |>
glimpse()
```
We can set the `deg_free` argument to specify how many spline features we want for each of the splines.
```{r}
log_rec <- recipe(~ Lot_Area + Year_Built, data = ames) |>
step_spline_b(Lot_Area, deg_free = 3) |>
step_spline_monotone(Year_Built, deg_free = 4)
log_rec |>
prep() |>
bake(new_data = NULL) |>
glimpse()
```
These steps have more arguments, so we can change other things. The B-splines created by `step_spline_b()` default to cubic splines, but we can change that by specifying which polynomial degree with want with the `degree` argument.
```{r}
log_rec <- recipe(~ Lot_Area + Year_Built, data = ames) |>
step_spline_b(Lot_Area, deg_free = 3, degree = 1) |>
step_spline_monotone(Year_Built, deg_free = 4)
log_rec |>
prep() |>
bake(new_data = NULL) |>
glimpse()
```
## Python Examples
```{python}
#| echo: false
import pandas as pd
from sklearn import set_config
set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```
We are using the `ames` data set for examples. {sklearn} provided the `SplineTransformer()` method we can use.
```{python}
from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import SplineTransformer
ct = ColumnTransformer(
[('spline', SplineTransformer(), ['Lot_Area'])],
remainder="passthrough")
ct.fit(ames)
ct.transform(ames)
```