/
datetime-advanced.qmd
202 lines (175 loc) · 7.62 KB
/
datetime-advanced.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
pagetitle: "Feature Engineering A-Z | Advanced Datetime Features"
---
# Advanced Features {#sec-datetime-advanced}
::: {style="visibility: hidden; height: 0px;"}
## Advanced Features
:::
```{r}
#| echo: false
#| message: false
library(tidyverse)
```
All the features we were able to extract were related to what day or time it was for a given observation. Or numbers on the form "how many since the start of the month" or "how many days since the start of the week". And while this information can be useful, there will often be times when we want to do slight modifications that can result in huge payoffs.
Consider merchandise sale-related data. The mere indication of specific dates might become useful, but the sale amount is not likely to be affected just on the sale days, but on the surrounding days as well. Consider the American Black Friday. This day is predetermined to come every year at an easily recognized day, namely the last Friday of November. Considering its close time to Christmas and other gift-giving holidays, it is a common day for thrifty people to start buying presents.
In the extraction since we have a single indicator for the day of Black Friday
```{r}
#| label: fig-datetime-advanced-indicator
#| echo: false
#| message: false
#| fig-cap: |
#| We only see the effect of a single Day
#| fig-alt: |
#| Bar chart. Dates along the x-axis, numeric effect along the y-axis. A
#| Single bar on Nov 24 with a value of 1 is shown.
tibble(
day = as_date("2023-11-24") + c(-5:5),
black_friday = c(rep(0, 5), 1, rep(0, 5))
) |>
ggplot(aes(day, black_friday)) +
geom_col() +
theme_minimal()
```
But it would make sense that since we know the day of Black Friday, that the sales will see a drop on the previous days, we can incorporate that as well.
```{r}
#| label: fig-datetime-advanced-indicator-before
#| echo: false
#| message: false
#| fig-cap: |
#| Negative before effects can capture hesitancy to buy before a big sale.
#| fig-alt: |
#| Bar chart. Dates along the x-axis, numeric effect along the y-axis. A
#| single bar on Nov 24 with a value of 1 is shown, the columns before the
#| 24ths takes negative values, with the 23 having the highest value, 22 less
#| and so on.
tibble(
day = as_date("2023-11-24") + c(-5:5),
black_friday = c(-0.5 ^ (5:1), 1, rep(0, 5))
) |>
ggplot(aes(day, black_friday)) +
geom_col() +
theme_minimal()
```
On the other hand, once the sale has started happening the sales to pick up again. Since this is the last big sale before the Holidays, shoppers are free to buy their remaining presents as they don't have to fear the item going on sale.
```{r}
#| label: fig-datetime-advanced-indicator-before-after
#| echo: false
#| message: false
#| fig-cap: |
#| Positive affects effects can capture the ease of mind that no other sale will
#| come.
#| fig-alt: |
#| Bar chart. Dates along the x-axis, numeric effect along the y-axis. A
#| single bar on Nov 24 with a value of 1 is shown, the columns before the
#| 24ths takes negative values, with the 23 having the highest value, 22 less
#| and so on. The values after Nov 24 have decreasing values.
tibble(
day = as_date("2023-11-24") + c(-5:5),
black_friday = c(-0.5 ^ (5:1), 1, 0.25 ^ (1:5/2))
) |>
ggplot(aes(day, black_friday)) +
geom_col() +
theme_minimal()
```
The exact effects shown here are just approximate to our story at hand. But they provide a useful illustration. There is a lot of bandwidth to be given if we look at date times from a distance perspective.
We can play around with "distance from" and "distance to", different numerical transformations we saw in @sec-numeric, and signs and indicators we talked about in @sec-datetime-extraction to tailor our feature engineering to our problem.
What all these methods have in common is a reference point. For an extracted `day` feature, the reference point is "first of the month" and the **after**-function is `x`, or in other words "days since the time of day". We see this in the following chart. Almost all extracted functions follow this formula
```{r}
#| label: fig-datetime-advanced-increasing
#| echo: false
#| message: false
#| fig-cap: |
#| Repeated increasing values.
#| fig-alt: |
#| Bar chart. Dates along the x-axis, numeric effect along the y-axis. Values
#| start at 1 and the first of the month, and increase by 1 for each day. A
#| Triangle pattern appears.
tibble(
day = as_date("2023-11-24") + c(-5:5),
black_friday = c(-0.5 ^ (5:1), 1, 0.25 ^ (1:5/2))
) |>
ggplot(aes(day, black_friday)) +
geom_col() +
theme_minimal()
tibble(
date = as_date("2023-11-24") + seq(-90, 0)
) |>
mutate(day = day(date)) |>
ggplot(aes(date, day)) +
geom_col() +
theme_minimal()
```
we could just as well do the inverse and look at how many days are left in the month. This would have a **before**-function of `x` as well.
```{r}
#| label: fig-datetime-advanced-decreasing
#| echo: false
#| message: false
#| fig-cap: |
#| Repeated increasing values.
#| fig-alt: |
#| Bar chart. Dates along the x-axis, numeric effect along the y-axis. Values
#| start at 1 and the last of the month, and increase by 1 for each day going
#| backwards. A Triangle pattern appears. The starting value is different for
#| each month as each month has a different number of days.
tibble(
date = as_date("2023-11-24") + seq(-90, 0)
) |>
mutate(day = day(date)) |>
mutate(month = month(date)) |>
mutate(day = max(day) - day + 1, .by = month) |>
ggplot(aes(date, day)) +
geom_col() +
theme_minimal()
```
We can do a both-sided formula by looking at "how many days are we away from a weekend". This would have both the before and after functions be `x` and look like so. Here it isn't too interesting as it is quite periodic, but using the same measure with "sale" instead of "weekend" and suddenly you have something different.
```{r}
#| label: fig-datetime-advanced-weekend
#| echo: false
#| message: false
#| fig-cap: |
#| Repeated
#| fig-alt: |
#| Bar chart. Dates along the x-axis, numeric effect along the y-axis. Values
#| are zero for both Saturdays and Sundays. 1 for Mondays and Fridays, 2 for
#| Tuesdays and Thursdays, and 3 for Wednesdays.
tibble(
date = as_date("2023-11-24") + seq(-30, 0)
) |>
mutate(day = c(1, 2, 3, 2, 1, 0, 0)[wday(date)]) |>
ggplot(aes(date, day)) +
geom_col() +
theme_minimal()
```
There are many other functions you can use, they will depend entirely on your task at hand. A few examples are shown below for inspiration.
```{r}
#| label: fig-datetime-advanced-functions
#| echo: false
#| message: false
#| fig-cap: |
#| Repeated
#| fig-alt: |
#| Faceted bar chart. Dates along the x-axis, numeric effect along the y-axis.
#| Each of the charts represents the day of the month for a couple of months.
#| One shows the logarithmic transformation, one shows the untransformed data
#| one looks at the square transformation, and one looks at the untransformed
#| data that has been rounded down to 10, creating a plateau.
tibble(
date = as_date("2023-11-24") + seq(-90, 0)
) |>
mutate(day = day(date)) |>
mutate(`log(x)` = log(day)) |>
mutate(`x^2` = day ^ 2) |>
mutate(`pmin(day, 10)` = pmin(day, 10)) |>
rename(x = day) |>
pivot_longer(-date) |>
ggplot(aes(date, value)) +
geom_col() +
facet_wrap(~name, scales = "free_y") +
theme_minimal()
```
What makes these calculations so neat is that they can be tailored to our task at hand and that they work with irregular events such as holidays and signup dates. These methods are not circular by definition, but they will work in many ways it. We will cover explicit circular methods in @sec-datetime-circular.
## Pros and Cons
### Pros
### Cons
## R Examples
## Python Examples