rolling funs / shift could support logical window #3241

jangorecki · 2018-12-22T13:15:03Z

I am filling this issue as a placeholder to evaluate users demand for such feature, at present there are no plans for incorporating it, so if you would need it be sure to upvote.
Extension of #2778.

Rolling functions and shift has been implemented to operate on physical order of data, which means that they do not handle "gaps" in, for example, time/date fields. If one wants to shift an IDate type vector by one day, one has to ensure that every single day is included in vector. If it isn't then one has to expand vector (or eventually a data.table) and perform shift afterwards. This can be flexibly and time efficiently solved using "rolling join" but the problem is memory consumption, especially for very sparse data. In an ideal world we would prefer to isolate roll functionality of rolling joins into helper function and re-use it in those cases.
Some examples of expected output for input x:

library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x
#         date value
#1: 1970-01-01     1
#2: 1970-01-02     2
#3: 1970-01-03     3
#4: 1970-01-06     4
#5: 1970-01-07     5
#6: 1970-01-09     6

## shift value by 1 date
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(NA,1,2,NA,4,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31      NA
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05      NA
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08      NA

## shift value by 1 date locf
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(NA,1,2,3,4,5)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31      NA
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05       3
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08       5

## shift value by 1 date nocb
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(1,1,2,4,4,6)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31       1
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05       4
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08       6

## shift value by -1 date
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,NA,5,NA,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04      NA
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08      NA
#6: 1970-01-09     6 1970-01-10      NA

## shift value by -1 date locf
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,3,5,5,6)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04       3
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08       5
#6: 1970-01-09     6 1970-01-10       6

## shift value by -1 date nocb
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,4,5,6,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04       4
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08       6
#6: 1970-01-09     6 1970-01-10      NA

## rollsum value by 3 date
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(NA,NA,1),sum(NA,1,2),sum(1,2,3),sum(NA,NA,4),sum(NA,4,5),sum(5,NA,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01      NA
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02      NA
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      NA
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      NA
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      NA

## rollsum value by 3 date locf
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(NA,NA,1),sum(NA,1,2),sum(1,2,3),sum(3,3,4),sum(3,4,5),sum(5,5,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01      NA
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02      NA
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      10
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      12
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      16

## rollsum value by 3 date nocb
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(1,1,1),sum(1,1,2),sum(1,2,3),sum(4,4,4),sum(4,4,5),sum(5,6,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01       3
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02       4
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      12
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      13
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      17

Related issue tagged as data.table: https://stackoverflow.com/questions/33553230/calculate-moving-average-every-n-hours

Worth to note that pandas, as of 0.23.4, do support rolling functions by logical order when window argument received offset instead of int: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html

window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0

The text was updated successfully, but these errors were encountered:

jangorecki · 2019-04-24T19:05:27Z

could probably efficiently resolved
https://stackoverflow.com/questions/55820330/how-to-mutate-variables-on-a-rollwing-time-window-by-groups-with-unequal-time-di
https://stackoverflow.com/questions/57584342/calculate-rolling-functions-on-up-to-a-time-interval-with-irregularly-spaced-tim

jangorecki · 2019-09-29T15:51:15Z

interesting package written by @gogonzo that already supports feature requested in this issue: https://github.com/gogonzo/runner The feature is called "handling missings" there. Another interesting feature in runner package is "varying window size" - it is implemented only to return all windows values, without applying any function, thus will require much more memory, but can be also more flexibly post-processed lapply.
This "varying window size" is already implemented in data.table as adaptive=TRUE argument.

jangorecki · 2019-10-04T10:57:56Z

Another package implementing functionality described in this post is https://github.com/DavisVaughan/slide

jangorecki · 2022-09-03T14:54:15Z

I am leaning towards removing this functionality from rolling functions implementation because this use case fits perfectly well into adaptive rolling functions which are part of rolling functions since the beginning.

Therefore instead of adding support for that in our C code, we can simply provide a helper function that generates expected n argument. Using the example above.

library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x
#         date value
#1: 1970-01-01     1
#2: 1970-01-02     2
#3: 1970-01-03     3
#4: 1970-01-06     4
#5: 1970-01-07     5
#6: 1970-01-09     6

## non-adaptive window of width 3
n = 3L
x[, n3 := frollsum(value, n)]

## adaptive window of 3 days
an = c(3L,3L,3L,1L,2L,2L)
x[, an3 := frollsum(value, an, adaptive=TRUE)]

x
#         date value    n3   an3
#       <IDat> <num> <num> <num>
#1: 1970-01-01     1    NA    NA
#2: 1970-01-02     2    NA    NA
#3: 1970-01-03     3     6     6
#4: 1970-01-06     4     9     4
#5: 1970-01-07     5    12     9
#6: 1970-01-09     6    15    11

So the whole point is to provide function adapt

adapt = function(index, window) ...

that for index column (date in example above) and window width (3 days in example above) will generate expected window widths (an in above example).
Helper function would have to handle various time units (at least seconds, minutes, hours, days, months, years).

This will obviously not address the feature for shift function as we don't have an adaptive shift feature there.

jangorecki · 2022-10-31T13:52:30Z

updated timings based on adapt branch

library(slider)
library(data.table)
set.seed(108)
N = 1e6
n = 1e3
x = rnorm(N)

## slightly sparse
idx = sort(sample(N*1.1, N))
system.time(s <- slide_index_dbl(x, idx, mean, .before=n-1L, .complete=TRUE))
#   user  system elapsed 
#  9.041   0.075   9.117 
system.time(d <- frollmean(x, frolladapt(idx, n), adaptive=TRUE))
#   user  system elapsed 
#  0.016   0.000   0.012 
all.equal(d, s)
#[1] TRUE

## sparse
idx = sort(sample(N*2, N))
system.time(s <- slide_index_dbl(x, idx, mean, .before=n-1L, .complete=TRUE))
#   user  system elapsed 
#  7.900   0.008   7.908 
system.time(d <- frollmean(x, frolladapt(idx, n), adaptive=TRUE))
#   user  system elapsed 
#  0.027   0.000   0.022 
all.equal(d, s)
#[1] TRUE

~~branch for now is in my repo, will push to Rdatatable namespace after rebase to master.~~

DavisVaughan · 2023-01-08T14:04:14Z

@jangorecki a better slider benchmark is probably against slide_index_mean(), which is specialized for this kind of thing. It uses a segment tree rather than an online algorithm to avoid some numerical issues, but is still fairly competitive most of the time. Implementation is from https://www.vldb.org/pvldb/vol8/p1058-leis.pdf.

But nice work with frolladapt()! Very cool.

library(slider)

set.seed(108)
N = 1e6
n = 1e3
x = rnorm(N)

## slightly sparse
idx = sort(sample(N*1.1, N))
system.time(s <- slide_index_dbl(x, idx, mean, .before=n-1L, .complete=TRUE))
#>    user  system elapsed 
#>   6.350   0.551   6.907
system.time(s2 <- slide_index_mean(x, idx, before=n-1L, complete=TRUE))
#>    user  system elapsed 
#>   0.270   0.012   0.282
all.equal(s, s2)
#> [1] TRUE

## sparse
idx = sort(sample(N*2, N))
system.time(s <- slide_index_dbl(x, idx, mean, .before=n-1L, .complete=TRUE))
#>    user  system elapsed 
#>   5.089   0.345   5.437
system.time(s2 <- slide_index_mean(x, idx, before=n-1L, complete=TRUE))
#>    user  system elapsed 
#>   0.291   0.016   0.308
all.equal(s, s2)
#> [1] TRUE

^{Created on 2023-01-08 with reprex v2.0.2.9000}

jangorecki · 2023-01-08T15:49:36Z

Thanks for pointing out _mean version. I thought it was only for a non-index version and must have miss this one. frollapply doesn't really do much here, it's adaptive=TRUE in rolling functions that does almost all work. I actually developed it for different purpose, adaptive rolling functions. Unevenly spaced time series turned out to be a special case of it.

jangorecki added the someday label Dec 22, 2018

jangorecki mentioned this issue Mar 6, 2019

allow subsetting by times #607

Open

Rdatatable deleted a comment from ewisienka Jul 31, 2019

MichaelChirico mentioned this issue Oct 15, 2019

Master list of most-requested issues #3189

Open

76 tasks

jangorecki mentioned this issue Nov 21, 2019

nafill new type: approx #4066

Open

jangorecki mentioned this issue Apr 8, 2020

Provide aggregation functionality for use with data.table eddelbuettel/nanotime#64

Closed

MichaelChirico added the High label May 30, 2020

jangorecki removed the High label Jun 3, 2020

MichaelChirico added the idate/itime label Sep 8, 2020

jangorecki mentioned this issue Sep 9, 2020

Minor feedback CEAUL/Dados_COVID-19_PT#21

Open

jangorecki mentioned this issue Nov 12, 2021

Shift needs to consider the time variable #5259

Closed

jangorecki mentioned this issue Apr 11, 2022

significantly slower performance time based rolling 1.14.3 #5366

Closed

jangorecki added help-wanted and removed someday labels Sep 3, 2022

jangorecki added a commit that referenced this issue Sep 3, 2022

improve doc too mention adaptive rolling function to address #3241

7184024

jangorecki mentioned this issue Sep 12, 2022

rolling functions, rolling aggregates, sliding window, moving average #2778

Open

28 tasks

jangorecki added the froll label Sep 26, 2022

This comment was marked as outdated.

Sign in to view

jangorecki removed the help-wanted label Oct 22, 2022

This comment was marked as outdated.

Sign in to view

jangorecki self-assigned this Oct 23, 2022

This comment was marked as outdated.

Sign in to view

jangorecki linked a pull request Jan 3, 2023 that will close this issue

adapt uneven time series rolling window #5576

Open

jangorecki added this to the 1.14.9 milestone Jan 3, 2023

jangorecki mentioned this issue Oct 5, 2023

Window functions, revisited PRQL/prql#2723

Open

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rolling funs / shift could support logical window #3241

rolling funs / shift could support logical window #3241

jangorecki commented Dec 22, 2018 •

edited

jangorecki commented Apr 24, 2019 •

edited

jangorecki commented Sep 29, 2019 •

edited

jangorecki commented Oct 4, 2019

jangorecki commented Sep 3, 2022 •

edited

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

jangorecki commented Oct 31, 2022 •

edited

DavisVaughan commented Jan 8, 2023 •

edited

jangorecki commented Jan 8, 2023

rolling funs / shift could support logical window #3241

rolling funs / shift could support logical window #3241

Comments

jangorecki commented Dec 22, 2018 • edited

jangorecki commented Apr 24, 2019 • edited

jangorecki commented Sep 29, 2019 • edited

jangorecki commented Oct 4, 2019

jangorecki commented Sep 3, 2022 • edited

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

jangorecki commented Oct 31, 2022 • edited

DavisVaughan commented Jan 8, 2023 • edited

jangorecki commented Jan 8, 2023

jangorecki commented Dec 22, 2018 •

edited

jangorecki commented Apr 24, 2019 •

edited

jangorecki commented Sep 29, 2019 •

edited

jangorecki commented Sep 3, 2022 •

edited

jangorecki commented Oct 31, 2022 •

edited

DavisVaughan commented Jan 8, 2023 •

edited