Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling funs / shift could support logical window #3241

Open
jangorecki opened this issue Dec 22, 2018 · 10 comments · May be fixed by #5576
Open

rolling funs / shift could support logical window #3241

jangorecki opened this issue Dec 22, 2018 · 10 comments · May be fixed by #5576
Assignees
Labels
froll idate/itime top request One of our most-requested issues
Milestone

Comments

@jangorecki
Copy link
Member

jangorecki commented Dec 22, 2018

I am filling this issue as a placeholder to evaluate users demand for such feature, at present there are no plans for incorporating it, so if you would need it be sure to upvote.
Extension of #2778.

Rolling functions and shift has been implemented to operate on physical order of data, which means that they do not handle "gaps" in, for example, time/date fields. If one wants to shift an IDate type vector by one day, one has to ensure that every single day is included in vector. If it isn't then one has to expand vector (or eventually a data.table) and perform shift afterwards. This can be flexibly and time efficiently solved using "rolling join" but the problem is memory consumption, especially for very sparse data. In an ideal world we would prefer to isolate roll functionality of rolling joins into helper function and re-use it in those cases.
Some examples of expected output for input x:

library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x
#         date value
#1: 1970-01-01     1
#2: 1970-01-02     2
#3: 1970-01-03     3
#4: 1970-01-06     4
#5: 1970-01-07     5
#6: 1970-01-09     6

## shift value by 1 date
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(NA,1,2,NA,4,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31      NA
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05      NA
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08      NA

## shift value by 1 date locf
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(NA,1,2,3,4,5)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31      NA
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05       3
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08       5

## shift value by 1 date nocb
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(1,1,2,4,4,6)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31       1
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05       4
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08       6

## shift value by -1 date
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,NA,5,NA,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04      NA
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08      NA
#6: 1970-01-09     6 1970-01-10      NA

## shift value by -1 date locf
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,3,5,5,6)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04       3
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08       5
#6: 1970-01-09     6 1970-01-10       6

## shift value by -1 date nocb
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,4,5,6,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04       4
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08       6
#6: 1970-01-09     6 1970-01-10      NA

## rollsum value by 3 date
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(NA,NA,1),sum(NA,1,2),sum(1,2,3),sum(NA,NA,4),sum(NA,4,5),sum(5,NA,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01      NA
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02      NA
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      NA
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      NA
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      NA

## rollsum value by 3 date locf
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(NA,NA,1),sum(NA,1,2),sum(1,2,3),sum(3,3,4),sum(3,4,5),sum(5,5,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01      NA
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02      NA
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      10
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      12
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      16

## rollsum value by 3 date nocb
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(1,1,1),sum(1,1,2),sum(1,2,3),sum(4,4,4),sum(4,4,5),sum(5,6,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01       3
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02       4
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      12
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      13
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      17

Related issue tagged as data.table: https://stackoverflow.com/questions/33553230/calculate-moving-average-every-n-hours


Worth to note that pandas, as of 0.23.4, do support rolling functions by logical order when window argument received offset instead of int: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html

window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0

@Rdatatable Rdatatable deleted a comment from ewisienka Jul 31, 2019
@jangorecki
Copy link
Member Author

jangorecki commented Sep 29, 2019

interesting package written by @gogonzo that already supports feature requested in this issue: https://github.com/gogonzo/runner The feature is called "handling missings" there. Another interesting feature in runner package is "varying window size" - it is implemented only to return all windows values, without applying any function, thus will require much more memory, but can be also more flexibly post-processed lapply.
This "varying window size" is already implemented in data.table as adaptive=TRUE argument.

@jangorecki
Copy link
Member Author

Another package implementing functionality described in this post is https://github.com/DavisVaughan/slide

@jangorecki
Copy link
Member Author

jangorecki commented Sep 3, 2022

I am leaning towards removing this functionality from rolling functions implementation because this use case fits perfectly well into adaptive rolling functions which are part of rolling functions since the beginning.

Therefore instead of adding support for that in our C code, we can simply provide a helper function that generates expected n argument. Using the example above.

library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x
#         date value
#1: 1970-01-01     1
#2: 1970-01-02     2
#3: 1970-01-03     3
#4: 1970-01-06     4
#5: 1970-01-07     5
#6: 1970-01-09     6

## non-adaptive window of width 3
n = 3L
x[, n3 := frollsum(value, n)]

## adaptive window of 3 days
an = c(3L,3L,3L,1L,2L,2L)
x[, an3 := frollsum(value, an, adaptive=TRUE)]

x
#         date value    n3   an3
#       <IDat> <num> <num> <num>
#1: 1970-01-01     1    NA    NA
#2: 1970-01-02     2    NA    NA
#3: 1970-01-03     3     6     6
#4: 1970-01-06     4     9     4
#5: 1970-01-07     5    12     9
#6: 1970-01-09     6    15    11

So the whole point is to provide function adapt

adapt = function(index, window) ...

that for index column (date in example above) and window width (3 days in example above) will generate expected window widths (an in above example).
Helper function would have to handle various time units (at least seconds, minutes, hours, days, months, years).

This will obviously not address the feature for shift function as we don't have an adaptive shift feature there.

@jangorecki

This comment was marked as outdated.

@jangorecki

This comment was marked as outdated.

@jangorecki jangorecki self-assigned this Oct 23, 2022
@jangorecki

This comment was marked as outdated.

@jangorecki
Copy link
Member Author

jangorecki commented Oct 31, 2022

updated timings based on adapt branch

library(slider)
library(data.table)
set.seed(108)
N = 1e6
n = 1e3
x = rnorm(N)

## slightly sparse
idx = sort(sample(N*1.1, N))
system.time(s <- slide_index_dbl(x, idx, mean, .before=n-1L, .complete=TRUE))
#   user  system elapsed 
#  9.041   0.075   9.117 
system.time(d <- frollmean(x, frolladapt(idx, n), adaptive=TRUE))
#   user  system elapsed 
#  0.016   0.000   0.012 
all.equal(d, s)
#[1] TRUE

## sparse
idx = sort(sample(N*2, N))
system.time(s <- slide_index_dbl(x, idx, mean, .before=n-1L, .complete=TRUE))
#   user  system elapsed 
#  7.900   0.008   7.908 
system.time(d <- frollmean(x, frolladapt(idx, n), adaptive=TRUE))
#   user  system elapsed 
#  0.027   0.000   0.022 
all.equal(d, s)
#[1] TRUE

branch for now is in my repo, will push to Rdatatable namespace after rebase to master.

@jangorecki jangorecki linked a pull request Jan 3, 2023 that will close this issue
@jangorecki jangorecki added this to the 1.14.9 milestone Jan 3, 2023
@DavisVaughan
Copy link
Contributor

DavisVaughan commented Jan 8, 2023

@jangorecki a better slider benchmark is probably against slide_index_mean(), which is specialized for this kind of thing. It uses a segment tree rather than an online algorithm to avoid some numerical issues, but is still fairly competitive most of the time. Implementation is from https://www.vldb.org/pvldb/vol8/p1058-leis.pdf.

But nice work with frolladapt()! Very cool.

library(slider)

set.seed(108)
N = 1e6
n = 1e3
x = rnorm(N)

## slightly sparse
idx = sort(sample(N*1.1, N))
system.time(s <- slide_index_dbl(x, idx, mean, .before=n-1L, .complete=TRUE))
#>    user  system elapsed 
#>   6.350   0.551   6.907
system.time(s2 <- slide_index_mean(x, idx, before=n-1L, complete=TRUE))
#>    user  system elapsed 
#>   0.270   0.012   0.282
all.equal(s, s2)
#> [1] TRUE

## sparse
idx = sort(sample(N*2, N))
system.time(s <- slide_index_dbl(x, idx, mean, .before=n-1L, .complete=TRUE))
#>    user  system elapsed 
#>   5.089   0.345   5.437
system.time(s2 <- slide_index_mean(x, idx, before=n-1L, complete=TRUE))
#>    user  system elapsed 
#>   0.291   0.016   0.308
all.equal(s, s2)
#> [1] TRUE

Created on 2023-01-08 with reprex v2.0.2.9000

@jangorecki
Copy link
Member Author

Thanks for pointing out _mean version. I thought it was only for a non-index version and must have miss this one. frollapply doesn't really do much here, it's adaptive=TRUE in rolling functions that does almost all work. I actually developed it for different purpose, adaptive rolling functions. Unevenly spaced time series turned out to be a special case of it.

@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
froll idate/itime top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants