Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling functions, rolling aggregates, sliding window, moving average #2778

Open
jangorecki opened this issue Apr 21, 2018 · 26 comments
Open

rolling functions, rolling aggregates, sliding window, moving average #2778

jangorecki opened this issue Apr 21, 2018 · 26 comments
Assignees

Comments

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Apr 21, 2018

To gather requirements in single place and refresh ~4 years old discussions creating this issue to cover rolling functions feature (also known as rolling aggregates, sliding window or moving average/moving aggregates).

rolling functions

  • rollmean
  • rollsum
  • rollmin
  • rollmax
  • rollmedian
  • rollprod
  • rollsd
  • rollvar
  • rollrank
  • rollapply (user provided FUN)
  • rollregression (highly demanded)
  • rollcor?
  • rollcov?

features

  • multiple columns at once
  • multiple windows at once
  • multiple columns and multiple windows at once
  • atomic vectors input and single window returns atomic vectors
  • various length list of vectors
  • align: left/center/right
  • handling NAs
  • fill constant
  • long vector support
  • partial window support (if needed can be found in ea766f2)
  • adaptive window support
  • use openmp to parallelize calculation of multiple columns/windows
  • rounding error correction
  • timing in verbose mode from parallel region (blocked by #3422, #3423)
@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Apr 21, 2018

Proposed rollmean implementation, simplified.

x = data.table(v1=1:5, v2=1:5)
k = c(2, 3)
i - single column
j - single window
m - int referring to single row
w - current row's sum of rolling window
r - answer for each i, j
for i in x
  for j in k
  r = NA_real_
  w = 0
    for m in 1:length(i)
      w = w + i[m]
      w = w - i[m-j]
      r[m] = w / j
@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Apr 21, 2018

@st-pasha

This comment has been minimized.

Copy link
Contributor

@st-pasha st-pasha commented Apr 21, 2018

I always envisioned rolling window functionality as grouping the dataset into multiple overlapping groups (windows). Then the API would look something like this:

DT[i, j,
   by = roll(width=5, align="center")]

Then if j contains, say, mean(A), we can internally replace it with rollmean(A) -- exactly like we are doing with gmean() right now. Or j can contain an arbitrarily complicated functionality (say, run a regression for each window), in which case we'd supply .SD data.table to it -- exactly like we do with groups right now.

This way there's no need to introduce 10+ new functions, just one. And it feels data.table-y in spirit too.

@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Apr 21, 2018

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Apr 21, 2018

@st-pasha interesting idea, looks like data.table-y spirit, but it will impose many limitations, and isn't really appropriate for this category of functions.


  • how to perform rollmean by group
DT[, rollmean(V1, 3), by=V2]
  • how to calculate different window sizes for different columns
DT[, .(rollmean(V1, 3), rollmean(V2, 100))]
  • how to calculate rollmean outside of [.data.table as we now allow for shift
rollmean(rnorm(10), 3)
  • how to support queries like
DT[, .(rollmean(list(V1, V2), c(5, 20)), rollmean(list(V2, V3), c(10, 30)))]
  • how to call mean and rollmean in same j call
DT[, .(rollmean(V1, 3), mean(V1)), by=V2]

Usually when using by we aggregate data to smaller number of rows, while rolling functions always returns vector of same length as input. This types of functions in SQL have API in the following style:

SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename;

You can still combine it with GROUP BY as follows:

SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename
GROUP BY group_columns;

So in SQL those functions stays in SELECT which refers to j in DT.
In DT we could achieve the same with:

DT[, rollmean(value, 100)]
DT[, rollmean(value, 100), group_columns]

Rolling functions fits into same category of functions as shift which also returns same number of rows as it got on input.
Shift in SQL looks like:

SELECT LAG(value, 1) OVER ()
FROM tablename;

mean and rollmean are not just different functions, they are different categories of functions. One meant to aggregate according to group, another not aggregate at all. This is easily visible in SQL where we don't use GROUP BY for rolling functions type but we do need to use GROUP BY for aggregates like mean (eventually getting grant total when grouping clause is not present).
I don't see strong reasoning to try to apply same optimizations rules as we do for mean, especially when it doesn't really fit to use case, and all that just for the sake of data.table-y spirit. Current proposal is data.table-y spirit too, it can easily combined with :=, same as shift. It just adds set of new functions, currently not available in data.table.

@st-pasha

This comment has been minimized.

Copy link
Contributor

@st-pasha st-pasha commented Apr 22, 2018

@jangorecki Thanks, these are all valid considerations. Of course different people have different experiences, and different views as to what should be considered "natural".

It is possible to perform rollmean by group: this is just a 2-level grouping: DT[, mean(V1), by=.(V2, roll(3))]. However I don't see how to make different window sizes on different columns with my syntax...

I must admit I never seen SQL syntax for rolling joins before. It's interesting that they use standard aggregator such as AVG yet apply the windowing specification to it. Looking at the Transact-SQL documentation there are some interesting ideas there, for example the distinction between logical/physical row selection. They do allow different "OVER" operators on different columns, however in all examples they give, it is the same OVER clause repeated multiple times. So it suggests that this use-case is much more common, and hence using a single roll() group would result in less repetition.

Also, this SO question provides an interesting insight why the OVER syntax was introduced in SQL at all:

You can use GROUP BY SalesOrderID. The difference is, with GROUP BY you can only have the aggregated values for the columns that are not included in GROUP BY. In contrast, using windowed aggregate functions instead of GROUP BY, you can retrieve both aggregated and non-aggregated values. That is, although you are not doing that in your example query, you could retrieve both individual OrderQty values and their sums, counts, averages etc. over groups of same SalesOrderIDs.

So it appears that the syntax is designed to circumvent the limitation of standard SQL where group-by results could not be combined with unaggregated values (i.e. selecting both A and mean(A) in the same expression). However data.table does not have such a limitation, so it has more freedom in its choice of syntax.


Now, if we want to really get ahead of the curve, we need to think in a broader perspective: what are the "rolling" functions, what are they used for, how they can be extended, etc. Here's my take this, coming from a statistician's point-of-view:

"Rolling mean" function is used to smooth some noisy input. Say, if you have observations over time and you want to have some notion of "average quantity", which would nevertheless vary over time although very slowly. In this case "rolling mean over last 100 observations" or "rolling mean over all previous observations" can be considered. Similarly, if you observe certain quantity over a range of inputs, you may smooth it out by applying "rolling mean over ±50 observations".

  • So, the first extension is to look at "smooth windows": imagine a mean over past observations where the further an observation in the past, the less its contribution is. Or an average of nearby observations over a Gaussian kernel.
  • Second are adaptive windows. For example, if you have a noisy input defined over an interval [0, 1], then smoothing it using a ±N window produces a biased result near the edges. An unbiased estimator would adapt the shape of the window based on the distance from the edge.
  • Resample smoothing: the restriction to produce exactly as many observations as in the source data is too limiting. If you think of your data as a noisy observations of some underlying function, then it is perfectly reasonable to ask to compute the smoothed values of that function on a mesh that is coarser / finer than the original input. Or perhaps the original data is spaced irregularly and you want to resample it onto a regular grid.
  • Jackknife: for each observation you want to compute average/regression over all observations except the current.
  • K-fold split: view data as multiple groups where each group excludes only a small portion of the observations.

All of these can be implemented as extended grouping operators, with rolling windows being just one of the elements on this list. That being said, I don't why we can't have it both ways.

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Apr 23, 2018

I must admit I never seen SQL syntax for rolling joins before.

I assume you mean rolling functions, issue has nothing to do with rolling joins.

They do allow different "OVER" operators on different columns, however in all examples they give, it is the same OVER clause repeated multiple times. So it suggests that this use-case is much more common, and hence using a single roll() group would result in less repetition.

It is just a matter of use case, if you are calling same OVER() many time you may find it more performant to use GROUP BY, build lookup table and re-use in other queries. Whatever examples are there, repeating OVER() is required to retain the feature of locality for each measure provided. My uses cases from Data Warehouses where not as simple as those from Microsoft docs.

In contrast, using windowed aggregate functions instead of GROUP BY, you can retrieve both aggregated and non-aggregated values.

In data.table we do := and by in one query to achieve it.

So it appears that the syntax is designed to circumvent the limitation of standard SQL where group-by results could not be combined with unaggregated values (i.e. selecting both A and mean(A) in the same expression). However data.table does not have such a limitation, so it has more freedom in its choice of syntax.

It isn't much limitation of SQL but just design of GROUP BY, that it will aggregate, the same way that our by aggregates. New API was required to cover new window functionalities. Grouping for SQL window function can be provided for each function call using FUN() OVER (PARTITION BY ...) where partition by is like local grouping for single measure. So to achieve flexibility of SQL we would need to use j = mean(V1, roll=5) or j = over(mean(V1), roll=5) keeping that API in j. Still this approach will not allow to support all use cases mentioned above.

you may smooth it out by applying "rolling mean over ±50 observations".

This is what align argument is used for.

So, the first extension is to look at "smooth windows": imagine a mean over past observations where the further an observation in the past, the less its contribution is. Or an average of nearby observations over a Gaussian kernel.

There are many variants (virtually unlimited number) of moving averages, the most common smoothing window function (other than rollmean/SMA) is exponential moving average (EMA). Which should be included, and which not, is not trivial to decide, and actually best to make that decision according to feature requests that will come from users, so far none like this was requested.

All of these can be implemented as extended grouping operators, with rolling windows being just one of the elements on this list.

Surely they can, but if you will look at SO, and issues created in our repo, you will see that those few rolling functions here are responsible for 95+% of requests from users. I am happy to work on EMA and other MAs (although I am not sure if data.table is best place for those), but as a separate issue. Some users, me included, are waiting for just simple moving average in data.table for 4 years already.

Here's my take this, coming from a statistician's point-of-view

My point-of-view comes from Data Warehousing (where I used window function, at least once a week) and price trend analysis (where I used tens of different moving averages).

jangorecki added a commit that referenced this issue Apr 24, 2018
@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Apr 24, 2018

rollmean draft is pushed to roll branch. I found most of other packages that implements rolling mean are not able to dealt well with na.rm=FALSE and NAs present in input. This implementation handles NA consistently to mean, which impose some extra overhead because of ISNAN calls. We could allow API to faster but less safe version if user is sure there are no NAs in input.
PR is in #2795

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Apr 27, 2018

@mattdowle answering questions from PR

Why are we doing this inside data.table? Why are we integrating it instead of contributing to existing packages and using them from data.table?

  1. There were 3 different issues created asking for that functionality in data.table. Also multiple SO questions tagged data.table. Users expects that to be in scope of data.table.
  2. data.table fits perfectly for time-series data and rolling aggregates are pretty useful statistic there.

my guess is it comes down to syntax (features only possible or convenient if built into data.table; e.g. inside [...] and optimized) and building data.table internals into the rolling function at C level; e.g. froll* should be aware and use data.table indices and key. If so, more specifics on that are needed; e.g. a simple short example.

For me personally it is about speed and lack of chain of dependencies, nowadays not easy to achieve.
Key/indices could be useful for frollmin/frollmax, but it is unlikely that user will create index on measure variable. It is unlikely that user will make index on measure variable, also we haven't made this optimization for min/max yet. I don't see much sense for GForce optimization because allocated memory is not released after roll* call but returned as answer (as opposed to non-rolling mean, sum, etc.).

If there is no convincing argument for integrating, then we should contribute to the other packages instead.

I listed some above, if you are not convinced I recommend you to fill a question to data.table users, ask on twitter, etc. to check response. This feature was long time requested and by many users. If response won't convince you then you can close this issue.

jangorecki added a commit that referenced this issue May 19, 2018
jangorecki added a commit that referenced this issue May 29, 2018
@harryprince

This comment has been minimized.

Copy link

@harryprince harryprince commented Aug 14, 2018

I found sparklyr can support rolling functions very well in a very large scale dataset.

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Aug 15, 2018

@harryprince could put a little bit more light by providing example code of how you do it in sparklyr?
According to "Window functions" dplyr vignette

Rolling aggregates operate in a fixed width window. You won’t find them in base R or in dplyr, but there are many implementations in other packages, such as RcppRoll.

AFAIU you use custom spark API via sparklyr for which dplyr interface is not implemented, correct?

This issue is about rolling aggregates, other "types" of window functions are already in data.table for a long time.

@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Aug 15, 2018

Providing some example so we can compare (in-memory) performance vs sparklyr/SparkR would also be helpful.

@st-pasha

This comment has been minimized.

Copy link
Contributor

@st-pasha st-pasha commented Aug 15, 2018

It just occurred to me that this question:

how to calculate different window sizes for different columns?

has in fact a broader scope, and does not apply to rolling functions only.

For example, it seems to be perfectly reasonable to ask how to select the average product price by date, and then by week, and then maybe by week+category -- all within the same query. If we ever to implement such functionality, the natural syntax for it could be

DT[, .( mean(price, by=date), 
        mean(price, by=week), 
        mean(price, by=c(week, category)) )]

Now, if this functionality was already implemented, then it would have been a simple leap from there to rolling means:

DT[, .( mean(price, roll=5), 
        mean(price, roll=20), 
        mean(price, roll=100) )]

Not saying that this is unequivocally better than rollmean(price, 5) -- just throwing in some alternatives to consider...

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Aug 15, 2018

@st-pasha

how to select the average product price by date, and then by week, and then maybe by week+category -- all within the same query.

AFAIU this is already possible using ?groupingsets, but not hooked into [.data.table yet.

jangorecki added a commit that referenced this issue Nov 11, 2018
@jangorecki jangorecki mentioned this issue Dec 15, 2018
3 of 3 tasks complete
@jangorecki jangorecki removed this from the 1.12.0 milestone Jan 5, 2019
@msummersgill

This comment has been minimized.

Copy link

@msummersgill msummersgill commented Jan 29, 2019

@jangorecki , @st-pasha , and Co. -- Thanks for all your work on this! I'm curious why partial window support was removed from the scope, is there any potential for that functionality to make it back on the roadmap? Would come in handy for me sometimes, and fill in a functionality gap that to my knowledge hasn't been filled in either zoo or RcppRoll.

This Stack Overflow Question is a good example of a rolling application that could benefit from a partial = TRUE argument.

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Jan 30, 2019

@msummersgill Thanks for feedback. In the first post I explicitly linked commit sha where partial window feature code can be found. The implementation that is there was later removed to reduce complexity of code. It was also imposing small performance cost even when not using that feature. This feature can (and probably should) be implemented the other way, first complete as is, and then just fill up the missing partial window using extra loop of 1:window_size. So the overhead of that feature is only noticeable when you use it. Nevertheless we do provide that functionality via adaptive argument, where partial feature is just a special case of adaptive rolling mean. There is example how to achieve partial using adaptive in ?froll manual. Pasting it here:

d = as.data.table(list(1:6/2, 3:8/4))
an = function(n, len) c(seq.int(n), rep(n, len-n))
n = an(3, nrow(d))
frollmean(d, n, adaptive=TRUE)

Of course it will not be as efficient as non-adaptive rolling function using extra loop to fill up just partial window.
AFAIK zoo has partial feature.

jangorecki added a commit that referenced this issue May 24, 2019
mattdowle added a commit that referenced this issue May 25, 2019
@jangorecki jangorecki mentioned this issue May 26, 2019
@waynelapierre

This comment has been minimized.

Copy link

@waynelapierre waynelapierre commented May 29, 2019

Do you guys have any plan of adding rolling regression functions to data.table?

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented May 29, 2019

@waynelapierre if there will be a demand for that, then yes. You have my +1

@randomgambit

This comment has been minimized.

Copy link

@randomgambit randomgambit commented Jul 1, 2019

thanks this is great. Just one question though. I only see simple rolling aggregates, like a rolling mean or rolling median. Are you also implementing more refined rolling functions such as rolling DT dataframes? Say, create a rolling DT using the last 10 obs and run a lm regression on it.

Thanks!

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Jul 1, 2019

@randomgambit I would say it is out of scope, unless there will be high demand for that. It wouldn't be very difficult to do it to be faster than base R/zoo just by handling nested loop in C. But we should try to implement it using "online" algorithm, to avoid nested loop. This is more tricky, and we could eventually do it for any statistic, so we have to cut off those statistics at some point.

@randomgambit

This comment has been minimized.

Copy link

@randomgambit randomgambit commented Jul 1, 2019

@jangorecki interesting thanks. That means I will keep using tsibble to embed... DATA.TABLES in a tibble! mind blown :D

@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Aug 7, 2019

Tried to use frollmean to calculate a nonparametric "logistic curve" which shows P[y | x] for binary y using nearest neighbors of x. Was surprised y stored as logical was not cast automatically to integer:

DT = data.table(x = rnorm(1000), y = runif(1000) > .5)
DT[order(x), .(x, p_y = frollmean(y, 50L))]

Error in froll(fun = "mean", x = x, n = n, fill = fill, algo = algo, align = align, :
x must be of type numeric

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Aug 22, 2019

An example of how vectorized x/n arguments can impact performance.
AdrianAntico/RemixAutoML@d837071#r34784427
less loops, code easier to read, much faster (10x-36x speedup).

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Sep 1, 2019

frollapply ready: #3600

    ### fun             mean     sum  median
    # rollfun          8.815   5.151  60.175
    # zoo::rollapply  34.373  27.837  88.552
    # zoo::roll[fun]   0.215   0.185      NA
    # frollapply       5.404   1.419  56.475
    # froll[fun]       0.003   0.002      NA
@MichaelChirico MichaelChirico mentioned this issue Oct 15, 2019
6 of 30 tasks complete
@jerryfuyu0104

This comment has been minimized.

Copy link

@jerryfuyu0104 jerryfuyu0104 commented Oct 28, 2019

hi guys, will FUN(user defined) passed to frollapply be changed to return an R object or data.frame(data.table), x passed to frollapply could be data.table of character not coerced to numeric, then FUN could do on labels and frollapply return a list? then we can do rolling regression or rolling testing like doing Benford's testing or summary on labels.

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Oct 28, 2019

It is always useful to provide reproducible example. To clarify... in such a scenario you would like to frollapply(dt, 3, FUN) return a list of length nrow(dt) where each list element will be data.table returned by FUN(dt[window])?
frollapply(x=dt, n=3, fun=FUN)[[3]] equals to FUN(dt[1:3])
frollapply(x=dt, n=3, FUN=FUN)[[4]] equals to FUN(dt[2:4])
is that correct? @jerryfuyu0104

Currently we support multiple columns passed to first argument but we process them separately, looping. We would probably need some extra argument multi.var=FALSE, when set to true it would not loop over x (as it does now: list(FUN(x[[1]]),FUN(x[[2]]))) but pass all columns FUN(x).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.