Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign uprolling functions, rolling aggregates, sliding window, moving average #2778
Comments
|
Proposed x = data.table(v1=1:5, v2=1:5)
k = c(2, 3)
|
|
yes, and many more rolled functions follow the same basic idea (including
rolling standard deviation/any expectation-based moment, and any function
like rollproduct that uses invertible * instead of + to aggregate within
the window
|
|
I always envisioned rolling window functionality as grouping the dataset into multiple overlapping groups (windows). Then the API would look something like this:
Then if This way there's no need to introduce 10+ new functions, just one. And it feels data.table-y in spirit too. |
|
yes, agree
…On Sat, Apr 21, 2018, 3:38 PM Pasha Stetsenko ***@***.***> wrote:
I always envisioned rolling window functionality as *grouping* the
dataset into multiple overlapping groups (windows). Then the API would look
something like this:
DT[i, j,
by = roll(width=5, align="center")]
Then if j contains, say, mean(A), we can internally replace it with
rollmean(A) -- exactly like we are doing with gmean() right now. Or j can
contain an arbitrarily complicated functionality (say, run a regression for
each window), in which case we'd supply .SD data.table to it -- exactly
like we do with groups right now.
This way there's no need to introduce 10+ new functions, just one. And it
feels data.table-y in spirit too.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2778 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdbADiE4aAI1qPxPnFXUM5gR-0w2Tks5tquH8gaJpZM4TeTQf>
.
|
|
@st-pasha interesting idea, looks like data.table-y spirit, but it will impose many limitations, and isn't really appropriate for this category of functions.
DT[, rollmean(V1, 3), by=V2]
DT[, .(rollmean(V1, 3), rollmean(V2, 100))]
rollmean(rnorm(10), 3)
DT[, .(rollmean(list(V1, V2), c(5, 20)), rollmean(list(V2, V3), c(10, 30)))]
DT[, .(rollmean(V1, 3), mean(V1)), by=V2]Usually when using SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename;You can still combine it with GROUP BY as follows: SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename
GROUP BY group_columns;So in SQL those functions stays in DT[, rollmean(value, 100)]
DT[, rollmean(value, 100), group_columns]Rolling functions fits into same category of functions as SELECT LAG(value, 1) OVER ()
FROM tablename;
|
|
@jangorecki Thanks, these are all valid considerations. Of course different people have different experiences, and different views as to what should be considered "natural". It is possible to perform rollmean by group: this is just a 2-level grouping: I must admit I never seen SQL syntax for rolling joins before. It's interesting that they use standard aggregator such as Also, this SO question provides an interesting insight why the OVER syntax was introduced in SQL at all:
So it appears that the syntax is designed to circumvent the limitation of standard SQL where group-by results could not be combined with unaggregated values (i.e. selecting both Now, if we want to really get ahead of the curve, we need to think in a broader perspective: what are the "rolling" functions, what are they used for, how they can be extended, etc. Here's my take this, coming from a statistician's point-of-view: "Rolling mean" function is used to smooth some noisy input. Say, if you have observations over time and you want to have some notion of "average quantity", which would nevertheless vary over time although very slowly. In this case "rolling mean over last 100 observations" or "rolling mean over all previous observations" can be considered. Similarly, if you observe certain quantity over a range of inputs, you may smooth it out by applying "rolling mean over ±50 observations".
All of these can be implemented as extended grouping operators, with rolling windows being just one of the elements on this list. That being said, I don't why we can't have it both ways. |
I assume you mean rolling functions, issue has nothing to do with rolling joins.
It is just a matter of use case, if you are calling same OVER() many time you may find it more performant to use
In data.table we do
It isn't much limitation of SQL but just design of GROUP BY, that it will aggregate, the same way that our
This is what
There are many variants (virtually unlimited number) of moving averages, the most common smoothing window function (other than rollmean/SMA) is exponential moving average (EMA). Which should be included, and which not, is not trivial to decide, and actually best to make that decision according to feature requests that will come from users, so far none like this was requested.
Surely they can, but if you will look at SO, and issues created in our repo, you will see that those few rolling functions here are responsible for 95+% of requests from users. I am happy to work on EMA and other MAs (although I am not sure if data.table is best place for those), but as a separate issue. Some users, me included, are waiting for just simple moving average in data.table for 4 years already.
My point-of-view comes from Data Warehousing (where I used window function, at least once a week) and price trend analysis (where I used tens of different moving averages). |
|
|
|
@mattdowle answering questions from PR
For me personally it is about speed and lack of chain of dependencies, nowadays not easy to achieve.
I listed some above, if you are not convinced I recommend you to fill a question to data.table users, ask on twitter, etc. to check response. This feature was long time requested and by many users. If response won't convince you then you can close this issue. |
|
I found |
|
@harryprince could put a little bit more light by providing example code of how you do it in sparklyr?
AFAIU you use custom spark API via sparklyr for which dplyr interface is not implemented, correct? This issue is about rolling aggregates, other "types" of window functions are already in |
|
Providing some example so we can compare (in-memory) performance vs |
|
It just occurred to me that this question:
has in fact a broader scope, and does not apply to rolling functions only. For example, it seems to be perfectly reasonable to ask how to select the average product price by date, and then by week, and then maybe by week+category -- all within the same query. If we ever to implement such functionality, the natural syntax for it could be DT[, .( mean(price, by=date),
mean(price, by=week),
mean(price, by=c(week, category)) )]Now, if this functionality was already implemented, then it would have been a simple leap from there to rolling means: DT[, .( mean(price, roll=5),
mean(price, roll=20),
mean(price, roll=100) )]Not saying that this is unequivocally better than |
AFAIU this is already possible using |
|
thanks this is great. Just one question though. I only see simple rolling aggregates, like a rolling mean or rolling median. Are you also implementing more refined rolling functions such as rolling DT dataframes? Say, create a rolling DT using the last 10 obs and run a Thanks! |
|
@randomgambit I would say it is out of scope, unless there will be high demand for that. It wouldn't be very difficult to do it to be faster than base R/zoo just by handling nested loop in C. But we should try to implement it using "online" algorithm, to avoid nested loop. This is more tricky, and we could eventually do it for any statistic, so we have to cut off those statistics at some point. |
|
@jangorecki interesting thanks. That means I will keep using |
|
Tried to use
|
|
An example of how vectorized |
|
frollapply ready: #3600
|
|
hi guys, will FUN(user defined) passed to frollapply be changed to return an R object or data.frame(data.table), x passed to frollapply could be data.table of character not coerced to numeric, then FUN could do on labels and frollapply return a list? then we can do rolling regression or rolling testing like doing Benford's testing or summary on labels. |
|
It is always useful to provide reproducible example. To clarify... in such a scenario you would like to Currently we support multiple columns passed to first argument but we process them separately, looping. We would probably need some extra argument |
|
any update for this? |
|
I second that previous request. Furthermore, would it be possible to support a "partial" argument to allow for partial windows? |
|
@eliocamp could you elaborate on what a |
|
It would mean computing the function from the beginning through the end instead than form the half-window point. |
|
@jangorecki oh, thanks, I didn't know that! I'll check it out. |
|
Agree, we need partial argument, not just for convenient but also for speed. |
|
I'd love to help but my C++ skills are utterly non-existent. |
|
We don't code in C++ but in C. Yes it is good place to start with. I did exactly that on frollmean. |
|
I look at the code and it seems daunting. But I'll update you in any case. But now, for yet another request: frollmean(.SD) should preserve names. More generally, froll* should preserve names if the input is a list-like with names. |
|
As a frequent user of data.table, I find it extremely useful to have "time aware" features, as those currently offered in the package |
|
@ywhcuhk Thanks for feedback, I was actually thinking this issue was already asking for too much. Most of that is well covered by still lightweight package roll which is very fast. As for the other features, I suggest to create new issue for each feature you are interested in, so discussion whether we want to implement/maintain can be decided for each separately. Just from looking at readme of tstibble I don't see anything new it offers... |
|
Thank you @jangorecki for the response. Maybe it's a context dependent issue. The data structure I deal with most frequently is known as "panel data", with an ID and time. If the program is "aware" of this data feature, a lot operations, especially time-series operations, will be made very easy. For someone who knows STATA, it's the operations based on Of course, these operations can be done in data.table functions like |
To gather requirements in single place and refresh ~4 years old discussions creating this issue to cover rolling functions feature (also known as rolling aggregates, sliding window or moving average/moving aggregates).
rolling functions
features
#3422, #3423)