Rdatatable / data.table Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rolling functions, rolling aggregates, sliding window, moving average #2778
Comments
Proposed x = data.table(v1=1:5, v2=1:5)
k = c(2, 3)
|
yes, and many more rolled functions follow the same basic idea (including
rolling standard deviation/any expectation-based moment, and any function
like rollproduct that uses invertible * instead of + to aggregate within
the window
|
I always envisioned rolling window functionality as grouping the dataset into multiple overlapping groups (windows). Then the API would look something like this:
Then if This way there's no need to introduce 10+ new functions, just one. And it feels data.table-y in spirit too. |
yes, agree
…On Sat, Apr 21, 2018, 3:38 PM Pasha Stetsenko ***@***.***> wrote:
I always envisioned rolling window functionality as *grouping* the
dataset into multiple overlapping groups (windows). Then the API would look
something like this:
DT[i, j,
by = roll(width=5, align="center")]
Then if j contains, say, mean(A), we can internally replace it with
rollmean(A) -- exactly like we are doing with gmean() right now. Or j can
contain an arbitrarily complicated functionality (say, run a regression for
each window), in which case we'd supply .SD data.table to it -- exactly
like we do with groups right now.
This way there's no need to introduce 10+ new functions, just one. And it
feels data.table-y in spirit too.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2778 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdbADiE4aAI1qPxPnFXUM5gR-0w2Tks5tquH8gaJpZM4TeTQf>
.
|
@st-pasha interesting idea, looks like data.table-y spirit, but it will impose many limitations, and isn't really appropriate for this category of functions.
DT[, rollmean(V1, 3), by=V2]
DT[, .(rollmean(V1, 3), rollmean(V2, 100))]
rollmean(rnorm(10), 3)
DT[, .(rollmean(list(V1, V2), c(5, 20)), rollmean(list(V2, V3), c(10, 30)))]
DT[, .(rollmean(V1, 3), mean(V1)), by=V2] Usually when using SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename; You can still combine it with GROUP BY as follows: SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename
GROUP BY group_columns; So in SQL those functions stays in DT[, rollmean(value, 100)]
DT[, rollmean(value, 100), group_columns] Rolling functions fits into same category of functions as SELECT LAG(value, 1) OVER ()
FROM tablename;
|
@jangorecki Thanks, these are all valid considerations. Of course different people have different experiences, and different views as to what should be considered "natural". It is possible to perform rollmean by group: this is just a 2-level grouping: I must admit I never seen SQL syntax for rolling joins before. It's interesting that they use standard aggregator such as Also, this SO question provides an interesting insight why the OVER syntax was introduced in SQL at all:
So it appears that the syntax is designed to circumvent the limitation of standard SQL where group-by results could not be combined with unaggregated values (i.e. selecting both Now, if we want to really get ahead of the curve, we need to think in a broader perspective: what are the "rolling" functions, what are they used for, how they can be extended, etc. Here's my take this, coming from a statistician's point-of-view: "Rolling mean" function is used to smooth some noisy input. Say, if you have observations over time and you want to have some notion of "average quantity", which would nevertheless vary over time although very slowly. In this case "rolling mean over last 100 observations" or "rolling mean over all previous observations" can be considered. Similarly, if you observe certain quantity over a range of inputs, you may smooth it out by applying "rolling mean over ±50 observations".
All of these can be implemented as extended grouping operators, with rolling windows being just one of the elements on this list. That being said, I don't why we can't have it both ways. |
I assume you mean rolling functions, issue has nothing to do with rolling joins.
It is just a matter of use case, if you are calling same OVER() many time you may find it more performant to use
In data.table we do
It isn't much limitation of SQL but just design of GROUP BY, that it will aggregate, the same way that our
This is what
There are many variants (virtually unlimited number) of moving averages, the most common smoothing window function (other than rollmean/SMA) is exponential moving average (EMA). Which should be included, and which not, is not trivial to decide, and actually best to make that decision according to feature requests that will come from users, so far none like this was requested.
Surely they can, but if you will look at SO, and issues created in our repo, you will see that those few rolling functions here are responsible for 95+% of requests from users. I am happy to work on EMA and other MAs (although I am not sure if data.table is best place for those), but as a separate issue. Some users, me included, are waiting for just simple moving average in data.table for 4 years already.
My point-of-view comes from Data Warehousing (where I used window function, at least once a week) and price trend analysis (where I used tens of different moving averages). |
|
@mattdowle answering questions from PR
For me personally it is about speed and lack of chain of dependencies, nowadays not easy to achieve.
I listed some above, if you are not convinced I recommend you to fill a question to data.table users, ask on twitter, etc. to check response. This feature was long time requested and by many users. If response won't convince you then you can close this issue. |
I found |
@harryprince could put a little bit more light by providing example code of how you do it in sparklyr?
AFAIU you use custom spark API via sparklyr for which dplyr interface is not implemented, correct? This issue is about rolling aggregates, other "types" of window functions are already in |
Providing some example so we can compare (in-memory) performance vs |
It just occurred to me that this question:
has in fact a broader scope, and does not apply to rolling functions only. For example, it seems to be perfectly reasonable to ask how to select the average product price by date, and then by week, and then maybe by week+category -- all within the same query. If we ever to implement such functionality, the natural syntax for it could be DT[, .( mean(price, by=date),
mean(price, by=week),
mean(price, by=c(week, category)) )] Now, if this functionality was already implemented, then it would have been a simple leap from there to rolling means: DT[, .( mean(price, roll=5),
mean(price, roll=20),
mean(price, roll=100) )] Not saying that this is unequivocally better than |
AFAIU this is already possible using |
@randomgambit I would say it is out of scope, unless there will be high demand for that. It wouldn't be very difficult to do it to be faster than base R/zoo just by handling nested loop in C. But we should try to implement it using "online" algorithm, to avoid nested loop. This is more tricky, and we could eventually do it for any statistic, so we have to cut off those statistics at some point. |
@jangorecki interesting thanks. That means I will keep using |
Tried to use
|
An example of how vectorized |
frollapply ready: #3600
|
hi guys, will FUN(user defined) passed to frollapply be changed to return an R object or data.frame(data.table), x passed to frollapply could be data.table of character not coerced to numeric, then FUN could do on labels and frollapply return a list? then we can do rolling regression or rolling testing like doing Benford's testing or summary on labels. |
It is always useful to provide reproducible example. To clarify... in such a scenario you would like to Currently we support multiple columns passed to first argument but we process them separately, looping. We would probably need some extra argument |
any update for this? |
I second that previous request. Furthermore, would it be possible to support a "partial" argument to allow for partial windows? |
@eliocamp could you elaborate on what a |
It would mean computing the function from the beginning through the end instead than form the half-window point. |
@jangorecki oh, thanks, I didn't know that! I'll check it out. |
Agree, we need partial argument, not just for convenient but also for speed. |
I'd love to help but my C++ skills are utterly non-existent. |
We don't code in C++ but in C. Yes it is good place to start with. I did exactly that on frollmean. |
I look at the code and it seems daunting. But I'll update you in any case. But now, for yet another request: frollmean(.SD) should preserve names. More generally, froll* should preserve names if the input is a list-like with names. |
As a frequent user of data.table, I find it extremely useful to have "time aware" features, as those currently offered in the package |
@ywhcuhk Thanks for feedback, I was actually thinking this issue was already asking for too much. Most of that is well covered by still lightweight package roll which is very fast. As for the other features, I suggest to create new issue for each feature you are interested in, so discussion whether we want to implement/maintain can be decided for each separately. Just from looking at readme of tstibble I don't see anything new it offers... |
Thank you @jangorecki for the response. Maybe it's a context dependent issue. The data structure I deal with most frequently is known as "panel data", with an ID and time. If the program is "aware" of this data feature, a lot operations, especially time-series operations, will be made very easy. For someone who knows STATA, it's the operations based on Of course, these operations can be done in data.table functions like |
Hey guys, I'm bringing up a possible feature request. For ML and Forecasting, I use the frollmean and shift functions quite a bit to generate useful features. In a scoring environment I typically only need to generate those rolling stat features for a handful of records from the data.table. I already created some functions for recreating rolling stats on subsets of a data.table using a bunch of lags and rowmean's from outside the data.table package. However, I began testing if I could generate them in faster time using shift and frollmean with a subset in i. When testing it out I realized that I have to include all the rows that need to be used to create the lags and rolling means in order to use the subset in i properly and I'm not sure if that is the intended way to do so. I have a few examples below where I try to create a lag column and a 2-period moving average for a single record in the data.table. In the examples, I first use the subset in i how I would like to use it, and then show that if I include the other rows used in the lag and rolling mean calc that I get what I want. It would more ideal for me if I only had to specify the rows I want the lags and rolling stats for without having to include the other rows in i. @st-pasha I included you in this because I know you have frollmean on the roadmap for the python version and you haven't gotten to it yet.
|
To gather requirements in single place and refresh ~4 years old discussions creating this issue to cover rolling functions feature (also known as rolling aggregates, sliding window or moving average/moving aggregates).
rolling functions
features
#3422, #3423)The text was updated successfully, but these errors were encountered: