Skip to content

Commit 3bb2e76

Browse files
jangoreckimcolben-schwenaitapToby Dylan Hocking
authored
left aligned, adaptive, rolling max (#7264)
* rolling max, adaptive left * frollmax exact, buggy fast, no fast adaptive * frollmax fast fixing bugs * frollmax man to fix CRAN check * frollmax fast adaptive non NA, dev * froll docs, adaptive left * no frollmax fast adaptive * frollmax adaptive exact NAs handling * PR summary in news * align happens in one place, less duplicated code * push up even more to frollR to reduce code duplication * frollapply push up align arg and early stopping up * typo fix in NEWS.md Co-authored-by: Marco Colombo <m.colombo@ed.ac.uk> * keep R agnostic C code in froll.c, yet deduplicated * new functionality unit tests * doc further improving * tests and NEWS * remove unnecessary new line to reduce diff * Many wording corrections by Ben Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com> * function args, suggested by Ivan * Add wording improvements by Ivan Co-authored-by: aitap <krylov.r00t@gmail.com> * follow up improvements added to TODO file --------- Co-authored-by: Marco Colombo <m.colombo@ed.ac.uk> Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com> Co-authored-by: aitap <krylov.r00t@gmail.com> * partial window support for rolling functions * adaptive and partial added to frollapply() * NA support in frollmax * frollmax adaptive speed up by early NA detection * rolling functions support for Inf/-Inf (#9) * Deduplicate args handling in rolling functions * give.names argument for rolling functions * adaptive && partial support, helper functions * also update function call * frollmax PRs reviews feedback follow up (#13) * catf rather than cat * malloc and free as per feedback * remove R < 3.4.0 related code * rm TODO file * elaborate on a breaking change * fix compiler notes * update timings and add NOTE * partial2adaptive does not wrap in a list when not needed * forget to remove wrapping into list, amend error for different types * compile with -O3 and use performance mode rather than balanced * fix codecov * atime tests for froll* * explain tryCatch * typo fix Co-authored-by: Marco Colombo <m.colombo@ed.ac.uk> * reverting atime test cases as there is nothing to compare to from the past * address feedback from Ben * indent code block to keep the right HTML structure * Apply suggestions from Michael's code review Co-authored-by: Michael Chirico <chiricom@google.com> * proper indentation in NEWS file for long entry * fix typo in NEWS Co-authored-by: Marco Colombo <m.colombo@ed.ac.uk> --------- Co-authored-by: Marco Colombo <m.colombo@ed.ac.uk> Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com> Co-authored-by: aitap <krylov.r00t@gmail.com> Co-authored-by: Toby Dylan Hocking <toby.dylan.hocking@usherbrooke.ca> Co-authored-by: Michael Chirico <chiricom@google.com>
1 parent 7920cad commit 3bb2e76

File tree

11 files changed

+1804
-733
lines changed

11 files changed

+1804
-733
lines changed

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ S3method(cube, data.table)
5454
S3method(rollup, data.table)
5555
export(frollmean)
5656
export(frollsum)
57+
export(frollmax)
5758
export(frollapply)
5859
export(nafill)
5960
export(setnafill)

NEWS.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,18 @@
44

55
## data.table [v1.17.99](https://github.com/Rdatatable/data.table/milestone/35) (in development)
66

7+
### BREAKING CHANGE
8+
9+
1. Rolling functions `frollmean` and `frollsum` distinguish `Inf`/`-Inf` from `NA` to match the same rules as base R when `algo="fast"` (previously they were considered the same). If your input into those functions has `Inf` or `-Inf` then you will be affected by this change. As a result, the argument that controls the handling of `NA`s has been renamed from `hasNA` to `has.nf` (_has non-finite_). `hasNA` continues to work with a warning, for now.
10+
```r
11+
## before
12+
frollsum(c(1,2,3,Inf,5,6), 2)
13+
#[1] NA 3 5 NA NA 11
14+
15+
## now
16+
frollsum(c(1,2,3,Inf,5,6), 2)
17+
#[1] NA 3 5 Inf Inf 11
18+
719
### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES
820

921
1. `data.table(x=1, <expr>)`, where `<expr>` is an expression resulting in a 1-column matrix without column names, will eventually have names `x` and `V2`, not `x` and `V1`, consistent with `data.table(x=1, <expr>)` where `<expr>` results in an atomic vector, for example `data.table(x=1, cbind(1))` and `data.table(x=1, 1)` will both have columns named `x` and `V2`. In this release, the matrix case continues to be named `V1`, but the new behavior can be activated by setting `options(datatable.old.matrix.autoname)` to `FALSE`. See point 5 under Bug Fixes for more context; this change will provide more internal consistency as well as more consistency with `data.frame()`.
@@ -75,6 +87,46 @@
7587

7688
15. New function `isoyear()` has been implemented as a complement to `isoweek()`, returning the ISO 8601 year corresponding to a given date, [#7154](https://github.com/Rdatatable/data.table/issues/7154). Thanks to @ben-schwen and @MichaelChirico for the suggestion and @venom1204 for the implementation.
7789

90+
16. Multiple improvements have been added to rolling functions. Request came from @gpierard who needed left aligned, adaptive, rolling max, [#5438](https://github.com/Rdatatable/data.table/issues/5438). There was no `frollmax` function yet. Adaptive rolling functions did not have support for `align="left"`. `frollapply` did not support `adaptive=TRUE`. Available alternatives were base R `mapply` or self-join using `max` and grouping `by=.EACHI`. As a follow up of his request, the following features have been added:
91+
- new function `frollmax`, applies `max` over a rolling window.
92+
- support for `align="left"` for adaptive rolling function.
93+
- support for `adaptive=TRUE` in `frollapply`.
94+
- `partial` argument to trim window width to available observations rather than returning `NA` whenever window is not complete.
95+
- `give.names` argument that can be used to automatically give the names based on the names of `x` and `n`.
96+
- `frollmean` and `frollsum` no longer treat `Inf` and `-Inf` as `NA`s as it used to be for `algo="fast"` (breaking change).
97+
- `hasNA` argument has been renamed to `has.nf` to convey that it is not only related to `NA/NaN` but other non-finite values (`Inf/-Inf`) as well.
98+
99+
Thanks to @jangorecki for implementation and @MichaelChirico and others for work on splitting into smaller PRs and reviews.
100+
For a comprehensive description about all available features see `?froll` manual.
101+
102+
Adaptive `frollmax` has observed to be around 80 times faster than second fastest solution (data.table self-join using `max` and grouping `by=.EACHI`). Note that important factor in performance is width of the rolling window. Code for the benchmark below has been taken from [this SO answer](https://stackoverflow.com/a/73408459/2490497).
103+
```r
104+
set.seed(108)
105+
setDTthreads(16)
106+
x = data.table(
107+
value = cumsum(rnorm(1e6, 0.1)),
108+
end_window = 1:1e6 + sample(50:500, 1e6, TRUE),
109+
row = 1:1e6
110+
)[, "end_window" := pmin(end_window, .N)
111+
][, "len_window" := end_window-row+1L]
112+
baser = function(x) x[, mapply(function(from, to) max(value[from:to]), row, end_window)]
113+
sj = function(x) x[x, max(value), on=.(row >= row, row <= end_window), by=.EACHI]$V1
114+
frmax = function(x) x[, frollmax(value, len_window, adaptive=TRUE, align="left", has.nf=FALSE)]
115+
frapply = function(x) x[, frollapply(value, len_window, max, adaptive=TRUE, align="left")]
116+
microbenchmark::microbenchmark(
117+
baser(x), sj(x), frmax(x), frapply(x),
118+
times=10, check="identical"
119+
)
120+
#Unit: milliseconds
121+
# expr min lq mean median uq max neval
122+
# baser(x) 3094.88357 3097.84966 3186.74832 3163.58050 3251.66753 3370.33785 10
123+
# sj(x) 2221.55456 2255.12083 2306.61382 2303.47883 2346.70293 2412.62975 10
124+
# frmax(x) 17.45124 24.16809 28.10062 28.58153 32.79802 34.83941 10
125+
# frapply(x) 272.07830 316.47060 366.94771 396.23566 416.06699 421.38701 10
126+
```
127+
128+
As of now, adaptive rolling max has no _on-line_ implementation (`algo="fast"`), it uses a naive approach (`algo="exact"`). Therefore further speed up is still possible if `algo="fast"` gets implemented.
129+
78130
### BUG FIXES
79131

80132
1. `fread()` no longer warns on certain systems on R 4.5.0+ where the file owner can't be resolved, [#6918](https://github.com/Rdatatable/data.table/issues/6918). Thanks @ProfFancyPants for the report and PR.

R/froll.R

Lines changed: 127 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,136 @@
1-
froll = function(fun, x, n, fill=NA, algo=c("fast", "exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
2-
stopifnot(!missing(fun), is.character(fun), length(fun)==1L, !is.na(fun))
3-
algo = match.arg(algo)
1+
# helpers for partial2adaptive
2+
trimn = function(n, len, align) {
3+
n = min(n, len) ## so frollsum(1:2, 3, partial=TRUE) works
4+
if (align=="right")
5+
c(seq_len(n), rep.int(n, len-n))
6+
else
7+
c(rep.int(n, len-n), rev(seq_len(n)))
8+
}
9+
trimnadaptive = function(n, align) {
10+
if (align=="right")
11+
pmin(n, seq_along(n))
12+
else
13+
pmin(n, rev(seq_along(n)))
14+
}
15+
16+
# partial2adaptive helper function
17+
## tune provided 'n' via partial=TRUE to adaptive=TRUE by prepared adaptive 'n' as shown in ?froll examples
18+
# partial2adaptive(1:4, 2, "right", adaptive=FALSE)
19+
# partial2adaptive(1:4, 2:3, "right", adaptive=FALSE)
20+
# partial2adaptive(list(1:4, 2:5), 2:3, "right", adaptive=FALSE)
21+
# frollsum(1:4, 2, partial=FALSE, adaptive=FALSE)
22+
# frollsum(1:4, 2, partial=TRUE, adaptive=FALSE)
23+
# frollsum(1:4, 2:3, partial=FALSE, adaptive=FALSE)
24+
# frollsum(1:4, 2:3, partial=TRUE, adaptive=FALSE)
25+
# frollsum(list(1:4, 2:5), 2:3, partial=FALSE, adaptive=FALSE)
26+
# frollsum(list(1:4, 2:5), 2:3, partial=TRUE, adaptive=FALSE)
27+
partial2adaptive = function(x, n, align, adaptive) {
28+
if (!length(n))
29+
stopf("n must be non 0 length")
30+
if (align=="center")
31+
stopf("'partial' cannot be used together with align='center'")
32+
if (is.list(x) && length(unique(lengths(x))) != 1L)
33+
stopf("'partial' does not support variable length of columns in 'x'")
34+
len = if (is.list(x)) length(x[[1L]]) else length(x)
35+
verbose = getOption("datatable.verbose")
36+
if (!adaptive) {
37+
if (is.list(n))
38+
stopf("n must be an integer, list is accepted for adaptive TRUE")
39+
if (!is.numeric(n))
40+
stopf("n must be an integer vector or a list of integer vectors")
41+
if (verbose)
42+
catf("partial2adaptive: froll partial=TRUE trimming 'n' and redirecting to adaptive=TRUE\n")
43+
if (length(n) > 1L) {
44+
## c(2,3) -> list(c(1,2,2,2),c(1,2,3,3)) ## for x=1:4
45+
lapply(n, len, align, FUN=trimn)
46+
} else {
47+
## 3 -> c(1,2,3,3) ## for x=1:4
48+
trimn(n, len, align)
49+
}
50+
} else {
51+
if (!(is.numeric(n) || (is.list(n) && all(vapply_1b(n, is.numeric)))))
52+
stopf("n must be an integer vector or a list of integer vectors")
53+
if (length(unique(lengths(n))) != 1L)
54+
stopf("adaptive window provided in 'n' must not to have different lengths")
55+
if (is.numeric(n) && length(n) != len)
56+
stopf("length of 'n' argument must be equal to number of observations provided in 'x'")
57+
if (is.list(n) && length(n[[1L]]) != len)
58+
stopf("length of vectors in 'x' must match to length of adaptive window in 'n'")
59+
if (verbose)
60+
catf("partial2adaptive: froll adaptive=TRUE and partial=TRUE trimming 'n'\n")
61+
if (is.numeric(n)) {
62+
## c(3,3,3,2) -> c(1,2,3,2) ## for x=1:4
63+
trimnadaptive(n, align)
64+
} else {
65+
## list(c(3,3,3,2),c(4,2,3,3)) -> list(c(1,2,3,2),c(1,2,3,3)) ## for x=1:4
66+
lapply(n, align, FUN = trimnadaptive)
67+
}
68+
}
69+
}
70+
71+
froll = function(fun, x, n, fill=NA, algo, align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, FUN, rho, give.names=FALSE) {
472
align = match.arg(align)
5-
ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm, hasNA, adaptive)
73+
if (isTRUE(give.names))
74+
orig = list(n=n, adaptive=adaptive)
75+
if (isTRUE(partial)) {
76+
n = partial2adaptive(x, n, align, adaptive)
77+
adaptive = TRUE
78+
}
79+
leftadaptive = isTRUE(adaptive) && align=="left"
80+
if (leftadaptive) {
81+
verbose = getOption("datatable.verbose")
82+
rev2 = function(x) if (is.list(x)) lapply(x, rev) else rev(x)
83+
if (verbose)
84+
catf("froll: adaptive=TRUE && align='left' pre-processing for align='right'\n")
85+
x = rev2(x)
86+
n = rev2(n)
87+
align = "right"
88+
} ## support for left adaptive added in #5441
89+
if (missing(FUN))
90+
ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm, has.nf, adaptive)
91+
else
92+
ans = .Call(CfrollapplyR, FUN, x, n, fill, align, adaptive, rho)
93+
if (leftadaptive) {
94+
if (verbose)
95+
catf("froll: adaptive=TRUE && align='left' post-processing from align='right'\n")
96+
ans = rev2(ans)
97+
}
98+
if (isTRUE(give.names) && is.list(ans)) {
99+
n = orig$n
100+
adaptive = orig$adaptive
101+
nx = names(x)
102+
nn = names(n)
103+
if (is.null(nx)) nx = paste0("V", if (is.atomic(x)) 1L else seq_along(x))
104+
if (is.null(nn)) nn = if (adaptive) paste0("N", if (is.atomic(n)) 1L else seq_along(n)) else paste("roll", as.character(n), sep="_")
105+
setattr(ans, "names", paste(rep(nx, each=length(nn)), nn, sep="_"))
106+
}
6107
ans
7108
}
8109

9-
frollmean = function(x, n, fill=NA, algo=c("fast", "exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
10-
froll(fun="mean", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, hasNA=hasNA, adaptive=adaptive)
110+
frollfun = function(fun, x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, hasNA, give.names=FALSE) {
111+
stopifnot(!missing(fun), is.character(fun), length(fun)==1L, !is.na(fun))
112+
if (!missing(hasNA)) {
113+
if (!is.na(has.nf))
114+
stopf("hasNA is deprecated, use has.nf instead")
115+
warningf("hasNA is deprecated, use has.nf instead")
116+
has.nf = hasNA
117+
} # remove check on next major release
118+
algo = match.arg(algo)
119+
froll(fun=fun, x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, give.names=give.names)
120+
}
121+
122+
frollmean = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, hasNA, give.names=FALSE) {
123+
frollfun(fun="mean", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
124+
}
125+
frollsum = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, hasNA, give.names=FALSE) {
126+
frollfun(fun="sum", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
11127
}
12-
frollsum = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE) {
13-
froll(fun="sum", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, hasNA=hasNA, adaptive=adaptive)
128+
frollmax = function(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"), na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, hasNA, give.names=FALSE) {
129+
frollfun(fun="max", x=x, n=n, fill=fill, algo=algo, align=align, na.rm=na.rm, has.nf=has.nf, adaptive=adaptive, partial=partial, hasNA=hasNA, give.names=give.names)
14130
}
15-
frollapply = function(x, n, FUN, ..., fill=NA, align=c("right", "left", "center")) {
131+
132+
frollapply = function(x, n, FUN, ..., fill=NA, align=c("right","left","center"), adaptive=FALSE, partial=FALSE, give.names=FALSE) {
16133
FUN = match.fun(FUN)
17-
align = match.arg(align)
18134
rho = new.env()
19-
ans = .Call(CfrollapplyR, FUN, x, n, fill, align, rho)
20-
ans
135+
froll(FUN=FUN, rho=rho, x=x, n=n, fill=fill, align=align, adaptive=adaptive, partial=partial, give.names=give.names)
21136
}

0 commit comments

Comments
 (0)