Skip to content

Commit

Permalink
Merge branch 'master' into gshift-1
Browse files Browse the repository at this point in the history
  • Loading branch information
MichaelChirico committed Feb 25, 2024
2 parents 9e3b0be + 82f559f commit a4889f0
Show file tree
Hide file tree
Showing 8 changed files with 103 additions and 81 deletions.
4 changes: 0 additions & 4 deletions .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,8 @@
on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master

name: R-CMD-check

Expand Down
4 changes: 0 additions & 4 deletions .github/workflows/test-coverage.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master

name: test-coverage

Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Rplots.pdf
data.table_*.tar.gz
data.table.Rcheck
src/Makevars
.Rprofile

# Package install
inst/cc
Expand Down
5 changes: 4 additions & 1 deletion GOVERNANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ Functionality that is out of current scope:

* Definition: permission to commit to, and merge PRs into, master branch.
* How to obtain this role: after a reviewer has a consistent history of careful reviews of others' PRs, then a current Committer should ask all other current Committers if they approve promoting the Reviewer to Committer, and it should be done if there is Consensus among active Committers.
* How this role is recognized: credited via role="aut" in DESCRIPTION (so they appear in Author list on CRAN), and added to https://github.com/orgs/Rdatatable/teams/maintainers which gives permission to merge PRs into master branch.
* How this role is recognized: credited via role="aut" in DESCRIPTION (so they appear in Author list on CRAN), and added to https://github.com/orgs/Rdatatable/teams/committers which gives permission to merge PRs into master branch.

## CRAN maintainer

Expand Down Expand Up @@ -123,6 +123,9 @@ data.table Version line in DESCRIPTION typically has the following meanings

# Governance history

Feb 2024: change team name/link maintainers to committers, to be consistent with role defined in governance.

Nov-Dec 2023: initial version drafted by Toby Dylan Hocking and
reviewed by Tyson Barrett, Jan Gorecki, Michael Chirico, Benjamin
Schwendinger.

2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

2. `cedta()` now returns `FALSE` if `.datatable.aware = FALSE` is set in the calling environment, [#5654](https://github.com/Rdatatable/data.table/issues/5654).

3. Namespace-qualifying `data.table::shift()`, `data.table::first()`, or `data.table::last()` will not deactivate GForce, [#5942](https://github.com/Rdatatable/data.table/issues/5942). Thanks @MichaelChirico for the proposal and fix. Namespace-qualifying other calls like `stats::sum()`, `base::prod()`, etc., continue to work as an escape valve to avoid GForce, e.g. to ensure S3 method dispatch.

## NOTES

1. `transform` method for data.table sped up substantially when creating new columns on large tables. Thanks to @OfekShilon for the report and PR. The implemented solution was proposed by @ColeMiller1.
Expand Down
157 changes: 86 additions & 71 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1734,7 +1734,6 @@ replace_dot_alias = function(e) {
else
catf("lapply optimization is on, j unchanged as '%s'\n", deparse(jsub,width.cutoff=200L, nlines=1L))
}
dotN = function(x) is.name(x) && x==".N" # For #334. TODO: Rprof() showed dotN() may be the culprit if iterated (#1470)?; avoid the == which converts each x to character?
# FR #971, GForce kicks in on all subsets, no joins yet. Although joins could work with
# nomatch=NULL even now.. but not switching it on yet, will deal it separately.
if (getOption("datatable.optimize")>=2L && !is.data.table(i) && !byjoin && length(f__)) {
Expand All @@ -1748,84 +1747,23 @@ replace_dot_alias = function(e) {
GForce = FALSE
} else {
# Apply GForce
# GForce needs to evaluate all arguments not present in the data.table before calling C part #5547
# Safe cases: variables [i], calls without variables [c(0,1), list(1)] # TODO extend this list
# Unsafe cases: functions containing variables [c(i), abs(i)], .N
is_constantish = function(expr, check_singleton=FALSE) {
if (!is.call(expr)) {
return(!dotN(expr))
}
if (check_singleton) {
return(FALSE)
}
# calls are allowed <=> there's no SYMBOLs in the sub-AST
return(length(all.vars(expr, max.names=1L, unique=FALSE)) == 0L)
}
.gshift_ok = function(q) {
q = match.call(shift, q)
is_constantish(q[["n"]]) &&
is_constantish(q[["fill"]]) &&
is_constantish(q[["type"]]) &&
!"give.names" %chin% names(q)
}
.ghead_ok = function(q) {
length(q) == 3L &&
is_constantish(q[[3L]], check_singleton = TRUE)
}
`.g[_ok` = function(q, x) {
length(q) == 3L &&
is_constantish(q[[3L]], check_singleton = TRUE) &&
(q[[1L]] != "[[" || eval(call('is.atomic', q[[2L]]), envir=x)) &&
!(as.character(q[[3L]]) %chin% names_x) && is.numeric(q3<-eval(q[[3L]], parent.frame(3L))) && length(q3)==1L && q3>0L
}
.gweighted.mean_ok = function(q, x) { #3977
q = match.call(gweighted.mean, q)
is_constantish(q[["na.rm"]]) &&
(is.null(q[["w"]]) || eval(call('is.numeric', q[["w"]]), envir=x))
}
.gforce_ok = function(q) {
if (dotN(q)) return(TRUE) # For #334
# run GForce for simple f(x) calls and f(x, na.rm = TRUE)-like calls where x is a column of .SD
# is.symbol() is for #1369, #1974 and #2949
if (!(is.call(q) && is.symbol(q[[1L]]) && is.symbol(q[[2L]]) && (q[[1L]]) %chin% gfuns)) return(FALSE)
if (!(q2 <- q[[2L]]) %chin% names(SDenv$.SDall) && q2 != ".I") return(FALSE) # 875
if (length(q)==2L || (!is.null(names(q)) && startsWith(names(q)[3L], "na") && is_constantish(q[[3L]]))) return(TRUE)
# ^^ base::startWith errors on NULL unfortunately
switch(as.character(q[[1L]]),
"shift" = .gshift_ok(q),
"weighted.mean" = .gweighted.mean_ok(q, x),
"tail" = , "head" = .ghead_ok(q),
"[[" = , "[" = `.g[_ok`(q, x),
FALSE
)
}
if (jsub[[1L]]=="list") {
GForce = TRUE
for (ii in seq.int(from=2L, length.out=length(jsub)-1L)) {
if (!.gforce_ok(jsub[[ii]])) {GForce = FALSE; break}
}
} else GForce = .gforce_ok(jsub)
gforce_jsub = function(x, names_x) {
x[[1L]] = as.name(paste0("g", x[[1L]]))
# gforce needs to evaluate arguments before calling C part TODO: move the evaluation into gforce_ok
# do not evaluate vars present as columns in x
if (length(x) >= 3L) {
for (i in 3:length(x)) {
if (is.symbol(x[[i]]) && !(x[[i]] %chin% names_x)) x[[i]] = eval(x[[i]], parent.frame(2L)) # tests 1187.2 & 1187.4
}
if (!.gforce_ok(jsub[[ii]], SDenv$.SDall)) {GForce = FALSE; break}
}
x
}
} else
GForce = .gforce_ok(jsub, SDenv$.SDall)
if (GForce) {
if (jsub[[1L]]=="list")
for (ii in seq_along(jsub)[-1L]) {
if (dotN(jsub[[ii]])) next; # For #334
jsub[[ii]] = gforce_jsub(jsub[[ii]], names_x)
if (is.N(jsub[[ii]])) next; # For #334
jsub[[ii]] = .gforce_jsub(jsub[[ii]], names_x)
}
else {
# adding argument to ghead/gtail if none is supplied to g-optimized head/tail
if (length(jsub) == 2L && jsub[[1L]] %chin% c("head", "tail")) jsub[["n"]] = 6L
jsub = gforce_jsub(jsub, names_x)
jsub = .gforce_jsub(jsub, names_x)
}
if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n", deparse(jsub, width.cutoff=200L, nlines=1L))
} else if (verbose) catf("GForce is on, but not activated for this query; left j unchanged (see ?GForce)\n");
Expand All @@ -1836,7 +1774,7 @@ replace_dot_alias = function(e) {
nomeanopt=FALSE # to be set by .optmean() using <<- inside it
oldjsub = jsub
if (jsub[[1L]]=="list") {
# Addressing #1369, #2949 and #1974. This used to be 30s (vs 0.5s) with 30K elements items in j, #1470. Could have been dotN() and/or the for-looped if()
# Addressing #1369, #2949 and #1974. This used to be 30s (vs 0.5s) with 30K elements items in j, #1470. Could have been is.N() and/or the for-looped if()
# jsub[[1]]=="list" so the first item of todo will always be FALSE
todo = sapply(jsub, `%iscall%`, 'mean')
if (any(todo)) {
Expand Down Expand Up @@ -3040,8 +2978,9 @@ rleidv = function(x, cols=seq_along(x), prefix=NULL) {
# (1) add it to gfuns
# (2) edit .gforce_ok (defined within `[`) to catch which j will apply the new function
# (3) define the gfun = function() R wrapper
gfuns = c("[", "[[", "head", "tail", "first", "last", "sum", "mean", "prod",
"median", "min", "max", "var", "sd", ".N", "shift", "weighted.mean") # added .N for #334
gdtfuns = c("first", "last", "shift") # exported by data.table, not generic, thus also accept data.table:: form under GForce, #5942.
gfuns = c(gdtfuns,
"[", "[[", "head", "tail", "sum", "mean", "prod", "median", "min", "max", "var", "sd", ".N", "weighted.mean") # added .N for #334
`g[` = `g[[` = function(x, n) .Call(Cgnthvalue, x, as.integer(n)) # n is of length=1 here.
ghead = function(x, n) .Call(Cghead, x, as.integer(n))
gtail = function(x, n) .Call(Cgtail, x, as.integer(n))
Expand Down Expand Up @@ -3073,6 +3012,82 @@ gshift = function(x, n=1L, fill=NA, type=c("lag", "lead", "shift", "cyclic")) {
}
gforce = function(env, jsub, o, f, l, rows) .Call(Cgforce, env, jsub, o, f, l, rows)

# GForce needs to evaluate all arguments not present in the data.table before calling C part #5547
# Safe cases: variables [i], calls without variables [c(0,1), list(1)] # TODO extend this list
# Unsafe cases: functions containing variables [c(i), abs(i)], .N
is.N = function(q) is.name(q) && q==".N" # For #334. TODO: Rprof() showed is.N() may be the culprit if iterated (#1470)?; avoid the == which converts each x to character?
is_constantish = function(q, check_singleton=FALSE) {
if (!is.call(q)) {
return(!is.N(q))
}
if (check_singleton) {
return(FALSE)
}
# calls are allowed <=> there's no SYMBOLs in the sub-AST
return(length(all.vars(q, max.names=1L, unique=FALSE)) == 0L)
}
.gshift_ok = function(q) {
q = match.call(shift, q)
is_constantish(q[["n"]]) &&
is_constantish(q[["fill"]]) &&
is_constantish(q[["type"]]) &&
!"give.names" %chin% names(q)
}
.ghead_ok = function(q) {
length(q) == 3L &&
is_constantish(q[[3L]], check_singleton = TRUE)
}
`.g[_ok` = function(q, x) {
length(q) == 3L &&
is_constantish(q[[3L]], check_singleton = TRUE) &&
(q[[1L]] != "[[" || eval(call('is.atomic', q[[2L]]), envir=x)) &&
!(as.character(q[[3L]]) %chin% names(x)) && is.numeric(q3 <- eval(q[[3L]], parent.frame(3L))) && length(q3)==1L && q3>0L
}
.gweighted.mean_ok = function(q, x) { #3977
q = match.call(gweighted.mean, q)
is_constantish(q[["na.rm"]]) &&
(is.null(q[["w"]]) || eval(call('is.numeric', q[["w"]]), envir=x))
}
# run GForce for simple f(x) calls and f(x, na.rm = TRUE)-like calls where x is a column of .SD
.get_gcall = function(q) {
if (!is.call(q)) return(NULL)
# is.symbol() is for #1369, #1974 and #2949
if (!is.symbol(q[[2L]])) return(NULL)
q1 = q[[1L]]
if (is.symbol(q1)) return(if (q1 %chin% gfuns) q1)
if (!q1 %iscall% "::") return(NULL)
if (q1[[2L]] != "data.table") return(NULL)
return(if (q1[[3L]] %chin% gdtfuns) q1[[3L]])
}
.gforce_ok = function(q, x) {
if (is.N(q)) return(TRUE) # For #334
q1 = .get_gcall(q)
if (is.null(q1)) return(FALSE)
if (!(q2 <- q[[2L]]) %chin% names(x) && q2 != ".I") return(FALSE) # 875
if (length(q)==2L || (!is.null(names(q)) && startsWith(names(q)[3L], "na") && is_constantish(q[[3L]]))) return(TRUE)
# ^^ base::startWith errors on NULL unfortunately
switch(as.character(q1),
"shift" = .gshift_ok(q),
"weighted.mean" = .gweighted.mean_ok(q, x),
"tail" = , "head" = .ghead_ok(q),
"[[" = , "[" = `.g[_ok`(q, x),
FALSE
)
}

.gforce_jsub = function(q, names_x) {
call_name = if (is.symbol(q[[1L]])) q[[1L]] else q[[1L]][[3L]] # latter is like data.table::shift, #5942. .gshift_ok checked this will work.
q[[1L]] = as.name(paste0("g", call_name))
# gforce needs to evaluate arguments before calling C part TODO: move the evaluation into gforce_ok
# do not evaluate vars present as columns in x
if (length(q) >= 3L) {
for (i in 3:length(q)) {
if (is.symbol(q[[i]]) && !(q[[i]] %chin% names_x)) q[[i]] = eval(q[[i]], parent.frame(2L)) # tests 1187.2 & 1187.4
}
}
q
}

.prepareFastSubset = function(isub, x, enclos, notjoin, verbose = FALSE){
## helper that decides, whether a fast binary search can be performed, if i is a call
## For details on the supported queries, see \code{\link{datatable-optimize}}
Expand Down
8 changes: 8 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -18281,3 +18281,11 @@ test(2245.1, dt[1], data.table(foo = 1L, bar = 4L)) # Single index should be int
test(2245.2, copy(dt[1]), data.table(foo = 1:3)) # Revert to data.frame syntax: Interpret index as column
rm(.datatable.aware)
test(2245.3, dt[1], data.table(foo = 1L, bar = 4L)) # Default in this environment should be datatable-aware

# data.table:: doesn't turn off GForce, #5942
DT = data.table(a = rep(1:5, 2L), b = 1:10)
old = options(datatable.optimize=Inf, datatable.verbose=TRUE)
test(2246.1, DT[, data.table::shift(b), by=a], DT[, shift(b), by=a], output="GForce TRUE")
test(2246.2, DT[, data.table::first(b), by=a], DT[, first(b), by=a], output="GForce TRUE")
test(2246.3, DT[, data.table::last(b), by=a], DT[, last(b), by=a], output="GForce TRUE")
options(old)
3 changes: 2 additions & 1 deletion src/fwriteR.c
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,8 @@ SEXP fwriteR(
if (xlength(rn)!=2 || INTEGER(rn)[0]==NA_INTEGER) {
// not R's default rownames c(NA,-nrow)
if (xlength(rn) != args.nrow)
error(_("input has specific integer rownames but their length (%"PRId64") != nrow (%"PRId64")"), xlength(rn), args.nrow); // # nocov
// Use (long long) to cast R_xlen_t to a fixed type to robustly avoid -Wformat compiler warnings, see #5768, PRId64 didnt work on M1
error(_("input has specific integer rownames but their length (%lld) != nrow (%"PRId64")"), (long long)xlength(rn), args.nrow); // # nocov
args.rowNames = INTEGER(rn);
args.rowNameFun = WF_Int32;
}
Expand Down

0 comments on commit a4889f0

Please sign in to comment.