New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cbindlist, mergelist #4370
base: master
Are you sure you want to change the base?
cbindlist, mergelist #4370
Conversation
.dev/cc.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to its own PR
@@ -16,6 +16,8 @@ | |||
|
|||
2. `cedta()` now returns `FALSE` if `.datatable.aware = FALSE` is set in the calling environment, [#5654](https://github.com/Rdatatable/data.table/issues/5654). | |||
|
|||
3. (add example here?) New functions `cbindlist` and `mergelist` have been implemented and exported. Works like `cbind`/`merge` but takes `list` of data.tables on input. `merge` happens in `Reduce` fashion. Supports `how` (_left_, _inner_, _full_, _right_, _semi_, _anti_, _cross_) joins and `mult` argument, closes [#599](https://github.com/Rdatatable/data.table/issues/599) and [#2576](https://github.com/Rdatatable/data.table/issues/2576). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self-citation here?
(add example here?)
also please address.
@@ -206,7 +206,7 @@ replace_dot_alias = function(e) { | |||
} | |||
return(x) | |||
} | |||
if (!mult %chin% c("first","last","all")) stopf("mult argument can only be 'first', 'last' or 'all'") | |||
if (!mult %chin% c("first","last","all","error")) stop("mult argument can only be 'first', 'last', 'all' or 'error'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be moved to an antecedent PR as well?
else if (!is.null(x) && !is.null(y)) { | ||
if (length(x)>=length(y)) intersect(y, x) ## align order to shorter|rhs key | ||
else intersect(x, y) | ||
} else NULL # nocov ## internal error is being called later in mergepair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else NULL # nocov ## internal error is being called later in mergepair | |
} | |
NULL # nocov ## internal error is being called later in mergepair |
cbindlist = function(l, copy=TRUE) { | ||
ans = .Call(Ccbindlist, l, copy) | ||
if (anyDuplicated(names(ans))) { ## invalidate key and index | ||
setattr(ans, "sorted", NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this non-standard way to reset key/index? shouldn't we re-use abstractions here rather than copy implementation details in multiple places?
cols = colnamesInt(x, cols) | ||
ans = union(keep, setdiff(cols, drop)) | ||
if (!retain.order) return(ans) | ||
intersect(colnamesInt(x, NULL), ans) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colnamesInt(x, NULL)
? Isn't that just seq_along(x)
?
stop("internal error: void must be used with mult='error'") # nocov | ||
if (how=="cross") { ## short-circuit bmerge results only for cross join | ||
if (length(on) || mult!="all" || !join.many) | ||
stop("cross join must be used with zero-length on, mult='all', join.many=TRUE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to stopf()
everywhere
ans = bmerge(i, x, icols, xcols, roll=0, rollends=c(FALSE, TRUE), nomatch=nomatch, mult=mult, ops=rep.int(1L, length(on)), verbose=verbose) | ||
if (void) { ## void=T is only for the case when we want raise error for mult='error', and that would happen in above line | ||
return(invisible(NULL)) | ||
} else if (how=="semi" || how=="anti") { ## semi and anti short-circuit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (how=="semi" || how=="anti") { ## semi and anti short-circuit | |
} | |
if (how=="semi" || how=="anti") { ## semi and anti short-circuit |
irows = which(if (how=="semi") ans$lens!=0L else ans$lens==0L) ## we will subset i rather than x, thus assign to irows, not to xrows | ||
if (length(irows)==length(ans$lens)) irows = NULL | ||
return(list(ans=ans, irows=irows)) | ||
} else if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check | |
} | |
if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check |
mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=names(rhs), copy=TRUE, join.many=TRUE, verbose=FALSE) { | ||
semianti = how=="semi" || how=="anti" | ||
innerfull = how=="inner" || how=="full" | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this extra nesting under {
is unusual, any motivation?
stop("'on' is missing and necessary key is not present") | ||
} | ||
if (any(bad.on <- !on %chin% names(lhs))) | ||
stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", "))) | |
stopf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", brackify(on[bad.on]), brackify(names(lhs))) |
if (any(bad.on <- !on %chin% names(lhs))) | ||
stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", "))) | ||
if (any(bad.on <- !on %chin% names(rhs))) | ||
stop(sprintf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(rhs), collapse=", "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop(sprintf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(rhs), collapse=", "))) | |
stopf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", brackify(on[bad.on]), brackify(names(rhs))) |
cp.x = !is.null(ans$xrows) | ||
## ensure no duplicated column names in merge results | ||
if (any(dup.i<-names(out.i) %chin% names(out.x))) | ||
stop("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): ", paste(names(out.i)[dup.i], collapse=", ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): ", paste(names(out.i)[dup.i], collapse=", ")) | |
stopf("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): %s", brackify(names(out.i)[dup.i])) |
verbose = getOption("datatable.verbose") | ||
if (verbose) | ||
p = proc.time()[[3L]] | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lone {
approach is especially misleading here given the preceding "naked" if()
statement lacking its own braces
out = if (!n) as.data.table(l) else l[[1L]] | ||
if (copy) out = copy(out) | ||
if (verbose) | ||
cat(sprintf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat(sprintf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)) | |
catf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p) |
if (copy) | ||
.Call(CcopyCols, out, colnamesInt(out, names(out.mem)[out.mem %chin% unique(unlist(l.mem, recursive=FALSE))])) | ||
if (verbose) | ||
cat(sprintf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat(sprintf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p)) | |
catf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p) |
if (!isInteger(x)) error(_("x must be an integer vector")); | ||
if (!isInteger(len)) error(_("len must be an integer vector")); | ||
if (LENGTH(x) != LENGTH(len)) error(_("x and len must be the same length")); | ||
if (!isInteger(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate these changes to own PR?
As encouraged elsewhere I think this PR should be split up. There are a few tiny clean-up changes independent of More importantly, I think we can do two PRs here: (1) |
That make sense but it's quite a bit of work and I am not sure when I could do it. I think we should stop accepting new PRs till current queue is not cleared yet as conflicts will be only piling up. |
I disagree, on the contrary, if you have such high priority on getting these PRs through to avoid conflicts, you should be the one to spend the effort on getting them ready for review. The current PR may be complete but it is absolutely not in a state for review. Pausing progress elsewhere in the repo indefinitely in the meantime is not the way forward. I will put out a post asking for help here -- this could be a good way for an interested contributor to get more familiar with the codebase (and possibly with FWIW it took me 90 seconds to create #5941: git checkout master
git checkout -b cc-quiet
git checkout origin/cbind-merge-list -- .dev/cc.R
git commit -m 'new quiet option for cc()'
git push origin cc-quiet |
Yes, as long as related code is in its own file then there is nothing to do really, as your commands nicely present. |
bool perhapsDataTable(SEXP x) { | ||
return isDataTable(x) || isDataFrame(x) || isDataList(x); | ||
} | ||
SEXP perhapsDataTableR(SEXP x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely ScalarLogical(perhapsDataTable(x))
is fine?
cbindlist
add function cbindlist #2576: very much likecbind()
ordata.table()
, but better control over copies, retains key/index.mergelist
[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599: supportsleft, inner, full, right, semi, anti, cross
joins andmult
argument, implemented aroundbmerge
:dtmerge
callsbmerge
and process its results to give indices ready to subsetx
andi
tables. IMOdtmerge
could be used to simplify[.data.table
as well.mergepair
callsdtmerge
, gets indices, use them to expand or subsetx
andi
, then it stacks them usingcbindlist
.mergelist
loops aroundmergepair
, avoids any copies inside, once joining finished, it then ensure that expected objects are copied.mult="error"
[R-Forge #1654] Allow mult="error" #655: implemented insidebmerge
partially, because not hooked in
[.data.table
, this should go as a separate PR:[.data.table
#1717extra testing vs SQLite db, for
how="left|inner|full|right"