Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cbindlist, mergelist #4370

Open
wants to merge 101 commits into
base: master
Choose a base branch
from
Open

cbindlist, mergelist #4370

wants to merge 101 commits into from

Conversation

jangorecki
Copy link
Member

@jangorecki jangorecki commented Apr 10, 2020

partially, because not hooked in [.data.table, this should go as a separate PR:


extra testing vs SQLite db, for how="left|inner|full|right"

Rscript inst/tests/sqlite.Rraw.manual

@jangorecki jangorecki added the WIP label Apr 10, 2020
@jangorecki jangorecki changed the title cbindlist cbindlist, mergelist Apr 10, 2020
R/mergelist.R Outdated Show resolved Hide resolved
R/mergelist.R Outdated Show resolved Hide resolved
src/mergelist.c Outdated Show resolved Hide resolved
src/mergelist.c Outdated Show resolved Hide resolved
@jangorecki jangorecki linked an issue Apr 17, 2020 that may be closed by this pull request
.dev/cc.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to its own PR

@@ -16,6 +16,8 @@

2. `cedta()` now returns `FALSE` if `.datatable.aware = FALSE` is set in the calling environment, [#5654](https://github.com/Rdatatable/data.table/issues/5654).

3. (add example here?) New functions `cbindlist` and `mergelist` have been implemented and exported. Works like `cbind`/`merge` but takes `list` of data.tables on input. `merge` happens in `Reduce` fashion. Supports `how` (_left_, _inner_, _full_, _right_, _semi_, _anti_, _cross_) joins and `mult` argument, closes [#599](https://github.com/Rdatatable/data.table/issues/599) and [#2576](https://github.com/Rdatatable/data.table/issues/2576).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-citation here?

(add example here?)

also please address.

@@ -206,7 +206,7 @@ replace_dot_alias = function(e) {
}
return(x)
}
if (!mult %chin% c("first","last","all")) stopf("mult argument can only be 'first', 'last' or 'all'")
if (!mult %chin% c("first","last","all","error")) stop("mult argument can only be 'first', 'last', 'all' or 'error'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to an antecedent PR as well?

else if (!is.null(x) && !is.null(y)) {
if (length(x)>=length(y)) intersect(y, x) ## align order to shorter|rhs key
else intersect(x, y)
} else NULL # nocov ## internal error is being called later in mergepair
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else NULL # nocov ## internal error is being called later in mergepair
}
NULL # nocov ## internal error is being called later in mergepair

cbindlist = function(l, copy=TRUE) {
ans = .Call(Ccbindlist, l, copy)
if (anyDuplicated(names(ans))) { ## invalidate key and index
setattr(ans, "sorted", NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this non-standard way to reset key/index? shouldn't we re-use abstractions here rather than copy implementation details in multiple places?

cols = colnamesInt(x, cols)
ans = union(keep, setdiff(cols, drop))
if (!retain.order) return(ans)
intersect(colnamesInt(x, NULL), ans)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colnamesInt(x, NULL)? Isn't that just seq_along(x)?

stop("internal error: void must be used with mult='error'") # nocov
if (how=="cross") { ## short-circuit bmerge results only for cross join
if (length(on) || mult!="all" || !join.many)
stop("cross join must be used with zero-length on, mult='all', join.many=TRUE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to stopf() everywhere

ans = bmerge(i, x, icols, xcols, roll=0, rollends=c(FALSE, TRUE), nomatch=nomatch, mult=mult, ops=rep.int(1L, length(on)), verbose=verbose)
if (void) { ## void=T is only for the case when we want raise error for mult='error', and that would happen in above line
return(invisible(NULL))
} else if (how=="semi" || how=="anti") { ## semi and anti short-circuit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (how=="semi" || how=="anti") { ## semi and anti short-circuit
}
if (how=="semi" || how=="anti") { ## semi and anti short-circuit

irows = which(if (how=="semi") ans$lens!=0L else ans$lens==0L) ## we will subset i rather than x, thus assign to irows, not to xrows
if (length(irows)==length(ans$lens)) irows = NULL
return(list(ans=ans, irows=irows))
} else if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check
}
if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check

mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=names(rhs), copy=TRUE, join.many=TRUE, verbose=FALSE) {
semianti = how=="semi" || how=="anti"
innerfull = how=="inner" || how=="full"
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this extra nesting under { is unusual, any motivation?

stop("'on' is missing and necessary key is not present")
}
if (any(bad.on <- !on %chin% names(lhs)))
stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", ")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", ")))
stopf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", brackify(on[bad.on]), brackify(names(lhs)))

if (any(bad.on <- !on %chin% names(lhs)))
stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", ")))
if (any(bad.on <- !on %chin% names(rhs)))
stop(sprintf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(rhs), collapse=", ")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stop(sprintf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(rhs), collapse=", ")))
stopf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", brackify(on[bad.on]), brackify(names(rhs)))

cp.x = !is.null(ans$xrows)
## ensure no duplicated column names in merge results
if (any(dup.i<-names(out.i) %chin% names(out.x)))
stop("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): ", paste(names(out.i)[dup.i], collapse=", "))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stop("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): ", paste(names(out.i)[dup.i], collapse=", "))
stopf("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): %s", brackify(names(out.i)[dup.i]))

verbose = getOption("datatable.verbose")
if (verbose)
p = proc.time()[[3L]]
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lone { approach is especially misleading here given the preceding "naked" if() statement lacking its own braces

out = if (!n) as.data.table(l) else l[[1L]]
if (copy) out = copy(out)
if (verbose)
cat(sprintf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cat(sprintf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p))
catf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)

if (copy)
.Call(CcopyCols, out, colnamesInt(out, names(out.mem)[out.mem %chin% unique(unlist(l.mem, recursive=FALSE))]))
if (verbose)
cat(sprintf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cat(sprintf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p))
catf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p)

if (!isInteger(x)) error(_("x must be an integer vector"));
if (!isInteger(len)) error(_("len must be an integer vector"));
if (LENGTH(x) != LENGTH(len)) error(_("x and len must be the same length"));
if (!isInteger(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate these changes to own PR?

@MichaelChirico
Copy link
Member

As encouraged elsewhere I think this PR should be split up. There are a few tiny clean-up changes independent of mergelist/cbindlist that I've noted & can be filed alone.

More importantly, I think we can do two PRs here: (1) cbindlist() (2) mergelist(). The current diff is quite unmanageable for one PR for giving quality review / not just rubber-stamping. It may make sense to further subdivide mergelist into implementing different how= approaches sequentially, but splitting of cbindlist() first seems pretty manageable.

@jangorecki
Copy link
Member Author

jangorecki commented Feb 19, 2024

That make sense but it's quite a bit of work and I am not sure when I could do it. I think we should stop accepting new PRs till current queue is not cleared yet as conflicts will be only piling up.

@MichaelChirico
Copy link
Member

I think we should stop accepting new PRs till current queue is not cleared yet as conflicts will be only piling up.

I disagree, on the contrary, if you have such high priority on getting these PRs through to avoid conflicts, you should be the one to spend the effort on getting them ready for review. The current PR may be complete but it is absolutely not in a state for review. Pausing progress elsewhere in the repo indefinitely in the meantime is not the way forward.

I will put out a post asking for help here -- this could be a good way for an interested contributor to get more familiar with the codebase (and possibly with git generally) and add a lot of value at the same time.

FWIW it took me 90 seconds to create #5941:

git checkout master
git checkout -b cc-quiet
git checkout origin/cbind-merge-list -- .dev/cc.R
git commit -m 'new quiet option for cc()'
git push origin cc-quiet

@jangorecki
Copy link
Member Author

Yes, as long as related code is in its own file then there is nothing to do really, as your commands nicely present.

bool perhapsDataTable(SEXP x) {
return isDataTable(x) || isDataFrame(x) || isDataList(x);
}
SEXP perhapsDataTableR(SEXP x) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely ScalarLogical(perhapsDataTable(x)) is fine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants