cbindlist, mergelist #4370

jangorecki · 2020-04-10T10:01:57Z

cbindlist add function cbindlist #2576: very much like cbind() or data.table(), but better control over copies, retains key/index.
mergelist [R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599: supports left, inner, full, right, semi, anti, cross joins and mult argument, implemented around bmerge:
1. internal dtmerge calls bmerge and process its results to give indices ready to subset x and i tables. IMO dtmerge could be used to simplify [.data.table as well.
2. internal mergepair calls dtmerge, gets indices, use them to expand or subset x and i, then it stacks them using cbindlist.
3. exported mergelist loops around mergepair, avoids any copies inside, once joining finished, it then ensure that expected objects are copied.
unfold C code for better codecov in C files that has been touched in this PR
fix codecov gaps
mult="error" [R-Forge #1654] Allow mult="error" #655: implemented inside bmerge
allow.cartesian more precise allow.cartesian should be more precise #4383 (only for equi-joins!)
rename allow.cartesian Rename allow.cartesian to allow.i.dups? #914

partially, because not hooked in [.data.table, this should go as a separate PR:

cross join allow cross join in [.data.table #1717
semi join Support x[!!y] anti-anti-join syntax #915, Add syntax for "subsetting join" #2158

extra testing vs SQLite db, for how="left|inner|full|right"

Rscript inst/tests/sqlite.Rraw.manual

R/mergelist.R

src/mergelist.c

MichaelChirico · 2024-02-19T08:57:45Z

.dev/cc.R

move this to its own PR

MichaelChirico · 2024-02-19T09:00:46Z

NEWS.md

@@ -16,6 +16,8 @@

 2. `cedta()` now returns `FALSE` if `.datatable.aware = FALSE` is set in the calling environment, [#5654](https://github.com/Rdatatable/data.table/issues/5654).

+3. (add example here?) New functions `cbindlist` and `mergelist` have been implemented and exported. Works like `cbind`/`merge` but takes `list` of data.tables on input. `merge` happens in `Reduce` fashion. Supports `how` (_left_, _inner_, _full_, _right_, _semi_, _anti_, _cross_) joins and `mult` argument, closes [#599](https://github.com/Rdatatable/data.table/issues/599) and [#2576](https://github.com/Rdatatable/data.table/issues/2576).


self-citation here?

(add example here?)

also please address.

MichaelChirico · 2024-02-19T09:01:11Z

R/data.table.R

@@ -206,7 +206,7 @@ replace_dot_alias = function(e) {
    }
    return(x)
  }
-  if (!mult %chin% c("first","last","all")) stopf("mult argument can only be 'first', 'last' or 'all'")
+  if (!mult %chin% c("first","last","all","error")) stop("mult argument can only be 'first', 'last', 'all' or 'error'")


can this be moved to an antecedent PR as well?

MichaelChirico · 2024-02-19T09:03:01Z

R/mergelist.R

+  else if (!is.null(x) && !is.null(y)) {
+    if (length(x)>=length(y)) intersect(y, x) ## align order to shorter|rhs key
+    else intersect(x, y)
+  } else NULL # nocov ## internal error is being called later in mergepair


Suggested change

} else NULL # nocov ## internal error is being called later in mergepair

}

NULL # nocov ## internal error is being called later in mergepair

MichaelChirico · 2024-02-19T09:04:32Z

R/mergelist.R

+cbindlist = function(l, copy=TRUE) {
+  ans = .Call(Ccbindlist, l, copy)
+  if (anyDuplicated(names(ans))) { ## invalidate key and index
+    setattr(ans, "sorted", NULL)


why this non-standard way to reset key/index? shouldn't we re-use abstractions here rather than copy implementation details in multiple places?

MichaelChirico · 2024-02-19T09:07:19Z

R/mergelist.R

+  cols = colnamesInt(x, cols)
+  ans = union(keep, setdiff(cols, drop))
+  if (!retain.order) return(ans)
+  intersect(colnamesInt(x, NULL), ans)


colnamesInt(x, NULL)? Isn't that just seq_along(x)?

MichaelChirico · 2024-02-19T09:10:27Z

R/mergelist.R

+    stop("internal error: void must be used with mult='error'") # nocov
+  if (how=="cross") { ## short-circuit bmerge results only for cross join
+    if (length(on) || mult!="all" || !join.many)
+      stop("cross join must be used with zero-length on, mult='all', join.many=TRUE")


switch to stopf() everywhere

MichaelChirico · 2024-02-19T09:11:29Z

R/mergelist.R

+    ans = bmerge(i, x, icols, xcols, roll=0, rollends=c(FALSE, TRUE), nomatch=nomatch, mult=mult, ops=rep.int(1L, length(on)), verbose=verbose)
+    if (void) { ## void=T is only for the case when we want raise error for mult='error', and that would happen in above line
+      return(invisible(NULL))
+    } else if (how=="semi" || how=="anti") { ## semi and anti short-circuit


Suggested change

} else if (how=="semi" || how=="anti") { ## semi and anti short-circuit

}

if (how=="semi" || how=="anti") { ## semi and anti short-circuit

MichaelChirico · 2024-02-19T09:12:40Z

R/mergelist.R

+      irows = which(if (how=="semi") ans$lens!=0L else ans$lens==0L) ## we will subset i rather than x, thus assign to irows, not to xrows
+      if (length(irows)==length(ans$lens)) irows = NULL
+      return(list(ans=ans, irows=irows))
+    } else if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check


Suggested change

} else if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check

}

if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check

MichaelChirico · 2024-02-19T09:14:03Z

R/mergelist.R

+mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=names(rhs), copy=TRUE, join.many=TRUE, verbose=FALSE) {
+  semianti = how=="semi" || how=="anti"
+  innerfull = how=="inner" || how=="full"
+  {


this extra nesting under { is unusual, any motivation?

MichaelChirico · 2024-02-19T09:15:02Z

R/mergelist.R

+          stop("'on' is missing and necessary key is not present")
+      }
+      if (any(bad.on <- !on %chin% names(lhs)))
+        stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", ")))


Suggested change

stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", ")))

stopf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", brackify(on[bad.on]), brackify(names(lhs)))

MichaelChirico · 2024-02-19T09:15:36Z

R/mergelist.R

+      if (any(bad.on <- !on %chin% names(lhs)))
+        stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", ")))
+      if (any(bad.on <- !on %chin% names(rhs)))
+        stop(sprintf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(rhs), collapse=", ")))


Suggested change

stop(sprintf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(rhs), collapse=", ")))

stopf("'on' argument specify columns to join [%s] that are not present in RHS table [%s]", brackify(on[bad.on]), brackify(names(rhs)))

MichaelChirico · 2024-02-19T09:16:37Z

R/mergelist.R

+    cp.x = !is.null(ans$xrows)
+    ## ensure no duplicated column names in merge results
+    if (any(dup.i<-names(out.i) %chin% names(out.x)))
+      stop("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): ", paste(names(out.i)[dup.i], collapse=", "))


Suggested change

stop("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): ", paste(names(out.i)[dup.i], collapse=", "))

stopf("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): %s", brackify(names(out.i)[dup.i]))

MichaelChirico · 2024-02-19T09:17:33Z

R/mergelist.R

+  verbose = getOption("datatable.verbose")
+  if (verbose)
+    p = proc.time()[[3L]]
+  {


this lone { approach is especially misleading here given the preceding "naked" if() statement lacking its own braces

MichaelChirico · 2024-02-19T09:17:55Z

R/mergelist.R

+    out = if (!n) as.data.table(l) else l[[1L]]
+    if (copy) out = copy(out)
+    if (verbose)
+      cat(sprintf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p))


Suggested change

cat(sprintf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p))

catf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)

MichaelChirico · 2024-02-19T09:18:19Z

R/mergelist.R

+  if (copy)
+    .Call(CcopyCols, out, colnamesInt(out, names(out.mem)[out.mem %chin% unique(unlist(l.mem, recursive=FALSE))]))
+  if (verbose)
+    cat(sprintf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p))


Suggested change

cat(sprintf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p))

catf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p)

MichaelChirico · 2024-02-19T09:19:25Z

src/vecseq.c

-  if (!isInteger(x)) error(_("x must be an integer vector"));
-  if (!isInteger(len)) error(_("len must be an integer vector"));
-  if (LENGTH(x) != LENGTH(len)) error(_("x and len must be the same length"));
+  if (!isInteger(x))


Separate these changes to own PR?

MichaelChirico · 2024-02-19T09:24:25Z

As encouraged elsewhere I think this PR should be split up. There are a few tiny clean-up changes independent of mergelist/cbindlist that I've noted & can be filed alone.

More importantly, I think we can do two PRs here: (1) cbindlist() (2) mergelist(). The current diff is quite unmanageable for one PR for giving quality review / not just rubber-stamping. It may make sense to further subdivide mergelist into implementing different how= approaches sequentially, but splitting of cbindlist() first seems pretty manageable.

jangorecki · 2024-02-19T16:45:49Z

That make sense but it's quite a bit of work and I am not sure when I could do it. I think we should stop accepting new PRs till current queue is not cleared yet as conflicts will be only piling up.

MichaelChirico · 2024-02-19T21:25:17Z

I think we should stop accepting new PRs till current queue is not cleared yet as conflicts will be only piling up.

I disagree, on the contrary, if you have such high priority on getting these PRs through to avoid conflicts, you should be the one to spend the effort on getting them ready for review. The current PR may be complete but it is absolutely not in a state for review. Pausing progress elsewhere in the repo indefinitely in the meantime is not the way forward.

I will put out a post asking for help here -- this could be a good way for an interested contributor to get more familiar with the codebase (and possibly with git generally) and add a lot of value at the same time.

FWIW it took me 90 seconds to create #5941:

git checkout master
git checkout -b cc-quiet
git checkout origin/cbind-merge-list -- .dev/cc.R
git commit -m 'new quiet option for cc()'
git push origin cc-quiet

jangorecki · 2024-02-19T22:47:04Z

Yes, as long as related code is in its own file then there is nothing to do really, as your commands nicely present.

HughParsonage · 2024-02-20T07:59:23Z

src/utils.c

+bool perhapsDataTable(SEXP x) {
+  return isDataTable(x) || isDataFrame(x) || isDataList(x);
+}
+SEXP perhapsDataTableR(SEXP x) {


Surely ScalarLogical(perhapsDataTable(x)) is fine?

cbindlist

3e72c8d

jangorecki added the WIP label Apr 10, 2020

jangorecki added 2 commits April 10, 2020 11:32

add cbind by reference, timing

a915832

R prototype of mergelist

05dd562

jangorecki changed the title ~~cbindlist~~ cbindlist, mergelist Apr 10, 2020

jangorecki added 2 commits April 10, 2020 14:31

wording

cba5bc1

use lower overhead funs

1edf4d3

MichaelChirico reviewed Apr 10, 2020

View reviewed changes

R/mergelist.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Apr 10, 2020

View reviewed changes

R/mergelist.R Outdated Show resolved Hide resolved

jangorecki mentioned this pull request Apr 10, 2020

[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599

Open

jangorecki commented Apr 10, 2020

View reviewed changes

src/mergelist.c Outdated Show resolved Hide resolved

jangorecki commented Apr 10, 2020

View reviewed changes

src/mergelist.c Outdated Show resolved Hide resolved

This was linked to issues Apr 11, 2020

[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599

Open

add function cbindlist #2576

Open

jangorecki added 11 commits April 16, 2020 12:29

stick to int32 for now, correct R_alloc

36bbd25

bmerge C refactor for codecov and one loop for speed

7d51dd6

address revealed codecov gaps

0437da5

refactor vecseq for codecov

e287213

seqexp helper, some alloccol export on C

5dc07bd

bmerge codecov, types handled in R bmerge already

a4d124e

better comment seqexp

40d3bfe

bmerge mult=error #655

beffe39

multiple new C utils

4e211a1

swap if branches

fbddcd6

explain new C utils

01b2f9d

jangorecki linked an issue Apr 17, 2020 that may be closed by this pull request

[R-Forge #1654] Allow mult="error" #655

Closed

jangorecki added 4 commits April 17, 2020 15:50

comments mostly

c8e070b

reduce conflicts to PR #4386

3004748

comment C code

cf73fcf

address multiple matches during update-on-join #3747

b64c0c3

MichaelChirico reviewed Feb 19, 2024

View reviewed changes

.dev/cc.R Outdated

Copy link

Member

MichaelChirico Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to its own PR

Merge branch 'master' into cbind-merge-list

53b9b0d

jangorecki requested a review from HughParsonage as a code owner February 19, 2024 08:58

Update NEWS.md

c6add42

MichaelChirico reviewed Feb 19, 2024

View reviewed changes

MichaelChirico mentioned this pull request Feb 19, 2024

New quiet option for cc() #5941

Merged

Merge branch 'master' into cbind-merge-list

ec1973f

HughParsonage reviewed Feb 20, 2024

View reviewed changes

m-muecke mentioned this pull request Jun 16, 2024

map_dtc is unreasonably slow when .f returns data.table mlr-org/mlr3misc#78

Open

MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cbindlist, mergelist #4370

cbindlist, mergelist #4370

jangorecki commented Apr 10, 2020 •

edited

Loading

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024

MichaelChirico Feb 19, 2024 •

edited

Loading

MichaelChirico Feb 19, 2024

MichaelChirico commented Feb 19, 2024

jangorecki commented Feb 19, 2024 •

edited

Loading

MichaelChirico commented Feb 19, 2024

jangorecki commented Feb 19, 2024

HughParsonage Feb 20, 2024

		@@ -16,6 +16,8 @@

		2. `cedta()` now returns `FALSE` if `.datatable.aware = FALSE` is set in the calling environment, [#5654](https://github.com/Rdatatable/data.table/issues/5654).

		3. (add example here?) New functions `cbindlist` and `mergelist` have been implemented and exported. Works like `cbind`/`merge` but takes `list` of data.tables on input. `merge` happens in `Reduce` fashion. Supports `how` (_left_, _inner_, _full_, _right_, _semi_, _anti_, _cross_) joins and `mult` argument, closes [#599](https://github.com/Rdatatable/data.table/issues/599) and [#2576](https://github.com/Rdatatable/data.table/issues/2576).

	} else NULL # nocov ## internal error is being called later in mergepair
	}
	NULL # nocov ## internal error is being called later in mergepair

	} else if (how=="semi" \|\| how=="anti") { ## semi and anti short-circuit
	}
	if (how=="semi" \|\| how=="anti") { ## semi and anti short-circuit

	} else if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check
	}
	if (mult=="all" && !ans$allLen1 && !join.many && ## join.many, like allow.cartesian, check

	stop(sprintf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", paste(on[bad.on], collapse=", "), paste(names(lhs), collapse=", ")))
	stopf("'on' argument specify columns to join [%s] that are not present in LHS table [%s]", brackify(on[bad.on]), brackify(names(lhs)))

	stop("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): ", paste(names(out.i)[dup.i], collapse=", "))
	stopf("merge result has duplicated column names, use 'cols' argument or rename columns in 'l' tables, duplicated column(s): %s", brackify(names(out.i)[dup.i]))

	cat(sprintf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p))
	catf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)

	cat(sprintf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p))
	catf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p)

cbindlist, mergelist #4370

Are you sure you want to change the base?

cbindlist, mergelist #4370

Conversation

jangorecki commented Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico commented Feb 19, 2024

jangorecki commented Feb 19, 2024 • edited Loading

MichaelChirico commented Feb 19, 2024

jangorecki commented Feb 19, 2024

Choose a reason for hiding this comment

jangorecki commented Apr 10, 2020 •

edited

Loading

MichaelChirico Feb 19, 2024 •

edited

Loading

jangorecki commented Feb 19, 2024 •

edited

Loading