as.data.table.list centralized (#3471)

Rdatatable · Jun 14, 2019 · 04eeb30 · 04eeb30
1 parent be1fae5
commit 04eeb30
Show file tree

Hide file tree

Showing 8 changed files with 213 additions and 240 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -92,6 +92,8 @@
 
 15. `DT[order(col)[1:5], ...]` (i.e. where `i` is a compound expression involving `order()`) is now optimized to use `data.table`'s multithreaded `forder`, [#1921](https://github.com/Rdatatable/data.table/issues/1921). This example is not a fully optimal top-N query since the full ordering is still computed. The improvement is that the call to `order()` is computed faster for any `i` expression using `order`.
 
+16. `as.data.table` now unpacks columns in a `data.frame` which are themselves a `data.frame`. This need arises when parsing JSON, a corollary in [#3369](https://github.com/Rdatatable/data.table/issues/3369#issuecomment-462662752). `data.table` does not allow columns to be objects which themselves have columns (such as `matrix` and `data.frame`), unlike `data.frame` which does. Bug fix 19 in v1.12.2 (see below) added a helpful error (rather than segfault) to detect such invalid `data.table`, and promised that `as.data.table()` would unpack these columns in the next release (i.e. this release) so that the invalid `data.table` is not created in the first place.
+
 #### BUG FIXES
 
 1. `first`, `last`, `head` and `tail` by group no longer error in some cases, [#2030](https://github.com/Rdatatable/data.table/issues/2030) [#3462](https://github.com/Rdatatable/data.table/issues/3462). Thanks to @franknarf1 for reporting.
@@ -134,7 +136,23 @@
 
 20. `c`, `seq` and `mean` of `ITime` objects now retain the `ITime` class via new `ITime` methods, [#3628](https://github.com/Rdatatable/data.table/issues/3628). Thanks @UweBlock for reporting. The `cut` and `split` methods for `ITime` have been removed since the default methods work, [#3630](https://github.com/Rdatatable/data.table/pull/3630).
 
-20. `as.data.table.array` now handles the case when some of the array's dimension names are `NULL`, [#3636](https://github.com/Rdatatable/data.table/issues/3636).
+21. `as.data.table.array` now handles the case when some of the array's dimension names are `NULL`, [#3636](https://github.com/Rdatatable/data.table/issues/3636).
+
+22. Adding a `list` column using `cbind`, `as.data.table`, or `data.table` now works rather than treating the `list` as if it were a set of columns, [#3471](https://github.com/Rdatatable/data.table/pull/3471). However, please note that using `:=` to add columns is preferred.
+
+    ```R
+    # cbind( data.table(1:2), list(c("a","b"),"a") )
+    #       V1     V2     NA    # v1.12.2 and before
+    #    <int> <char> <char>    # introduced invalid NA column name too
+    # 1:     1      a      a
+    # 2:     2      b      a
+    #
+    #       V1     V2           # v1.12.4+
+    #    <int> <list>
+    # 1:     1    a,b
+    # 2:     2      a
+    ```
+
 
 #### NOTES
 
@@ -231,7 +249,7 @@
 
 18. `cbind` with a null (0-column) `data.table` now works as expected, [#3445](https://github.com/Rdatatable/data.table/issues/3445). Thanks to @mb706 for reporting.
 
-19. Subsetting does a better job of catching a malformed `data.table` with error rather than segfault. A column may not be NULL, nor may a column be an object such as a data.frame or matrix which have columns. Thanks to a comment and reproducible example in [#3369](https://github.com/Rdatatable/data.table/issues/3369) from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enable `as.data.table` to unpack columns which are data.frame to support this use case.
+19. Subsetting does a better job of catching a malformed `data.table` with error rather than segfault. A column may not be NULL, nor may a column be an object which has columns (such as a `data.frame` or `matrix`). Thanks to a comment and reproducible example in [#3369](https://github.com/Rdatatable/data.table/issues/3369) from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enable `as.data.table` to unpack columns which are `data.frame` to support this use case.
 
 #### NOTES
 

diff --git a/R/as.data.table.R b/R/as.data.table.R
@@ -115,49 +115,75 @@ as.data.table.array = function(x, keep.rownames=FALSE, key=NULL, sorted=TRUE, va
   ans[]
 }
 
-as.data.table.list = function(x, keep.rownames=FALSE, key=NULL, ...) {
-  wn = sapply(x,is.null)
-  if (any(wn)) x = x[!wn]
-  if (!length(x)) return( null.data.table() )
-  # fix for #833, as.data.table.list with matrix/data.frame/data.table as a list element..
-  # TODO: move this entire logic (along with data.table() to C
-  for (i in seq_along(x)) {
-    dims = dim(x[[i]])
-    if (!is.null(dims)) {
-      ans = do.call("data.table", x)
-      setnames(ans, make.unique(names(ans)))
-      return(ans)
+as.data.table.list = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE, ...) {
+  n = length(x)
+  eachnrow = integer(n)          # vector of lengths of each column. may not be equal if silent repetition is required.
+  eachncol = integer(n)
+  missing.check.names = missing(check.names)
+  for (i in seq_len(n)) {
+    xi = x[[i]]
+    if (is.null(xi)) next    # eachncol already initialized to 0 by integer() above
+    if (!is.null(dim(xi)) && missing.check.names) check.names=TRUE
+    if ("POSIXlt" %chin% class(xi)) {
+      warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
+      xi = x[[i]] = as.POSIXct(xi)
+    } else if (is.matrix(xi) || is.data.frame(xi)) {
+      if (!is.data.table(xi)) {
+        xi = x[[i]] = as.data.table(xi, keep.rownames=keep.rownames)  # we will never allow a matrix to be a column; always unpack the columns
+      }
+      # else avoid dispatching to as.data.table.data.table (which exists and copies)
+    } else if (is.table(xi)) {
+      xi = x[[i]] = as.data.table.table(xi, keep.rownames=keep.rownames)
+    } else if (is.function(xi)) {
+      xi = x[[i]] = list(xi)
     }
+    eachnrow[i] = NROW(xi)    # for a vector (including list() columns) returns the length
+    eachncol[i] = NCOL(xi)    # for a vector returns 1
   }
-  n = vapply(x, length, 0L)
-  mn = max(n)
-  x = copy(x)
-  idx = which(n < mn)
-  if (length(idx)) {
-    for (i in idx) {
-      # any is.null(x[[i]]) were removed above, otherwise warning when a list element is NULL
-      if (inherits(x[[i]], "POSIXlt")) {
-        warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
-        x[[i]] = as.POSIXct(x[[i]])
-      }
-      # Implementing FR #4813 - recycle with warning when nr %% nrows[i] != 0L
-      if (!n[i] && mn)
-        warning("Item ", i, " is of size 0 but maximum size is ", mn, ", therefore recycled with 'NA'")
-      else if (n[i] && mn %% n[i] != 0L)
-        warning("Item ", i, " is of size ", n[i], " but maximum size is ", mn, " (recycled leaving a remainder of ", mn%%n[i], " items)")
-      x[[i]] = rep(x[[i]], length.out=mn)
+  ncol = sum(eachncol)  # hence removes NULL items silently (no error or warning), #842.
+  if (ncol==0L) return(null.data.table())
+  nrow = max(eachnrow)
+  ans = vector("list",ncol)  # always return a new VECSXP
+  recycle = function(x, nrow) {
+    if (length(x)==nrow) {
+      return(copy(x))
+      # This copy used to be achieved via .Call(CcopyNamedInList,x) at the top of data.table(). It maintains pre-Rv3.1.0
+      # behavior, for now. See test 548.2. The copy() calls duplicate() at C level which (importantly) also expands ALTREP objects.
+      # TODO: port this as.data.table.list() to C and use MAYBE_REFERENCED(x) || ALTREP(x) to save some copies.
+      #       That saving used to be done by CcopyNamedInList but the copies happened again as well, so removing CcopyNamedInList is
+      #       not worse than before, and gets us in a better centralized place to port as.data.table.list to C and use MAYBE_REFERENCED
+      #       again in future.
     }
+    if (identical(x,list())) vector("list", nrow) else rep(x, length.out=nrow)   # new objects don't need copy
   }
-  # fix for #842
-  if (mn > 0L) {
-    nz = which(n > 0L)
-    xx = point(vector("list", length(nz)), seq_along(nz), x, nz)
-    if (!is.null(names(x)))
-      setattr(xx, 'names', names(x)[nz])
-    x = xx
+  vnames = character(ncol)
+  k = 1L
+  for(i in seq_len(n)) {
+    xi = x[[i]]
+    if (is.null(xi)) next
+    if (eachnrow[i]>1L && nrow%%eachnrow[i]!=0L)   # in future: eachnrow[i]!=nrow
+      warning("Item ", i, " has ", eachnrow[i], " rows but longest item has ", nrow, "; recycled with remainder.")
+    if (eachnrow[i]==0L && nrow>0L && is.atomic(xi))   # is.atomic to ignore list() since list() is a common way to initialize; let's not insist on list(NULL)
+      warning("Item ", i, " has 0 rows but longest item has ", nrow, "; filled with NA")  # the rep() in recycle() above creates the NA vector
+    if (is.data.table(xi)) {   # matrix and data.frame were coerced to data.table above
+      # vnames[[i]] = names(xi)  #if (nm!="" && n>1L) paste(nm, names(xi), sep=".") else names(xi)
+      for (j in seq_along(xi)) {
+        ans[[k]] = recycle(xi[[j]], nrow)
+        vnames[k] = names(xi)[j]
+        k = k+1L
+      }
+    } else {
+      nm = names(x)[i]
+      vnames[k] = if (length(nm) && !is.na(nm) && nm!="") nm else paste0("V",i)
+      ans[[k]] = recycle(xi, nrow)
+      k = k+1L
+    }
   }
-  setDT(x, key=key) # copy ensured above; also, setDT handles naming
-  x
+  if (any(vnames==".SD")) stop("A column may not be called .SD. That has special meaning.")
+  if (check.names) vnames = make.names(vnames, unique=TRUE)
+  setattr(ans, "names", vnames)
+  setDT(ans, key=key) # copy ensured above; also, setDT handles naming
+  ans
 }
 
 # don't retain classes before "data.frame" while converting
@@ -180,6 +206,10 @@ as.data.table.data.frame = function(x, keep.rownames=FALSE, key=NULL, ...) {
       setnames(ans, 'rn', keep.rownames[1L])
     return(ans)
   }
+  if (any(!sapply(x,is.atomic))) {
+    # a data.frame with a column that is data.frame needs to be expanded; test 2013.4
+    return(as.data.table.list(x, keep.rownames=keep.rownames, ...))
+  }
   ans = copy(x)  # TO DO: change this deep copy to be shallow.
   setattr(ans, "row.names", .set_row_names(nrow(x)))