Skip to content

Commit

Permalink
as.data.table.list centralized (#3471)
Browse files Browse the repository at this point in the history
  • Loading branch information
mattdowle committed Jun 14, 2019
1 parent be1fae5 commit 04eeb30
Show file tree
Hide file tree
Showing 8 changed files with 213 additions and 240 deletions.
22 changes: 20 additions & 2 deletions NEWS.md
Expand Up @@ -92,6 +92,8 @@

15. `DT[order(col)[1:5], ...]` (i.e. where `i` is a compound expression involving `order()`) is now optimized to use `data.table`'s multithreaded `forder`, [#1921](https://github.com/Rdatatable/data.table/issues/1921). This example is not a fully optimal top-N query since the full ordering is still computed. The improvement is that the call to `order()` is computed faster for any `i` expression using `order`.

16. `as.data.table` now unpacks columns in a `data.frame` which are themselves a `data.frame`. This need arises when parsing JSON, a corollary in [#3369](https://github.com/Rdatatable/data.table/issues/3369#issuecomment-462662752). `data.table` does not allow columns to be objects which themselves have columns (such as `matrix` and `data.frame`), unlike `data.frame` which does. Bug fix 19 in v1.12.2 (see below) added a helpful error (rather than segfault) to detect such invalid `data.table`, and promised that `as.data.table()` would unpack these columns in the next release (i.e. this release) so that the invalid `data.table` is not created in the first place.

#### BUG FIXES

1. `first`, `last`, `head` and `tail` by group no longer error in some cases, [#2030](https://github.com/Rdatatable/data.table/issues/2030) [#3462](https://github.com/Rdatatable/data.table/issues/3462). Thanks to @franknarf1 for reporting.
Expand Down Expand Up @@ -134,7 +136,23 @@

20. `c`, `seq` and `mean` of `ITime` objects now retain the `ITime` class via new `ITime` methods, [#3628](https://github.com/Rdatatable/data.table/issues/3628). Thanks @UweBlock for reporting. The `cut` and `split` methods for `ITime` have been removed since the default methods work, [#3630](https://github.com/Rdatatable/data.table/pull/3630).

20. `as.data.table.array` now handles the case when some of the array's dimension names are `NULL`, [#3636](https://github.com/Rdatatable/data.table/issues/3636).
21. `as.data.table.array` now handles the case when some of the array's dimension names are `NULL`, [#3636](https://github.com/Rdatatable/data.table/issues/3636).

22. Adding a `list` column using `cbind`, `as.data.table`, or `data.table` now works rather than treating the `list` as if it were a set of columns, [#3471](https://github.com/Rdatatable/data.table/pull/3471). However, please note that using `:=` to add columns is preferred.

```R
# cbind( data.table(1:2), list(c("a","b"),"a") )
# V1 V2 NA # v1.12.2 and before
# <int> <char> <char> # introduced invalid NA column name too
# 1: 1 a a
# 2: 2 b a
#
# V1 V2 # v1.12.4+
# <int> <list>
# 1: 1 a,b
# 2: 2 a
```


#### NOTES

Expand Down Expand Up @@ -231,7 +249,7 @@

18. `cbind` with a null (0-column) `data.table` now works as expected, [#3445](https://github.com/Rdatatable/data.table/issues/3445). Thanks to @mb706 for reporting.

19. Subsetting does a better job of catching a malformed `data.table` with error rather than segfault. A column may not be NULL, nor may a column be an object such as a data.frame or matrix which have columns. Thanks to a comment and reproducible example in [#3369](https://github.com/Rdatatable/data.table/issues/3369) from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enable `as.data.table` to unpack columns which are data.frame to support this use case.
19. Subsetting does a better job of catching a malformed `data.table` with error rather than segfault. A column may not be NULL, nor may a column be an object which has columns (such as a `data.frame` or `matrix`). Thanks to a comment and reproducible example in [#3369](https://github.com/Rdatatable/data.table/issues/3369) from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enable `as.data.table` to unpack columns which are `data.frame` to support this use case.

#### NOTES

Expand Down
106 changes: 68 additions & 38 deletions R/as.data.table.R
Expand Up @@ -115,49 +115,75 @@ as.data.table.array = function(x, keep.rownames=FALSE, key=NULL, sorted=TRUE, va
ans[]
}

as.data.table.list = function(x, keep.rownames=FALSE, key=NULL, ...) {
wn = sapply(x,is.null)
if (any(wn)) x = x[!wn]
if (!length(x)) return( null.data.table() )
# fix for #833, as.data.table.list with matrix/data.frame/data.table as a list element..
# TODO: move this entire logic (along with data.table() to C
for (i in seq_along(x)) {
dims = dim(x[[i]])
if (!is.null(dims)) {
ans = do.call("data.table", x)
setnames(ans, make.unique(names(ans)))
return(ans)
as.data.table.list = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE, ...) {
n = length(x)
eachnrow = integer(n) # vector of lengths of each column. may not be equal if silent repetition is required.
eachncol = integer(n)
missing.check.names = missing(check.names)
for (i in seq_len(n)) {
xi = x[[i]]
if (is.null(xi)) next # eachncol already initialized to 0 by integer() above
if (!is.null(dim(xi)) && missing.check.names) check.names=TRUE
if ("POSIXlt" %chin% class(xi)) {
warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
xi = x[[i]] = as.POSIXct(xi)
} else if (is.matrix(xi) || is.data.frame(xi)) {
if (!is.data.table(xi)) {
xi = x[[i]] = as.data.table(xi, keep.rownames=keep.rownames) # we will never allow a matrix to be a column; always unpack the columns
}
# else avoid dispatching to as.data.table.data.table (which exists and copies)
} else if (is.table(xi)) {
xi = x[[i]] = as.data.table.table(xi, keep.rownames=keep.rownames)
} else if (is.function(xi)) {
xi = x[[i]] = list(xi)
}
eachnrow[i] = NROW(xi) # for a vector (including list() columns) returns the length
eachncol[i] = NCOL(xi) # for a vector returns 1
}
n = vapply(x, length, 0L)
mn = max(n)
x = copy(x)
idx = which(n < mn)
if (length(idx)) {
for (i in idx) {
# any is.null(x[[i]]) were removed above, otherwise warning when a list element is NULL
if (inherits(x[[i]], "POSIXlt")) {
warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
x[[i]] = as.POSIXct(x[[i]])
}
# Implementing FR #4813 - recycle with warning when nr %% nrows[i] != 0L
if (!n[i] && mn)
warning("Item ", i, " is of size 0 but maximum size is ", mn, ", therefore recycled with 'NA'")
else if (n[i] && mn %% n[i] != 0L)
warning("Item ", i, " is of size ", n[i], " but maximum size is ", mn, " (recycled leaving a remainder of ", mn%%n[i], " items)")
x[[i]] = rep(x[[i]], length.out=mn)
ncol = sum(eachncol) # hence removes NULL items silently (no error or warning), #842.
if (ncol==0L) return(null.data.table())
nrow = max(eachnrow)
ans = vector("list",ncol) # always return a new VECSXP
recycle = function(x, nrow) {
if (length(x)==nrow) {
return(copy(x))
# This copy used to be achieved via .Call(CcopyNamedInList,x) at the top of data.table(). It maintains pre-Rv3.1.0
# behavior, for now. See test 548.2. The copy() calls duplicate() at C level which (importantly) also expands ALTREP objects.
# TODO: port this as.data.table.list() to C and use MAYBE_REFERENCED(x) || ALTREP(x) to save some copies.
# That saving used to be done by CcopyNamedInList but the copies happened again as well, so removing CcopyNamedInList is
# not worse than before, and gets us in a better centralized place to port as.data.table.list to C and use MAYBE_REFERENCED
# again in future.
}
if (identical(x,list())) vector("list", nrow) else rep(x, length.out=nrow) # new objects don't need copy
}
# fix for #842
if (mn > 0L) {
nz = which(n > 0L)
xx = point(vector("list", length(nz)), seq_along(nz), x, nz)
if (!is.null(names(x)))
setattr(xx, 'names', names(x)[nz])
x = xx
vnames = character(ncol)
k = 1L
for(i in seq_len(n)) {
xi = x[[i]]
if (is.null(xi)) next
if (eachnrow[i]>1L && nrow%%eachnrow[i]!=0L) # in future: eachnrow[i]!=nrow
warning("Item ", i, " has ", eachnrow[i], " rows but longest item has ", nrow, "; recycled with remainder.")
if (eachnrow[i]==0L && nrow>0L && is.atomic(xi)) # is.atomic to ignore list() since list() is a common way to initialize; let's not insist on list(NULL)
warning("Item ", i, " has 0 rows but longest item has ", nrow, "; filled with NA") # the rep() in recycle() above creates the NA vector
if (is.data.table(xi)) { # matrix and data.frame were coerced to data.table above
# vnames[[i]] = names(xi) #if (nm!="" && n>1L) paste(nm, names(xi), sep=".") else names(xi)
for (j in seq_along(xi)) {
ans[[k]] = recycle(xi[[j]], nrow)
vnames[k] = names(xi)[j]
k = k+1L
}
} else {
nm = names(x)[i]
vnames[k] = if (length(nm) && !is.na(nm) && nm!="") nm else paste0("V",i)
ans[[k]] = recycle(xi, nrow)
k = k+1L
}
}
setDT(x, key=key) # copy ensured above; also, setDT handles naming
x
if (any(vnames==".SD")) stop("A column may not be called .SD. That has special meaning.")
if (check.names) vnames = make.names(vnames, unique=TRUE)
setattr(ans, "names", vnames)
setDT(ans, key=key) # copy ensured above; also, setDT handles naming
ans
}

# don't retain classes before "data.frame" while converting
Expand All @@ -180,6 +206,10 @@ as.data.table.data.frame = function(x, keep.rownames=FALSE, key=NULL, ...) {
setnames(ans, 'rn', keep.rownames[1L])
return(ans)
}
if (any(!sapply(x,is.atomic))) {
# a data.frame with a column that is data.frame needs to be expanded; test 2013.4
return(as.data.table.list(x, keep.rownames=keep.rownames, ...))
}
ans = copy(x) # TO DO: change this deep copy to be shallow.
setattr(ans, "row.names", .set_row_names(nrow(x)))

Expand Down

0 comments on commit 04eeb30

Please sign in to comment.