Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.data.table.array - convert multidimensional array into data.table #1418

Closed
jangorecki opened this issue Oct 30, 2015 · 10 comments
Assignees
Milestone

Comments

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Oct 30, 2015

FR for multidimensional array conversion to data.table.
Logic behind conversion is to lookup value from array for each combination of dimensions. Rationale is not only the similar API on subset of array/data.table (see below examples) but the underlying organization of data. It basically reduce array dimensions to tabular structure keeping all the relations between dimensions and corresponding value of a measure - so losslessly.
Below solution is likely to be inefficient due to lookup value from array for each group. The j argument may looks scary but it simply builds following call .(value = x[color, year, country]) to perform subset x array for each group.

library(data.table)
set.seed(1)

# array
ar = array(rnorm(8,10,5), rep(2,3), dimnames = list(color = c("green","red"), year = c("2014","2015"), country = c("UK","IN")))
ar["green","2015",]
ar["green",c("2014","2015"),]

# data.table
as.data.table.array = function(x) do.call(CJ, dimnames(x))[, .(value = eval(as.call(lapply(c("[","x", names(dimnames(x))), as.symbol)))),, keyby = c(names(dimnames(x)))]
dt = as.data.table.array(ar)
dt[J("green","2015")]
dt[J("green", c("2014","2015"))]

update after merge: http://stackoverflow.com/questions/11141406/reshaping-an-array-to-data-frame

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Nov 4, 2015

already have it well managed in separate project.

@jangorecki jangorecki closed this Nov 4, 2015
@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Mar 22, 2016

reopening as it is worth to improve, current state:

library(data.table)
x = array(c(1, 0, 0, 2, 0, 0, 0, 3), dim=c(2, 2, 2))
as.data.frame(x)
#  V1 V2 V3 V4
#1  1  0  0  0
#2  0  2  0  3
as.data.table(x)
#   x
#1: 1
#2: 0
#3: 0
#4: 2
#5: 0
#6: 0
#7: 0
#8: 3

I would NOT aim for consistency to data.frame here as it doesn't really provide useful output for arrays.

new.as.data.table.array = function(x) {
    d = dim(x)
    dn = dimnames(x)
    if (is.null(dn)) dn = lapply(d, seq.int)
    r = do.call(CJ, c(dn, list(sorted=TRUE, unique=TRUE)))
    dim.cols = copy(names(r))
    jj = as.call(list(
        as.name(":="),
        "value",
        as.call(lapply(c("[","x", dim.cols), as.symbol)) # lookup to 'x' array for each row
    )) # `:=`("value", x[V1, V2, V3])
    r[, eval(jj), by=c(dim.cols)][]
}
new.as.data.table.array(x)
#   V1 V2 V3 value
#1:  1  1  1     1
#2:  1  1  2     0
#3:  1  2  1     0
#4:  1  2  2     0
#5:  2  1  1     0
#6:  2  1  2     0
#7:  2  2  1     2
#8:  2  2  2     3

It would handle use case described in previous comments:

set.seed(1)
# array
x = array(rnorm(8,10,5), rep(2,3), dimnames = list(color = c("green","red"), year = c("2014","2015"), country = c("UK","IN")))
x["green","2015",]
#      UK       IN 
#17.55891 15.62465 
x["green",c("2014","2015"),]
#      country
#year         UK        IN
#  2014 12.87891  6.893797
#  2015 17.55891 15.624655

dt = new.as.data.table.array(x)
dt[J("green","2015")]
#   color year country    value
#1: green 2015      IN 15.62465
#2: green 2015      UK 17.55891
dt[J("green", c("2014","2015"))]
#   color year country     value
#1: green 2014      IN  6.893797
#2: green 2014      UK 12.878907
#3: green 2015      IN 15.624655
#4: green 2015      UK 17.558906

Any feedback to draft welcome.

@jangorecki jangorecki reopened this Mar 22, 2016
@jangorecki jangorecki self-assigned this Mar 22, 2016
@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Mar 22, 2016

Maybe better naming than V1:3? Not sure how standard i,j,k is, perhaps dim_1:3?

Like the idea though.

jangorecki added a commit that referenced this issue Mar 22, 2016
@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Mar 22, 2016

@MichaelChirico as.data.table.* needs to be quite low level conversion, as few data.table metadata as possible, if source (array) doesn't have names I think it is better to use the data.table's most default ones: V1:V3

Just pushed RC version so feedback on it is welcome, or some new tests, after a while I will rebase it to master.
https://github.com/Rdatatable/data.table/compare/as.data.table.array

in summary:

  • arrays doesn't scale for more dimensions and sparse data due to cartesian product of dimensions.
  • as.data.table.array gives ability to keep arrays in a sparse way, modelling multidimensional array in tabular structure the way I believe it should be modelled.

@arunsrinivasan FYI:
The worst thing about that PR is that it is not performance focused, for each dimensions set in cartesian product we are making lookup for a value to input x array, kind of dt[, value:= x[a, b], by = c("a","b")]. It may not even need to scale better due to poor array memory scalability.

@MichaelChirico

This comment has been minimized.

Copy link
Member

@MichaelChirico MichaelChirico commented Mar 23, 2016

I guess we're saving the key argument for #890?

Also, it might be nice to have an option to generate more than one variable from this, e.g. for an M x N x P x Q array, generating Q columns, akin to margin.

This would make it more parallel to as.data.table.matrix.

I think implementation is just a prudent use of dcast.

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Mar 23, 2016

@MichaelChirico the use case you are describing would be simply the case when Q dimension would be a measure type dimension. I'm not sure if we really need it, dcast of course can handle that as a post-processing step.
Not sure about the key, cannot find a rationale for a default other than current setkey on all dimensions, this is made in CJ call which is unavoidable.

@mrdwab

This comment has been minimized.

Copy link

@mrdwab mrdwab commented Mar 24, 2016

@jangorecki Perhaps you also want to consider a "wide" representation, as you would get if you did ftable(x). That's what I make use of in ftable2dt().

Also, with ftable2dt, if one wanted the long skinny version, they could use the "wide" option (ftable2dt(x, "wide")).

@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Mar 25, 2016

Will hold on with that. I don't see a big problem with dcast'ing measures as post-process, but..
Last dimension could be optionally kept in columns, forming multiple measures, so it would be consistent with as.data.table.matrix.

@mrdwab

This comment has been minimized.

Copy link

@mrdwab mrdwab commented Mar 26, 2016

@jangorecki I was also sharing it because it might be faster on larger arrays.

Here's the rough version of the function I'm proposing:

am_adt <- function(inarray) {
  if (!is.array(inarray)) stop("input must be an array")
  dims <- dim(inarray)
  if (is.null(dimnames(inarray))) {
    inarray <- provideDimnames(inarray, base = list(as.character(seq_len(max(dims)))))
  }
  FT <- if (any(class(inarray) %in% "ftable")) inarray else ftable(inarray) 
  out <- data.table(as.table(ftable(FT)))
  nam <- names(out)[seq_along(dims)]
  setorderv(out[, (nam) := lapply(.SD, type.convert), .SDcols = nam], nam)[]
}

Here are a couple of large-ish arrays to test against. "M" has no names, and "N" does. It's just 1 million values, put into a 5D array.

dims <- c(10, 20, 50, 10, 10)
set.seed(1)
M <- `dim<-`(sample(100, prod(dims), TRUE), dims)
N <- `dimnames<-`(M, lapply(dims, function(x) c(letters, LETTERS)[seq_len(x)]))

Wrapping your approach in funDT and mine in funAM and running benchmarks, I get:

benchmarks

Certainly, there is room for improvement. I'm not sure that I should always use type.convert for example. It might be better to only use that if the data does not have dimnames, and keep them as characters otherwise.

By the way, regarding your comment about dcasting after getting a wide dataset, I guess the question would be whether dcast is in general faster than melt or not. If melt is faster, then it would make more sense to go to a wide format first, and then get skinny, rather than the other way around. Getting the wide format is certainly fast:

benchmarks

@jangorecki jangorecki added this to the v1.9.10 milestone Nov 23, 2016
@jangorecki

This comment has been minimized.

Copy link
Member Author

@jangorecki jangorecki commented Dec 5, 2016

@mrdwab previous implementation was unnecessarily complex and inefficient. The one just pushed is much faster. Looking at as.data.frame.table method, I think it make sense for this PR actually update table method instead of adding array method.

@mattdowle mattdowle modified the milestones: v1.10.6, Candidate Aug 7, 2017
@mattdowle mattdowle closed this in 0ecb60b Aug 7, 2017
mattdowle added a commit that referenced this issue Aug 7, 2017
new as.data.table.array, closes #1418
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.