Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.matrix.data.table incorrect results when nrows == 1 #2930

Closed
malcook opened this issue Jun 12, 2018 · 19 comments
Closed

as.matrix.data.table incorrect results when nrows == 1 #2930

malcook opened this issue Jun 12, 2018 · 19 comments
Labels
Milestone

Comments

@malcook
Copy link

@malcook malcook commented Jun 12, 2018

# Minimal reproducible example

library(data.table)
d<-data.table(name=c("bob","joe"),age=10:11,IQ=c(130,140))
as.matrix(d, 1) 
#>     age  IQ
#> bob  10 130
#> joe  11 140

as.matrix(d[1, ], 1)
#>   name  age  IQ   
#> 1 "bob" "10" "130"

Created on 2018-06-13 by the reprex package (v0.2.0).

The problem is that the rownames argument to as.matrix.data.table is not being respected as documented:

rownames: optional, a single column name or column index to use as the
'rownames' in the returned 'matrix'. If 'TRUE' the 'key' of
the 'data.table' will be used if it is a single column,
otherwise the first column in the 'data.table' will be used.
Alternative a vector of length 'nrow(x)' to assign as the row
names of the returned 'matrix'.

We should expect the 2nd call above to yield a 1x2 matrix of numeric having the single rowname of "bob". Instead, it yields a 1x3 character matrix.

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /n/apps/CentOS7/install/r-3.4.3/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-3.4.3/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.2 devtools_1.13.5  

loaded via a namespace (and not attached):
[1] compiler_3.4.3 withr_2.1.2    memoise_1.1.0  digest_0.6.15
@HughParsonage
Copy link
Member

@HughParsonage HughParsonage commented Jun 13, 2018

Doesn't seem to be a bug. (Did you mean as.matrix(d[1, 1])?) Please reopen and clarify with another example if you think it is.

@malcook
Copy link
Author

@malcook malcook commented Jun 13, 2018

@HughParsonage - I've clarified the original issue. I think it IS a bug. I don't think I have privs to re-open per https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github - can you re-open for me if you agree?

@HughParsonage
Copy link
Member

@HughParsonage HughParsonage commented Jun 13, 2018

Thanks @malcook, good edit. I think this is a documentation issue: essentially rownames = 1 could mean the column index or a "vector of length 'nrow(x)'". It seems the latter takes precedence when available. (Consider as.matrix(d[1, ], 99).)

@malcook
Copy link
Author

@malcook malcook commented Jun 13, 2018

I think it is NOT a documentation issue. Try passing rownames='name' and we don't get the documented behavior. IMO, the mode and dimension of as.matrix(d,1) should not depend on the dimensionality of d. That is does reminds me of the default drop=TRUE option to [ which is annoying to everybody. (Erhm, ok, at least to me ;)

In fact...

I'm guessing that under the hood (in the data.table source code) it will prove to be exactly this underlying issue - and the fix will be to add drop=FALSE to some call to [.

@HughParsonage
Copy link
Member

@HughParsonage HughParsonage commented Jun 13, 2018

It's definitely an issue.

as.matrix(d[1, ], "name")
#>      name  age  IQ   
#> name "bob" "10" "130"

but this is a documented result:

rownames: optional, a single column name or column index to use as the
'rownames' in the returned 'matrix'. If 'TRUE' the 'key' of
the 'data.table' will be used if it is a single column,
otherwise the first column in the 'data.table' will be used.
Alternative a vector of length 'nrow(x)' to assign as the row
names of the returned 'matrix'.

Since nrow(x) == length("name") == 1, the last sentence is what determines the interpretation of rownames. Bear in mind that this is similar to data.frame:

data.frame(x = 1, y = 7, row.names = 2)
#>   x y
#> 2 1 7
data.frame(x = 1:2, y = 7:8, row.names = 2)
#>   x
#> 7 1
#> 8 2

Unfortunately, this means that forcing the column of a one-row data.frame to a row name is not possible, but that is the behaviour.

I think even if we agree that this is documented behaviour, the documentation should be changed to emphasize that if length(rownames) == nrow(x) then that behaviour takes precedence.

@malcook
Copy link
Author

@malcook malcook commented Jun 13, 2018

Hmm. I think you’re stretching.

Try setkey(d,’name’) and then read the documentation again, and then guess what the result of passing rownames=TRUE to as.matrix.

Also, decide what you wish/expect/hope the result of passing rownames=TRUE should be before trying it.

Now try it.

Surprised?

I claim that if you are correct and this is a bug in the documentation, and you “fix” the documentation to reflect current behavior, then the documentation will only more clearly reflect a bug in the design.

But I don’t think there is a bug in the design, except perhaps for trying to overload rownames with multiple possible conflicting possible interpretations. I hope the project finds a way to provide both behaviors and resolve the (presumably unintended) ambiguity.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Jun 13, 2018

@malcook
Copy link
Author

@malcook malcook commented Jun 13, 2018

My suggestion would be to change the design in such a way that the mode and dimension of the result of as.matrix should never depend on the dimensionality of d, which it does now as implemented in this (arguably) edge case.

So, any change would have to be a (possibly) breaking change.

I'd be surprised if anyone discovered this and chose to depend upon it, but YMMV.

I'm not sure who vets/approves pull requests. Is that you @MichaelChirico ?

Anyway, if that person(s) agrees that this is a poor design, and should be fixed, then perhaps there is a chance that a pull request from me that fixes the design/implementation/code would be welcome.

But I would understand that a reasonable position might be that this is not a bad design, or the badness is insufficient to merit a possible breaking change.

Please advise, oh data.table gods, and I will try and respond accordingly.

Thanks

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Jun 13, 2018

I am guessing it is as.array.data.table method responsible for this behavior.

@malcook
Copy link
Author

@malcook malcook commented Jun 13, 2018

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Jun 13, 2018

mode and dimension of the result of as.matrix should never depend on the dimensionality of d

and what if we want to apply same principle to as.array method? didn't look into the code yet but we may end up with conflict, simply because matrix is edge case of array.

@malcook
Copy link
Author

@malcook malcook commented Jun 13, 2018

There appears to be no data.table:::as.array.data.table so not much of an issue

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Jun 13, 2018

so maybe there should be...
we have as.data.table.array already, reverse operation as.array.data.table would most often result in memory error because base array not scale in memory. It produce cross product of all elements for all columns, except the (single) measure.
as of now probably better just handle this edge case in as.matrix.

@malcook
Copy link
Author

@malcook malcook commented Jun 13, 2018

I don't really understand where you are leading, but feel it is not toward resolving the issue I've raise, but, thanks for your interest and please carry on if there is a relationship I am missing.

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Jun 13, 2018

if result should not depend on dimensionality then we should focus on array method instead of matrix method which by name limits dimensionality.

@malcook
Copy link
Author

@malcook malcook commented Jun 15, 2018

I'm going to bow out of this exchange I've started. Thanks all for hearing me out. If you want to "fix" the documentation so others do not fall down the same rabbit hole as I did, that is one way of resolving it.

As always, thanks for data.table. I use it every day!

@sritchie73
Copy link
Contributor

@sritchie73 sritchie73 commented Jun 17, 2018

Thanks for raising this @malcook , and apologies for not thinking of that edge case!

@malcook malcook closed this as completed Jun 18, 2018
@mattdowle mattdowle added this to the 1.11.6 milestone Aug 14, 2018
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Aug 14, 2018

Reopening because PR #2939 by @sritchie73 (thanks) fixes this in agreement with @malcook, about to be merged.

@mattdowle mattdowle reopened this Aug 14, 2018
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Aug 14, 2018

In further agreement with @malcook, for even further clarity, safety and consistency, I've suggested a minor change to @sritchie73's PR here: #2939 (comment). I'll make the change to the PR unless there are any objections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants