Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up[BUG] dcast.data.table leads to inconsistent keying with NA #2202
Comments
|
Thanks for this report. Just fixed. Will push soon. |
lhs cols with NAs are sorted correctly in result. Closes #2202.
Hello all,
this is my very first post on Github and I want to thank the creators of data.table for their amazing job!
I currently use data.table to deal with huge datasets (50 datasets of 15 millions rows / dozens of columns each) and I noticed a strange behavior of dcast.data.table which should likely be considered as a bug (although it could just be a misunderstanding of data.table’s internals).
Many thanks to those who will take the time to understand and address my problem,
Galileo
P.S : I looked for a similar issue on SO and Github, but did not encountered it, so I decided to open a new issue. please let me know if a similar bug has already been reported.
P.S.2 : The data.table version I use is the "data.table_1.10.4" which seems to be the latest on CRAN.
Bug description
dcast data.table marks as « keyed » the cast data.table, while it handles NA differently than setkey. This leads to inconsistent results when further merging.
To my understanding, and related to this SO answer, setkey considers NA as large negative integer and sorts them consistently with
base:: sort(x,na.last=FALSE). This seems to be the desired behaviour of data.tabledcast.data.table seems to not sort the NA.
Minimal Reproducible Example
First, I create the dataset :
I want to aggregate some values by id, since it is the only relevant information :
I want to cast the data.table to turn rows as columns (in the real dataset, I have much more variables and levels in factor_var) :
This is where the bug lies : the dataset is marked as « keyed » while this is not coherent with setkey.