You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
this is my very first post on Github and I want to thank the creators of data.table for their amazing job!
I currently use data.table to deal with huge datasets (50 datasets of 15 millions rows / dozens of columns each) and I noticed a strange behavior of dcast.data.table which should likely be considered as a bug (although it could just be a misunderstanding of data.table’s internals).
Many thanks to those who will take the time to understand and address my problem,
Galileo
P.S : I looked for a similar issue on SO and Github, but did not encountered it, so I decided to open a new issue. please let me know if a similar bug has already been reported.
P.S.2 : The data.table version I use is the "data.table_1.10.4" which seems to be the latest on CRAN.
Bug description
dcast data.table marks as « keyed » the cast data.table, while it handles NA differently than setkey. This leads to inconsistent results when further merging.
To my understanding, and related to this SO answer, setkey considers NA as large negative integer and sorts them consistently with base:: sort(x,na.last=FALSE). This seems to be the desired behaviour of data.table
dcast.data.table seems to not sort the NA.
Minimal Reproducible Example
First, I create the dataset :
# first create a toy dataset
toy_data = structure(list(id = structure(c(1L, 1L, 1L, 1L, NA, NA),
.Label = "123456", class = "factor"),
factor_var = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
.Label = "U", class = "factor"),
num_var1 = c(0,300, 600, 500, 0, 800),
num_var2 = c(0,15, 50, 30, 0, 50)),
.Names = c("id", "factor_var", "num_var1", "num_var2"),
row.names = c(NA, -6L),
class = c("data.table","data.frame"))
# id factor_var num_var1 num_var2
# 1: 123456 U 0 0
# 2: 123456 U 300 15
# 3: 123456 U 600 50
# 4: 123456 U 500 30
# 5: NA U 0 0
# 6: NA U 800 50
#REMARK1 : The dataset is NOT KEYED!
key(toy_data_agg )
#REMARK2 : The dataset the factor has only one level here, but not in my real data so It is very unlikely that it is the cause of the error
I want to aggregate some values by id, since it is the only relevant information :
# Aggregate by id
toy_data_agg = toy_data[,.(num_var1 = sum(num_var1),
num_var2 = sum(num_var2)),
by=.(id,factor_var)]
# id factor_var num_var1 num_var2
# 1: 123456 U 1400 95
# 2: NA U 800 50
#REMARK1 : The dataset is NOT KEYED!
key(toy_data_agg )
# REMARK2 : NA appears in last !
I want to cast the data.table to turn rows as columns (in the real dataset, I have much more variables and levels in factor_var) :
# cast by factor variable
toy_data_cast = dcast.data.table(data = toy_data_agg,
formula = id ~ factor_var,
value.var = c("num_var1","num_var2"))
# id num_var1_U num_var2_U
# 1: 123456 1400 95
# 2: NA 800 50
#REMARK1 : The dataset is now KEYED, without any warning!
key(toy_data_cast)
# REMARK2 : NA appears in LAST! THIS SEEMS INCONSISTENT WITH KEYING NA? SEE BELOW.
This is where the bug lies : the dataset is marked as « keyed » while this is not coherent with setkey.
setkey(toy_data_cast,id)
# Warning message:
# In setkeyv(x, cols, verbose = verbose, physical = physical) :
# Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood
# please let datatable-help know so the root cause can be fixed.
toy_data_cast
# id num_var1_U num_var2_U
# 1: NA 800 50
# 2: 123456 1400 95
# REMARK : NA appears in FIRST!
The text was updated successfully, but these errors were encountered:
Galileo-Galilei
changed the title
[BUG] dcast.data.table lead to inconsistent keying with NA
[BUG] dcast.data.table leads to inconsistent keying with NA
Jun 15, 2017
Hello all,
this is my very first post on Github and I want to thank the creators of data.table for their amazing job!
I currently use data.table to deal with huge datasets (50 datasets of 15 millions rows / dozens of columns each) and I noticed a strange behavior of dcast.data.table which should likely be considered as a bug (although it could just be a misunderstanding of data.table’s internals).
Many thanks to those who will take the time to understand and address my problem,
Galileo
P.S : I looked for a similar issue on SO and Github, but did not encountered it, so I decided to open a new issue. please let me know if a similar bug has already been reported.
P.S.2 : The data.table version I use is the "data.table_1.10.4" which seems to be the latest on CRAN.
Bug description
dcast data.table marks as « keyed » the cast data.table, while it handles NA differently than setkey. This leads to inconsistent results when further merging.
To my understanding, and related to this SO answer, setkey considers NA as large negative integer and sorts them consistently with
base:: sort(x,na.last=FALSE)
. This seems to be the desired behaviour of data.tabledcast.data.table seems to not sort the NA.
Minimal Reproducible Example
First, I create the dataset :
I want to aggregate some values by id, since it is the only relevant information :
I want to cast the data.table to turn rows as columns (in the real dataset, I have much more variables and levels in factor_var) :
This is where the bug lies : the dataset is marked as « keyed » while this is not coherent with setkey.
The text was updated successfully, but these errors were encountered: