[BUG] dcast.data.table leads to inconsistent keying with NA #2202

Galileo-Galilei · 2017-06-15T12:02:55Z

Hello all,

this is my very first post on Github and I want to thank the creators of data.table for their amazing job!

I currently use data.table to deal with huge datasets (50 datasets of 15 millions rows / dozens of columns each) and I noticed a strange behavior of dcast.data.table which should likely be considered as a bug (although it could just be a misunderstanding of data.table’s internals).

Many thanks to those who will take the time to understand and address my problem,

Galileo

P.S : I looked for a similar issue on SO and Github, but did not encountered it, so I decided to open a new issue. please let me know if a similar bug has already been reported.

P.S.2 : The data.table version I use is the "data.table_1.10.4" which seems to be the latest on CRAN.

Bug description

dcast data.table marks as « keyed » the cast data.table, while it handles NA differently than setkey. This leads to inconsistent results when further merging.

To my understanding, and related to this SO answer, setkey considers NA as large negative integer and sorts them consistently with base:: sort(x,na.last=FALSE). This seems to be the desired behaviour of data.table
dcast.data.table seems to not sort the NA.

Minimal Reproducible Example

First, I create the dataset :

# first create a toy dataset
toy_data = structure(list(id = structure(c(1L, 1L, 1L, 1L, NA, NA),
                                         .Label = "123456", class = "factor"),
                          factor_var = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
                                                 .Label = "U", class = "factor"),
                          num_var1 = c(0,300, 600, 500, 0, 800),
                          num_var2 = c(0,15, 50, 30, 0, 50)),
                     .Names = c("id", "factor_var", "num_var1", "num_var2"),
                     row.names = c(NA, -6L),
                     class = c("data.table","data.frame"))

#      id factor_var num_var1 num_var2
# 1: 123456          U        0        0
# 2: 123456          U      300       15
# 3: 123456          U      600       50
# 4: 123456          U      500       30
# 5:    NA          U        0        0
# 6:    NA          U      800       50

#REMARK1 : The dataset is NOT KEYED!
key(toy_data_agg )
#REMARK2 : The dataset the factor has only one level here, but not in my real data so It is very unlikely that it is the cause of the error

I want to aggregate some values by id, since it is the only relevant information :

# Aggregate by id
toy_data_agg = toy_data[,.(num_var1 = sum(num_var1),
                           num_var2 = sum(num_var2)),
                        by=.(id,factor_var)]   
#      id factor_var num_var1 num_var2
# 1: 123456          U     1400       95
# 2:    NA          U      800       50

#REMARK1 : The dataset is NOT KEYED!
key(toy_data_agg )
# REMARK2 : NA appears in last !

I want to cast the data.table to turn rows as columns (in the real dataset, I have much more variables and levels in factor_var) :

# cast by factor variable
toy_data_cast = dcast.data.table(data = toy_data_agg,
                                 formula = id ~ factor_var,
                                 value.var = c("num_var1","num_var2"))
#      id num_var1_U num_var2_U
# 1: 123456       1400         95
# 2:    NA        800         50
#REMARK1 : The dataset is now KEYED, without any warning!
key(toy_data_cast)
# REMARK2 : NA appears in LAST! THIS SEEMS INCONSISTENT WITH KEYING NA? SEE BELOW.

This is where the bug lies : the dataset is marked as « keyed » while this is not coherent with setkey.

setkey(toy_data_cast,id) 
# Warning message:
# In setkeyv(x, cols, verbose = verbose, physical = physical) :
# Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood 
# please let datatable-help know so the root cause can be fixed.

toy_data_cast
#      id num_var1_U num_var2_U
# 1:    NA        800         50
# 2: 123456       1400         95
# REMARK : NA appears in FIRST!

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2019-02-16T18:08:07Z

Thanks for this report. Just fixed. Will push soon.

lhs cols with NAs are sorted correctly in result. Closes #2202.

Galileo-Galilei changed the title ~~[BUG] dcast.data.table lead to inconsistent keying with NA~~ [BUG] dcast.data.table leads to inconsistent keying with NA Jun 15, 2017

arunsrinivasan self-assigned this Feb 16, 2019

arunsrinivasan added the bug label Feb 16, 2019

arunsrinivasan added a commit that referenced this issue Feb 16, 2019

Fixes #2202 dcast issue

eb3fcff

lhs cols with NAs are sorted correctly in result. Closes #2202.

arunsrinivasan mentioned this issue Feb 16, 2019

#2202 dcast col with NA keying issue #3407

Merged

arunsrinivasan closed this as completed in #3407 Feb 16, 2019

arunsrinivasan added a commit that referenced this issue Feb 16, 2019

Fixes #2202 dcast issue (#3407)

47f7a05

lhs cols with NAs are sorted correctly in result. Closes #2202.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] dcast.data.table leads to inconsistent keying with NA #2202

[BUG] dcast.data.table leads to inconsistent keying with NA #2202

Galileo-Galilei commented Jun 15, 2017 •

edited

arunsrinivasan commented Feb 16, 2019

[BUG] dcast.data.table leads to inconsistent keying with NA #2202

[BUG] dcast.data.table leads to inconsistent keying with NA #2202

Comments

Galileo-Galilei commented Jun 15, 2017 • edited

Bug description

Minimal Reproducible Example

arunsrinivasan commented Feb 16, 2019

Galileo-Galilei commented Jun 15, 2017 •

edited