Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dcast.data.table leads to inconsistent keying with NA #2202

Closed
Galileo-Galilei opened this issue Jun 15, 2017 · 1 comment
Closed

[BUG] dcast.data.table leads to inconsistent keying with NA #2202

Galileo-Galilei opened this issue Jun 15, 2017 · 1 comment
Assignees
Labels

Comments

@Galileo-Galilei
Copy link

Galileo-Galilei commented Jun 15, 2017

Hello all,

this is my very first post on Github and I want to thank the creators of data.table for their amazing job!

I currently use data.table to deal with huge datasets (50 datasets of 15 millions rows / dozens of columns each) and I noticed a strange behavior of dcast.data.table which should likely be considered as a bug (although it could just be a misunderstanding of data.table’s internals).

Many thanks to those who will take the time to understand and address my problem,

Galileo

P.S : I looked for a similar issue on SO and Github, but did not encountered it, so I decided to open a new issue. please let me know if a similar bug has already been reported.

P.S.2 : The data.table version I use is the "data.table_1.10.4" which seems to be the latest on CRAN.


Bug description

dcast data.table marks as « keyed » the cast data.table, while it handles NA differently than setkey. This leads to inconsistent results when further merging.

To my understanding, and related to this SO answer, setkey considers NA as large negative integer and sorts them consistently with base:: sort(x,na.last=FALSE). This seems to be the desired behaviour of data.table
dcast.data.table seems to not sort the NA.

Minimal Reproducible Example

First, I create the dataset :

# first create a toy dataset
toy_data = structure(list(id = structure(c(1L, 1L, 1L, 1L, NA, NA),
                                         .Label = "123456", class = "factor"),
                          factor_var = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
                                                 .Label = "U", class = "factor"),
                          num_var1 = c(0,300, 600, 500, 0, 800),
                          num_var2 = c(0,15, 50, 30, 0, 50)),
                     .Names = c("id", "factor_var", "num_var1", "num_var2"),
                     row.names = c(NA, -6L),
                     class = c("data.table","data.frame"))

#      id factor_var num_var1 num_var2
# 1: 123456          U        0        0
# 2: 123456          U      300       15
# 3: 123456          U      600       50
# 4: 123456          U      500       30
# 5:    NA          U        0        0
# 6:    NA          U      800       50

#REMARK1 : The dataset is NOT KEYED!
key(toy_data_agg )
#REMARK2 : The dataset the factor has only one level here, but not in my real data so It is very unlikely that it is the cause of the error

I want to aggregate some values by id, since it is the only relevant information :

# Aggregate by id
toy_data_agg = toy_data[,.(num_var1 = sum(num_var1),
                           num_var2 = sum(num_var2)),
                        by=.(id,factor_var)]   
#      id factor_var num_var1 num_var2
# 1: 123456          U     1400       95
# 2:    NA          U      800       50

#REMARK1 : The dataset is NOT KEYED!
key(toy_data_agg )
# REMARK2 : NA appears in last !

I want to cast the data.table to turn rows as columns (in the real dataset, I have much more variables and levels in factor_var) :

# cast by factor variable
toy_data_cast = dcast.data.table(data = toy_data_agg,
                                 formula = id ~ factor_var,
                                 value.var = c("num_var1","num_var2"))
#      id num_var1_U num_var2_U
# 1: 123456       1400         95
# 2:    NA        800         50
#REMARK1 : The dataset is now KEYED, without any warning!
key(toy_data_cast)
# REMARK2 : NA appears in LAST! THIS SEEMS INCONSISTENT WITH KEYING NA? SEE BELOW.

This is where the bug lies : the dataset is marked as « keyed » while this is not coherent with setkey.

setkey(toy_data_cast,id) 
# Warning message:
# In setkeyv(x, cols, verbose = verbose, physical = physical) :
# Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood 
# please let datatable-help know so the root cause can be fixed.

toy_data_cast
#      id num_var1_U num_var2_U
# 1:    NA        800         50
# 2: 123456       1400         95
# REMARK : NA appears in FIRST!
@Galileo-Galilei Galileo-Galilei changed the title [BUG] dcast.data.table lead to inconsistent keying with NA [BUG] dcast.data.table leads to inconsistent keying with NA Jun 15, 2017
@arunsrinivasan arunsrinivasan self-assigned this Feb 16, 2019
@arunsrinivasan
Copy link
Member

Thanks for this report. Just fixed. Will push soon.

arunsrinivasan added a commit that referenced this issue Feb 16, 2019
lhs cols with NAs are sorted correctly in result. Closes #2202.
arunsrinivasan added a commit that referenced this issue Feb 16, 2019
lhs cols with NAs are sorted correctly in result. Closes #2202.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants