New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5660] Interaction between "Caret" package and data.table in R 3.1 gives different column name #476

Closed
arunsrinivasan opened this Issue Jun 8, 2014 · 3 comments

Comments

Projects
None yet
1 participant
@arunsrinivasan
Member

arunsrinivasan commented Jun 8, 2014

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

As illustrated here on SO.

@arunsrinivasan arunsrinivasan added the High label Aug 1, 2014

@arunsrinivasan arunsrinivasan added this to the v1.9.4 milestone Aug 1, 2014

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Aug 1, 2014

Member

Some updates:

First off, @topepo seems to have incorporated a fix for this issue on to the caret package already in v 6.0-29. Much appreciated, thank you.

However, the fix doesn't seem to be working. Here's what I get on my system:

library(caret)
library(data.table)
DT <- data.table(x = rnorm(10), y = rnorm(10))
cv.ctrl <- trainControl(method = 'repeatedcv', number = 5, repeats = 1)
fit <- train(y ~ x, data = DT, 'lm', trControl = cv.ctrl)

> names(DT)
# [1] "x"        ".outcome"

> sessionInfo()
# R version 3.1.1 (2014-07-10)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] graphics  grDevices datasets  stats     utils     methods   base     

# other attached packages:
# [1] data.table_1.9.3 caret_6.0-30     ggplot2_1.0.0    lattice_0.20-29  bit64_0.9-4     
# [6] bit_1.1-12      

# loaded via a namespace (and not attached):
#  [1] BradleyTerry2_1.0-5 brglm_0.5-9         car_2.0-20          codetools_0.2-8    
#  [5] colorspace_1.2-4    compiler_3.1.1      digest_0.6.4        foreach_1.4.2      
#  [9] grid_3.1.1          gtable_0.1.2        gtools_3.4.1        iterators_1.0.7    
# [13] lme4_1.1-7          MASS_7.3-33         Matrix_1.1-4        minqa_1.2.3        
# [17] munsell_0.4.2       nlme_3.1-117        nloptr_1.0.0        nnet_7.3-8         
# [21] plyr_1.8.1          proto_0.3-10        Rcpp_0.11.2         reshape2_1.4.0.99  
# [25] scales_0.2.4        splines_3.1.1       stringr_0.6.2       tools_3.1.1        

The issue seems to arise from these lines from train.formula function:

    res$trainingData <- data
    isY <- names(res$trainingData) %in% as.character(form[[2]])
    if(any(isY)) colnames(res$trainingData)[isY] <- ".outcome"

which should work just fine, except it doesn't on v3.1+. It works just fine with < v3.1. This is because R v3.1+ shallow copies where ever possible as opposed to deep copies and data.table will have to ake care of this. Here's a simple way to reproduce the issue:

# R v3.1.1
require(data.table)
dt = data.table(x=1:5, y=6:10)
ll = vector("list", 2L)
names(ll) <- c("a", "b")
ll$a = 1L; ll$b = dt
idx = c(FALSE, TRUE)
colnames(ll$b)[idx] = "bla"
names(dt) # [1] "x"   "bla"
names(ll$b) # [1] "x"   "bla"

The function colnames<- simply calls names(x) <- value which then calls the appropriate data.table method that uses setnames. However, x has only been shallow copied, on which we set names by reference, and hence the issue.

Bumping priority. Assigning milestone 1.9.4.

Member

arunsrinivasan commented Aug 1, 2014

Some updates:

First off, @topepo seems to have incorporated a fix for this issue on to the caret package already in v 6.0-29. Much appreciated, thank you.

However, the fix doesn't seem to be working. Here's what I get on my system:

library(caret)
library(data.table)
DT <- data.table(x = rnorm(10), y = rnorm(10))
cv.ctrl <- trainControl(method = 'repeatedcv', number = 5, repeats = 1)
fit <- train(y ~ x, data = DT, 'lm', trControl = cv.ctrl)

> names(DT)
# [1] "x"        ".outcome"

> sessionInfo()
# R version 3.1.1 (2014-07-10)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] graphics  grDevices datasets  stats     utils     methods   base     

# other attached packages:
# [1] data.table_1.9.3 caret_6.0-30     ggplot2_1.0.0    lattice_0.20-29  bit64_0.9-4     
# [6] bit_1.1-12      

# loaded via a namespace (and not attached):
#  [1] BradleyTerry2_1.0-5 brglm_0.5-9         car_2.0-20          codetools_0.2-8    
#  [5] colorspace_1.2-4    compiler_3.1.1      digest_0.6.4        foreach_1.4.2      
#  [9] grid_3.1.1          gtable_0.1.2        gtools_3.4.1        iterators_1.0.7    
# [13] lme4_1.1-7          MASS_7.3-33         Matrix_1.1-4        minqa_1.2.3        
# [17] munsell_0.4.2       nlme_3.1-117        nloptr_1.0.0        nnet_7.3-8         
# [21] plyr_1.8.1          proto_0.3-10        Rcpp_0.11.2         reshape2_1.4.0.99  
# [25] scales_0.2.4        splines_3.1.1       stringr_0.6.2       tools_3.1.1        

The issue seems to arise from these lines from train.formula function:

    res$trainingData <- data
    isY <- names(res$trainingData) %in% as.character(form[[2]])
    if(any(isY)) colnames(res$trainingData)[isY] <- ".outcome"

which should work just fine, except it doesn't on v3.1+. It works just fine with < v3.1. This is because R v3.1+ shallow copies where ever possible as opposed to deep copies and data.table will have to ake care of this. Here's a simple way to reproduce the issue:

# R v3.1.1
require(data.table)
dt = data.table(x=1:5, y=6:10)
ll = vector("list", 2L)
names(ll) <- c("a", "b")
ll$a = 1L; ll$b = dt
idx = c(FALSE, TRUE)
colnames(ll$b)[idx] = "bla"
names(dt) # [1] "x"   "bla"
names(ll$b) # [1] "x"   "bla"

The function colnames<- simply calls names(x) <- value which then calls the appropriate data.table method that uses setnames. However, x has only been shallow copied, on which we set names by reference, and hence the issue.

Bumping priority. Assigning milestone 1.9.4.

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Aug 1, 2014

Member

One possible fix for the caret package would be to change this line to:

if ("data.table" %in% class(data)) 
    res$trainingData <- copy(data)
else res$trainingData <- data
Member

arunsrinivasan commented Aug 1, 2014

One possible fix for the caret package would be to change this line to:

if ("data.table" %in% class(data)) 
    res$trainingData <- copy(data)
else res$trainingData <- data

@arunsrinivasan arunsrinivasan modified the milestones: 2.0.1, v2.0 Sep 6, 2014

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Oct 3, 2014

Member

Updated the SO post: link.

Member

arunsrinivasan commented Oct 3, 2014

Updated the SO post: link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment