Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5660] Interaction between "Caret" package and data.table in R 3.1 gives different column name #476

Closed
arunsrinivasan opened this issue Jun 8, 2014 · 3 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

As illustrated here on SO.

@arunsrinivasan arunsrinivasan added this to the v1.9.4 milestone Aug 1, 2014
@arunsrinivasan
Copy link
Member Author

Some updates:

First off, @topepo seems to have incorporated a fix for this issue on to the caret package already in v 6.0-29. Much appreciated, thank you.

However, the fix doesn't seem to be working. Here's what I get on my system:

library(caret)
library(data.table)
DT <- data.table(x = rnorm(10), y = rnorm(10))
cv.ctrl <- trainControl(method = 'repeatedcv', number = 5, repeats = 1)
fit <- train(y ~ x, data = DT, 'lm', trControl = cv.ctrl)

> names(DT)
# [1] "x"        ".outcome"

> sessionInfo()
# R version 3.1.1 (2014-07-10)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] graphics  grDevices datasets  stats     utils     methods   base     

# other attached packages:
# [1] data.table_1.9.3 caret_6.0-30     ggplot2_1.0.0    lattice_0.20-29  bit64_0.9-4     
# [6] bit_1.1-12      

# loaded via a namespace (and not attached):
#  [1] BradleyTerry2_1.0-5 brglm_0.5-9         car_2.0-20          codetools_0.2-8    
#  [5] colorspace_1.2-4    compiler_3.1.1      digest_0.6.4        foreach_1.4.2      
#  [9] grid_3.1.1          gtable_0.1.2        gtools_3.4.1        iterators_1.0.7    
# [13] lme4_1.1-7          MASS_7.3-33         Matrix_1.1-4        minqa_1.2.3        
# [17] munsell_0.4.2       nlme_3.1-117        nloptr_1.0.0        nnet_7.3-8         
# [21] plyr_1.8.1          proto_0.3-10        Rcpp_0.11.2         reshape2_1.4.0.99  
# [25] scales_0.2.4        splines_3.1.1       stringr_0.6.2       tools_3.1.1        

The issue seems to arise from these lines from train.formula function:

    res$trainingData <- data
    isY <- names(res$trainingData) %in% as.character(form[[2]])
    if(any(isY)) colnames(res$trainingData)[isY] <- ".outcome"

which should work just fine, except it doesn't on v3.1+. It works just fine with < v3.1. This is because R v3.1+ shallow copies where ever possible as opposed to deep copies and data.table will have to ake care of this. Here's a simple way to reproduce the issue:

# R v3.1.1
require(data.table)
dt = data.table(x=1:5, y=6:10)
ll = vector("list", 2L)
names(ll) <- c("a", "b")
ll$a = 1L; ll$b = dt
idx = c(FALSE, TRUE)
colnames(ll$b)[idx] = "bla"
names(dt) # [1] "x"   "bla"
names(ll$b) # [1] "x"   "bla"

The function colnames<- simply calls names(x) <- value which then calls the appropriate data.table method that uses setnames. However, x has only been shallow copied, on which we set names by reference, and hence the issue.

Bumping priority. Assigning milestone 1.9.4.

@arunsrinivasan
Copy link
Member Author

One possible fix for the caret package would be to change this line to:

if ("data.table" %in% class(data)) 
    res$trainingData <- copy(data)
else res$trainingData <- data

@arunsrinivasan arunsrinivasan modified the milestones: 2.0.1, v2.0 Sep 6, 2014
@arunsrinivasan
Copy link
Member Author

Updated the SO post: link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant