Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table not returning the correct splinefun by group #4298

Closed
jphelps13 opened this issue Mar 11, 2020 · 5 comments
Closed

data.table not returning the correct splinefun by group #4298

jphelps13 opened this issue Mar 11, 2020 · 5 comments

Comments

@jphelps13
Copy link

jphelps13 commented Mar 11, 2020

Grouping is apparently overwriting the return value of other groups in the following example where producing a column with a list of functions:

library(data.table) # data.table_1.12.8
library(stats) 

# mimic our data in simpler format
set.seed(1)
dt <- data.table(cat = rep(letters[1:3], each = 10), x = 1:10)
dt[, y := x^0.5 * rnorm(.N, mean=runif(1, 1, 100), sd=runif(1, 1, 10)), by=cat]

mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
                  by = c("cat")]

mod_splines$Spline[[1]](5)
# [1] 92.84816
mod_splines$Spline[[2]](5)
# [1] 92.84816
mod_splines$Spline[[3]](5)
# [1] 92.84816

If we do this in base, the functions are different:

alt_splines <- lapply(
  split(dt, by='cat'), 
  function(x) with(x, splinefun(x, y, method='natural'))
)

alt_splines[[1]](5)
# [1] 53.03293
alt_splines[[2]](5)
# [1] 146.4205
alt_splines[[3]](5)
# [1] 92.84816

It appears the last group's function was copied to the other groups

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
[1] ggplot2_3.2.1     data.table_1.12.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3       digest_0.6.25    withr_2.1.2     
 [4] assertthat_0.2.1 dplyr_0.8.4      crayon_1.3.4    
 [7] grid_3.6.3       R6_2.4.1         lifecycle_0.1.0 
[10] gtable_0.3.0     magrittr_1.5     scales_1.1.0    
[13] pillar_1.4.3     rlang_0.4.5      farver_2.0.3    
[16] lazyeval_0.2.2   rstudioapi_0.11  labeling_0.3    
[19] tools_3.6.3      glue_1.3.1       purrr_0.3.3     
[22] munsell_0.5.0    compiler_3.6.3   pkgconfig_2.0.3 
[25] colorspace_1.4-1 tidyselect_1.0.0 tibble_2.1.3  
@shrektan
Copy link
Member

shrektan commented Mar 11, 2020

I haven't looked at this issue deeply, but the workaround is just to add a copy() to the internal x and y variables.

The issue itself, I believe, is about the way how splinefun() works... it looks like it will storage the input data without explicitly copying...

An tweaked-now-ok example based on yours

# R version: 3.6.3 (2020-02-29)
library(data.table) # data.table_1.12.8

# mimic our data in simpler format
set.seed(1)
dt <- data.table(cat = rep(letters[1:3], each = 10),
                 x = 1:10)
dt[, y := x^0.5 * rnorm(.N, mean=runif(1, 1, 100), sd=runif(1, 1, 10)), by=cat]

# fit spline, segment the data by category
######## this works ########3
mod_splines <- dt[, .(Spline = list(splinefun(x=copy(x), y=copy(y), method = "natural"))),
                  by = cat]

# splinefun works such that you provide new values of x and it gives an output
# y from a spline fitted to y~x
# Can see they are all the same, which seems unlikely
mod_splines$Spline[[1]](5)
mod_splines$Spline[[2]](5)
mod_splines$Spline[[3]](5)

# alternative approach
alt_splines <-  lapply(unique(dt$cat), function(x_cat){
  splinefun(x=dt[cat==x_cat, ]$x, 
            y=dt[cat==x_cat, ]$y, 
            method = "natural")
})

# looks more realistic
alt_splines[[1]](5)
alt_splines[[2]](5)
alt_splines[[3]](5) # Matches the mod_splines one!

@MichaelChirico
Copy link
Member

The problem is happening in dogroups.c... Haven't been able to track down why, but I'll note it's related to all the groups having the same size. If we augment your data with some extra rows, the problem goes away:

dt = rbind(dt, dt[1][ , x := 0], dt[11:12][ , x := -1:0])
dt[, y := x^0.5 * rnorm(.N, mean=runif(1, 1, 100), sd=runif(1, 1, 10)), by=cat]

mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
                  by = c("cat")]

mod_splines$Spline[[1]](5)
# [1] 127.596
mod_splines$Spline[[2]](5)
# [1] 56.64536
mod_splines$Spline[[3]](5)
# [1] 160.7622

@shrektan
Copy link
Member

@MichaelChirico If that's the case, it may be a bug?

@MichaelChirico
Copy link
Member

I think it's a bug for sure

@shrektan
Copy link
Member

shrektan commented Apr 6, 2020

@jangorecki is it a dup of #507?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants