Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.13.0 slow down in a repeated loop on list column #4658

Closed
sandoronodi opened this issue Aug 3, 2020 · 2 comments · Fixed by #4655
Closed

1.13.0 slow down in a repeated loop on list column #4658

sandoronodi opened this issue Aug 3, 2020 · 2 comments · Fixed by #4655
Milestone

Comments

@sandoronodi
Copy link

sandoronodi commented Aug 3, 2020

I have noticed a huge performance drop in data.table loop operations, possibly due to the new version upgrade

library(data.table)
library(microbenchmark)

dt <- data.table('id'=1:20000,
                 'list_col'=sample(c('', '', 'a', 'a:b', 'a:b:c'), 20000, TRUE))
feature <- 'list_col'

microbenchmark(
  long_dt <- dt[, c("id", feature), with = FALSE][
    , feature_names := {
      x <- get(feature)
      stringr::str_split(x, ':')
    }][
      , .(
        feature_names = paste0(feature, "_", unlist(feature_names))
      )
      , by = "id"]
  , times = 10
  , unit = 'ms'
)

data.table 1.12.8, default settings, using 6 threads:

      min       lq     mean   median       uq      max neval
 122.2447 149.6991 173.3268 183.5777 193.9876 201.7234    10

data.table 1.13.0, default settings, using 6 threads:

      min      lq     mean   median      uq      max neval
 12820.75 12913.1 12989.59 13007.94 13065.1 13097.85    10

Also, I have tried several different threads and throttle combinations, but have seen no improvements at all

@tdeenes
Copy link
Member

tdeenes commented Aug 3, 2020

Probably a duplicate of #4646, and see also #4655 .

Notes to @sandoronodi

  • if possible, do not use irrelevant packages in a minimal reproducible example [MRE]: we have base::strsplit, so stringr::str_split is unnecessary (BTW, the latter is just a wrapper of stringi::stri_split - better to use that package)
  • split your processing pipeline into minimal steps, and profile those minimal steps: in the current case, dt[, c("id", feature), with = FALSE] just returns the original table, the splitting step could be created directly when you create the example table, and the performance degradation occurs in the unlist-by-group step (so only this last step is relevant)

@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Aug 3, 2020

@sandoronodi thanks for the report and I can reproduce. It is very related to #4646. I tried the #4655 workaround and it seems to address performance. Could you see if it solves your actual use case as well?

remotes::install_github("https://github.com/Rdatatable/data.table", ref  = "extract_performance")

Also, and I am not sure if this is necessarily a more readable approach, but this is faster method to get the same result. You could probably tweak it more if this is a bottleneck, but this would allow you to move forward with 1.13.0.

library(data.table)

dt <- data.table('id'=1:20000,
                 'list_col'=sample(c('', '', 'a', 'a:b', 'a:b:c'), 20000, TRUE))
feature <- 'list_col'

dt[, {
    
    x = get(feature)
    l = strsplit(x, ":")
    
    lens = lengths(l)
    lens[lens == 0L] = 1L ##for those without matches, we'll still have `list_col_` for each row based on OP. Therefore, we need those rows.
    
    partial_text = paste0(feature, "_")
    
    list(id = rep(id, lens),
         feature_names = unlist(Map(function(y) if (length(y)) paste0(partial_text, y) else partial_text, l), use.names = FALSE)
    )}
]

## A tibble: 2 x 13
##  expression               min  median `itr/sec` mem_alloc
##  <bch:expr>             <bch> <bch:t>     <dbl> <bch:byt>
##1 potential_solution      50ms  56.5ms     18.0     1.33MB
##2 OP_extract_perf_branch 156ms 158.3ms      6.24       2MB

@mattdowle mattdowle changed the title Performance issues in a repeated loop at v1.30.0 1.13.0 slow down in a repeated loop on list column Aug 4, 2020
@mattdowle mattdowle added this to the 1.13.1 milestone Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants