Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646

Closed
fabiocs8 opened this issue Jul 27, 2020 · 7 comments · Fixed by #4655
Closed

[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646

fabiocs8 opened this issue Jul 27, 2020 · 7 comments · Fixed by #4655
Milestone

Comments

@fabiocs8
Copy link

fabiocs8 commented Jul 27, 2020

I have a script that takes only 4 seconds to process in V 1.12.8, and more than 24+ hours when using V 1.13.0. All other variables are the same (computer, r and RStudio versions, dataset, etc.)

All relevant and detailed information about this issue is contained in Stack Overflow.

https://stackoverflow.com/questions/63105711/why-data-table-unnesting-time-grows-with-the-square-of-number-of-rows-for-a-spec
I am not a programmer, so I am not able to go through the code. Below you will find the sessionInfo() reg\rding thje environment AFTER changing back to V 1.12.8.

Please let me know whether I can be of any further help.

BTW, congratulations for this amazing gamecahnger package.

Regards,

Fabio.

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] microbenchmark_1.4-7 data.table_1.12.8 lubridate_1.7.9 stringi_1.4.6 runner_0.3.7 e1071_1.7-3
[7] ggplot2_3.3.2 stringr_1.4.0 magrittr_1.5

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 class_7.3-17 tools_4.0.2 digest_0.6.25 packrat_0.5.0 evaluate_0.14
[9] lifecycle_0.2.0 tibble_3.0.3 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7 rstudioapi_0.11 yaml_2.2.1 xfun_0.16
[17] withr_2.2.0 dplyr_1.0.0 knitr_1.29 generics_0.0.2 vctrs_0.3.2 grid_4.0.2 tidyselect_1.1.0 glue_1.4.1
[25] R6_2.4.1 rmarkdown_2.3 purrr_0.3.4 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 colorspace_1.4-1 tinytex_0.25
[33] munsell_0.5.0 crayon_1.3.4

@mattdowle
Copy link
Member

mattdowle commented Jul 27, 2020

Many thanks for the great report and I've downloaded your RDS from S.O.

There was a change to [[ by group in this release so we'll have to look into that, bug fix 5 in news :

GForce is deactivated for [[ on non-atomic input, part of #4159. Thanks @hongyuanjia and @ColeMiller1 for helping debug an issue in dev with the original fix before release, #4612.

The verbose output in the S.O. (thanks) show the following :

Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

Which is more a clue for us to fix in this case, than for the user.

@mattdowle mattdowle added this to the 1.13.1 milestone Jul 27, 2020
@mattdowle mattdowle changed the title Version 1.13.0 takes forever (24 hours +) instead of 4 seconds when using version 1.12.8 [[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 Jul 27, 2020
@MichaelChirico
Copy link
Member

MichaelChirico commented Jul 27, 2020

I had the same initial thought for the cause, bit SO also reported the same slowdown for an unlist command.

Anyway there's repro data now which makes all the difference. Thanks @fabiocs8 !

@fabiocs8
Copy link
Author

fabiocs8 commented Jul 28, 2020

@fabiocs8
Copy link
Author

fabiocs8 commented Jul 28, 2020

@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Jul 28, 2020

Thanks for doing all the work to re-installing 1.12.8 to provide additional information. This type of info is great!

The most interesting thing in 1.13.0 is that there is a major performance difference between dt[, v[[1L]], by = grps] versus dt[seq_len(.N), v[[1L]], by = grps] for your RDS. I could only reproduce the performance degradation using the RDS and not the simulated data. Here is some of the verbose output in 1.13.0:

##dt[seq_len(.N), ...]
collecting discontiguous groups took 0.000s for 5000 groups 

##dt[, ...]
memcpy contiguous groups took 50.502s for 5000 groups 

It does appear to be a regression from 1.12.8. I need to deactivate the gforce optimization using parentheses (e.g. (val[[1L]])), and I get very different results:

##dt[seq_len(.N), (...), ...]
collecting discontiguous groups took 0.004s for 5000 groups 

##dt[, (...), ...]
memcpy contiguous groups took 0.000s for 5000 groups 

So it appears to be related to memcpy. For right now, it appears a workaround in 1.13.0 may be item_res[seq_len(.N), lance[[1]], by = item_id].

In the meantime - here's a way for simulated data to reproduce the issue:

library(data.table) #1.13.0

set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(id = seq_len(n),
                value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

bench::mark(
    dt[seq_len(.N), value[[1L]], by = id]
    , dt[, value[[1L]], by = id]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                 min   median `itr/sec` mem_alloc
#>   <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 dt[seq_len(.N), value[[1L]], by = id]   2.56ms   3.05ms    287.      2.24MB
#> 2 dt[, value[[1L]], by = id]            219.41ms 250.98ms      3.98   51.91MB
#> # ... with 1 more variable: `gc/sec` <dbl>

Using profmem, our seq_len(.N) approach has 29 allocations. Our NULL i has around 250,000 allocations, many of which are 208 byes of alloc.

library(profmem)
profmem( dt[seq_len(.N), value[[1L]], by = id])
profmem( dt[, value[[1L]], by = id]) ## SLOW!

@fabiocs8
Copy link
Author

fabiocs8 commented Jul 28, 2020

I noticed other related weird behaviors. All the commands below were run under V 1.12.8 on the full database:

1) Problems when using item[, lance[[1]], by = unnest_names,verbose = T], see below:

> microbenchmark(
    item_int <- item[, lance[[1]], by = unnest_names,verbose = T]
    , times = 1L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.170s elapsed (0.200s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'lance[[1]]'
GForce optimized j to 'g[[(lance, 1)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.002
gforce assign high and low took 0.002
gforce eval took 0.035
0.130s elapsed (0.140s cpu)
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, lance[[1]], by = unnest_names, verbose = T] 1.175517 1.175517 1.175517 1.175517 1.175517 1.175517 1

> item_int
Error in [.data.table(x, i, , ) :
Column 12 ['V1'] is a data.frame or data.table; malformed data.table.

Note 1) that data.table has not issued any error or warning, and 2) the output number of rows is the same of the input data (872,851), it does not make sense (the output number of rows should be 16,070,070).

2) item_int <- item[, unlist(lance, recursive = F), by = item_id] runs smooth and correctly

> microbenchmark(
    item_int <- item[, unlist(lance, recursive = F), by = unnest_names]
    , times = 5L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.140s elapsed (0.120s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'unlist(lance, recursive = F)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).

memcpy contiguous groups took 0.074s for 872581 groups
eval(j) took 1.400s for 872581 calls
8.200s elapsed (6.940s cpu)
Unit: seconds
expr min lq mean median uq
item_int <- item[, unlist(lance, recursive = F), by = unnest_names, verbose = TRUE] 5.777739 8.352493 8.204699 8.3721 8.698475
max neval
9.822689 5

Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, unlist(lance, recursive = F), by = unnest_names] 8.022987 8.245593 8.93737 8.914378 8.982508 10.52138 5

3) item_int <- item[, rbindlist(lance), by = unnest_names] runs correctly, but takes a long time... does it make sense?

> microbenchmark(
    item_int <- item[, rbindlist(lance), by = unnest_names, verbose = TRUE]
    , times = 1L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.300s elapsed (0.280s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'rbindlist(lance)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).

memcpy contiguous groups took 0.113s for 872581 groups
eval(j) took 98.270s for 872581 calls
00:01:43 elapsed (00:01:39 cpu)
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, rbindlist(lance), by = unnest_names, verbose = TRUE] 103.8269 103.8269 103.8269 103.8269 103.8269 103.8269 1

Regards,

Fabio.

@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Jul 29, 2020

Additional research: #4164 is the commit where this started.

## merged commit before #4164
##remotes::install_github("https://github.com/Rdatatable/data.table/tree/793f8545c363d222de18ac892bc7abb80154e724") # good

##  expression                                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                    memory               time         gc             
##  <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                    <list>               <list>       <list>         
##1 dt[seq_len(.N), value[[1L]], by = id]   3.68ms   4.06ms      232.    2.26MB     5.11    91     2      392ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [260 ~ <bch:tm [93~ <tibble [93 x ~
##2 dt[, value[[1L]], by = id]              2.06ms   2.29ms      430.  401.24KB     4.16   207     2      481ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [17 x~ <bch:tm [20~ <tibble [209 x~

## final commit with merged #4164 
# remotes::install_github("https://github.com/Rdatatable/data.table/tree/4aadde8f5a51cd7c8f3889964e7280432ec65bbc") #bad - merged commit
##  expression                                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                   memory                  time        gc            
##  <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                   <list>                  <list>      <list>        
##1 dt[seq_len(.N), value[[1L]], by = id]   1.25ms   1.39ms    572.      2.26MB    12.0    286     6      500ms <data.table[,2] [20,000~ <Rprofmem[,3] [260 x 3~ <bch:tm [2~ <tibble [286 ~
##2 dt[, value[[1L]], by = id]            173.01ms 188.94ms      5.29   52.03MB     5.29     3     3      567ms <data.table[,2] [20,000~ <Rprofmem[,3] [251,669~ <bch:tm [3~ <tibble [3 x ~

The last thing I looked into is that if byval is ordered (e.g., length(o__) == 0L), then memrecycle is triggered. So while ideally memrecycle would be more performant on lists, another way would be to only do continuous gathering when length(o__) && !any_typeof_cols_is_list) (e.g., when there is a list column, we'll use subsetVectorRaw instead). Alternatively we could look into an unnest function more (#4156).

Code between discontinuous vs. continuous:

data.table/src/dogroups.c

Lines 161 to 179 in db61844

if (LENGTH(order)==0) {
const int rownum = grpn ? istarts[i]-1 : -1;
for (int j=0; j<grpn; ++j) iI[j] = rownum+j+1;
if (rownum>=0) {
for (int j=0; j<length(SDall); ++j)
memrecycle(VECTOR_ELT(SDall,j), R_NilValue, 0, grpn, VECTOR_ELT(dt, INTEGER(dtcols)[j]-1), rownum, grpn, j+1, "Internal error assigning to SDall");
for (int j=0; j<length(xSD); ++j)
memrecycle(VECTOR_ELT(xSD,j), R_NilValue, 0, 1, VECTOR_ELT(dt, INTEGER(xjiscols)[j]-1), rownum, 1, j+1, "Internal error assigning to xSD");
}
if (verbose) { tblock[0] += clock()-tstart; nblock[0]++; }
} else {
for (int k=0; k<grpn; ++k) iI[k] = iorder[ istarts[i]-1 + k ];
for (int j=0; j<length(SDall); ++j) {
// this is the main non-contiguous gather, and is parallel (within-column) for non-SEXP
subsetVectorRaw(VECTOR_ELT(SDall,j), VECTOR_ELT(dt,INTEGER(dtcols)[j]-1), I, /*anyNA=*/false);
}
if (verbose) { tblock[1] += clock()-tstart; nblock[1]++; }
// The two blocks have separate timing statements to make sure which is running
}

Performance difference:

library(data.table) #1.13.0

set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(ordered_id = seq_len(n),
                unordered_id = sample(n),
                value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

bench::mark(
    dt[, value[[1L]], by = ordered_id]
    , dt[, value[[1L]], by = unordered_id]
    , check = FALSE
    , time_unit = "ms" 
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                              min median `itr/sec` mem_alloc
#>   <bch:expr>                            <dbl>  <dbl>     <dbl> <bch:byt>
#> 1 dt[, value[[1L]], by = ordered_id]   93.5   124.        8.25    53.7MB
#> 2 dt[, value[[1L]], by = unordered_id]  0.634   1.69    629.     409.2KB
#> # ... with 1 more variable: `gc/sec` <dbl>

I will work on a PR for !any_type_of_col_is_list to do subsetVectorRaw this weekend unless I hear there's a better route to fix the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants