uniqueN() is very slow compared to length(unique()) #3739

shrektan · 2019-08-02T10:16:50Z

I'm using the lastest dev version of data.table. Still uniqueN() is an order of magnitude slower than length(unique()) . So slow I think it should be tagged as a bug ... See the reprex example below.

Note, the below code uses 4 threads (on a Win7 computer). If we set the thread number to 1, the time cost will be reduced to a half but still significantly slower than length(unique()).

(In fact, the reason I notice this is because I have a daily routine script costs maybe 20 minutes... and trying to improve the speed leads me to the cause - uniqueN())

Character

library(data.table)
set.seed(1000)
mk_rd_words <- function(min = 4, max = 20) {
  n <- floor(runif(1, min, max))
  paste0(sample(c(letters, LETTERS), size = n, replace = TRUE), collapse = '')
}
words <- vapply(1:1000, function(x) mk_rd_words(4, 50), FUN.VALUE = 'a')

n <- 1e4
tbl <- data.table(
  a = sample(words, size = n, replace = TRUE),
  b = sample(words, size = n, replace = TRUE)
)
microbenchmark::microbenchmark(
  times = 100,
  tbl[, .(N = uniqueN(b)), keyby = a],
  tbl[, .(N = length(unique(b))), keyby = a]
)
#> Unit: milliseconds
#>                                        expr       min         lq
#>         tbl[, .(N = uniqueN(b)), keyby = a] 169.13260 171.651133
#>  tbl[, .(N = length(unique(b))), keyby = a]   8.12066   8.607032
#>        mean     median         uq       max neval
#>  176.649940 173.972808 181.373316 201.70846   100
#>    9.233874   8.738746   9.208014  16.28779   100

^{Created on 2019-08-02 by the reprex package (v0.2.1)}

Double

library(data.table)
set.seed(1000)
n <- 1e4
tbl <- data.table(
  a = sample(1:1e3, size = n, replace = TRUE),
  b = sample(1:1e3, size = n, replace = TRUE)
)
microbenchmark::microbenchmark(
  times = 100,
  tbl[, .(N = uniqueN(b)), keyby = a],
  tbl[, .(N = length(unique(b))), keyby = a]
)
#> Unit: milliseconds
#>                                        expr        min         lq
#>         tbl[, .(N = uniqueN(b)), keyby = a] 107.329319 111.531912
#>  tbl[, .(N = length(unique(b))), keyby = a]   5.777745   5.980306
#>        mean     median         uq       max neval
#>  119.497038 115.412347 124.303334 158.04780   100
#>    6.992722   6.314294   7.678619  13.08759   100

^{Created on 2019-08-02 by the reprex package (v0.2.1)}

The text was updated successfully, but these errors were encountered:

shrektan · 2019-08-02T14:30:55Z

Well, I see. For very large vectors, uniqueN() is faster. The problem is if uniqueN() is supposed to be called millions of times (e.g. there're many subgroups), the performance's slowing down is significant.

Maybe for small atomic vector, we just use length(unique()) instead?

(Uses 4 thread on Mac OSX)

library(data.table)
set.seed(1000)
# small vector, uniqueN() is slower
x <- sample(1:1e5, size = 1e2, replace = TRUE)
microbenchmark::microbenchmark(
  uniqueN(x),
  length(unique(x))
)
#> Unit: microseconds
#>               expr    min      lq     mean  median     uq      max neval
#>         uniqueN(x) 20.744 21.3460 32.05373 21.7050 22.226 1010.481   100
#>  length(unique(x))  4.757  5.3565  6.18967  5.9215  6.698   18.199   100
# big vector, uniqueN() is faster
x <- sample(1:1e5, size = 1e6, replace = TRUE)
microbenchmark::microbenchmark(
  uniqueN(x),
  length(unique(x))
)
#> Unit: milliseconds
#>               expr       min       lq     mean   median       uq      max
#>         uniqueN(x)  9.176062 10.31124 11.55733 10.80676 12.44610 16.72396
#>  length(unique(x)) 20.834950 22.74216 26.23369 25.77416 27.84571 43.23249
#>  neval
#>    100
#>    100

^{Created on 2019-08-02 by the reprex package (v0.2.1)}

jangorecki · 2019-08-03T07:51:56Z

it make sense to include number of cores in those timings

shrektan · 2019-08-09T02:23:51Z

Thanks for the remind. It uses 4 threads and I've included the thread number in my original post above.

r2evans · 2020-03-10T18:44:00Z

Another example, perhaps also needing "n-cores" expansion: https://stackoverflow.com/questions/60623235/r-why-dplyr-counts-unique-values-n-distinct-by-groups-faster-than-data-table

skanskan · 2020-06-19T00:17:25Z

The same happens to me.
These two lines produce the same result on my data (with 100000 different IDs and around 5 rows within each ID).

DT[,if(.N==uniqueN(test)) .SD ,by=ID]
DT[,c(.SD,.N),by=.(ID,test)][,if(!any(N>1)) .SD, by=ID][,-"N"]

But the former is 10 times slower.

MichaelChirico · 2020-06-19T01:29:11Z

@skanskan can you confirm that timing's on current master? Since there was just a related commit yesterday. My understanding is the timing should be better (though probably not yet equivalent)

skanskan · 2020-06-19T10:48:25Z

Right now I'm using the stable version because I'm doing calculations for an article I didn't want to mess with versions. But I'm going to try the master version.

skanskan · 2020-06-19T11:44:47Z

OK, I have just compared data.table 1.12.8 vs 1.12.9
Windows 10 x64. i7 7700HQ 8 cores. 16GB RAM.

DT[,if(.N==uniqueN(test)) .SD ,by=ID] improved from 44.3s to 17.5s
but
DT[,c(.SD,.N),by=.(ID,test)][,if(!any(N>1)) .SD, by=ID][,-"N"] now is much slower, went from 1.9s to 6.2s. The slow part seems to be [,if(!any(N>1)) .SD, by=ID].

I have also tried
DT[,if(!anyDuplicated(test)) .SD ,by=ID] which was 1.3s and now is 1.4s.
DT[,if(.N==length(unique(test))) .SD ,by=ID] is 1.6s.

jangorecki · 2020-10-12T07:29:48Z

@shrektan any idea what was the version of DT when you reported this issue?

@skanskan j = if (...) .SD looks to be a different issue. Could you please fill up a new issue presenting regression?

jangorecki · 2020-10-12T07:35:31Z

1.12.8

Unit: milliseconds
                                       expr       min        lq      mean
        tbl[, .(N = uniqueN(b)), keyby = a] 22.487731 23.279140 26.204391
 tbl[, .(N = length(unique(b))), keyby = a]  6.958176  7.077496  8.084379
    median        uq     max neval
 24.351137 25.483656 54.6224   100
  7.321736  8.482808 21.8727   100

master

Unit: milliseconds
                                       expr       min        lq      mean
        tbl[, .(N = uniqueN(b)), keyby = a] 18.444103 19.182902 20.200975
 tbl[, .(N = length(unique(b))), keyby = a]  6.995607  7.304626  7.820315
   median        uq      max neval
 19.83011 20.730743 40.06141   100
  7.48119  8.519765 10.77755   100

4 cores / 50%

shrektan · 2020-10-12T09:26:51Z

@jangorecki I think I was using the dev version of data.table at time of writing. So it should be version 1.12.3 2986736

BTW, I get basically the same result as you do using version 1.13.0 - still slower (2-3 times) but not that much (10 - 30 times) as it was in 1.12.3 .

shrektan added the bug label Aug 2, 2019

jangorecki mentioned this issue Aug 3, 2019

uniqueN escape forder to base R for small atomic vectors, #1120 #3743

Draft

shrektan mentioned this issue Oct 23, 2019

frank by group is much slower than rank #3988

Open

shrektan mentioned this issue Mar 7, 2020

Proposal for a function to handle many small groups in some situations #4284

Open

jangorecki mentioned this issue May 25, 2020

throttle threads for iterated small data tasks #4484

Merged

mattdowle added this to the 1.12.11 milestone Jun 18, 2020

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

MichaelChirico mentioned this issue May 14, 2021

Master list of most-requested issues #3189

Open

76 tasks

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

jangorecki removed this from the 1.16.0 milestone Nov 6, 2023

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uniqueN() is very slow compared to length(unique()) #3739

uniqueN() is very slow compared to length(unique()) #3739

shrektan commented Aug 2, 2019 •

edited

Loading

shrektan commented Aug 2, 2019 •

edited

Loading

jangorecki commented Aug 3, 2019

shrektan commented Aug 9, 2019

r2evans commented Mar 10, 2020

skanskan commented Jun 19, 2020

MichaelChirico commented Jun 19, 2020

skanskan commented Jun 19, 2020

skanskan commented Jun 19, 2020 •

edited

Loading

jangorecki commented Oct 12, 2020

jangorecki commented Oct 12, 2020 •

edited

Loading

shrektan commented Oct 12, 2020 •

edited

Loading

uniqueN() is very slow compared to length(unique()) #3739

uniqueN() is very slow compared to length(unique()) #3739

Comments

shrektan commented Aug 2, 2019 • edited Loading

Character

Double

shrektan commented Aug 2, 2019 • edited Loading

jangorecki commented Aug 3, 2019

shrektan commented Aug 9, 2019

r2evans commented Mar 10, 2020

skanskan commented Jun 19, 2020

MichaelChirico commented Jun 19, 2020

skanskan commented Jun 19, 2020

skanskan commented Jun 19, 2020 • edited Loading

jangorecki commented Oct 12, 2020

jangorecki commented Oct 12, 2020 • edited Loading

shrektan commented Oct 12, 2020 • edited Loading

shrektan commented Aug 2, 2019 •

edited

Loading

shrektan commented Aug 2, 2019 •

edited

Loading

skanskan commented Jun 19, 2020 •

edited

Loading

jangorecki commented Oct 12, 2020 •

edited

Loading

shrektan commented Oct 12, 2020 •

edited

Loading