Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique(DT) when there are no dups could be much faster #2013

mattdowle opened this issue Feb 3, 2017 · 1 comment

unique(DT) when there are no dups could be much faster #2013

mattdowle opened this issue Feb 3, 2017 · 1 comment


Copy link

@mattdowle mattdowle commented Feb 3, 2017

(The new default of using all columns brings this to the fore.)

DT = data.table(A=1:3, B=4:6)
   A B
1: 1 4
2: 2 5
3: 3 6

Browse[3]> o
[1] 1 2 3
[1] 1

So at this point it knows that DT is unique and it could return it or a shallow copy straight away. But it doesn't. It carries on to turn all-FALSE into 1:nrow and then subset every column by that 1:nrow.

Also should time the forderv to make sure it is short-circuiting correctly once it resolves ambiguities in the first few columns. forderv should not touch B in this example at all because A is enough to reach uniqueness.

Copy link

@MichaelChirico MichaelChirico commented Oct 19, 2017

Working on this on branch unique_speedup

For now, just inserted the short-circuit you mentioned in Speed-up from doing this alone seems to be about 30% (regardless of # of rows). Speed testing script:

# dup_timing.R
use_old = commandArgs(trailingOnly = TRUE)[1L] == 'old'

repos = if (use_old) '' else NULL
pkgs = if (use_old) 'data.table' else '~/data.table_1.10.5.tar.gz'

install.packages(pkgs, type = 'source', repos = repos)

NN = 1e8
DT = data.table(
  A = sample(1000, NN, TRUE),
  B = sample(1000, NN, TRUE),
  C = sample(1000, NN, TRUE)
DT = unique(DT)

Rscript dup_timing.R old
Rscript dup_timing.R new

This is free and required almost no effort.

Two remaining things can be done:

  • Confirm forderv can be short-circuit prematurely when we're only checking for uniqueness and have established that before iterating over all columns. Requires a new argument to forderv?

  • Running within still necessitates declaring/returning the object, nrow(x)), which is probably slow. Better to split the logic of so we can just return(x) instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

4 participants