Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upunique(DT) when there are no dups could be much faster #2013
Comments
|
Working on this on branch unique_speedup For now, just inserted the short-circuit you mentioned in # dup_timing.R
use_old = commandArgs(trailingOnly = TRUE)[1L] == 'old'
repos = if (use_old) 'http://Rdatatable.github.io/data.table' else NULL
pkgs = if (use_old) 'data.table' else '~/data.table_1.10.5.tar.gz'
remove.packages('data.table')
install.packages(pkgs, type = 'source', repos = repos)
library(data.table)
set.seed(039203)
NN = 1e8
DT = data.table(
A = sample(1000, NN, TRUE),
B = sample(1000, NN, TRUE),
C = sample(1000, NN, TRUE)
)
DT = unique(DT)
system.time(unique(DT))# timing_runs.sh
Rscript dup_timing.R old
Rscript dup_timing.R newThis is free and required almost no effort. Two remaining things can be done:
|
(The new default of using all columns brings this to the fore.)
So at this point it knows that DT is unique and it could return it or a shallow copy straight away. But it doesn't. It carries on to turn all-FALSE into 1:nrow and then subset every column by that 1:nrow.
Also should time the forderv to make sure it is short-circuiting correctly once it resolves ambiguities in the first few columns. forderv should not touch B in this example at all because A is enough to reach uniqueness.