So at this point it knows that DT is unique and it could return it or a shallow copy straight away. But it doesn't. It carries on to turn all-FALSE into 1:nrow and then subset every column by that 1:nrow.
Also should time the forderv to make sure it is short-circuiting correctly once it resolves ambiguities in the first few columns. forderv should not touch B in this example at all because A is enough to reach uniqueness.
The text was updated successfully, but these errors were encountered:
Rscript dup_timing.R old
Rscript dup_timing.R new
This is free and required almost no effort.
Two remaining things can be done:
Confirm forderv can be short-circuit prematurely when we're only checking for uniqueness and have established that before iterating over all columns. Requires a new argument to forderv?
Running duplicated.data.table within unique.data.table still necessitates declaring/returning the object rep.int(FALSE, nrow(x)), which is probably slow. Better to split the logic of unique.data.table so we can just return(x) instead?