Option for `unique` to _also_ delete first instance of any duplicated observation #1163

MichaelChirico · 2015-05-27T22:14:28Z

I'm frequently presented with data.tables which have (a very small percentage of) duplicated keys which causes some trouble when I use them in j to merge.

In my application, it makes sense to just drop any of those observations because they can't reliably be distinguished and they're so infrequent;

unique seems perfectly suited to this end, except that it retains at least one of the observations in any duplicated group, while I prefer to cut them out completely because it's essential not to mess up the merge process, the results of which are crucial for the whole project.

There's a workaround which is quite verbose; let's use the sample in ?duplicated.data.table:

DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B")

> unique(DT)
   A B C
1: 1 1 1
2: 1 2 2
3: 2 2 1
4: 2 3 1
5: 3 3 1
6: 3 4 2

The troublesome observations are 1,3,4,6.

From what I can tell my only recourse at the moment is something rather elaborate like

> DT[.(DT[,.N,by=key(DT)][N==1,!"N",with=F])]
   A B C
1: 1 2 2
2: 3 3 1

Something like unique(DT,only.unique=T) that achieves the same end seems like it would be easy to implement.

---EDIT 2015 May 27---

Arun's suggested workaround is much better than my approach, but comparison with unique suggests there's still considerable speed being lost:

> microbenchmark(times=1000L,
+                arun(),mike(),unique(DT))
Unit: microseconds
       expr      min        lq      mean    median       uq      max neval cld
     arun()  775.852  818.4715  950.2565  840.2605  865.327 47848.84  1000  b 
     mike() 2269.876 2346.1640 2953.5697 2413.6900 2478.700 50269.23  1000   c
 unique(DT)  199.339  225.0725  289.1449  239.5555  253.971 46924.54  1000 a

Performance of the workaround also deteriorates substantially when .SD is large, while that of unique is barely affected:

DT[,paste0("V",1:100):=lapply(1:100,function(x)sample(.N))]
> microbenchmark(times=1000L,
+                arun(),unique(DT))
Unit: microseconds
       expr      min        lq      mean   median        uq       max neval cld
     arun() 3397.032 3517.9175 4686.5543 3631.659 3728.6725 56132.181  1000   b
 unique(DT)  212.203  234.8935  256.9668  248.669  267.7265   623.812  1000  a

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2015-05-27T22:19:16Z

Would be nice to have this option as well..(ex: strict = TRUE or strictly.unique=TRUE) but how about DT[, if (.N == 1L) .SD, by=key(DT)]?

MichaelChirico · 2015-05-27T22:28:06Z

Indeed that's a much better workaround; I've edited the FR to reflect that, but also note that there still seems to be room for up to 3x improvement in speed by extending the unique functionality.

franknarf1 · 2015-05-28T01:46:00Z

Another workaround: Using unique itself and then differencing row numbers may be faster:

fr   <- function() unique(DT[,ii:=.I])[(!c(diff(ii)-1L,ii[.N]-nrow(DT)))]

For example in this case:

n_group       <- 1e5
draw_n_member <- 1:5
draw_val      <- 1:5

DT0 <- data.table(id=1:n_group)
DT  <- DT0[,
  sample(draw_val,sample(if(.GRP%in%c(1,n_group)) 1 else draw_n_member,1)
,replace=TRUE),keyby=id]

microbenchmark(times=10,arun(),fr(),unique(DT))

Change the simulation parameters, and this way turns out worse, though.

MichaelChirico · 2019-01-30T17:32:55Z

This would match the keep = False behavior of pandas.DataFrame.drop_duplicates:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

arunsrinivasan added the feature request label Jun 9, 2015

arunsrinivasan added the Medium label Jul 11, 2015

arunsrinivasan added this to the v1.9.6 milestone Jul 11, 2015

arunsrinivasan self-assigned this Jul 11, 2015

arunsrinivasan modified the milestones: v1.9.8, v1.9.6 Jul 20, 2015

arunsrinivasan mentioned this issue Jul 23, 2015

Feature request: return a vector of keys instead of a data.table when conditioning in j #1242

Closed

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 20, 2016

arunsrinivasan added Low and removed Medium labels Mar 20, 2016

franknarf1 mentioned this issue May 27, 2016

add a 'having' parameter to [.data.table #788

Open

jangorecki changed the title ~~Feature Request: Option for unique to _also_ delete first instance of any duplicated observation~~ Option for unique to _also_ delete first instance of any duplicated observation Apr 3, 2020

MichaelChirico mentioned this issue Dec 4, 2020

Allow duplicated() to show all duplicates #4828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option for `unique` to _also_ delete first instance of any duplicated observation #1163

Option for `unique` to _also_ delete first instance of any duplicated observation #1163

MichaelChirico commented May 27, 2015

arunsrinivasan commented May 27, 2015

MichaelChirico commented May 27, 2015

franknarf1 commented May 28, 2015

MichaelChirico commented Jan 30, 2019

Option for unique to _also_ delete first instance of any duplicated observation #1163

Option for unique to _also_ delete first instance of any duplicated observation #1163

Comments

MichaelChirico commented May 27, 2015

arunsrinivasan commented May 27, 2015

MichaelChirico commented May 27, 2015

franknarf1 commented May 28, 2015

MichaelChirico commented Jan 30, 2019

Option for `unique` to _also_ delete first instance of any duplicated observation #1163

Option for `unique` to _also_ delete first instance of any duplicated observation #1163