Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for unique to _also_ delete first instance of any duplicated observation #1163

Open
MichaelChirico opened this issue May 27, 2015 · 4 comments

Comments

@MichaelChirico
Copy link
Member

I'm frequently presented with data.tables which have (a very small percentage of) duplicated keys which causes some trouble when I use them in j to merge.

In my application, it makes sense to just drop any of those observations because they can't reliably be distinguished and they're so infrequent;

unique seems perfectly suited to this end, except that it retains at least one of the observations in any duplicated group, while I prefer to cut them out completely because it's essential not to mess up the merge process, the results of which are crucial for the whole project.

There's a workaround which is quite verbose; let's use the sample in ?duplicated.data.table:

DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B")

> unique(DT)
   A B C
1: 1 1 1
2: 1 2 2
3: 2 2 1
4: 2 3 1
5: 3 3 1
6: 3 4 2

The troublesome observations are 1,3,4,6.

From what I can tell my only recourse at the moment is something rather elaborate like

> DT[.(DT[,.N,by=key(DT)][N==1,!"N",with=F])]
   A B C
1: 1 2 2
2: 3 3 1

Something like unique(DT,only.unique=T) that achieves the same end seems like it would be easy to implement.

---EDIT 2015 May 27---

Arun's suggested workaround is much better than my approach, but comparison with unique suggests there's still considerable speed being lost:

> microbenchmark(times=1000L,
+                arun(),mike(),unique(DT))
Unit: microseconds
       expr      min        lq      mean    median       uq      max neval cld
     arun()  775.852  818.4715  950.2565  840.2605  865.327 47848.84  1000  b 
     mike() 2269.876 2346.1640 2953.5697 2413.6900 2478.700 50269.23  1000   c
 unique(DT)  199.339  225.0725  289.1449  239.5555  253.971 46924.54  1000 a  

Performance of the workaround also deteriorates substantially when .SD is large, while that of unique is barely affected:

DT[,paste0("V",1:100):=lapply(1:100,function(x)sample(.N))]
> microbenchmark(times=1000L,
+                arun(),unique(DT))
Unit: microseconds
       expr      min        lq      mean   median        uq       max neval cld
     arun() 3397.032 3517.9175 4686.5543 3631.659 3728.6725 56132.181  1000   b
 unique(DT)  212.203  234.8935  256.9668  248.669  267.7265   623.812  1000  a 
@arunsrinivasan
Copy link
Member

Would be nice to have this option as well..(ex: strict = TRUE or strictly.unique=TRUE) but how about DT[, if (.N == 1L) .SD, by=key(DT)]?

@MichaelChirico
Copy link
Member Author

Indeed that's a much better workaround; I've edited the FR to reflect that, but also note that there still seems to be room for up to 3x improvement in speed by extending the unique functionality.

@franknarf1
Copy link
Contributor

Another workaround: Using unique itself and then differencing row numbers may be faster:

fr   <- function() unique(DT[,ii:=.I])[(!c(diff(ii)-1L,ii[.N]-nrow(DT)))]

For example in this case:

n_group       <- 1e5
draw_n_member <- 1:5
draw_val      <- 1:5

DT0 <- data.table(id=1:n_group)
DT  <- DT0[,
  sample(draw_val,sample(if(.GRP%in%c(1,n_group)) 1 else draw_n_member,1)
,replace=TRUE),keyby=id]

microbenchmark(times=10,arun(),fr(),unique(DT))

Change the simulation parameters, and this way turns out worse, though.

@MichaelChirico
Copy link
Member Author

This would match the keep = False behavior of pandas.DataFrame.drop_duplicates:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

@jangorecki jangorecki changed the title Feature Request: Option for unique to _also_ delete first instance of any duplicated observation Option for unique to _also_ delete first instance of any duplicated observation Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants