-
Notifications
You must be signed in to change notification settings - Fork 969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option for unique
to _also_ delete first instance of any duplicated observation
#1163
Comments
Would be nice to have this option as well..(ex: |
Indeed that's a much better workaround; I've edited the FR to reflect that, but also note that there still seems to be room for up to 3x improvement in speed by extending the |
Another workaround: Using
For example in this case:
Change the simulation parameters, and this way turns out worse, though. |
This would match the https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html |
unique
to _also_ delete first instance of any duplicated observationunique
to _also_ delete first instance of any duplicated observation
I'm frequently presented with
data.table
s which have (a very small percentage of) duplicated keys which causes some trouble when I use them inj
to merge.In my application, it makes sense to just drop any of those observations because they can't reliably be distinguished and they're so infrequent;
unique
seems perfectly suited to this end, except that it retains at least one of the observations in any duplicated group, while I prefer to cut them out completely because it's essential not to mess up the merge process, the results of which are crucial for the whole project.There's a workaround which is quite verbose; let's use the sample in
?duplicated.data.table
:DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B")
The troublesome observations are 1,3,4,6.
From what I can tell my only recourse at the moment is something rather elaborate like
Something like
unique(DT,only.unique=T)
that achieves the same end seems like it would be easy to implement.---EDIT 2015 May 27---
Arun's suggested workaround is much better than my approach, but comparison with
unique
suggests there's still considerable speed being lost:Performance of the workaround also deteriorates substantially when
.SD
is large, while that ofunique
is barely affected:The text was updated successfully, but these errors were encountered: