Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsetequal treatment of dupes inconsistent with base #2968

Closed
franknarf1 opened this issue Jul 6, 2018 · 5 comments
Closed

fsetequal treatment of dupes inconsistent with base #2968

franknarf1 opened this issue Jul 6, 2018 · 5 comments
Milestone

Comments

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jul 6, 2018

I mean:

library(data.table)
DT1 = data.table(x = "yeehaw")
DT2 = DT1[c(1,1)]

fsetequal(DT1, DT2) # FALSE
setequal(DT1$x, DT2$x) # TRUE

(Testing on an old version of data.table, but figured I'd file since I didn't see any issues filed on it and still see the same code up, missing calls to unique/funique.) EDIT: Just upgraded and am seeing the same behavior.

@sritchie73
Copy link
Contributor

@sritchie73 sritchie73 commented Aug 20, 2018

I think you're right - fsetequal() should test whether the set of rows is equal, without regard for order or duplicates.

@jangorecki jangorecki added this to the 1.12.0 milestone Aug 21, 2018
@jangorecki jangorecki self-assigned this Aug 21, 2018
@sritchie73
Copy link
Contributor

@sritchie73 sritchie73 commented Aug 21, 2018

Just noting that setequal() in base has the same problem with data.frames:

df1 <- data.frame(id=c(1, 1, 2, 3, 4))
df2 <- unique(df1)

setequal(df1, df2) # FALSE
setequal(dt1$id, df2$id) # TRUE

@jangorecki just noticed you've self assigned this. I've just created a new branch with plans to add the all argument to fsetequals so that all = FALSE ignores duplicate rows. Do you want me to delete the branch and leave for you to resolve instead?

@franknarf1
Copy link
Contributor Author

@franknarf1 franknarf1 commented Aug 21, 2018

(I guess you know but...) setequal in base behaves that way not because it counts dupe rows, but because it regards columns as the elements of the set instead of rows. So setequal(transpose(df1), transpose(df2)) is true.

It also does other wonky/undesirable stuff when not working with vanilla vectors:

setequal(
  data.frame(x = 1:2, y = 3:4), 
  data.frame(w = factor(3:4, levels=1:4), z = as.character(1:2))
)
# TRUE
setequal(
  list(x = 1:2), 
  list(z = as.character(1:2))
)
# FALSE

@sritchie73
Copy link
Contributor

@sritchie73 sritchie73 commented Aug 21, 2018

Oh right, because a data.frame is just a list of columns, so its treating each column as an element in a set...

@jangorecki jangorecki removed their assignment Aug 22, 2018
@jangorecki
Copy link
Member

@jangorecki jangorecki commented Aug 22, 2018

@sritchie73 go ahead, I was not planning to look at this anytime soon.
fsetequal is the only function from set theory operators which doesn't have all argument, it should be used for that purpose.
https://rdatatable.gitlab.io/data.table/library/data.table/html/setops.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants