Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-equi join masks column names from x and i in the j expression #2595

Closed
ethanbsmith opened this issue Jan 28, 2018 · 8 comments
Closed

non-equi join masks column names from x and i in the j expression #2595

ethanbsmith opened this issue Jan 28, 2018 · 8 comments
Labels
Milestone

Comments

@ethanbsmith
Copy link
Contributor

ethanbsmith commented Jan 28, 2018

This was very hard to track down. If prefixes are indeed required in this scenario, maybe a warning or error could be issued

might be related to #2313, #1761

a <- data.table(foo = c(1:5))
b <- data.table(bar = c(1:5))

This version produces incorrect values for fmean and fmax

a[b, on = .(foo >= bar), by = .EACHI, .(bar, .N, fmean = mean(foo), fmax = max(foo))]

   foo bar N fmean fmax
1:   1   1 5     1    1
2:   2   2 4     2    2
3:   3   3 3     3    3
4:   4   4 2     4    4
5:   5   5 1     5    5

This version, with the prefix works as expected:

a[b, on = .(foo >= bar), by = .EACHI, .(bar, .N, fmean = mean(x.foo), fmax = max(x.foo))]

   foo bar N fmean fmax
1:   1   1 5   3.0    5
2:   2   2 4   3.5    5
3:   3   3 3   4.0    5
4:   4   4 2   4.5    5
5:   5   5 1   5.0    5
@ethanbsmith
Copy link
Contributor Author

may be related to #2569

@MarkusBonsch
Copy link
Contributor

MarkusBonsch commented Feb 1, 2018

To me this is very much related to the general weird behaviour of non-equi joins. The column that is joined on looks very weird after the join. Just doing:

a <- data.table(foo = c(1:5))
b <- data.table(bar = c(1:5), iRows = 1:5)
a[b, on = .(foo >= bar)]

yields

#     foo iRows
# 1:   1     1
# 2:   1     1
# 3:   1     1
# 4:   1     1
# 5:   1     1
# 6:   2     2
# 7:   2     2
# 8:   2     2
# 9:   2     2
# 10:   3     3
# 11:   3     3
# 12:   3     3
# 13:   4     4
# 14:   4     4
# 15:   5     5

Here you see, that grouping by iRows (which is essentially, what by = .EACHi does) is naturally giving the unexpected result that you saw.
It is just weird that the resulting foo column is so different from the original x.foo column. Essentially foo is not x.foo, but i.bar. This seems close to a bug to me and I don't know, why this behaviour was chosen. I would expect the following:

a[b, on = .(foo >= bar)]
#     foo iRows
# 1:   1     1
# 2:   2     1
# 3:   3     1
# 4:   4     1
# 5:   5     1
# 6:   2     2
# 7:   3     2
# 8:   4     2
# 9:   5     2
# 10:   3     3
# 11:   4     3
# 12:   5     3
# 13:   4     4
# 14:   5     4
# 15:   5     5

MAybe @mattdowle or @arunsrinivasan can comment on this.

@jangorecki
Copy link
Member

@MarkusBonsch this behaviour is discussed in #1615. We can eventually force users to use x. and i. prefixes to avoid confusion.

@MarkusBonsch
Copy link
Contributor

@ethanbsmith I am closing this issue because the "bug" can be explained by known data.table behaviour. If you disagree, don't hesitate to reopen.

@ethanbsmith
Copy link
Contributor Author

this issue just bit me again and cost me hours to track down. I think this is a real bug that should be addressed, but I will defer to the owners. just my $.02 below:

  1. i think the better question to ask is is this desired/reasonable behavior, vs. is it known behavior
  2. to quote @MarkusBonsch: Essentially foo is not x.foo, but i.bar. This seems close to a bug to me and I don't know, why this behaviour was chosen. That foo masks x.foo in name resolution is the heart of the problem here.
  3. to quote @arunsrinivasan in this SO Issuer: For some reason, using frame instead of x.frame, i.e., frame[which.max(signal)], returns all NA, which I'd suppose is a bug .. Could you please file an issue by linking to this post? Thanks.
  4. when using non-equi joins you are pretty much required to uses the x. and i. prefixes. this is not the case for equi-joins. i would assume that most people would prefer consitency.
  5. this really bites when prototyping and moving between equi/non-equi joins. a simple > or < can entirely break your code
  6. I don't think that join in [.data.table could be consistent to SQL #1615 really explains this issue. join in [.data.table could be consistent to SQL #1615 adds support for x. and i. prefixes, but does not discuss that the are required.

@ethanbsmith ethanbsmith changed the title non-equi join with .EACHI produces invalid results without x. prefix non-equi join masks column names from x and i in the j expression Mar 5, 2018
@ethanbsmith
Copy link
Contributor Author

simpler example highlighting the masking problem:

a <- data.table(foo = c(1:5))
b <- data.table(bar = c(1:5))

identical(a[b, on = .(foo >= bar), .(foo, bar)], a[b, on = .(foo >= bar), .(x.foo, i.bar)])

@ethanbsmith
Copy link
Contributor Author

having chained through a bunch of other threads. it seems that they keep getting closed and linked back as duplicates of #1700, which remains open. for history or anyone else going down this path, I found that link via #1807. I will reproduce some of these comments in the #1700 thread

@jangorecki
Copy link
Member

jangorecki commented Mar 6, 2018

@ethanbsmith ad.2 please read why this behaviour was "chosen" in #1615, ad.4 I agree, ad.5 read comments there too, prefixes are required as of now, they are just not enforced on user while they probably should be.

Please discuss this behaviour in single issue, best the place where it was already discussed. You can also include some minimal test cases there so we can include them when closing #1615. Easiest way seems to be warning on lack of prefixes when column names mismatch occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants