Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-column non-equi join produces invalid results #2360

ethanbsmith opened this issue Sep 15, 2017 · 4 comments

multi-column non-equi join produces invalid results #2360

ethanbsmith opened this issue Sep 15, 2017 · 4 comments
bug non-equi joins


Copy link

@ethanbsmith ethanbsmith commented Sep 15, 2017

I'm trying to do self non-equi join on a date and a numeric column. The single column version of both non-equi joins produce the expected results, but when i try both columns in the on clause, i get some
very funky output.

Sample script to reproduce the problem.

    d <- data.table(index = as.Date(c('2017-08-25', '2017-08-28', '2017-08-29', '2017-08-30', '2017-08-31', '2017-09-01', '2017-09-05', '2017-09-06', '2017-09-07', '2017-09-08', '2017-09-11', '2017-09-12', '2017-09-13')),
        High = c(52.85, 51.81, 51.86, 52.29, 52.59, 52.77, 51.9, 50.77, 50.69, 50.45, 50.67, 51.07, 51.105),
        FwdHi = c(51.81, 51.86, 52.29, 51.9, 50.77, 50.69, 50.45, 50.45, 50.45, 50.67, NA, NA, NA))
    setkey(d, index)
    d[d, .(FirstDate = min(x.index)), on = .(index > index), by = .EACHI]
    d[d, .(FirstDate = min(x.index)), on = .(FwdHi > High), by = .EACHI]
    d[d, .(FirstDate = min(x.index)), on = .(index > index, FwdHi > High), by = .EACHI]

The final line produces an FirstDate = "1970-01-1"

Copy link

@franknarf1 franknarf1 commented Sep 15, 2017

Quite odd. Also, if we zoom in on that row...

d[d, if (.GRP == 4) x.index, on = .(index > index, FwdHi > High), by = .EACHI]
#         index FwdHi         V1
# 1: 2017-08-30 52.29 1970-01-01
# same result OP sees

d[d[4], x.index, on = .(index > index, FwdHi > High), by = .EACHI]
#         index FwdHi x.index
# 1: 2017-08-30 52.29    <NA>
# correct result

Somehow the inclusion of other groups in the join affects the values in this group. I think it's similar to #2275

Copy link
Contributor Author

@ethanbsmith ethanbsmith commented Sep 17, 2017

I have been doing some work on a workaround for this issues, by chaining d[...][...][...] the output of each condition into a discrete expression of the next. This produces correct output.

My real world case is more complex than the example above and i have been testing various combination of chaining schemes. What is interesting is that when i mix data-types on conditions. the processing is much slower than chaining those same conditions. eg:

d[d, on = .(index > index, FwdHi > High)]
where processing is on both date and numerics, is much slower than:
d[d, on = .(index > index)[FwdHi > High)]

I figured i'd note this, just in case it helps identify the source of the problem

@ethanbsmith ethanbsmith changed the title multi-column non-equi join produces invalid data multi-column non-equi join produces invalid results Sep 18, 2017
@arunsrinivasan arunsrinivasan added this to the v1.10.6 milestone Nov 2, 2017
Copy link

@arunsrinivasan arunsrinivasan commented Nov 3, 2017

Thanks for the nice report. Fixed in PR. Should be merged by Matt ASAP.

Copy link

@mattdowle mattdowle commented Nov 8, 2017

Closed by Arun's PR #2461 (the "closes #2461" needs to appear in the PR's first comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
bug non-equi joins
None yet

No branches or pull requests

4 participants