Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-column non-equi join produces invalid results #2360

Closed
ethanbsmith opened this issue Sep 15, 2017 · 4 comments
Closed

multi-column non-equi join produces invalid results #2360

ethanbsmith opened this issue Sep 15, 2017 · 4 comments
Labels
bug non-equi joins rolling, overlapping, non-equi joins
Milestone

Comments

@ethanbsmith
Copy link
Contributor

ethanbsmith commented Sep 15, 2017

I'm trying to do self non-equi join on a date and a numeric column. The single column version of both non-equi joins produce the expected results, but when i try both columns in the on clause, i get some
very funky output.

Sample script to reproduce the problem.

    d <- data.table(index = as.Date(c('2017-08-25', '2017-08-28', '2017-08-29', '2017-08-30', '2017-08-31', '2017-09-01', '2017-09-05', '2017-09-06', '2017-09-07', '2017-09-08', '2017-09-11', '2017-09-12', '2017-09-13')),
        High = c(52.85, 51.81, 51.86, 52.29, 52.59, 52.77, 51.9, 50.77, 50.69, 50.45, 50.67, 51.07, 51.105),
        FwdHi = c(51.81, 51.86, 52.29, 51.9, 50.77, 50.69, 50.45, 50.45, 50.45, 50.67, NA, NA, NA))
    setkey(d, index)
    d[d, .(FirstDate = min(x.index)), on = .(index > index), by = .EACHI]
    d[d, .(FirstDate = min(x.index)), on = .(FwdHi > High), by = .EACHI]
    d[d, .(FirstDate = min(x.index)), on = .(index > index, FwdHi > High), by = .EACHI]

The final line produces an FirstDate = "1970-01-1"
image

@franknarf1
Copy link
Contributor

Quite odd. Also, if we zoom in on that row...

d[d, if (.GRP == 4) x.index, on = .(index > index, FwdHi > High), by = .EACHI]
#         index FwdHi         V1
# 1: 2017-08-30 52.29 1970-01-01
# same result OP sees

d[d[4], x.index, on = .(index > index, FwdHi > High), by = .EACHI]
#         index FwdHi x.index
# 1: 2017-08-30 52.29    <NA>
# correct result

Somehow the inclusion of other groups in the join affects the values in this group. I think it's similar to #2275

@ethanbsmith
Copy link
Contributor Author

I have been doing some work on a workaround for this issues, by chaining d[...][...][...] the output of each condition into a discrete expression of the next. This produces correct output.

My real world case is more complex than the example above and i have been testing various combination of chaining schemes. What is interesting is that when i mix data-types on conditions. the processing is much slower than chaining those same conditions. eg:

d[d, on = .(index > index, FwdHi > High)]
where processing is on both date and numerics, is much slower than:
d[d, on = .(index > index)[FwdHi > High)]

I figured i'd note this, just in case it helps identify the source of the problem

@ethanbsmith ethanbsmith changed the title multi-column non-equi join produces invalid data multi-column non-equi join produces invalid results Sep 18, 2017
@arunsrinivasan arunsrinivasan added this to the v1.10.6 milestone Nov 2, 2017
@arunsrinivasan
Copy link
Member

arunsrinivasan commented Nov 3, 2017

Thanks for the nice report. Fixed in PR. Should be merged by Matt ASAP.

@mattdowle
Copy link
Member

Closed by Arun's PR #2461 (the "closes #2461" needs to appear in the PR's first comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

No branches or pull requests

4 participants