Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is order determined in the result of a non-equi join? #1991

Closed
Henrik-P opened this issue Jan 14, 2017 · 2 comments
Closed

How is order determined in the result of a non-equi join? #1991

Henrik-P opened this issue Jan 14, 2017 · 2 comments
Labels
bug Low non-equi joins rolling, overlapping, non-equi joins
Milestone

Comments

@Henrik-P
Copy link

Henrik-P commented Jan 14, 2017

This question was originally posted on SO. Arun asked me to file an issue, so here we go. Not sure about the ethics on data.table github, if a 'SO link only' is OK? Anyway, here's the question, verbatim from SO.


I'm trying to understand the underlying logic of how the result of a non-equi join in data.table is ordered within each level of the on-variable.

Just to make it clear from the start: I have no problem with the order itself, or to order the output in a desired way after the join. However, because I find the output from all other data.table operations highly consistent, I suspect there is a ordering pattern to be revealed in non-equi joins as well.

I will give two examples, where two different 'large' data sets are joined with a smaller. I have tried to describe the most obvious patterns in the output, as well as instances where the pattern differs between the joins of the two data sets.

# the first 'large' data set
d1 <- data.table(x = c(rep(c("b", "a", "c"), each = 3), c("a", "b")),
                 y = c(rep(c(1, 3, 6), 3), 6, 6),
                 id = 1:11) # to make it easier to track the original order in the output    
#     x y  id
# 1:  b 1   1
# 2:  b 3   2
# 3:  b 6   3
# 4:  a 1   4
# 5:  a 3   5
# 6:  a 6   6
# 7:  c 1   7
# 8:  c 3   8
# 9:  c 6   9
# 10: a 6  10
# 11: b 6  11

# the small data set
d2 <- data.table(id = 1:2, val = c(4, 2))   
#     id val
# 1:   1   4
# 2:   2   2

Non-equi join between the first large data set and the small, on = .(y >= val).

d1[d2, on = .(y >= val)]
#     x y  id  i.id
# 1:  b 4   3     1 # Row 1-5, first match: y >= val[1]; y >= 4
# 2:  a 4   6     1 # The rows within this match have the same order as the original data
# 3:  c 4   9     1 # and runs consecutively from first to last match
# 4:  a 4  10     1
# 5:  b 4  11     1

# 6:  b 2   2     2 # Row 6-13, second match: y >= val[2]; y >= 2 
# 7:  a 2   5     2 # The rows within this match do not have the same order as the original data
# 8:  c 2   8     2 # Rather, they seem to be come in chunks (6-8, 9-11, 12-13) 
                    # First chunk starts with the match with lowest index, y[2] 
# 9:  b 2   3     2  
# 10: a 2   6     2 
# 11: c 2   9     2 

# 12: a 2  10     2
# 13: b 2  11     2

The second 'large' data set:

d3 <- data.table(x = rep(c("a", "b", "c"), each = 3),
                 y = c(6, 1, 3),
                 id = 1:9)
#    x y id
# 1: a 6  1
# 2: a 1  2
# 3: a 3  3
# 4: b 6  4
# 5: b 1  5
# 6: b 3  6
# 7: c 6  7
# 8: c 1  8
# 9: c 3  9

Same non-equi join between the second large data set with the small:

d3[d2, on = .(y >= val)]

#    x y   id i.id
# 1: a 4   1     1 # Row 1-3, first match (y >= 4), similar to output above
# 2: b 4   4     1
# 3: c 4   7     1

# 4: a 2   3     2 # Row 4-9, second match (y >= 2).  
# 5: b 2   6     2 # Again, rows not consecutive.
# 6: c 2   9     2 # However, now the first chunk does not start with the match with lowest index,
                   # y[3] instead of y[1]
                   
# 7: a 2   1     2 # y[1] appears after y[3]
# 8: b 2   4     2 # ditto
# 9: c 2   7     2

Can anyone explain the logic of (1) the order within each level of the on-variable, here especially within the second match, where original order of the data isn't kept in the result. And (2) why does the order between chunks within matches differ when the two different data sets are used?


An even smaller example which makes it easier to track the re-ordering spotted by @franknarf1; in the result of the join, x is sorted by its join variable:

d1 <- data.table(end = c(4, 1, 3), ix = 1:3)
d1
#    end ix
# 1:   4  1 
# 2:   1  2
# 3:   3  3

d2 <- data.table(start = 2)
d2
#    start
# 1:     2


# result of join not in original order, but in order of "x.end"
d1[d2, .(x.end, ix), on = .(end > start)]
#    x.end ix
# 1:     3  3
# 2:     4  1
@franknarf1
Copy link
Contributor

franknarf1 commented Jan 14, 2017

I cannot explain the logic, but the pattern is d1[order(y)][y > 2, id] and d3[order(y)][y > 2, id], right? (Or in place of 2, put whatever other value from i.)

@arunsrinivasan arunsrinivasan added bug Low non-equi joins rolling, overlapping, non-equi joins labels Mar 28, 2017
@mattdowle mattdowle added this to the v1.10.6 milestone Nov 2, 2017
@arunsrinivasan
Copy link
Member

Update SO post https://stackoverflow.com/a/47148117/559784

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Low non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

No branches or pull requests

4 participants