How is order determined in the result of a non-equi join? #1991

Henrik-P · 2017-01-14T22:37:16Z

This question was originally posted on SO. Arun asked me to file an issue, so here we go. Not sure about the ethics on data.table github, if a 'SO link only' is OK? Anyway, here's the question, verbatim from SO.

I'm trying to understand the underlying logic of how the result of a non-equi join in data.table is ordered within each level of the on-variable.

Just to make it clear from the start: I have no problem with the order itself, or to order the output in a desired way after the join. However, because I find the output from all other data.table operations highly consistent, I suspect there is a ordering pattern to be revealed in non-equi joins as well.

I will give two examples, where two different 'large' data sets are joined with a smaller. I have tried to describe the most obvious patterns in the output, as well as instances where the pattern differs between the joins of the two data sets.

# the first 'large' data set
d1 <- data.table(x = c(rep(c("b", "a", "c"), each = 3), c("a", "b")),
                 y = c(rep(c(1, 3, 6), 3), 6, 6),
                 id = 1:11) # to make it easier to track the original order in the output    
#     x y  id
# 1:  b 1   1
# 2:  b 3   2
# 3:  b 6   3
# 4:  a 1   4
# 5:  a 3   5
# 6:  a 6   6
# 7:  c 1   7
# 8:  c 3   8
# 9:  c 6   9
# 10: a 6  10
# 11: b 6  11

# the small data set
d2 <- data.table(id = 1:2, val = c(4, 2))   
#     id val
# 1:   1   4
# 2:   2   2

Non-equi join between the first large data set and the small, on = .(y >= val).

d1[d2, on = .(y >= val)]
#     x y  id  i.id
# 1:  b 4   3     1 # Row 1-5, first match: y >= val[1]; y >= 4
# 2:  a 4   6     1 # The rows within this match have the same order as the original data
# 3:  c 4   9     1 # and runs consecutively from first to last match
# 4:  a 4  10     1
# 5:  b 4  11     1

# 6:  b 2   2     2 # Row 6-13, second match: y >= val[2]; y >= 2 
# 7:  a 2   5     2 # The rows within this match do not have the same order as the original data
# 8:  c 2   8     2 # Rather, they seem to be come in chunks (6-8, 9-11, 12-13) 
                    # First chunk starts with the match with lowest index, y[2] 
# 9:  b 2   3     2  
# 10: a 2   6     2 
# 11: c 2   9     2 

# 12: a 2  10     2
# 13: b 2  11     2

The second 'large' data set:

d3 <- data.table(x = rep(c("a", "b", "c"), each = 3),
                 y = c(6, 1, 3),
                 id = 1:9)
#    x y id
# 1: a 6  1
# 2: a 1  2
# 3: a 3  3
# 4: b 6  4
# 5: b 1  5
# 6: b 3  6
# 7: c 6  7
# 8: c 1  8
# 9: c 3  9

Same non-equi join between the second large data set with the small:

d3[d2, on = .(y >= val)]

#    x y   id i.id
# 1: a 4   1     1 # Row 1-3, first match (y >= 4), similar to output above
# 2: b 4   4     1
# 3: c 4   7     1

# 4: a 2   3     2 # Row 4-9, second match (y >= 2).  
# 5: b 2   6     2 # Again, rows not consecutive.
# 6: c 2   9     2 # However, now the first chunk does not start with the match with lowest index,
                   # y[3] instead of y[1]
                   
# 7: a 2   1     2 # y[1] appears after y[3]
# 8: b 2   4     2 # ditto
# 9: c 2   7     2

Can anyone explain the logic of (1) the order within each level of the on-variable, here especially within the second match, where original order of the data isn't kept in the result. And (2) why does the order between chunks within matches differ when the two different data sets are used?

An even smaller example which makes it easier to track the re-ordering spotted by @franknarf1; in the result of the join, x is sorted by its join variable:

d1 <- data.table(end = c(4, 1, 3), ix = 1:3)
d1
#    end ix
# 1:   4  1 
# 2:   1  2
# 3:   3  3

d2 <- data.table(start = 2)
d2
#    start
# 1:     2


# result of join not in original order, but in order of "x.end"
d1[d2, .(x.end, ix), on = .(end > start)]
#    x.end ix
# 1:     3  3
# 2:     4  1

The text was updated successfully, but these errors were encountered:

franknarf1 · 2017-01-14T22:45:12Z

I cannot explain the logic, but the pattern is d1[order(y)][y > 2, id] and d3[order(y)][y > 2, id], right? (Or in place of 2, put whatever other value from i.)

arunsrinivasan · 2017-11-07T00:33:53Z

Update SO post https://stackoverflow.com/a/47148117/559784

arunsrinivasan added bug Low non-equi joins rolling, overlapping, non-equi joins labels Mar 28, 2017

arunsrinivasan added a commit that referenced this issue Nov 2, 2017

Non-equi join row order is returned correctly. Fixes #1991.

3f1f018

mattdowle added this to the v1.10.6 milestone Nov 2, 2017

mattdowle closed this as completed in 27f8c21 Nov 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is order determined in the result of a non-equi join? #1991

How is order determined in the result of a non-equi join? #1991

Henrik-P commented Jan 14, 2017 •

edited

Loading

franknarf1 commented Jan 14, 2017 •

edited

Loading

arunsrinivasan commented Nov 7, 2017

How is order determined in the result of a non-equi join? #1991

How is order determined in the result of a non-equi join? #1991

Comments

Henrik-P commented Jan 14, 2017 • edited Loading

franknarf1 commented Jan 14, 2017 • edited Loading

arunsrinivasan commented Nov 7, 2017

Henrik-P commented Jan 14, 2017 •

edited

Loading

franknarf1 commented Jan 14, 2017 •

edited

Loading