binary search extensions to `<`, `<=`, `>`, `>=` #1452

arunsrinivasan · 2015-11-30T10:10:42Z

Using on= as on=.(x == y, a <= b) -- as simple as that..

Then extend #1068

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2016-03-29T16:39:06Z

Update posts:

http://stackoverflow.com/q/36284173/559784
http://stackoverflow.com/q/36207861/559784
http://stackoverflow.com/q/36190921/559784
http://stackoverflow.com/q/36533139/2490497
http://stackoverflow.com/q/37163070/559784

ywhcuhk · 2016-04-01T01:17:52Z

This would be awesome if implemented !

@jan

* non-equi-joins: non-equi joins update to NEWS. #1452. Patching another issue spotted by Jan. Thanks! Update ?data.table with current non-equi join functionality. Limit number of combinations for tests to max of 100. Closes #1257, on=.() syntax is now possible. Added test for join on char type with op other than '=='. Allow only '==' operator for joins on char type. Free allocated variable. Fix for the issue @jan spotted. Added tests. Thanks Jan. Finally, non-equi joins NAs/NaNs correctly in all cases, hopefully. Added a note to self comment to nestedid. Minor: fix code spacing. Adding tests for non-equi joins only for non-NA/NaN cases. Fixing logic for NAs in i. Various improvements and fixes to nestedid. better logic fixes edge cases, also removes for-loop = ~3x faster Just fixing indentation and minor code cleanup. No implementations. thinko! should be seq_len, not seq_along. First stab at non-equi joins

MichaelChirico · 2016-04-10T19:38:08Z

Any plans to support syntax like on = .(x == y, a < b +3)?

arunsrinivasan · 2016-04-10T19:48:11Z

Very likely not in this release (depends on how fast I wrap up the rest), but definitely useful. Perhaps as a new issue would be great.

MichaelChirico · 2016-04-10T20:02:50Z

Done, see #1639

eantonya · 2016-05-18T17:02:57Z

@arunsrinivasan can you give a quick example of .EACHI failing in combination with a non-equi join?

arunsrinivasan · 2016-05-18T17:40:17Z

Here's an example where it'll fail:

require(data.table)
dt = data.table(id="x", a=as.integer(c(3,8,8,15,15,15,16,22,22,25,25)), b=as.integer(c(9,10,25,19,22,25,38,3,9,7,28)), c=as.integer(c(22,33,44,14,49,44,40,25,400,52,77)))

dt[.(a=c(12,20), b=20), sum(c), on=c("a>a", "b<=b"), by=.EACHI]
# Error in `[.data.table`(dt, .(a = c(12, 20), b = 20), sum(c), on = c("a>a",  : 
#   by-joins are not yet implemented for multi-group non-equi-joins.

The idea for non-equi joins is to split data.table x (here, dt) into groups where rows of all columns, independently, have values always in increasing order. Here, there'll be more than one such "group". In the examples you've worked so far, there must have been only one such group. If this is hard to follow, please wait a while. I'll have better explanations (with visualisations) for UseR'16 soon.

eantonya · 2016-05-18T23:09:59Z

Ok, thanks, so failure is actually catastrophic and not silent. That's good - I thought it just works incorrectly for some cases and was looking for an incorrect result.

Iirc the cases I've worked with all had a single column on LHS, which always satisfies the single group condition.

arunsrinivasan · 2016-06-25T01:09:59Z

Updated all SO posts linked.

ywhcuhk · 2016-06-25T03:55:01Z

great work!!

ywhcuhk · 2016-06-28T19:30:56Z

It seems that for this to work, one has to specify j.

Here is a small example

> dt1 <- data.table(year=1991:2000, v=rnorm(10))
> dt1
    year          v
 1: 1991 -0.4465306
 2: 1992  0.3355444
 3: 1993  1.7731734
 4: 1994  0.5285609
 5: 1995 -1.7025382
 6: 1996  0.3752937
 7: 1997  2.0677762
 8: 1998  0.6509314
 9: 1999  0.6291038
10: 2000 -0.9639413
> dt2 <- data.table(start=dt1$year-5, end=dt1$year)
> dt2
    start  end
 1:  1986 1991
 2:  1987 1992
 3:  1988 1993
 4:  1989 1994
 5:  1990 1995
 6:  1991 1996
 7:  1992 1997
 8:  1993 1998
 9:  1994 1999
10:  1995 2000

what I want to get can be accomplished by foverlaps:

> dt1[, year_hlp:=year]
> setkey(dt1, year, year_hlp)
> setkey(dt2, start, end)
> ?foverlaps
> foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
    start  end year          v year_hlp
 1:  1986 1991 1991 -0.4465306     1991
 2:  1987 1992 1991 -0.4465306     1991
 3:  1987 1992 1992  0.3355444     1992
 4:  1988 1993 1991 -0.4465306     1991
 5:  1988 1993 1992  0.3355444     1992
 6:  1988 1993 1993  1.7731734     1993
 7:  1989 1994 1991 -0.4465306     1991
 8:  1989 1994 1992  0.3355444     1992
 9:  1989 1994 1993  1.7731734     1993
10:  1989 1994 1994  0.5285609     1994
11:  1990 1995 1991 -0.4465306     1991
12:  1990 1995 1992  0.3355444     1992
13:  1990 1995 1993  1.7731734     1993
14:  1990 1995 1994  0.5285609     1994
15:  1990 1995 1995 -1.7025382     1995
16:  1991 1996 1991 -0.4465306     1991
17:  1991 1996 1992  0.3355444     1992
18:  1991 1996 1993  1.7731734     1993
......

But I thought I could simply do this:

> dt1[dt2, on=.(year>=start, year<=end), by=.EACHI]

However, I got an error:

Error in `[.data.table`(dt1, dt2, on = .(year >= start, year <= end),  : 
  'by' or 'keyby' is supplied but not j

Why do I have to specify j? All I wanted is to do a simply inequality join like in SQL.

jangorecki · 2016-06-28T19:53:15Z

@ywhuofu if you want inequality join like in SQL then why you use by=.EACHI? it adds SQL's GROUP BY to your query. If you are grouping then it is reasonable to have an aggregate functions, in case of grouping on join it is mandatory to provide j, even when doing it for equi-join.

ywhcuhk · 2016-06-28T20:16:41Z

@jangorecki thanks for you response. So now I am a little confused. Based on my example, what should I do instead?

jangorecki · 2016-06-28T20:32:50Z

@ywhuofu Use set.seed when dealing with rnorm. I've changed year-5 to year-5L in your code to keep it as integer.

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
dt1[, year_hlp:=year]
setkey(dt1, year, year_hlp)
setkey(dt2, start, end)
rf = foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
rf[, year_hlp:=NULL]

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
r = dt1[dt2, .(start, end, year=x.year, v), on=.(year>=start, year<=end), allow.cartesian=TRUE][order(start)]
all.equal(rf, r)
#[1] TRUE

Note the year=x.year in j argument, the rationale for using x. explicitly is described in #1615, otherwise you would get year equal to the value to which it was joined in i data.table.

ywhcuhk · 2016-06-28T20:56:46Z

@jangorecki thanks a lot for the detailed observations. It seems the key is allow.cartesian. I think it could be useful to keep a cookbook for data.table strategies to replicate SQL queries.

One more interesting observation, the inequality join is faster than foverlaps.

arunsrinivasan · 2016-06-30T18:37:38Z

Benchmark:

TODO: add MonetDBLite benchmark later.

Data:

# sample data
require(data.table)
set.seed(1L)
ids = paste0("id", 1:30e3)
N = 40e6L
query = data.table(id=sample(ids, N, TRUE), range1=sample(1e2L, N, TRUE))
query[, range2 := range1 + as.integer(runif(N)*300L)]
query

subject = data.table(id=sample(ids), range1=sample(2e2L, 30e3L, TRUE))
subject[, range2 := range1 + as.integer(runif(30e3L)*10e3L)]
subject

Non-equi joins:

system.time(
  nq_ans <- query[subject, .N, on=.(id, range1>=range1, range2<=range2), nomatch=0L, by=.EACHI]
)
# 19.8s

findOverlaps

require(GenomicRanges)
q.gr = GRanges(query$id, IRanges(query$range1, query$range2)) # 12.7s!!!
s.gr = GRanges(subject$id, IRanges(subject$range1, subject$range2))
system.time(gr_ans <- findOverlaps(q.gr, s.gr, type="within"))
# 16.4s
# note that we have not obtained the counts yet, just the overlaps
# the fact that q.gr takes ~13s is quite suspicious (i.e., makes me think that it does 
# some preprocessing and therefore should be included in the total run time)

RSQLite

# Thanks @jangorecki 
library(RSQLite)
conn = dbConnect(SQLite())
dbWriteTable(conn, "query", query)
dbWriteTable(conn, "subject", subject)
sql = 'SELECT subject.id, subject.range1, subject.range2, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 AND query.range2 <= subject.range2 GROUP BY subject.id, subject.range1, subject.range2;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 53.3s

foverlaps

system.time({
  setkey(subject, id, range1, range2)
  folaps_ans <- foverlaps(query, subject, type="within", nomatch=0L, which=TRUE)
})
# 12.9s
# note that we have not obtained the counts yet, just the overlaps

Another non-equi joins comparison with RSQLite: (not an interval join)

non-equi joins

system.time(nq_ans <- query[subject, .N, on=.(id, range1>=range1), nomatch=0L, by=.EACHI])
# 4.3s

RSQLite

sql = 'SELECT subject.id, subject.range1, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 GROUP BY subject.id, subject.range1;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 50.7s

…rbose #1452.

…atatable#1452

arunsrinivasan added feature request High labels Nov 30, 2015

arunsrinivasan added this to the v1.9.8 milestone Nov 30, 2015

arunsrinivasan mentioned this issue Mar 9, 2016

Conditional Merge in Data.table #829

Closed

arunsrinivasan mentioned this issue Apr 1, 2016

foverlaps #1617

Closed

arunsrinivasan self-assigned this Apr 6, 2016

arunsrinivasan added a commit that referenced this issue Apr 6, 2016

non-equi joins update to NEWS. #1452.

55a7b55

arunsrinivasan added a commit that referenced this issue Apr 7, 2016

mult="first"/"last" moved to bmerge.c, handles non-equi joins too, #1452

fc6812b

ywhcuhk mentioned this issue Apr 12, 2016

Join on inequality constraints tidyverse/dplyr#557

Closed

arunsrinivasan added a commit that referenced this issue Jun 25, 2016

by=.EACHI implemented for non-equi joins. Added tests, #1452.

cfa844a

arunsrinivasan added a commit that referenced this issue Jun 25, 2016

Update NEWS, #1452.

898a582

arunsrinivasan added a commit that referenced this issue Jun 25, 2016

Minor bug fix and code clean up, #1452

fc61568

arunsrinivasan mentioned this issue Jun 25, 2016

Hook up foverlaps to [.data.table #830

Closed

arunsrinivasan added a commit that referenced this issue Jun 30, 2016

Handle nomatch=0L correctly for non-equi when by=.EACHI, #1452.

b8ba74e

arunsrinivasan closed this as completed Jun 30, 2016

arunsrinivasan added a commit that referenced this issue Jul 1, 2016

minor: verbose msg on timings/groups generated for nqjoins, #1452.

917419d

arunsrinivasan changed the title ~~binary search extensions to <, <=, >, >= and !=~~ binary search extensions to <, <=, >, >= Jul 5, 2016

arunsrinivasan added a commit that referenced this issue Jul 13, 2016

Better nestedid generation for nqjoins, better performance, better ve…

7889347

…rbose #1452.

worker000000 pushed a commit to worker000000/data.table that referenced this issue Aug 13, 2017

non-equi joins update to NEWS. Rdatatable#1452.

be47bdb

worker000000 pushed a commit to worker000000/data.table that referenced this issue Aug 13, 2017

mult="first"/"last" moved to bmerge.c, handles non-equi joins too, Rd…

7c8755e

…atatable#1452

franknarf1 mentioned this issue Aug 17, 2017

non equi join for character columns #2308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binary search extensions to `<`, `<=`, `>`, `>=` #1452

binary search extensions to `<`, `<=`, `>`, `>=` #1452

arunsrinivasan commented Nov 30, 2015 •

edited

arunsrinivasan commented Mar 29, 2016 •

edited

ywhcuhk commented Apr 1, 2016

MichaelChirico commented Apr 10, 2016

arunsrinivasan commented Apr 10, 2016

MichaelChirico commented Apr 10, 2016

eantonya commented May 18, 2016

arunsrinivasan commented May 18, 2016 •

edited

eantonya commented May 18, 2016 •

edited

arunsrinivasan commented Jun 25, 2016

ywhcuhk commented Jun 25, 2016

ywhcuhk commented Jun 28, 2016

jangorecki commented Jun 28, 2016

ywhcuhk commented Jun 28, 2016

jangorecki commented Jun 28, 2016 •

edited

ywhcuhk commented Jun 28, 2016

arunsrinivasan commented Jun 30, 2016 •

edited

binary search extensions to <, <=, >, >= #1452

binary search extensions to <, <=, >, >= #1452

Comments

arunsrinivasan commented Nov 30, 2015 • edited

arunsrinivasan commented Mar 29, 2016 • edited

ywhcuhk commented Apr 1, 2016

MichaelChirico commented Apr 10, 2016

arunsrinivasan commented Apr 10, 2016

MichaelChirico commented Apr 10, 2016

eantonya commented May 18, 2016

arunsrinivasan commented May 18, 2016 • edited

eantonya commented May 18, 2016 • edited

arunsrinivasan commented Jun 25, 2016

ywhcuhk commented Jun 25, 2016

ywhcuhk commented Jun 28, 2016

jangorecki commented Jun 28, 2016

ywhcuhk commented Jun 28, 2016

jangorecki commented Jun 28, 2016 • edited

ywhcuhk commented Jun 28, 2016

arunsrinivasan commented Jun 30, 2016 • edited

Data:

Non-equi joins:

findOverlaps

RSQLite

foverlaps

non-equi joins

RSQLite

binary search extensions to `<`, `<=`, `>`, `>=` #1452

binary search extensions to `<`, `<=`, `>`, `>=` #1452

arunsrinivasan commented Nov 30, 2015 •

edited

arunsrinivasan commented Mar 29, 2016 •

edited

arunsrinivasan commented May 18, 2016 •

edited

eantonya commented May 18, 2016 •

edited

jangorecki commented Jun 28, 2016 •

edited

arunsrinivasan commented Jun 30, 2016 •

edited