New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binary search extensions to `<`, `<=`, `>`, `>=` #1452

Closed
arunsrinivasan opened this Issue Nov 30, 2015 · 16 comments

Comments

Projects
None yet
5 participants
@arunsrinivasan
Member

arunsrinivasan commented Nov 30, 2015

Using on= as on=.(x == y, a <= b) -- as simple as that..

  • handle operators >=, >, <= and <.
  • support for mult="first" and mult="last"
  • add by=.EACHI support
  • update ?data.table
  • add a benchmark

Then extend #1068

@whyusc

This comment has been minimized.

Show comment
Hide comment
@whyusc

whyusc Apr 1, 2016

This would be awesome if implemented !

whyusc commented Apr 1, 2016

This would be awesome if implemented !

@arunsrinivasan arunsrinivasan self-assigned this Apr 6, 2016

arunsrinivasan added a commit that referenced this issue Apr 6, 2016

Merge branch 'non-equi-joins'
* non-equi-joins:
  non-equi joins update to NEWS. #1452.
  Patching another issue spotted by Jan. Thanks!
  Update ?data.table with current non-equi join functionality.
  Limit number of combinations for tests to max of 100.
  Closes #1257, on=.() syntax is now possible.
  Added test for join on char type with op other than '=='.
  Allow only '==' operator for joins on char type.
  Free allocated variable.
  Fix for the issue @jan spotted. Added tests. Thanks Jan.
  Finally, non-equi joins NAs/NaNs correctly in all cases, hopefully.
  Added a note to self comment to nestedid.
  Minor: fix code spacing.
  Adding tests for non-equi joins only for non-NA/NaN cases.
  Fixing logic for NAs in i.
  Various improvements and fixes to nestedid.
  better logic fixes edge cases, also removes for-loop = ~3x faster
  Just fixing indentation and minor code cleanup. No implementations.
  thinko! should be seq_len, not seq_along.
  First stab at non-equi joins
@MichaelChirico

This comment has been minimized.

Show comment
Hide comment
@MichaelChirico

MichaelChirico Apr 10, 2016

Contributor

Any plans to support syntax like on = .(x == y, a < b +3)?

Contributor

MichaelChirico commented Apr 10, 2016

Any plans to support syntax like on = .(x == y, a < b +3)?

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Apr 10, 2016

Member

Very likely not in this release (depends on how fast I wrap up the rest), but definitely useful. Perhaps as a new issue would be great.

Member

arunsrinivasan commented Apr 10, 2016

Very likely not in this release (depends on how fast I wrap up the rest), but definitely useful. Perhaps as a new issue would be great.

@MichaelChirico

This comment has been minimized.

Show comment
Hide comment
@MichaelChirico

MichaelChirico Apr 10, 2016

Contributor

Done, see #1639

Contributor

MichaelChirico commented Apr 10, 2016

Done, see #1639

@eantonya

This comment has been minimized.

Show comment
Hide comment
@eantonya

eantonya May 18, 2016

Contributor

@arunsrinivasan can you give a quick example of .EACHI failing in combination with a non-equi join?

Contributor

eantonya commented May 18, 2016

@arunsrinivasan can you give a quick example of .EACHI failing in combination with a non-equi join?

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan May 18, 2016

Member

Here's an example where it'll fail:

require(data.table)
dt = data.table(id="x", a=as.integer(c(3,8,8,15,15,15,16,22,22,25,25)), b=as.integer(c(9,10,25,19,22,25,38,3,9,7,28)), c=as.integer(c(22,33,44,14,49,44,40,25,400,52,77)))

dt[.(a=c(12,20), b=20), sum(c), on=c("a>a", "b<=b"), by=.EACHI]
# Error in `[.data.table`(dt, .(a = c(12, 20), b = 20), sum(c), on = c("a>a",  : 
#   by-joins are not yet implemented for multi-group non-equi-joins.

The idea for non-equi joins is to split data.table x (here, dt) into groups where rows of all columns, independently, have values always in increasing order. Here, there'll be more than one such "group". In the examples you've worked so far, there must have been only one such group. If this is hard to follow, please wait a while. I'll have better explanations (with visualisations) for UseR'16 soon.

Member

arunsrinivasan commented May 18, 2016

Here's an example where it'll fail:

require(data.table)
dt = data.table(id="x", a=as.integer(c(3,8,8,15,15,15,16,22,22,25,25)), b=as.integer(c(9,10,25,19,22,25,38,3,9,7,28)), c=as.integer(c(22,33,44,14,49,44,40,25,400,52,77)))

dt[.(a=c(12,20), b=20), sum(c), on=c("a>a", "b<=b"), by=.EACHI]
# Error in `[.data.table`(dt, .(a = c(12, 20), b = 20), sum(c), on = c("a>a",  : 
#   by-joins are not yet implemented for multi-group non-equi-joins.

The idea for non-equi joins is to split data.table x (here, dt) into groups where rows of all columns, independently, have values always in increasing order. Here, there'll be more than one such "group". In the examples you've worked so far, there must have been only one such group. If this is hard to follow, please wait a while. I'll have better explanations (with visualisations) for UseR'16 soon.

@eantonya

This comment has been minimized.

Show comment
Hide comment
@eantonya

eantonya May 18, 2016

Contributor

Ok, thanks, so failure is actually catastrophic and not silent. That's good - I thought it just works incorrectly for some cases and was looking for an incorrect result.

Iirc the cases I've worked with all had a single column on LHS, which always satisfies the single group condition.

Contributor

eantonya commented May 18, 2016

Ok, thanks, so failure is actually catastrophic and not silent. That's good - I thought it just works incorrectly for some cases and was looking for an incorrect result.

Iirc the cases I've worked with all had a single column on LHS, which always satisfies the single group condition.

arunsrinivasan added a commit that referenced this issue Jun 25, 2016

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Jun 25, 2016

Member

Updated all SO posts linked.

Member

arunsrinivasan commented Jun 25, 2016

Updated all SO posts linked.

@whyusc

This comment has been minimized.

Show comment
Hide comment
@whyusc

whyusc Jun 25, 2016

great work!!

whyusc commented Jun 25, 2016

great work!!

arunsrinivasan added a commit that referenced this issue Jun 25, 2016

@whyusc

This comment has been minimized.

Show comment
Hide comment
@whyusc

whyusc Jun 28, 2016

It seems that for this to work, one has to specify j.

Here is a small example

> dt1 <- data.table(year=1991:2000, v=rnorm(10))
> dt1
    year          v
 1: 1991 -0.4465306
 2: 1992  0.3355444
 3: 1993  1.7731734
 4: 1994  0.5285609
 5: 1995 -1.7025382
 6: 1996  0.3752937
 7: 1997  2.0677762
 8: 1998  0.6509314
 9: 1999  0.6291038
10: 2000 -0.9639413
> dt2 <- data.table(start=dt1$year-5, end=dt1$year)
> dt2
    start  end
 1:  1986 1991
 2:  1987 1992
 3:  1988 1993
 4:  1989 1994
 5:  1990 1995
 6:  1991 1996
 7:  1992 1997
 8:  1993 1998
 9:  1994 1999
10:  1995 2000

what I want to get can be accomplished by foverlaps:

> dt1[, year_hlp:=year]
> setkey(dt1, year, year_hlp)
> setkey(dt2, start, end)
> ?foverlaps
> foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
    start  end year          v year_hlp
 1:  1986 1991 1991 -0.4465306     1991
 2:  1987 1992 1991 -0.4465306     1991
 3:  1987 1992 1992  0.3355444     1992
 4:  1988 1993 1991 -0.4465306     1991
 5:  1988 1993 1992  0.3355444     1992
 6:  1988 1993 1993  1.7731734     1993
 7:  1989 1994 1991 -0.4465306     1991
 8:  1989 1994 1992  0.3355444     1992
 9:  1989 1994 1993  1.7731734     1993
10:  1989 1994 1994  0.5285609     1994
11:  1990 1995 1991 -0.4465306     1991
12:  1990 1995 1992  0.3355444     1992
13:  1990 1995 1993  1.7731734     1993
14:  1990 1995 1994  0.5285609     1994
15:  1990 1995 1995 -1.7025382     1995
16:  1991 1996 1991 -0.4465306     1991
17:  1991 1996 1992  0.3355444     1992
18:  1991 1996 1993  1.7731734     1993
......

But I thought I could simply do this:

> dt1[dt2, on=.(year>=start, year<=end), by=.EACHI]

However, I got an error:

Error in `[.data.table`(dt1, dt2, on = .(year >= start, year <= end),  : 
  'by' or 'keyby' is supplied but not j

Why do I have to specify j? All I wanted is to do a simply inequality join like in SQL.

whyusc commented Jun 28, 2016

It seems that for this to work, one has to specify j.

Here is a small example

> dt1 <- data.table(year=1991:2000, v=rnorm(10))
> dt1
    year          v
 1: 1991 -0.4465306
 2: 1992  0.3355444
 3: 1993  1.7731734
 4: 1994  0.5285609
 5: 1995 -1.7025382
 6: 1996  0.3752937
 7: 1997  2.0677762
 8: 1998  0.6509314
 9: 1999  0.6291038
10: 2000 -0.9639413
> dt2 <- data.table(start=dt1$year-5, end=dt1$year)
> dt2
    start  end
 1:  1986 1991
 2:  1987 1992
 3:  1988 1993
 4:  1989 1994
 5:  1990 1995
 6:  1991 1996
 7:  1992 1997
 8:  1993 1998
 9:  1994 1999
10:  1995 2000

what I want to get can be accomplished by foverlaps:

> dt1[, year_hlp:=year]
> setkey(dt1, year, year_hlp)
> setkey(dt2, start, end)
> ?foverlaps
> foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
    start  end year          v year_hlp
 1:  1986 1991 1991 -0.4465306     1991
 2:  1987 1992 1991 -0.4465306     1991
 3:  1987 1992 1992  0.3355444     1992
 4:  1988 1993 1991 -0.4465306     1991
 5:  1988 1993 1992  0.3355444     1992
 6:  1988 1993 1993  1.7731734     1993
 7:  1989 1994 1991 -0.4465306     1991
 8:  1989 1994 1992  0.3355444     1992
 9:  1989 1994 1993  1.7731734     1993
10:  1989 1994 1994  0.5285609     1994
11:  1990 1995 1991 -0.4465306     1991
12:  1990 1995 1992  0.3355444     1992
13:  1990 1995 1993  1.7731734     1993
14:  1990 1995 1994  0.5285609     1994
15:  1990 1995 1995 -1.7025382     1995
16:  1991 1996 1991 -0.4465306     1991
17:  1991 1996 1992  0.3355444     1992
18:  1991 1996 1993  1.7731734     1993
......

But I thought I could simply do this:

> dt1[dt2, on=.(year>=start, year<=end), by=.EACHI]

However, I got an error:

Error in `[.data.table`(dt1, dt2, on = .(year >= start, year <= end),  : 
  'by' or 'keyby' is supplied but not j

Why do I have to specify j? All I wanted is to do a simply inequality join like in SQL.

@jangorecki

This comment has been minimized.

Show comment
Hide comment
@jangorecki

jangorecki Jun 28, 2016

Member

@ywhuofu if you want inequality join like in SQL then why you use by=.EACHI? it adds SQL's GROUP BY to your query. If you are grouping then it is reasonable to have an aggregate functions, in case of grouping on join it is mandatory to provide j, even when doing it for equi-join.

Member

jangorecki commented Jun 28, 2016

@ywhuofu if you want inequality join like in SQL then why you use by=.EACHI? it adds SQL's GROUP BY to your query. If you are grouping then it is reasonable to have an aggregate functions, in case of grouping on join it is mandatory to provide j, even when doing it for equi-join.

@whyusc

This comment has been minimized.

Show comment
Hide comment
@whyusc

whyusc Jun 28, 2016

@jangorecki thanks for you response. So now I am a little confused. Based on my example, what should I do instead?

whyusc commented Jun 28, 2016

@jangorecki thanks for you response. So now I am a little confused. Based on my example, what should I do instead?

@jangorecki

This comment has been minimized.

Show comment
Hide comment
@jangorecki

jangorecki Jun 28, 2016

Member

@ywhuofu Use set.seed when dealing with rnorm. I've changed year-5 to year-5L in your code to keep it as integer.

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
dt1[, year_hlp:=year]
setkey(dt1, year, year_hlp)
setkey(dt2, start, end)
rf = foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
rf[, year_hlp:=NULL]

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
r = dt1[dt2, .(start, end, year=x.year, v), on=.(year>=start, year<=end), allow.cartesian=TRUE][order(start)]
all.equal(rf, r)
#[1] TRUE

Note the year=x.year in j argument, the rationale for using x. explicitly is described in #1615, otherwise you would get year equal to the value to which it was joined in i data.table.

Member

jangorecki commented Jun 28, 2016

@ywhuofu Use set.seed when dealing with rnorm. I've changed year-5 to year-5L in your code to keep it as integer.

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
dt1[, year_hlp:=year]
setkey(dt1, year, year_hlp)
setkey(dt2, start, end)
rf = foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
rf[, year_hlp:=NULL]

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
r = dt1[dt2, .(start, end, year=x.year, v), on=.(year>=start, year<=end), allow.cartesian=TRUE][order(start)]
all.equal(rf, r)
#[1] TRUE

Note the year=x.year in j argument, the rationale for using x. explicitly is described in #1615, otherwise you would get year equal to the value to which it was joined in i data.table.

@whyusc

This comment has been minimized.

Show comment
Hide comment
@whyusc

whyusc Jun 28, 2016

@jangorecki thanks a lot for the detailed observations. It seems the key is allow.cartesian. I think it could be useful to keep a cookbook for data.table strategies to replicate SQL queries.

One more interesting observation, the inequality join is faster than foverlaps.

whyusc commented Jun 28, 2016

@jangorecki thanks a lot for the detailed observations. It seems the key is allow.cartesian. I think it could be useful to keep a cookbook for data.table strategies to replicate SQL queries.

One more interesting observation, the inequality join is faster than foverlaps.

arunsrinivasan added a commit that referenced this issue Jun 30, 2016

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Jun 30, 2016

Member

Benchmark:

TODO: add MonetDBLite benchmark later.

Data:

# sample data
require(data.table)
set.seed(1L)
ids = paste0("id", 1:30e3)
N = 40e6L
query = data.table(id=sample(ids, N, TRUE), range1=sample(1e2L, N, TRUE))
query[, range2 := range1 + as.integer(runif(N)*300L)]
query

subject = data.table(id=sample(ids), range1=sample(2e2L, 30e3L, TRUE))
subject[, range2 := range1 + as.integer(runif(30e3L)*10e3L)]
subject

Non-equi joins:

system.time(
  nq_ans <- query[subject, .N, on=.(id, range1>=range1, range2<=range2), nomatch=0L, by=.EACHI]
)
# 19.8s

findOverlaps

require(GenomicRanges)
q.gr = GRanges(query$id, IRanges(query$range1, query$range2)) # 12.7s!!!
s.gr = GRanges(subject$id, IRanges(subject$range1, subject$range2))
system.time(gr_ans <- findOverlaps(q.gr, s.gr, type="within"))
# 16.4s
# note that we have not obtained the counts yet, just the overlaps
# the fact that q.gr takes ~13s is quite suspicious (i.e., makes me think that it does 
# some preprocessing and therefore should be included in the total run time)

RSQLite

# Thanks @jangorecki 
library(RSQLite)
conn = dbConnect(SQLite())
dbWriteTable(conn, "query", query)
dbWriteTable(conn, "subject", subject)
sql = 'SELECT subject.id, subject.range1, subject.range2, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 AND query.range2 <= subject.range2 GROUP BY subject.id, subject.range1, subject.range2;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 53.3s

foverlaps

system.time({
  setkey(subject, id, range1, range2)
  folaps_ans <- foverlaps(query, subject, type="within", nomatch=0L, which=TRUE)
})
# 12.9s
# note that we have not obtained the counts yet, just the overlaps

Another non-equi joins comparison with RSQLite: (not an interval join)

non-equi joins

system.time(nq_ans <- query[subject, .N, on=.(id, range1>=range1), nomatch=0L, by=.EACHI])
# 4.3s

RSQLite

sql = 'SELECT subject.id, subject.range1, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 GROUP BY subject.id, subject.range1;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 50.7s 
Member

arunsrinivasan commented Jun 30, 2016

Benchmark:

TODO: add MonetDBLite benchmark later.

Data:

# sample data
require(data.table)
set.seed(1L)
ids = paste0("id", 1:30e3)
N = 40e6L
query = data.table(id=sample(ids, N, TRUE), range1=sample(1e2L, N, TRUE))
query[, range2 := range1 + as.integer(runif(N)*300L)]
query

subject = data.table(id=sample(ids), range1=sample(2e2L, 30e3L, TRUE))
subject[, range2 := range1 + as.integer(runif(30e3L)*10e3L)]
subject

Non-equi joins:

system.time(
  nq_ans <- query[subject, .N, on=.(id, range1>=range1, range2<=range2), nomatch=0L, by=.EACHI]
)
# 19.8s

findOverlaps

require(GenomicRanges)
q.gr = GRanges(query$id, IRanges(query$range1, query$range2)) # 12.7s!!!
s.gr = GRanges(subject$id, IRanges(subject$range1, subject$range2))
system.time(gr_ans <- findOverlaps(q.gr, s.gr, type="within"))
# 16.4s
# note that we have not obtained the counts yet, just the overlaps
# the fact that q.gr takes ~13s is quite suspicious (i.e., makes me think that it does 
# some preprocessing and therefore should be included in the total run time)

RSQLite

# Thanks @jangorecki 
library(RSQLite)
conn = dbConnect(SQLite())
dbWriteTable(conn, "query", query)
dbWriteTable(conn, "subject", subject)
sql = 'SELECT subject.id, subject.range1, subject.range2, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 AND query.range2 <= subject.range2 GROUP BY subject.id, subject.range1, subject.range2;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 53.3s

foverlaps

system.time({
  setkey(subject, id, range1, range2)
  folaps_ans <- foverlaps(query, subject, type="within", nomatch=0L, which=TRUE)
})
# 12.9s
# note that we have not obtained the counts yet, just the overlaps

Another non-equi joins comparison with RSQLite: (not an interval join)

non-equi joins

system.time(nq_ans <- query[subject, .N, on=.(id, range1>=range1), nomatch=0L, by=.EACHI])
# 4.3s

RSQLite

sql = 'SELECT subject.id, subject.range1, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 GROUP BY subject.id, subject.range1;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 50.7s 

@arunsrinivasan arunsrinivasan changed the title from binary search extensions to `<`, `<=`, `>`, `>=` and `!=` to binary search extensions to `<`, `<=`, `>`, `>=` Jul 5, 2016

arunsrinivasan added a commit that referenced this issue Jul 13, 2016

2236529177 pushed a commit to 2236529177/data.table that referenced this issue Aug 13, 2017

2236529177 pushed a commit to 2236529177/data.table that referenced this issue Aug 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment