Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binary search extensions to <, <=, >, >= #1452

Closed
5 tasks done
arunsrinivasan opened this issue Nov 30, 2015 · 16 comments
Closed
5 tasks done

binary search extensions to <, <=, >, >= #1452

arunsrinivasan opened this issue Nov 30, 2015 · 16 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Nov 30, 2015

Using on= as on=.(x == y, a <= b) -- as simple as that..

  • handle operators >=, >, <= and <.
  • support for mult="first" and mult="last"
  • add by=.EACHI support
  • update ?data.table
  • add a benchmark

Then extend #1068

@ywhcuhk
Copy link

ywhcuhk commented Apr 1, 2016

This would be awesome if implemented !

@arunsrinivasan arunsrinivasan self-assigned this Apr 6, 2016
arunsrinivasan added a commit that referenced this issue Apr 6, 2016
* non-equi-joins:
  non-equi joins update to NEWS. #1452.
  Patching another issue spotted by Jan. Thanks!
  Update ?data.table with current non-equi join functionality.
  Limit number of combinations for tests to max of 100.
  Closes #1257, on=.() syntax is now possible.
  Added test for join on char type with op other than '=='.
  Allow only '==' operator for joins on char type.
  Free allocated variable.
  Fix for the issue @jan spotted. Added tests. Thanks Jan.
  Finally, non-equi joins NAs/NaNs correctly in all cases, hopefully.
  Added a note to self comment to nestedid.
  Minor: fix code spacing.
  Adding tests for non-equi joins only for non-NA/NaN cases.
  Fixing logic for NAs in i.
  Various improvements and fixes to nestedid.
  better logic fixes edge cases, also removes for-loop = ~3x faster
  Just fixing indentation and minor code cleanup. No implementations.
  thinko! should be seq_len, not seq_along.
  First stab at non-equi joins
@MichaelChirico
Copy link
Member

Any plans to support syntax like on = .(x == y, a < b +3)?

@arunsrinivasan
Copy link
Member Author

Very likely not in this release (depends on how fast I wrap up the rest), but definitely useful. Perhaps as a new issue would be great.

@MichaelChirico
Copy link
Member

Done, see #1639

@eantonya
Copy link
Contributor

@arunsrinivasan can you give a quick example of .EACHI failing in combination with a non-equi join?

@arunsrinivasan
Copy link
Member Author

arunsrinivasan commented May 18, 2016

Here's an example where it'll fail:

require(data.table)
dt = data.table(id="x", a=as.integer(c(3,8,8,15,15,15,16,22,22,25,25)), b=as.integer(c(9,10,25,19,22,25,38,3,9,7,28)), c=as.integer(c(22,33,44,14,49,44,40,25,400,52,77)))

dt[.(a=c(12,20), b=20), sum(c), on=c("a>a", "b<=b"), by=.EACHI]
# Error in `[.data.table`(dt, .(a = c(12, 20), b = 20), sum(c), on = c("a>a",  : 
#   by-joins are not yet implemented for multi-group non-equi-joins.

The idea for non-equi joins is to split data.table x (here, dt) into groups where rows of all columns, independently, have values always in increasing order. Here, there'll be more than one such "group". In the examples you've worked so far, there must have been only one such group. If this is hard to follow, please wait a while. I'll have better explanations (with visualisations) for UseR'16 soon.

@eantonya
Copy link
Contributor

eantonya commented May 18, 2016

Ok, thanks, so failure is actually catastrophic and not silent. That's good - I thought it just works incorrectly for some cases and was looking for an incorrect result.

Iirc the cases I've worked with all had a single column on LHS, which always satisfies the single group condition.

@arunsrinivasan
Copy link
Member Author

Updated all SO posts linked.

@ywhcuhk
Copy link

ywhcuhk commented Jun 25, 2016

great work!!

@ywhcuhk
Copy link

ywhcuhk commented Jun 28, 2016

It seems that for this to work, one has to specify j.

Here is a small example

> dt1 <- data.table(year=1991:2000, v=rnorm(10))
> dt1
    year          v
 1: 1991 -0.4465306
 2: 1992  0.3355444
 3: 1993  1.7731734
 4: 1994  0.5285609
 5: 1995 -1.7025382
 6: 1996  0.3752937
 7: 1997  2.0677762
 8: 1998  0.6509314
 9: 1999  0.6291038
10: 2000 -0.9639413
> dt2 <- data.table(start=dt1$year-5, end=dt1$year)
> dt2
    start  end
 1:  1986 1991
 2:  1987 1992
 3:  1988 1993
 4:  1989 1994
 5:  1990 1995
 6:  1991 1996
 7:  1992 1997
 8:  1993 1998
 9:  1994 1999
10:  1995 2000

what I want to get can be accomplished by foverlaps:

> dt1[, year_hlp:=year]
> setkey(dt1, year, year_hlp)
> setkey(dt2, start, end)
> ?foverlaps
> foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
    start  end year          v year_hlp
 1:  1986 1991 1991 -0.4465306     1991
 2:  1987 1992 1991 -0.4465306     1991
 3:  1987 1992 1992  0.3355444     1992
 4:  1988 1993 1991 -0.4465306     1991
 5:  1988 1993 1992  0.3355444     1992
 6:  1988 1993 1993  1.7731734     1993
 7:  1989 1994 1991 -0.4465306     1991
 8:  1989 1994 1992  0.3355444     1992
 9:  1989 1994 1993  1.7731734     1993
10:  1989 1994 1994  0.5285609     1994
11:  1990 1995 1991 -0.4465306     1991
12:  1990 1995 1992  0.3355444     1992
13:  1990 1995 1993  1.7731734     1993
14:  1990 1995 1994  0.5285609     1994
15:  1990 1995 1995 -1.7025382     1995
16:  1991 1996 1991 -0.4465306     1991
17:  1991 1996 1992  0.3355444     1992
18:  1991 1996 1993  1.7731734     1993
......

But I thought I could simply do this:

> dt1[dt2, on=.(year>=start, year<=end), by=.EACHI]

However, I got an error:

Error in `[.data.table`(dt1, dt2, on = .(year >= start, year <= end),  : 
  'by' or 'keyby' is supplied but not j

Why do I have to specify j? All I wanted is to do a simply inequality join like in SQL.

@jangorecki
Copy link
Member

@ywhuofu if you want inequality join like in SQL then why you use by=.EACHI? it adds SQL's GROUP BY to your query. If you are grouping then it is reasonable to have an aggregate functions, in case of grouping on join it is mandatory to provide j, even when doing it for equi-join.

@ywhcuhk
Copy link

ywhcuhk commented Jun 28, 2016

@jangorecki thanks for you response. So now I am a little confused. Based on my example, what should I do instead?

@jangorecki
Copy link
Member

jangorecki commented Jun 28, 2016

@ywhuofu Use set.seed when dealing with rnorm. I've changed year-5 to year-5L in your code to keep it as integer.

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
dt1[, year_hlp:=year]
setkey(dt1, year, year_hlp)
setkey(dt2, start, end)
rf = foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
rf[, year_hlp:=NULL]

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
r = dt1[dt2, .(start, end, year=x.year, v), on=.(year>=start, year<=end), allow.cartesian=TRUE][order(start)]
all.equal(rf, r)
#[1] TRUE

Note the year=x.year in j argument, the rationale for using x. explicitly is described in #1615, otherwise you would get year equal to the value to which it was joined in i data.table.

@ywhcuhk
Copy link

ywhcuhk commented Jun 28, 2016

@jangorecki thanks a lot for the detailed observations. It seems the key is allow.cartesian. I think it could be useful to keep a cookbook for data.table strategies to replicate SQL queries.

One more interesting observation, the inequality join is faster than foverlaps.

@arunsrinivasan
Copy link
Member Author

arunsrinivasan commented Jun 30, 2016

Benchmark:

TODO: add MonetDBLite benchmark later.

Data:

# sample data
require(data.table)
set.seed(1L)
ids = paste0("id", 1:30e3)
N = 40e6L
query = data.table(id=sample(ids, N, TRUE), range1=sample(1e2L, N, TRUE))
query[, range2 := range1 + as.integer(runif(N)*300L)]
query

subject = data.table(id=sample(ids), range1=sample(2e2L, 30e3L, TRUE))
subject[, range2 := range1 + as.integer(runif(30e3L)*10e3L)]
subject

Non-equi joins:

system.time(
  nq_ans <- query[subject, .N, on=.(id, range1>=range1, range2<=range2), nomatch=0L, by=.EACHI]
)
# 19.8s

findOverlaps

require(GenomicRanges)
q.gr = GRanges(query$id, IRanges(query$range1, query$range2)) # 12.7s!!!
s.gr = GRanges(subject$id, IRanges(subject$range1, subject$range2))
system.time(gr_ans <- findOverlaps(q.gr, s.gr, type="within"))
# 16.4s
# note that we have not obtained the counts yet, just the overlaps
# the fact that q.gr takes ~13s is quite suspicious (i.e., makes me think that it does 
# some preprocessing and therefore should be included in the total run time)

RSQLite

# Thanks @jangorecki 
library(RSQLite)
conn = dbConnect(SQLite())
dbWriteTable(conn, "query", query)
dbWriteTable(conn, "subject", subject)
sql = 'SELECT subject.id, subject.range1, subject.range2, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 AND query.range2 <= subject.range2 GROUP BY subject.id, subject.range1, subject.range2;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 53.3s

foverlaps

system.time({
  setkey(subject, id, range1, range2)
  folaps_ans <- foverlaps(query, subject, type="within", nomatch=0L, which=TRUE)
})
# 12.9s
# note that we have not obtained the counts yet, just the overlaps

Another non-equi joins comparison with RSQLite: (not an interval join)

non-equi joins

system.time(nq_ans <- query[subject, .N, on=.(id, range1>=range1), nomatch=0L, by=.EACHI])
# 4.3s

RSQLite

sql = 'SELECT subject.id, subject.range1, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 GROUP BY subject.id, subject.range1;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 50.7s 

@arunsrinivasan arunsrinivasan changed the title binary search extensions to <, <=, >, >= and != binary search extensions to <, <=, >, >= Jul 5, 2016
worker000000 pushed a commit to worker000000/data.table that referenced this issue Aug 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants