Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bed_subtract() reports overlapping intervals in larger datasets #316

Merged
merged 4 commits into from Jan 13, 2018

Conversation

kriemo
Copy link
Member

@kriemo kriemo commented Jan 12, 2018

In some cases bed_subtract() does not drop intervals that overlap. This behavior only occurs with larger datasets. This bug is due to assuming that the output from findOverlaps() is sorted by coordinate, which is not always true when overlapping intervals are in different nodes of the interval tree.

Below is an example. Calling bed_subtract() on the same bed data with ~20k rows returns a small number of intervals that should have been dropped. If the non-dropped intervals are rerun through bed_subtract() now they are successfully dropped.

library(valr)

bg <- read_bedgraph(valr_example("hela.h3k4.chip.bg.gz"))

bed_subtract(bg, bg)
#> # A tibble: 479 x 4
#>    chrom    start      end value
#>    <chr>    <int>    <int> <dbl>
#>  1 chr22 17565704 17565724    18
#>  2 chr22 17566504 17566524    45
#>  3 chr22 17639724 17639744    53
#>  4 chr22 17640684 17640704    45
#>  5 chr22 18111344 18111364   128
#>  6 chr22 18121504 18121524    31
#>  7 chr22 18122344 18122364    30
#>  8 chr22 18257044 18257064    60
#>  9 chr22 18483444 18483464    47
#> 10 chr22 18484304 18484324    39
#> # ... with 469 more rows

not_dropped_ivls <- bed_subtract(bg, bg)

bed_subtract(not_dropped_ivls, not_dropped_ivls)
#> # A tibble: 0 x 4
#> # ... with 4 variables: chrom <chr>, start <int>, end <int>, value <dbl>

The proposed fix sorts the output from findOverlaps() by start coordinate, and has a negligible impact on performance.

Performance with the proposed fix is ~0.5 sec per 1e6 random ivls.

library(valr)
library(dplyr)
library(microbenchmark)

genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

n <- 1e6
nrep <- 10

seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
seed_y <- 9283019
y <- bed_random(genome, n = n, seed = seed_y)

microbenchmark(bed_subtract(x, y), times = nrep, unit = 's')
#> Unit: seconds
#>                expr       min        lq      mean    median        uq
#>  bed_subtract(x, y) 0.4919131 0.5441874 0.5529519 0.5547336 0.5660038
#>        max neval
#>  0.6023845    10

bed_subtract(x, x)
#> # A tibble: 0 x 3
#> # ... with 3 variables: chrom <chr>, start <int>, end <int>

Performance on current master also is ~0.5 sec per 1e6 random ivls.

library(valr)
library(dplyr)
library(microbenchmark)

genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

n <- 1e6
nrep <- 10

seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
seed_y <- 9283019
y <- bed_random(genome, n = n, seed = seed_y)

microbenchmark(bed_subtract(x, y), times = nrep, unit = 's')
#> Unit: seconds
#>                expr       min        lq      mean   median        uq
#>  bed_subtract(x, y) 0.5727233 0.5818941 0.6381701 0.650193 0.6624005
#>        max neval
#>  0.7120917    10

bed_subtract(x, x)
#> # A tibble: 1,007 x 3
#>    chrom    start      end
#>    <chr>    <int>    <int>
#>  1  chr1  2295436  2295921
#>  2  chr1 11019061 11019863
#>  3  chr1 18648714 18649508
#>  4  chr1 26404705 26405241
#>  5  chr1 29222401 29223360
#>  6  chr1 29222624 29223360
#>  7  chr1 31946987 31947882
#>  8  chr1 38934793 38935368
#>  9  chr1 39925867 39926630
#> 10  chr1 41100720 41101368
#> # ... with 997 more rows

Created on 2018-01-12 by the reprex package (v0.1.1.9000).

@jayhesselberth
Copy link
Member

jayhesselberth commented Jan 12, 2018

Deploy preview for valr processing.

Built with commit 980e15a

https://app.netlify.com/sites/valr/deploys/5a59229da114774478c90fdb

@kriemo kriemo changed the title bed_subtract() reports overlapping intervals in larger datasets bed_subtract() reports overlapping intervals in larger datasets Jan 12, 2018
@jayhesselberth
Copy link
Member

Can you rebase at re-submit?

@jayhesselberth jayhesselberth merged commit cf5587c into rnabioco:master Jan 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants