bed_subtract() reports overlapping intervals in larger datasets #316

kriemo · 2018-01-12T15:36:17Z

In some cases bed_subtract() does not drop intervals that overlap. This behavior only occurs with larger datasets. This bug is due to assuming that the output from findOverlaps() is sorted by coordinate, which is not always true when overlapping intervals are in different nodes of the interval tree.

Below is an example. Calling bed_subtract() on the same bed data with ~20k rows returns a small number of intervals that should have been dropped. If the non-dropped intervals are rerun through bed_subtract() now they are successfully dropped.

library(valr)

bg <- read_bedgraph(valr_example("hela.h3k4.chip.bg.gz"))

bed_subtract(bg, bg)
#> # A tibble: 479 x 4
#>    chrom    start      end value
#>    <chr>    <int>    <int> <dbl>
#>  1 chr22 17565704 17565724    18
#>  2 chr22 17566504 17566524    45
#>  3 chr22 17639724 17639744    53
#>  4 chr22 17640684 17640704    45
#>  5 chr22 18111344 18111364   128
#>  6 chr22 18121504 18121524    31
#>  7 chr22 18122344 18122364    30
#>  8 chr22 18257044 18257064    60
#>  9 chr22 18483444 18483464    47
#> 10 chr22 18484304 18484324    39
#> # ... with 469 more rows

not_dropped_ivls <- bed_subtract(bg, bg)

bed_subtract(not_dropped_ivls, not_dropped_ivls)
#> # A tibble: 0 x 4
#> # ... with 4 variables: chrom <chr>, start <int>, end <int>, value <dbl>

The proposed fix sorts the output from findOverlaps() by start coordinate, and has a negligible impact on performance.

Performance with the proposed fix is ~0.5 sec per 1e6 random ivls.

library(valr)
library(dplyr)
library(microbenchmark)

genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

n <- 1e6
nrep <- 10

seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
seed_y <- 9283019
y <- bed_random(genome, n = n, seed = seed_y)

microbenchmark(bed_subtract(x, y), times = nrep, unit = 's')
#> Unit: seconds
#>                expr       min        lq      mean    median        uq
#>  bed_subtract(x, y) 0.4919131 0.5441874 0.5529519 0.5547336 0.5660038
#>        max neval
#>  0.6023845    10

bed_subtract(x, x)
#> # A tibble: 0 x 3
#> # ... with 3 variables: chrom <chr>, start <int>, end <int>

Performance on current master also is ~0.5 sec per 1e6 random ivls.

library(valr)
library(dplyr)
library(microbenchmark)

genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

n <- 1e6
nrep <- 10

seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
seed_y <- 9283019
y <- bed_random(genome, n = n, seed = seed_y)

microbenchmark(bed_subtract(x, y), times = nrep, unit = 's')
#> Unit: seconds
#>                expr       min        lq      mean   median        uq
#>  bed_subtract(x, y) 0.5727233 0.5818941 0.6381701 0.650193 0.6624005
#>        max neval
#>  0.7120917    10

bed_subtract(x, x)
#> # A tibble: 1,007 x 3
#>    chrom    start      end
#>    <chr>    <int>    <int>
#>  1  chr1  2295436  2295921
#>  2  chr1 11019061 11019863
#>  3  chr1 18648714 18649508
#>  4  chr1 26404705 26405241
#>  5  chr1 29222401 29223360
#>  6  chr1 29222624 29223360
#>  7  chr1 31946987 31947882
#>  8  chr1 38934793 38935368
#>  9  chr1 39925867 39926630
#> 10  chr1 41100720 41101368
#> # ... with 997 more rows

Created on 2018-01-12 by the reprex package (v0.1.1.9000).

jayhesselberth · 2018-01-12T15:36:19Z

Deploy preview for valr processing.

Built with commit 980e15a

https://app.netlify.com/sites/valr/deploys/5a59229da114774478c90fdb

jayhesselberth · 2018-01-12T20:35:20Z

Can you rebase at re-submit?

…laps

kriemo added 2 commits January 12, 2018 07:49

sort findOverlaps output prior to modifying start and end coordinates

f0c6411

test if all intervals are dropped

b16f9df

add news item

dd15fce

kriemo changed the title ~~bed_subtract() reports overlapping intervals in larger datasets~~ bed_subtract() reports overlapping intervals in larger datasets Jan 12, 2018

Merge remote-tracking branch 'upstream/master' into subtract-sortover…

980e15a

…laps

jayhesselberth merged commit cf5587c into rnabioco:master Jan 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bed_subtract() reports overlapping intervals in larger datasets #316

bed_subtract() reports overlapping intervals in larger datasets #316

kriemo commented Jan 12, 2018

jayhesselberth commented Jan 12, 2018 •

edited

jayhesselberth commented Jan 12, 2018

bed_subtract() reports overlapping intervals in larger datasets #316

bed_subtract() reports overlapping intervals in larger datasets #316

Conversation

kriemo commented Jan 12, 2018

jayhesselberth commented Jan 12, 2018 • edited

jayhesselberth commented Jan 12, 2018

jayhesselberth commented Jan 12, 2018 •

edited