Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing overlaps with type='start' or type='end' in findOverlaps #46

Closed
jakob-wirbel opened this issue Nov 16, 2022 · 2 comments
Closed

Comments

@jakob-wirbel
Copy link

Hi folks 😃
I recently encountered an odd behaviour in the findOverlaps function.
I have a single query for which I want to find overlapping genes, stratified by the type of overlap. So, I want to know, if the overlap is within the gene or overlapping the C or N-terminus. Therefore, I used the type parameters in the findOverlap function.
I first checked if there are any overlaps and found two genes with overlap, with my region overlapping the start of one and the end of another one of the two hits. However, when I run it with type='end' or 'start', I do not get the expected overlaps.
I tried to understand the underlying code, but it seems to call some C function which I cannot really figure out, unfortunately.

This should be a reproducible example with a subset of the gene coordinates (using IRanges_2.32.0 and R 4.2.1):

library("IRanges")
query <- IRanges(start=638752, end=639346)
subject <- IRanges(start=c(635246, 636459, 637282, 638191, 639306, 
                           640686, 641814, 643721, 643992, 645102, 
                           645499), 
                   end=c(636343, 637097, 638187, 639150, 640670, 641798, 
                         643592, 643978, 644858, 645485, 647505))

for (type.overlap in c('within', 'start', 'end', 'any', 'equal')){
        message(type.overlap)
        overlaps <- findOverlaps(query=query, subject=subject, 
                                 type = type.overlap, select='all')
        print(overlaps)
}

This is the output:

within
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 11
start
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 11
end
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 11
any
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           4
  [2]         1           5
  -------
  queryLength: 1 / subjectLength: 11
equal
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 11

Just to clarify, the subjects 4 and 5 should be returned as overlaps with the query. This is the query:

query
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]    638752    639346       595

and this the expected overlaps:

subject[c(4,5),]
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]    638191    639150       960
  [2]    639306    640670      1365

I hope I am not missing something obvious, but it seems like the function would not return the expected results.
Your help will be much appreciated!
Cheers,
Jakob

@PeteHaitch
Copy link
Contributor

From help("findOverlaps-methods", "IRanges"), the documentation of type is (emphasis mine):

By default, any overlap is accepted. By specifying the type parameter, one can select for specific types of overlap. The types correspond to operations in Allen's Interval Algebra (see references). If type is start or end, the intervals are required to have matching starts or ends, respectively. Specifying equal as the type returns the intersection of the start and end matches. If type is within, the query interval must be wholly contained within the subject interval. Note that all matches must additionally satisfy the minoverlap constraint described above.

Plotting the ranges should help clarify the behaviour (see below).
We see that the query (red) overlaps the end of query[4] (grey), but does not have the same start or end, and start of query[5] (black), but does not have the same start or end.
Carefully reading the above documentation, that means this cannot be a type = "start", type = "end", type = "within", or type = "equal" match but is a type = "any" match, which reflects the results you got.

suppressPackageStartupMessages(library(IRanges))
                       
# Taken from IRanges vignette
plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black", 
                       sep = 0.5, ...) {
  height <- 1
  if (is(xlim, "IntegerRanges"))
    xlim <- c(min(start(xlim)), max(end(xlim)))
  bins <- disjointBins(IRanges(start(x), end(x) + 1))
  plot.new()
  plot.window(xlim, c(0, max(bins)*(height + sep)))
  ybottom <- bins * (sep + height) - height
  rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col, ...)
  title(main)
  axis(1)
}

# reprex from https://github.com/Bioconductor/IRanges/issues/46#issue-1450599624
query <- IRanges(start=638752, end=639346)
subject <- IRanges(start=c(635246, 636459, 637282, 638191, 639306, 
                           640686, 641814, 643721, 643992, 645102, 
                           645499), 
                   end=c(636343, 637097, 638187, 639150, 640670, 641798, 
                         643592, 643978, 644858, 645485, 647505))

# Plot regions (query in red, subject alternating in black and grey).
plotRanges(
  c(query, subject), 
  col = c(rep("red", length(query)), rep(c("black", "grey"), length(subject) / 2 + 1)))

Created on 2022-11-16 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.2.2 Patched (2022-11-10 r83330)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
#> 
#> locale:
#>  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
#>  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
#>  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] IRanges_2.32.0      S4Vectors_0.36.0    BiocGenerics_0.44.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.14 xml2_1.3.3      knitr_1.40      magrittr_2.0.3 
#>  [5] R6_2.5.1        rlang_1.0.6     fastmap_1.1.0   stringr_1.4.1  
#>  [9] highr_0.9       httr_1.4.4      tools_4.2.2     xfun_0.34      
#> [13] cli_3.4.1       withr_2.5.0     htmltools_0.5.3 yaml_2.3.6     
#> [17] digest_0.6.30   lifecycle_1.0.3 purrr_0.3.5     fs_1.5.2       
#> [21] curl_4.3.3      mime_0.12       glue_1.6.2      evaluate_0.18  
#> [25] rmarkdown_2.18  reprex_2.0.2    stringi_1.7.8   compiler_4.2.2

BTW Questions like these are better asked on https://support.bioconductor.org/.

@jakob-wirbel
Copy link
Author

Hi @PeteHaitch
Thanks so much for the clarification!
I misunderstood the help and didn't recognize that you need exact matches of the start/end coordinates. I guess the overlap function is then not actually the most helpful to me :D
Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants