Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to default minoverlap in subsetByOverlaps() feels unintuitive #1

Closed
PeteHaitch opened this issue Sep 11, 2017 · 15 comments
Closed

Comments

@PeteHaitch
Copy link
Contributor

The change of the default value of minoverlap to be 1 for subsetByOverlaps() feels really unintuitive to me (78ed68a). E.g., these used to have the same default behaviour (which felt intuitive) but no longer do:

suppressPackageStartupMessages(library(IRanges))
x <- IRanges(1, 4)
ranges <- IRanges(5, 8)
subsetByOverlaps(x, ranges)
#> IRanges object with 1 range and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         1         4         4
x[overlapsAny(x, ranges)]
#> IRanges object with 0 ranges and 0 metadata columns:
#>        start       end     width
#>    <integer> <integer> <integer>

Might there a way to get the 'correct' behaviour for zero width ranges in subsetByOverlaps() without changing the default? I'd expect more people to be confused by the new behaviour than by dropped zero-width ranges.

Cheers,
Pete

@hpages
Copy link
Contributor

hpages commented Sep 12, 2017

Hi Pete,
I'll try to work something so that adjacent ranges are excluded. Note that the default for minoverlap needs to be 0. Having it set to 1 would exclude zero-width ranges and loosing these ranges is seriously wrong (call it a disaster if 'x' contains SNPs and some of them are in-dels). Unfortunately that means subsetByOverlaps(x, ranges) is not equivalent to x[overlapsAny(x, ranges)] anymore (but they will at least produce the same result again for your above example).
Thanks for your feedback,
H.

@PeteHaitch
Copy link
Contributor Author

Thanks, Hervé! I appreciate you're trying to balance two different edge cases. I guess I'm unclear why the default minoverlap has to be 0 for some findOverlaps-based methods and 1 for others?

@hpages
Copy link
Contributor

hpages commented Sep 12, 2017

That's a good point. It would indeed be better to use the same default everywhere. Right now findOverlaps() and overlapsAny() ignore zero-width ranges by default, but subsetByOverlaps() does not. Not an ideal situation. However I guess having subsetByOverlaps() loose zero-width ranges by default is not acceptable so that change needed to be made. We could in theory make the same change for findOverlaps() and overlapsAny() but the impact of such a change is hard to anticipate (and easy to underestimate) so I'm a little bit nervous about it. I'd rather postpone for now.

@PeteHaitch
Copy link
Contributor Author

I agree that the impact of changing minoverlap is hard to anticipate - I got bitten by the latest change :).

My preference would be retaining the old default of minoverlap = 1 everywhere so that subsetByOverlaps() didn't by default consider adjacent ranges to be overlapping. And then some special handling for when any(width(x) == 0) is TRUE.

@lawremi
Copy link
Collaborator

lawremi commented Sep 12, 2017

I agree about not changing the behavior unless absolutely necessary. No one would ever expect subsetByOverlaps() to consider adjacent ranges to be overlapping.

@hpages
Copy link
Contributor

hpages commented Sep 12, 2017

Default behavior on zero-width ranges has to change because it's broken. After putting more thoughts into this I'm leaning towards having maxgap=-1 and minoverlap=0 by default for ALL the findOverlaps family. So by default, adjacent ranges won't be overlapping but when a zero-width range falls within a range they'll be considered to overlap. So if you didn't have zero-width ranges in your query or subject, you won't notice any change. But now you will be able to control whether you want adjacent ranges to overlap (by setting maxgap to 0), or not have zero-width ranges overlap anything (by setting minoverlap to > 0), or both.

@lawremi
Copy link
Collaborator

lawremi commented Sep 12, 2017

Is it really broken or do we just need a convenient way to find overlaps for single positions? One option is to simply resize() the ranges to width 1 just prior to overlap detection.

@hpages
Copy link
Contributor

hpages commented Sep 12, 2017

Broken. IMO [7,6] and [1,10] should be considered to overlap (by default) even though the intersection is empty (BTW a zero-width range is not a single position). The criteria for overlap should not be that there is a non-empty intersection between the 2 ranges. I think it's more useful to look at the position of one range w.r.t. the other. In other words, a more sensible (default) criteria is end1 >= start2 and end2 >= start1. This makes no difference if the ranges are not zero-width but it does make a difference on zero-width ranges.

With this criteria:

  • A zero-width range located within a non zero-width range (e.g. [7,6] and [1,10]) is an overlap.
  • Two adjacent (and non zero-width) ranges (e.g. [1, 10] and [11, 15]) still do not overlap.
  • A zero-width range adjacent to a non zero-width range (e.g. [1, 0] and [1, 10]) do not overlap. Note that you could also consider that the zero-width range is within the non zero-width range here. More on this below.
  • Two identical zero-width ranges (e.g. [1, 0] and [1, 0]) do not overlap.

The last 2 situations are extreme edge cases where the relative position of the 2 ranges is ambiguous: you can either consider that the 1st range is inside the other or that it's adjacent to it. There is no unique answer and different people will have a different opinion on that. Anyway the end1 >= start2 and end2 >= start1 criteria "sees" them as adjacent.

Using maxgap=-1 and minoverlap=0 by default will make findOverlaps() and family behave according to this criteria.

@lawremi
Copy link
Collaborator

lawremi commented Sep 13, 2017

Ok, but wow, those default parameters are not intuitive at all, unfortunately. I have no intuition what a maxgap of -1 would mean, and it's strange to see the minoverlap defaulting to 0.

@hpages
Copy link
Contributor

hpages commented Sep 13, 2017

Maybe that's because you've been with a mindset where overlap means non-empty intersection for too long? The fact that 10 years ago you chose these defaults for maxgap and minoverlap makes me think so but maybe I'm wrong. It does suggest that zero-width ranges were not really considered as part of the big picture though ;-)

@lawremi
Copy link
Collaborator

lawremi commented Sep 13, 2017

I guess minoverlap=0 makes sense given the new definition, but it does the conflict with the name of the function. It's the maxgap=-1 that I find pretty mysterious (how can a gap width be negative?), but I guess it is the best compromise. In some ways this suggests a new verb, but I don't think we should go there.

@hpages
Copy link
Contributor

hpages commented Sep 13, 2017

A gap > 0 means the 2 ranges are disjoint and separated by 1 or more nucleotides. A gap of 0 means they are adjacent. By extrapolation a gap of -1 just means the 2 ranges are "even closer than adjacent" i.e. one range has its start or end strictly inside the other. So maxgap=-1 just means "no adjacent ranges please". It's just a special value with a special meaning. The choice of -1 for this might sound arbitrary but so far I'm pleased with it as it seems to minimize the amount of changes that need to be done to the current C code.

An alternative would be to add an extra argument to findOverlaps() to control what to do with adjacent ranges but: (1) findOverlaps() already has too many arguments, and (2) this extra argument wouldn't be orthogonal to maxgap (having orthogonal arguments is a good property of an interface). It's kind of natural to control adjacency via the maxgap argument.

@hpages
Copy link
Contributor

hpages commented Sep 14, 2017

I made the change in IRanges 2.11.16, GenomicRanges 1.29.14, GenomicAlignments 1.13.6, and SummarizedExperiment 1.7.8.
Also posted an update on the support site to the guy who started all this:
https://support.bioconductor.org/p/99369/#100402

Hard to anticipate the impact of this change. Hopefully it won't be too bad...

@PeteHaitch Modifying the definitions of the findOverlaps and countOverlaps methods defined in GenomicTuples to make them adhere to the new maxgap and minoverlap defaults should be straightforward, I hope.

@PeteHaitch
Copy link
Contributor Author

@hpages I've updated GenomicTuples and it builds locally without error. Will push to BioC ASAP

@hpages
Copy link
Contributor

hpages commented Sep 20, 2017

Sounds good!

@hpages hpages closed this as completed Sep 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants