StructuralVariantAnnotation #1079

d-cameron · 2019-04-05T06:33:26Z

Update the following URL to point to the GitHub repository of
the package you wish to submit to Bioconductor

Repository: https://github.com/PapenfussLab/StructuralVariantAnnotation

Confirm the following by editing each check box to '[x]'

[ X ] I understand that by submitting my package to Bioconductor,
the package source and all review commentary are visible to the
general public.
[ X ] I have read the Bioconductor Package Submission
instructions. My package is consistent with the Bioconductor
Package Guidelines.
[X ] I understand that a minimum requirement for package acceptance
is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS.
Passing these checks does not result in automatic acceptance. The
package will then undergo a formal review and recommendations for
acceptance regarding other Bioconductor standards will be addressed.
[ X ] My package addresses statistical or bioinformatic issues related
to the analysis and comprehension of high throughput genomic data.
[ X ] I am committed to the long-term maintenance of my package. This
includes monitoring the support site for issues that users may
have, subscribing to the bioc-devel mailing list to stay aware
of developments in the Bioconductor community, responding promptly
to requests for updates from the Core team in response to changes in
R or underlying software.

I am familiar with the essential aspects of Bioconductor software
management, including:

[ X ] The 'devel' branch for new packages and features.
[ X ] The stable 'release' branch, made available every six
months, for bug fixes.
[ X ] Bioconductor version control using Git
(optionally via GitHub).

For help with submitting your package, please subscribe and post questions
to the bioc-devel mailing list.

bioc-issue-bot · 2019-04-05T06:33:27Z

Hi @d-cameron

Thanks for submitting your package. We are taking a quick
look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: StructuralVariantAnnotation
Type: Package
Title: Variant annotations for structural variants
Version: 0.99.0
Date: 2019-04-05
Authors@R: c(
	person("Daniel", "Cameron", email="daniel.l.cameron@gmail.com", role=c("aut", "cre"), comment=c(ORCID = "0000-0002-0951-7116")),
	person("Ruining", "Dong", email="dong.rn@wehi.edu.au", role=c("aut"), comment=c(ORCID = "0000-0003-1433-0484")))
Description: StructuralVariantAnnotation contains useful helper
	functions for dealing with structural variants in VCF format.
	The packages contains functions for parsing VCFs from a number
	of popular callers as well as functions for dealing with 
	breakpoints involving two separate genomic loci encoded as
	GRanges objects.
License: GPL-3
Depends:
	GenomicRanges,
	rtracklayer,
  VariantAnnotation,
  BiocGenerics
Imports:
    assertthat,
    Biostrings,
    stringr,
    dplyr
Suggests:
    BSgenome.Hsapiens.UCSC.hg19,
    devtools,
    testthat,
    roxygen2,
    covr,
    knitr,
    plyranges,
    dplyr,
    ggbio,
    biovizBase,
    circlize
RoxygenNote: 6.1.1
VignetteBuilder: knitr
biocViews: DataImport, Sequencing, Annotation, Genetics, VariantAnnotation

Add SSH keys to your GitHub account. SSH keys
will are used to control access to accepted Bioconductor
packages. See these instructions to add SSH keys to
your GitHub account.

bioc-issue-bot · 2019-04-05T08:57:27Z

A reviewer has been assigned to your package. Learn what to expect
during the review process.

IMPORTANT: Please read the instructions for setting
up a push hook on your repository, or further changes to your
repository will NOT trigger a new build.

bioc-issue-bot · 2019-04-05T09:06:20Z

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

d-cameron · 2019-04-07T22:50:54Z

Hmm. What's the best way to proceed when my package builds without error/warning when build against release dependencies, but is failing in devel due to dependencies? In this case, a test/example VCF parses in release, but I get a invalid class "VCFHeader" object: 'info(VCFHeader)' must be a 3 column DataFrame with names Number, Type, Description when I call VariantAnnotation::readVcf() in the vignette.

hpages · 2019-04-09T19:58:55Z

Hi @d-cameron ,

Thanks for this submission.

Please focus on BioC devel. There have been major changes to Rsamtools and VariantAnnotation between BioC 3.8 (release) and 3.9 (devel) that could explain these differences. Also let's ignore Windows for now: there is a known Windows-specific issue with the latest Rsamtools and VariantAnnotation in devel that is still under investigation.

Best,
H.

lawremi · 2019-04-09T20:35:25Z

The overlap functionality could greatly benefit from using features of the Hits objects, instead of reducing to data.frames and resorting to dplyr. For example, in findBreakpointOverlaps(), comments mention that duplicated(hits) takes too much memory. Is that just because hits is a data.frame? The duplicated,Hits() method is very efficient. I think intersect(x_hits, partner_hits) would be a better approach.

bioc-issue-bot · 2019-04-11T03:29:52Z

Received a valid push; starting a build. Commits are:

b67d361 version bump

bioc-issue-bot · 2019-04-11T03:45:37Z

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot · 2019-04-11T04:52:46Z

Received a valid push; starting a build. Commits are:

269c5ce remove test file bugs
d813bba version bump Merge branch 'master' of https://git...

bioc-issue-bot · 2019-04-11T05:04:39Z

Received a valid push; starting a build. Commits are:

9494b6b version bump 0.99.3

bioc-issue-bot · 2019-04-11T05:08:23Z

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot · 2019-04-11T05:19:39Z

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot · 2019-04-11T08:44:43Z

Received a valid push; starting a build. Commits are:

7420bba update examples
d5e4e0c update examples and version bump
79b93c6 update package documentation Merge branch 'master...

bioc-issue-bot · 2019-04-11T09:02:34Z

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot · 2019-04-11T10:51:17Z

Received a valid push; starting a build. Commits are:

4b34ffc findBreakpointOverlaps() now returns a Hits object
17b4c8a Version bump

d-cameron · 2019-04-11T10:52:03Z

@lawremi that also had the advantage of the of the return type matching findOverlaps() which is a nicer design so I've refactored that code.

Attempting to do the obvious and concat the two hits objects fails with S4Vectors::c() fails with the error:

Error in validObject(ans) : 
  invalid class “SortedByQueryHits” object: 'queryHits(x)' must be sorted

Is this something I should raise an issue for? I would have expected either a zipper merger returning a SortedByQueryHits, or the return class to be demoted to Hits since the concatenated sequences are not in order.

Judging by performance in profvis, neither intersect() nor duplicated() have any optimisations for the sorted nature of the SortedByQueryHits returned by findOverlaps() (which is somewhat strange given that duplicated() requires the Hits to be queryHits sorted). intersect() so I just went with intersect as it's actually the operation that I'm attempting to perform.

bioc-issue-bot · 2019-04-11T11:16:11Z

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, TIMEOUT, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot · 2019-04-11T12:26:13Z

Received a valid push; starting a build. Commits are:

3c217b1 Removed dplyr and stringr from NAMESPACE imports d...
a99ccfc Removing building warning

bioc-issue-bot · 2019-04-11T12:42:05Z

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

d-cameron · 2019-04-23T20:13:44Z

@hpages build now passing with your changes (hopefully) addressed.

lawremi · 2019-04-23T20:18:11Z

Extending GRanges a la VRanges (call it SVRanges?) might be a good idea.

d-cameron · 2019-04-23T20:20:42Z

@lawremi that would require adding hooks to everywhere that modifies a GRanges class so they also support the new class, correct? Wouldn't it be easier to modify GRanges itself to optionally track row pairings? Such as class would be useful in other contexts as well (e.g. InteractionSet).

lawremi · 2019-04-23T20:25:10Z

You can look at VRanges but basically it's not so bad, because GRanges is designed internally to be extensible. Many functions update the object without constructing a new GRanges instance.

lawremi · 2019-04-23T20:26:18Z

Actually maybe SVRanges should extend VRanges.

d-cameron · 2019-04-23T20:55:16Z

Actually maybe SVRanges should extend VRanges.

Unfortunately, most of the VRanges columns are not appropriate for arbitrary SVs so it's not a good fit. Even something as simple as equivalent of refCount and altCount are actually much more complicated for SVs (e.g. read pair support for smallish variants with either be wrong, or require a model to estimate the likelihood of a read pair coming from the ref or alt as the expected fragment size distributions overlap for small events).

Many functions update the object without constructing a new GRanges instance.

Thinking it over, extending from GRanges (or VRanges) is problematic as it violates Liskov Substitution Principle - the proposed SVRanges class would either restrict the permitted operation (thus making operations that were valid on a GRanges not valid on the derived class), or if you allow the operations, be effectively the same as the GRange implementation that I've implemented (raising errors when the GRanges violates the breakpoint pairing constraints).

hpages · 2019-04-23T21:56:29Z

@d-cameron

Yes that sounds like a reasonable path forward to me.

Some additional clarifications:

As I said, GRangesList was specifically designed for representing compound genome features, not for "exon lists" in particular. The "exon lists" use case just happened to be the first use case supported by these objects.
With the GRangesList approach, in the situation of a list element with A, B, and C, couldn't the encoding of which breakends correspond to which partners be achieved via a breakend-level metadata column?
Yes findOverlaps() ordinals are the list ordinals and you want the row ordinals. But isn't it something that findBreakpointOverlaps() could handle e.g. by returning an annotated Hits object? You could make the valid point that maybe this feature belongs to the findOverlaps() method for GRangesList objects. I guess the reason we don't have it yet is that nobody asked for it so far. In the meantime, nothing prevents you from implementing it in findBreakpointOverlaps().
Finally a note about the Liskov Substitution Principle: Unfortunately real-world objects sometimes violate this principle. For example, data.frame inherits from list but you cannot change the length of one list element without breaking the object:
```
df <- data.frame(aa=1:4, bb=letters[1:4])
df[[1]] <- 1:3  # works on a list, but not on a data.frame
# Error in `[[<-.data.frame`(`*tmp*`, 1, value = 1:3) : 
#   replacement has 3 rows, data has 4
```
We just have to live with that. There are many situations where it still makes sense to extend an existing class though, even if that means that the extended class supports a few less operations.

H.

lawremi · 2019-04-23T23:01:58Z

In many cases where the constraints would be broken by an exact endomorphism, one can just return the more general type. For example, intersect,GRanges,GRanges() always returns a GRanges instance, even if it receives a VRanges.

Principles like the LSP are good guides, but the ultimate test is whether the software is more useful or not. VRanges has proven empirically useful.

d-cameron · 2019-04-24T05:14:02Z

GRangesList was specifically designed for representing compound genome features
With the GRangesList approach, in the situation of a list element with A, B, and C, couldn't the encoding of which breakends correspond to which partners be achieved via a breakend-level metadata column?

Which is essentially a List version for the partner implementation, which would have the same drawbacks that I have now. If we restrict ourselves to breakpoints and single breakend there's no problem and GRangesList would work well. I don't currently have any plan to support A,B,C style variants as they're only used by a) multi-mapping aware callers that don't have widespread acceptance, or b) my variant caller but for the purpose of this package, they're treated as single breakend.

Yes that sounds like a reasonable path forward to me.

We now have proposals for the package to also accept SVRanges, GRangeList, and Pairs objects. I'm not sure which (if any) need to be implemented now. Are there any blockers/required review changes that are preventing the acceptance of the package in its current form?

Thanks to both of you for the feedback and discussions.

hpages · 2019-04-24T08:10:20Z

@d-cameron If, as you said earlier, most of the VRanges columns are not appropriate for arbitrary SVs, extending it might not be suitable for your use case.

I don't think anybody suggested that the package should also accept Pairs objects.

I see that the last check by the automated single package builder was successful on all platforms. I'll take another look at the package later today and will let you know.

BTW you closed this issue. Was it intentional?

d-cameron · 2019-04-25T06:16:18Z

BTW you closed this issue. Was it intentional?

It was not.

hpages · 2019-04-25T11:49:04Z

Hi @d-cameron

Thanks for the changes.

There must have been a misunderstanding about item 2. I was only suggesting that breakpointgr2bedpe(bedpe2breakpointgr(bedpe.file)) should work. This is a reasonable expectation for functions that have symmetric names, which suggests that they perform opposite operations. I was not suggesting that you get rid of bedpe2breakpointgr() and that you introduce the breakpointgr2pairs() / pairs2breakpointgr() combo. Exposing the intermediate Pairs representation has little value. Plus now you've lost the convenience of being able to go from bedpe.file to the breakpoint GRanges object with a single function call.

A few more things:

Looks like you're using tidyverse in the vignette so the package needs to be added to the Suggests field. Note that you only need to load tidyverse once in the vignette. Furthermore AFAICS you only seem to be using the ggplot2 and dplyr packages from the tidyverse. For the sake of minimizing package dependencies, it would be preferable to Suggest these 2 packages only instead of tidyverse. In the vignette it's better to load them right before you use them (i.e. right before generating the Precision-Recall and ROC curves).
Load the VariantAnnotation package before using it in the vignette, not after.
Use library() (or require()) consistently in the vignette, instead of a mix of both.
This code in the vignette:
```
breakpointgr <- breakpointRanges(vcf)
singleBreakendgr <- breakendRanges(vcf)
gr <- c(breakpointgr, singleBreakendgr)
```
concatenates breakpointgr with empty singleBreakendgr (both GRanges object). If that's intended, it deserves a comment. Ideally, an example where this concatenation is not a no-op would be more illustrative and would have more value. In particular it would show what happens to the partner metadata column and how the breakpointgr + singleBreakendgr blend is handled downstream.

"Once we know which calls match the truth set, it is relatively straight-forward to generate Precision-Recall and ROC curves for each caller."

ggplot(as.data.frame(svgr) %>%
  dplyr::select(QUAL, caller, truth_matches) %>%
  dplyr::group_by(caller, QUAL) %>%
  dplyr::summarise(
    calls=n(),
    tp=sum(truth_matches > 0)) %>%
  dplyr::group_by(caller) %>%
  dplyr::arrange(dplyr::desc(QUAL)) %>%
  dplyr::mutate(
    cum_tp=cumsum(tp),
    cum_n=cumsum(calls),
    cum_fp=cum_n - cum_tp,
    Precision=cum_tp / cum_n,
    Recall=cum_tp/length(truth_svgr))) +
  aes(x=Recall, y=Precision, colour=caller) +
  geom_point() +
  geom_line() +
  labs(title="NA12878 chr22 CREST and HYDRA\nSudmunt 2015 truth set")

That doesn't look straight-forward to me.

Please remove the parentheses in ggplot2() (ggplot2 is a package, not a function).

H.

d-cameron · 2019-04-25T14:19:47Z

There must have been a misunderstanding about item 2.
Exposing the intermediate Pairs representation has little value. Plus now you've lost the convenience of being able to go from bedpe.file to the breakpoint GRanges object with a single function call.

Replacing the direct bedpe parsing with Pairs makes BEDPE parsing consistent with VCF parsing. Read/write of VCFs is performed entirely by VariantAnnotation, and with the Pairs functions, read/write of BEDPEs is performed entirely by rtracklayer. I feel that this is a cleaner design as the package is now consistent with how IO is handled and I'm no longer providing a parallel implementation of functionality already provided by rtracklayer.

bioc-issue-bot · 2019-04-25T14:44:33Z

Received a valid push; starting a build. Commits are:

69343dd Round 2 of review feedback.

d-cameron · 2019-04-25T14:47:50Z

12, 13, 14, 16, 17

Updated

15

I've added a real world (somatic COLO829) call set to the vignette and included a section on how to handle mixed single breakend/breakpoint GRanges. An added bonus is that the circos plots actually have interesting events in them now.

d-cameron · 2019-04-25T16:49:17Z

@hpages round two of feedback has been addressed

bioc-issue-bot · 2019-04-25T17:17:28Z

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings
on all platforms.

Please see the build report for more details.

hpages · 2019-04-26T23:55:21Z

Thanks Daniel

bioc-issue-bot · 2019-04-26T23:55:40Z

Your package has been accepted. It will be added to the
Bioconductor Git repository and nightly builds. Additional
information will be posed to this issue in the next several
days.

Thank you for contributing to Bioconductor!

mtmorgan · 2019-04-27T17:26:29Z

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/d-cameron.keys is not empty), then no further steps are required. Otherwise, do the following:

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/
https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("StructuralVariantAnnotation"). The package 'landing page' will be created at

https://bioconductor.org/packages/StructuralVariantAnnotation

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.

bioc-issue-bot added the 1. awaiting moderation label Apr 5, 2019

bioc-issue-bot added 2. review in progress assign a reviewer and a more thorough review of package code and documentation taking place and removed 1. awaiting moderation labels Apr 5, 2019

bioc-issue-bot assigned hpages Apr 5, 2019

bioc-issue-bot added the ERROR label Apr 5, 2019

bioc-issue-bot added VERSION BUMP REQUIRED and removed VERSION BUMP REQUIRED labels Apr 11, 2019

bioc-issue-bot added the VERSION BUMP REQUIRED label Apr 11, 2019

bioc-issue-bot removed the VERSION BUMP REQUIRED label Apr 11, 2019

bioc-issue-bot added the WARNINGS label Apr 11, 2019

bioc-issue-bot added the TIMEOUT label Apr 11, 2019

bioc-issue-bot added OK and removed ERROR labels Apr 23, 2019

d-cameron closed this as completed Apr 24, 2019

d-cameron reopened this Apr 24, 2019

d-cameron mentioned this issue Apr 25, 2019

Packaged PapenfussLab/StructuralVariantAnnotation#7

Closed

bioc-issue-bot added the VERSION BUMP REQUIRED label Apr 25, 2019

bioc-issue-bot removed the VERSION BUMP REQUIRED label Apr 25, 2019

hpages added 3a. accepted will be ingested into Bioconductor daily builder for distribution and removed 2. review in progress assign a reviewer and a more thorough review of package code and documentation taking place labels Apr 26, 2019

mtmorgan closed this as completed Apr 27, 2019

StructuralVariantAnnotation #1079

StructuralVariantAnnotation #1079

Comments

d-cameron commented Apr 5, 2019

bioc-issue-bot commented Apr 5, 2019

bioc-issue-bot commented Apr 5, 2019

bioc-issue-bot commented Apr 5, 2019

d-cameron commented Apr 7, 2019

hpages commented Apr 9, 2019 • edited Loading

lawremi commented Apr 9, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

d-cameron commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

bioc-issue-bot commented Apr 11, 2019

d-cameron commented Apr 23, 2019

lawremi commented Apr 23, 2019

d-cameron commented Apr 23, 2019 • edited Loading

lawremi commented Apr 23, 2019

lawremi commented Apr 23, 2019

d-cameron commented Apr 23, 2019

hpages commented Apr 23, 2019

lawremi commented Apr 23, 2019

d-cameron commented Apr 24, 2019 • edited Loading

hpages commented Apr 24, 2019

d-cameron commented Apr 25, 2019

hpages commented Apr 25, 2019

d-cameron commented Apr 25, 2019

bioc-issue-bot commented Apr 25, 2019

d-cameron commented Apr 25, 2019

d-cameron commented Apr 25, 2019

bioc-issue-bot commented Apr 25, 2019

hpages commented Apr 26, 2019

bioc-issue-bot commented Apr 26, 2019

mtmorgan commented Apr 27, 2019

hpages commented Apr 9, 2019 •

edited

Loading

d-cameron commented Apr 23, 2019 •

edited

Loading

d-cameron commented Apr 24, 2019 •

edited

Loading