Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StructuralVariantAnnotation #1079

Closed
d-cameron opened this issue Apr 5, 2019 · 84 comments
Closed

StructuralVariantAnnotation #1079

d-cameron opened this issue Apr 5, 2019 · 84 comments
Assignees
Labels
3a. accepted will be ingested into Bioconductor daily builder for distribution OK

Comments

@d-cameron
Copy link

Update the following URL to point to the GitHub repository of
the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

  • [ X ] I understand that by submitting my package to Bioconductor,
    the package source and all review commentary are visible to the
    general public.

  • [ X ] I have read the Bioconductor Package Submission
    instructions. My package is consistent with the Bioconductor
    Package Guidelines.

  • [X ] I understand that a minimum requirement for package acceptance
    is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS.
    Passing these checks does not result in automatic acceptance. The
    package will then undergo a formal review and recommendations for
    acceptance regarding other Bioconductor standards will be addressed.

  • [ X ] My package addresses statistical or bioinformatic issues related
    to the analysis and comprehension of high throughput genomic data.

  • [ X ] I am committed to the long-term maintenance of my package. This
    includes monitoring the support site for issues that users may
    have, subscribing to the bioc-devel mailing list to stay aware
    of developments in the Bioconductor community, responding promptly
    to requests for updates from the Core team in response to changes in
    R or underlying software.

I am familiar with the essential aspects of Bioconductor software
management, including:

  • [ X ] The 'devel' branch for new packages and features.
  • [ X ] The stable 'release' branch, made available every six
    months, for bug fixes.
  • [ X ] Bioconductor version control using Git
    (optionally via GitHub).

For help with submitting your package, please subscribe and post questions
to the bioc-devel mailing list.

@bioc-issue-bot
Copy link
Collaborator

Hi @d-cameron

Thanks for submitting your package. We are taking a quick
look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: StructuralVariantAnnotation
Type: Package
Title: Variant annotations for structural variants
Version: 0.99.0
Date: 2019-04-05
Authors@R: c(
	person("Daniel", "Cameron", email="daniel.l.cameron@gmail.com", role=c("aut", "cre"), comment=c(ORCID = "0000-0002-0951-7116")),
	person("Ruining", "Dong", email="dong.rn@wehi.edu.au", role=c("aut"), comment=c(ORCID = "0000-0003-1433-0484")))
Description: StructuralVariantAnnotation contains useful helper
	functions for dealing with structural variants in VCF format.
	The packages contains functions for parsing VCFs from a number
	of popular callers as well as functions for dealing with 
	breakpoints involving two separate genomic loci encoded as
	GRanges objects.
License: GPL-3
Depends:
	GenomicRanges,
	rtracklayer,
  VariantAnnotation,
  BiocGenerics
Imports:
    assertthat,
    Biostrings,
    stringr,
    dplyr
Suggests:
    BSgenome.Hsapiens.UCSC.hg19,
    devtools,
    testthat,
    roxygen2,
    covr,
    knitr,
    plyranges,
    dplyr,
    ggbio,
    biovizBase,
    circlize
RoxygenNote: 6.1.1
VignetteBuilder: knitr
biocViews: DataImport, Sequencing, Annotation, Genetics, VariantAnnotation

Add SSH keys to your GitHub account. SSH keys
will are used to control access to accepted Bioconductor
packages. See these instructions to add SSH keys to
your GitHub account.

@bioc-issue-bot
Copy link
Collaborator

A reviewer has been assigned to your package. Learn what to expect
during the review process.

IMPORTANT: Please read the instructions for setting
up a push hook on your repository, or further changes to your
repository will NOT trigger a new build.

@bioc-issue-bot bioc-issue-bot added 2. review in progress assign a reviewer and a more thorough review of package code and documentation taking place and removed 1. awaiting moderation labels Apr 5, 2019
@bioc-issue-bot
Copy link
Collaborator

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

@d-cameron
Copy link
Author

Hmm. What's the best way to proceed when my package builds without error/warning when build against release dependencies, but is failing in devel due to dependencies? In this case, a test/example VCF parses in release, but I get a invalid class "VCFHeader" object: 'info(VCFHeader)' must be a 3 column DataFrame with names Number, Type, Description when I call VariantAnnotation::readVcf() in the vignette.

@hpages
Copy link
Contributor

hpages commented Apr 9, 2019

Hi @d-cameron ,

Thanks for this submission.

Please focus on BioC devel. There have been major changes to Rsamtools and VariantAnnotation between BioC 3.8 (release) and 3.9 (devel) that could explain these differences. Also let's ignore Windows for now: there is a known Windows-specific issue with the latest Rsamtools and VariantAnnotation in devel that is still under investigation.

Best,
H.

@lawremi
Copy link

lawremi commented Apr 9, 2019

The overlap functionality could greatly benefit from using features of the Hits objects, instead of reducing to data.frames and resorting to dplyr. For example, in findBreakpointOverlaps(), comments mention that duplicated(hits) takes too much memory. Is that just because hits is a data.frame? The duplicated,Hits() method is very efficient. I think intersect(x_hits, partner_hits) would be a better approach.

@bioc-issue-bot
Copy link
Collaborator

Received a valid push; starting a build. Commits are:

b67d361 version bump

@bioc-issue-bot
Copy link
Collaborator

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

@bioc-issue-bot
Copy link
Collaborator

Received a valid push; starting a build. Commits are:

269c5ce remove test file bugs
d813bba version bump Merge branch 'master' of https://git...

@bioc-issue-bot
Copy link
Collaborator

Received a valid push; starting a build. Commits are:

9494b6b version bump 0.99.3

@bioc-issue-bot
Copy link
Collaborator

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

@bioc-issue-bot
Copy link
Collaborator

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

@bioc-issue-bot
Copy link
Collaborator

Received a valid push; starting a build. Commits are:

7420bba update examples
d5e4e0c update examples and version bump
79b93c6 update package documentation Merge branch 'master...

@bioc-issue-bot
Copy link
Collaborator

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

@bioc-issue-bot
Copy link
Collaborator

Received a valid push; starting a build. Commits are:

4b34ffc findBreakpointOverlaps() now returns a Hits object
17b4c8a Version bump

@d-cameron
Copy link
Author

@lawremi that also had the advantage of the of the return type matching findOverlaps() which is a nicer design so I've refactored that code.

Attempting to do the obvious and concat the two hits objects fails with S4Vectors::c() fails with the error:

Error in validObject(ans) : 
  invalid class “SortedByQueryHits” object: 'queryHits(x)' must be sorted

Is this something I should raise an issue for? I would have expected either a zipper merger returning a SortedByQueryHits, or the return class to be demoted to Hits since the concatenated sequences are not in order.

Judging by performance in profvis, neither intersect() nor duplicated() have any optimisations for the sorted nature of the SortedByQueryHits returned by findOverlaps() (which is somewhat strange given that duplicated() requires the Hits to be queryHits sorted). intersect() so I just went with intersect as it's actually the operation that I'm attempting to perform.

@bioc-issue-bot
Copy link
Collaborator

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, TIMEOUT, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

@bioc-issue-bot
Copy link
Collaborator

Received a valid push; starting a build. Commits are:

3c217b1 Removed dplyr and stringr from NAMESPACE imports d...
a99ccfc Removing building warning

@bioc-issue-bot
Copy link
Collaborator

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, ERROR".
This may mean there is a problem with the package that you need to fix.
Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

@bioc-issue-bot bioc-issue-bot added OK and removed ERROR labels Apr 23, 2019
@d-cameron
Copy link
Author

@hpages build now passing with your changes (hopefully) addressed.

@lawremi
Copy link

lawremi commented Apr 23, 2019

Extending GRanges a la VRanges (call it SVRanges?) might be a good idea.

@d-cameron
Copy link
Author

d-cameron commented Apr 23, 2019

@lawremi that would require adding hooks to everywhere that modifies a GRanges class so they also support the new class, correct? Wouldn't it be easier to modify GRanges itself to optionally track row pairings? Such as class would be useful in other contexts as well (e.g. InteractionSet).

@lawremi
Copy link

lawremi commented Apr 23, 2019

You can look at VRanges but basically it's not so bad, because GRanges is designed internally to be extensible. Many functions update the object without constructing a new GRanges instance.

@lawremi
Copy link

lawremi commented Apr 23, 2019

Actually maybe SVRanges should extend VRanges.

@d-cameron
Copy link
Author

Actually maybe SVRanges should extend VRanges.

Unfortunately, most of the VRanges columns are not appropriate for arbitrary SVs so it's not a good fit. Even something as simple as equivalent of refCount and altCount are actually much more complicated for SVs (e.g. read pair support for smallish variants with either be wrong, or require a model to estimate the likelihood of a read pair coming from the ref or alt as the expected fragment size distributions overlap for small events).

Many functions update the object without constructing a new GRanges instance.

Thinking it over, extending from GRanges (or VRanges) is problematic as it violates Liskov Substitution Principle - the proposed SVRanges class would either restrict the permitted operation (thus making operations that were valid on a GRanges not valid on the derived class), or if you allow the operations, be effectively the same as the GRange implementation that I've implemented (raising errors when the GRanges violates the breakpoint pairing constraints).

@hpages
Copy link
Contributor

hpages commented Apr 23, 2019

@d-cameron

Yes that sounds like a reasonable path forward to me.

Some additional clarifications:

  • As I said, GRangesList was specifically designed for representing compound genome features, not for "exon lists" in particular. The "exon lists" use case just happened to be the first use case supported by these objects.

  • With the GRangesList approach, in the situation of a list element with A, B, and C, couldn't the encoding of which breakends correspond to which partners be achieved via a breakend-level metadata column?

  • Yes findOverlaps() ordinals are the list ordinals and you want the row ordinals. But isn't it something that findBreakpointOverlaps() could handle e.g. by returning an annotated Hits object? You could make the valid point that maybe this feature belongs to the findOverlaps() method for GRangesList objects. I guess the reason we don't have it yet is that nobody asked for it so far. In the meantime, nothing prevents you from implementing it in findBreakpointOverlaps().

  • Finally a note about the Liskov Substitution Principle: Unfortunately real-world objects sometimes violate this principle. For example, data.frame inherits from list but you cannot change the length of one list element without breaking the object:

    df <- data.frame(aa=1:4, bb=letters[1:4])
    df[[1]] <- 1:3  # works on a list, but not on a data.frame
    # Error in `[[<-.data.frame`(`*tmp*`, 1, value = 1:3) : 
    #   replacement has 3 rows, data has 4
    

    We just have to live with that. There are many situations where it still makes sense to extend an existing class though, even if that means that the extended class supports a few less operations.

H.

@lawremi
Copy link

lawremi commented Apr 23, 2019

In many cases where the constraints would be broken by an exact endomorphism, one can just return the more general type. For example, intersect,GRanges,GRanges() always returns a GRanges instance, even if it receives a VRanges.

Principles like the LSP are good guides, but the ultimate test is whether the software is more useful or not. VRanges has proven empirically useful.

@d-cameron
Copy link
Author

d-cameron commented Apr 24, 2019

GRangesList was specifically designed for representing compound genome features
With the GRangesList approach, in the situation of a list element with A, B, and C, couldn't the encoding of which breakends correspond to which partners be achieved via a breakend-level metadata column?

Which is essentially a List version for the partner implementation, which would have the same drawbacks that I have now. If we restrict ourselves to breakpoints and single breakend there's no problem and GRangesList would work well. I don't currently have any plan to support A,B,C style variants as they're only used by a) multi-mapping aware callers that don't have widespread acceptance, or b) my variant caller but for the purpose of this package, they're treated as single breakend.

Yes that sounds like a reasonable path forward to me.

We now have proposals for the package to also accept SVRanges, GRangeList, and Pairs objects. I'm not sure which (if any) need to be implemented now. Are there any blockers/required review changes that are preventing the acceptance of the package in its current form?

Thanks to both of you for the feedback and discussions.

@hpages
Copy link
Contributor

hpages commented Apr 24, 2019

@d-cameron If, as you said earlier, most of the VRanges columns are not appropriate for arbitrary SVs, extending it might not be suitable for your use case.

I don't think anybody suggested that the package should also accept Pairs objects.

I see that the last check by the automated single package builder was successful on all platforms. I'll take another look at the package later today and will let you know.

BTW you closed this issue. Was it intentional?

@d-cameron d-cameron reopened this Apr 24, 2019
@d-cameron
Copy link
Author

BTW you closed this issue. Was it intentional?

It was not.

@hpages
Copy link
Contributor

hpages commented Apr 25, 2019

Hi @d-cameron

Thanks for the changes.

There must have been a misunderstanding about item 2. I was only suggesting that breakpointgr2bedpe(bedpe2breakpointgr(bedpe.file)) should work. This is a reasonable expectation for functions that have symmetric names, which suggests that they perform opposite operations. I was not suggesting that you get rid of bedpe2breakpointgr() and that you introduce the breakpointgr2pairs() / pairs2breakpointgr() combo. Exposing the intermediate Pairs representation has little value. Plus now you've lost the convenience of being able to go from bedpe.file to the breakpoint GRanges object with a single function call.

A few more things:

  1. Looks like you're using tidyverse in the vignette so the package needs to be added to the Suggests field. Note that you only need to load tidyverse once in the vignette. Furthermore AFAICS you only seem to be using the ggplot2 and dplyr packages from the tidyverse. For the sake of minimizing package dependencies, it would be preferable to Suggest these 2 packages only instead of tidyverse. In the vignette it's better to load them right before you use them (i.e. right before generating the Precision-Recall and ROC curves).

  2. Load the VariantAnnotation package before using it in the vignette, not after.

  3. Use library() (or require()) consistently in the vignette, instead of a mix of both.

  4. This code in the vignette:

    breakpointgr <- breakpointRanges(vcf)
    singleBreakendgr <- breakendRanges(vcf)
    gr <- c(breakpointgr, singleBreakendgr)
    

    concatenates breakpointgr with empty singleBreakendgr (both GRanges object). If that's intended, it deserves a comment. Ideally, an example where this concatenation is not a no-op would be more illustrative and would have more value. In particular it would show what happens to the partner metadata column and how the breakpointgr + singleBreakendgr blend is handled downstream.

  5. "Once we know which calls match the truth set, it is relatively straight-forward to generate Precision-Recall and ROC curves for each caller."

    ggplot(as.data.frame(svgr) %>%
      dplyr::select(QUAL, caller, truth_matches) %>%
      dplyr::group_by(caller, QUAL) %>%
      dplyr::summarise(
        calls=n(),
        tp=sum(truth_matches > 0)) %>%
      dplyr::group_by(caller) %>%
      dplyr::arrange(dplyr::desc(QUAL)) %>%
      dplyr::mutate(
        cum_tp=cumsum(tp),
        cum_n=cumsum(calls),
        cum_fp=cum_n - cum_tp,
        Precision=cum_tp / cum_n,
        Recall=cum_tp/length(truth_svgr))) +
      aes(x=Recall, y=Precision, colour=caller) +
      geom_point() +
      geom_line() +
      labs(title="NA12878 chr22 CREST and HYDRA\nSudmunt 2015 truth set")
    

    That doesn't look straight-forward to me.

  6. Please remove the parentheses in ggplot2() (ggplot2 is a package, not a function).

H.

@d-cameron
Copy link
Author

There must have been a misunderstanding about item 2.
Exposing the intermediate Pairs representation has little value. Plus now you've lost the convenience of being able to go from bedpe.file to the breakpoint GRanges object with a single function call.

Replacing the direct bedpe parsing with Pairs makes BEDPE parsing consistent with VCF parsing. Read/write of VCFs is performed entirely by VariantAnnotation, and with the Pairs functions, read/write of BEDPEs is performed entirely by rtracklayer. I feel that this is a cleaner design as the package is now consistent with how IO is handled and I'm no longer providing a parallel implementation of functionality already provided by rtracklayer.

@bioc-issue-bot
Copy link
Collaborator

Received a valid push; starting a build. Commits are:

69343dd Round 2 of review feedback.

@d-cameron
Copy link
Author

12, 13, 14, 16, 17

Updated

15

I've added a real world (somatic COLO829) call set to the vignette and included a section on how to handle mixed single breakend/breakpoint GRanges. An added bonus is that the circos plots actually have interesting events in them now.

@d-cameron
Copy link
Author

@hpages round two of feedback has been addressed

@bioc-issue-bot
Copy link
Collaborator

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings
on all platforms.

Please see the build report for more details.

@hpages
Copy link
Contributor

hpages commented Apr 26, 2019

Thanks Daniel

@hpages hpages added 3a. accepted will be ingested into Bioconductor daily builder for distribution and removed 2. review in progress assign a reviewer and a more thorough review of package code and documentation taking place labels Apr 26, 2019
@bioc-issue-bot
Copy link
Collaborator

Your package has been accepted. It will be added to the
Bioconductor Git repository and nightly builds. Additional
information will be posed to this issue in the next several
days.

Thank you for contributing to Bioconductor!

@mtmorgan
Copy link
Contributor

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/d-cameron.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/
https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("StructuralVariantAnnotation"). The package 'landing page' will be created at

https://bioconductor.org/packages/StructuralVariantAnnotation

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3a. accepted will be ingested into Bioconductor daily builder for distribution OK
Projects
None yet
Development

No branches or pull requests

6 participants