Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [Virusbreakend] Support for BAMs aligned to reference genomes containing viral decoy sequences #484

Closed
selkamand opened this issue Apr 6, 2021 · 4 comments
Assignees

Comments

@selkamand
Copy link

Thanks for VirusBreakend, its a really nice tool!

tldr: Is it possible to make virusbreakend work on BAMs aligned to reference genomes containing decoy sequences such that the output is identical to what would be obtained if said decoys were not included

The human reference genomes used in many of our pipelines include viral decoy sequences. One common example of this is hs37d5 (1000genomes), which includes an EBV (NC_007605) decoy sequence that seems to interfere with virusbreakend's ability to correctly identify EBV positive samples (presumably because the viral reads end up mapped?).

In my case, It would be nice to consider reads that map to the sequences NC_007605 or hs37d5 as potentially viral (and thus included in the kraken run). Further, when it comes to breakpoint calling, the breakpoints of interest are those that occur in the non-decoy sequences.

Would it be possible to add an option that makes virusbreakend aware of decoy sequences? This would allow end-users to easily integrate virusbreakend into existing workflows irrespective of the version of the human reference genome they use.

Thanks again for the tool!

Kind regards,
Sam

@d-cameron d-cameron self-assigned this Apr 9, 2021
@d-cameron
Copy link
Member

d-cameron commented Apr 9, 2021

Implementation notes:

  • add --viralreferences command line parameter to virusbreakend.sh and gridsstools unmappedSequencesToFastq.
  • file should contain reference contig names (one per line)
    • abort if contig not found in reference
    • test that driver script actually recognises child script abort
  • treat reads mapped to these contigs as unmapped
    • don't need to handle SA tags for split read alignment. The unmapping will do the trick
    • (verify that we actually do process split read alignment & don't just abort when we see a SA tag). Just make unmapping force output
  • add hs37d5 example to readme

@d-cameron
Copy link
Member

@selkamand Are you aware of any other common reference genomes that contain viral decoys?

@selkamand
Copy link
Author

@selkamand Are you aware of any other common reference genomes that contain viral decoys?

The GRCh38 human reference also has an EBV contig (chrEBV)

A bunch of the different versions include this EBV contig. The ref we use in our GRCh38 pipelines is based on:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

I don't know of any non-human examples I'm afraid.

@d-cameron
Copy link
Member

d-cameron commented May 17, 2021

Split read alignments to viral reference contigs was messier than I expected.

VIRUSBreakend now defaults to excluding "chrEBV" (for hg38 support) and any sequence in the viral database (hs37d5 support).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants