Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VIRUSBreakend VCF detail. #489

Closed
byeongill opened this issue Apr 16, 2021 · 1 comment
Closed

VIRUSBreakend VCF detail. #489

byeongill opened this issue Apr 16, 2021 · 1 comment

Comments

@byeongill
Copy link

Hi,
I am using VIRUSBreakend for detecting viral integration in human WGS(30X) samples.

I used the default parameters in the virusbreakend.sh script.

I got the output file in .vcf format and converted to excel for examples.(provided in the preview).

I have the following question regarding the output file:

1: What is the critera of FILTER? I found that the values are enough high in the QUAL(684.91, 357.52, 276.53), but they regarded as LOW-QUAL in the FILTER column. And, is it possible to change the filtering criteria and make it more strict?

2: I need to the supporting viral read count and human read count at breakpoint. I wonder RP,SR,REF,RF,VF FORAMT(or INFO) vaues are corresponding (representing) to these breakpoint values?

image

@d-cameron
Copy link
Member

d-cameron commented Apr 21, 2021

Apart from BEALN and

The FILTER, FORMAT, and INFO fields are the outputs from running GRIDSS on the viral genome. VIRUSBreakend identifies viral single breakends that go to non-viral sequence then identifies the integration locations in the host genome. The INFO and FORMAT annotations are single breakend viral annotations.

1: What is the critera of FILTER? I found that the values are enough high in the QUAL(684.91, 357.52, 276.53), but they regarded as LOW-QUAL in the FILTER column.

They are just the default GRIDSS FILTERs. GRIDSS has a higher threshold for single breakends than breakpoints hence the LOW_QUAL filter being applied to these integration sites.

And, is it possible to change the filtering criteria and make it more strict?

In the Hartwig cohort, I've found that if an integration site was detected at all (regardless of FILTER), then it's highly likely to be a true positive. Is this not the case for your data?

I wonder RP,SR,REF,RF,VF FORAMT(or INFO) vaues are corresponding (representing) to these breakpoint values?

No, they correspond to the support for the single breakend from a viral genome perspective. This means that RP and SR are always zero, REF/REFPAIR are viral reference supporting read counts.

2: I need to the supporting viral read count and human read count at breakpoint.

What do you mean by "supporting viral read count" and "human read count"?

To prevent doubling-counting of breakpoint-supporting read pairs that also have one read with a split read alignment, GRIDSS uses a supporting fragment approach. For a breakpoint, you have the following supporting fragment counts:

  • fragments aligned to the human reference overlapping the breakpoint position that support the reference allele (VIRUSBreakend does not report this)
  • fragments aligned to the viral reference overlapping the breakpoint position that support the reference allele (VIRUSBreakend reports this as: REF+REFPAIR)
  • fragments partially aligned to both human and viral sequence that support the breakpoint (VIRUSBreakend reports this as: BVF)

When the location of the viral integration is ambiguous (e.g. integration into a centromere), then it's not really possible to determine the number of reads/fragments supporting the reference allele on the host side.

The simplest way to get the human ref support is to run gridss.AnnotateReferenceCoverage but that's not going to work with a VIRUSBreakend VCF since the viral reference contig doesn't exist in the host-aligned input bam so it'll throw an error. Your options are:

  • hacking the VIRUSBreakend to make the VCF
    • e.g. filter to breakends with host coordinates, change ALT to N[[chr1:1[[, run gridss.AnnotateReferenceCoverage, merge just REF/PAIR into the original VIRUSBreakend VCF INFO fields.
    • host aligned bam can be subset to only the regions around the integration sites to make it run faster (it does a linear pass on the entire bam)
  • Re-calling with GRIDSS against a host+virus reference

It'll be much faster to only process the relevant subset of reads. The relevant subset is the union of the reads extracted by VIRUSBreakend (already in the virusbreakend working directory), and the reads surrounding the breakpoint. The latter can be extracted by gridss_Extract_overlapping_fragments.sh. You'll need to make sure you don't include the same reads twice and mess up your counts.

  • Run samtools view host_aligned.bam $region_of_interest (where $region_of_interest is +- 2kbp around all the host integration sites)and append to the virusbreakend readnames file
    • Alternatively: subset the VIRUSBreakend to the host breakend records, run gridss_extract_overlapping_fragments.sh on it, and get the read names from that subset bam file
  • run gridsstools extractFragmentsToFastq on larger readnames file
  • create a reference genome containing the host + viral genomes
  • realign the relevant reads to the host+viral reference with bwa; sort
  • run GRIDSS
  • Filter to host <-> viral breakpoints

Note that this only works for viral integration where the integration sites occurs in mappable sequence. Integration sites in unmappable sequence (such as centromeres, telomeres, and some repetative or low complexy sequences) will not be called by GRIDSS.

d-cameron pushed a commit that referenced this issue May 17, 2021
REF+REFPAIR works except for the specific edge case of a concordant read pair that has an internal split read.
>>>>    >>>>------<<<<<
primary sup       primary (primaries are concordant)
         ^
         |
      breakpoint position
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants