Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unlabeled synthetic HBV sequences in Genbank can impact VIRUSBreakend output #508

Closed
toddajohnson opened this issue Jun 29, 2021 · 8 comments

Comments

@toddajohnson
Copy link

I followed the instructions in virusbreakend-build to produce the virusbreakenddb and have been examining VBE (GRIDSS 2.12.0) output for 11 ICGC PCAWG HCC samples, four of which have known HBV integrations, with previous short-read results and a recent long-read sequencing based analysis. For one sample in particular (HCC RK147), about half of integration sites were not being reported by VBE. Of course, some of those could be due to cutoffs for frequency/fragment support, but when I examined the gridss.assembly.bam and viral.bam files in IGV, I noticed that regions on the ends consisted mostly of reads with MAPQ 0. Since the read extraction and breakend assembly process removes reads with low mapping quality (GridssConfiguration minMapq=20 in the log), I suspect that causes overlapping genome integration sites to not be assembled and called by VBE.

Here is the IGV figure:

RK147_adjusted_AB819617 1_igv_snapshot

Both of the HBV entries in Genbank that were chosen by VBE as the best viral sequence (AB819617.1 and AB206816.2) for the Japanese HCC samples were uploaded by Japanese researchers, but they are not patient isolates or curated viral genome sequences, but rather, they were genetically engineered. AB819617.1 has two versions of the HBV X gene (one on each end) and AB206816.2 is described as being a "1.3 x complete genome" in the Genbank entry (designed to better infect mouse models).

Unless this is already included in the recent mods made for issue #502, it seems that the database build process may need additional filters. I am currently downloading the pre-built virusbreakenddb to check the results.

@toddajohnson
Copy link
Author

I can see that the GRIDSS-2.12.0 virusbreakend-build script sets a filter when creating the neighbour.fa file to remove synthetic sequences with "NOT gbdiv syn[prop]".

I searched the two sequence entries above for selectable keywords and tried "genetically engineered[Text words]" and "constructed[Text words]". Ten other entries from the same authors/reports were retrieved, but beside those, no other entries among the 30 or so that were retrieved appeared to be synthetic. To specifically filter out those ten HBV synthetic genomes, I am now re-creating the neighbour.fa file with:
esearch -db nuccore -query "Viruses[Organism] NOT cellular organisms[ORGN] NOT wgs[PROP] NOT AC_000001:AC_999999[pacc] NOT gbdiv syn[prop] AND (srcdb_refseq[PROP] OR nuccore genome samespecies[Filter]) NOT (AB206816:AB206817[ACCN] OR AB819611:AB819618[ACCN])" \ | efetch -format fasta > $dbname/neighbour.fa

@toddajohnson toddajohnson changed the title Genetically engineered viruses in Genbank and virusbreakenddb impact VIRUSBreakend output Unlabeled synthetic HBV sequences in Genbank can impact VIRUSBreakend output Jun 30, 2021
@d-cameron
Copy link
Member

Thanks for the investigation. Viral repeats are indeed problematic for VBE.
Did you find filtering those genomes ended up resolving the issue for the ICGC samples?

@toddajohnson
Copy link
Author

Removing the synthetic genomes helped a little, but even though the next best HBV genome (EU919163.1) was about the correct genome length (3197 bp vs 3215 bp in GenBank HBV genomes), it also turned out to have complications. Examining the IGV output, there are clusters of reads that have MAPQ 0 between about 1700-2300 bp.

RK147_MAPQ20_adjusted_EU919163 1_igv_snapshot

Those could be regions of the genome that are normally duplicated, but actually reading though the GenBank entry, it lists under CDS section 1374..2051 /note="includes in-frame duplication" and then 2027..2434 /note="includes in-frame deletion". Using LALIGN from EMBOSS, those are duplicated/deleted compared to reference HBV genomes. Also, based on the kmercounts table, I did pairwise alignment with the next best reference sequence (KP017271.1) and it did not have those CNAs. I wonder if the number of kmers may be getting over-counted for the viral genome sequence with duplicated regions?

I have played with some VBE configuration properties to see if I can recover more of the previous reported insertions (despite duplications/deletions), and by setting minMapq=0 and some of the assembly and variant calling writeFiltered flags to true, most of the insertion breakends with support from multiple reads seems to be recovered.

Here is my comparison of filtering/configuration changes; about twice as many insertions in the final output.

RK147_VBE_modification_results

There looks like one obvious insertion with HBV breakend around 2,000bp and chr8:67.633Mb that is still not getting called; Around that breakend, there are three insertions in different genome locations with overlapping HBV genome break ends, which I imagine is difficult for VBE to resolve. Are there any VBE or GRIDSS configuration flags that might help?

@d-cameron
Copy link
Member

I've been thinking about this issue lately and I can see the following potential solutions/workarounds:

  • Run GRIDSS as well as VIRUSBreakend
    • GRIDSS single breakends to viral sequence = viral integration site. This is what we used before VIRUSBreakend but it missed some integration. I clearly should have done a better job of regression testing when I added the non-RefSeq genomes to the VBE database.
    • This will pick up integrations that are into viral repeats but not human repeats. It's problematic since GRIDSS is around 50x more computationally expensive to run so I'd like to find a VIRUSBreakend-only solution. As a workaround it should be fine.
  • Always use the RefSeq genome if it exists (the neighbour genomes were initially added because HPV45 has no RefSeq exemplar so VIRUSBreakend was getting clinically relevant HPV subtypes incorrect)
  • Enable GRIDSS calling in low mappability regions by changing the VIRUSBreakend GRIDSS config file
minMapq = 0
scoring.model = ReadCount
variantcalling.minScore = 2
  • N masking the second+ copy of viral reference repeats genome
    • Would need to work out an appropriate repeat threshold.
  • De-novo assembly of viral reference
    • Potentially problematic/a lot of work as I'd need isolate the chimeric contigs and assemblers like to break contigs at branch points - the exact the opposite of you want when you're after breakpoint contigs
  • De-novo assembly of just the SV-supporting fragments.

Defaulting to RefSeq & allowing mapq=0 seems like the fastest way to recover the obvious calls. I'll ask around and see if I'm on an ICGC data access request that'll give me access to HCC RK147 to test myself.

@toddajohnson
Copy link
Author

Thank you! I will try some of the easy changes to VBE parameters or modifications to the script as time permits.

FYI, I have looked at this same data using GRIDSS along with GRIPSS, PURPLE, LINX, and there are very few integration sites left at the end of the pipeline. Of course, most are present in the unfiltered GRIDSS SV VCF file, but between the GRIPSS somatic and GRIPSS somatic.filtered, most HBV integration sites across the HCC samples disappeared. In the somatic file, RK147 had 7 sites but only one was left after hard-filtering. Most of the breakends that were filtered out had neighboring breakends representing the other end of the insertion, but they did not make it to PURPLE or LINX to be clustered together into one event. I suppose that is why HMF is developing the "virus-interpreter" to process VBE output?

As you noted

d-cameron pushed a commit that referenced this issue Jul 18, 2021
@d-cameron
Copy link
Member

Did using the ReadCount scoring method restore the missing calls?

The G/P/L pipeline is very conservative when calling single breakends. You are correct that we haven't special-cased the GRIDSS viral integration calls.

G/P/L is designed for 40x normal/90x tumour sequencing. If you've got 30x or even 60x tumour sequencing, the default single breakend calling threshold is probably too high to call even a clonal single-copy single breakend. I suspect this is the cause of your GRIPSS drop-out of true positive viral integration sites. Subclonal viral integration would be even worse.

I suspect we'll be whitelisting all single breakends that gridss_annotate_kraken2 against virusbreakenddb reported as viral when we next review the viral detection component of the Hartwig pipeline.

@toddajohnson
Copy link
Author

I ran VBE again yesterday (minMapq = 0, scoring.model = ReadCount, variantcalling.minScore = 2) and did not see any differencefrom just using minMapq = 0. As you suspect, our sequencing coverage is somewhere around 30x normal, 40x tumor; if you have suggestions for more appropriate configuration values for running GPL in such a situation, it would be appreciated. For this "simple" case of viral integration, like you wrote, I suspect that whitelisting gridss_annotate_kraken2 detected integrations in the GPL pipeline would be the simplest solution. In any case thank you so much for developing this useful tool!

@d-cameron
Copy link
Member

2.12.1 released which should solve the problem of unusual HBV ref genome choices (now always uses the RefSeq genome if it's available for that virus) but the low mappability issue is a more fundamental issue with GRIDSS that I'm in the process of writing a new assembler for (but this will take months).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants