Software, architecture, and data index design for the 2018/2019 Virus Discovery Project
Here we present a compromise pipeline for extracting virological information from publicly available metagenomic datasets, in order to present a usable index to the virological research community.
Objectives (See Project Kanban Boards):
Known Virus Index
The 'Knowns' portion of the VirusDiscovery pipeline processes data from the guided assembly database to sort for virus-like contigs. Specifically, contigs are processed with BLASTN, sorting for an average nucleotide identity ('ANI') of greater than 80% or other defined cutoff. For contigs identified as viral, an index entry is generated as below.
Index for 'Known' viral contigs:
- Metagenome SRR accession [string]
- Contig name [string]
- Assembly type [denovo, reference guided]
- Median depth of coverage by reads of contig [int]
- Length [int]
- Covered length from hit [int]
- NCBI taxonomy id by kmer [int]
- NCBI taxonomic species by kmer [string]
- Unique kmer hits [int]
- Species for reference-guided assembly [string]
- Accession for subject in blastn [string]
- NCBI taxonomy id for subject in blastn [string]
- Percent idendity of blastn hit [float]
- Evalue of blastn hit [float]
- Bit score of blastn hit [float]
- Length of blastn hit [int]
We assume that the contigs db will remain as part of the VirusDiscoveryProject Index ('VDPI') and that indices to that db will be adequate for access rather than having to store individual contigs with the VDPI. This contig db is assumed to include metagenome accession IDs. From those IDs search can make available access to other desirable data features, as provided in the NCBI Virus DB, such as species, source material, country of origin, etc. so we need not provide such information.
The Knowns pipeline generates a possible taxonomic level based on more 85% ANI and more than 80% of coverage by blastn hit and provides that in the index. We propose that there also be entries allowing for expert curation when and if any occurs. The auto-generated taxonomic identity id simply derived from the best blastn hit.
Contigs that have an ANI and coverage lower than the cutoff are sorted and their indices are provided to the Novel Virus processing pipeline.
Detection of novel contigs and novel viruses
Scaling: Containerization, distribution, and user interface + contribution mechanism?
Test Data Selection
Detecting recombination and variation from/in known viruses
Determination of phylogenetic position of viruses similar to knowns
Look at the cool new sh*& NCBI built for us!
NCBI Blast Docker:
BLAST workbench docker (+magicBLAST, +EDirect) -- beta
NCBI_Powertools (+ sra toolkit)
Theres also a nice cookbook here, with thanks to @christiam!
BLAST Databases currently being updated to the NIH STRIDES GCP bucket and can be obtained via update_blastdb.pl or obtained pre-configured via the BLAST GCP VM.
Virus specific ones!
- Virus RefSeq; Reference Viral sequences
- Entrez query - “latest_refseq[Prop] AND viruses[Organism]”
- Viroid RefSeq; Reference Viroid Sequences
- Entrez Query - “latest_refseq[Prop] AND viroids[Organism]”
- Proteins from coding-complete, genomic viral sequences
- Equivalent to protein records in https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus/Vira%252C%2520taxid%253A10239
- Coding-complete, genomic viral sequences
- Equivalent to nucleotide records in https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus/Viruses%252C%2520taxid%253A10239
SRA realign project
SRA public metagenome data has been aligned and assembled in Google cloud and the results are stored in the bucket gs://ncbi_sra_realign/. The project is an ongoing activity and as of the 12/11/2018 there is almost 150000 realigned runs available for the hackathon.
The URLs to individual files can be built using SRA run accession, for example: gs://ncbi_sra_realign/ERR2227558.realign
Here is an example how to download a realigned file: gsutil cp gs://ncbi_sra_realign/ERR2227558.realign .
To list all realign file: gsutil -m ls gs://ncbi_sra_realign/*.realign
SRA toolkit provides means to read the data stored in realigned files. For example, to view references: align-info ERR2227558.realign
To extract a contig by name: dump-ref-fasta ERR2227558.realign Contig_100000_4.78977
To dump reads in fasta format: fastq-dump -Z --fasta ERR2227558.realign Contig_100000_4.78977 | head
For the duration of the hackathon additional information is available in BigQuery: coverage, contig taxonomy and summary with breakdown of host/viral/denovo/unmapped reads. Please note that the processing is still ongoing, so the numbers provided are not final.
Some BigQuery examples:
number of available SRA runs (133200):
select count(distinct accession) from ncbi_sra_realign.coverage
number of guided contigs (274928):
select count(1) from ncbi_sra_realign.coverage where contig not like 'Contig_%' and REGEXP_CONTAINS(contig, '_[[:digit:]]$')
number of denovo contigs (2674975354):
select count(1) from ncbi_sra_realign.coverage where contig like 'Contig_%'
number of SRA runs with available contig taxonomy (100):
select count(distinct accession) from ncbi_sra_realign.taxonomy