-
Notifications
You must be signed in to change notification settings - Fork 27
Coronavirus annotation
Identifying and annotating Coronaviridae sequences other than SARS-CoV-2 using an extended VADR model library
-
Download and install the latest version of vadr, following the instructions on this page
-
Download the latest coronavirus vadr models (gzipped tarball) from this FTP page, unpack them (e.g.
tar xfz <tarball.gz>). Note the path to the directory name created (<coronavirus-models-dir-path>) for step 3. -
Run the
v-annotate.plprogram on an input fasta file with SARS-CoV-2 sequences using the recommended command and options below:
v-annotate.pl -r -s --lowsimterm 2 --mxsize 64000 --mdir <coronavirus-models-dir-path> --mkey NC_045512 --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf --lowsc 0.75 <fasta-file-to-annotate> <output-directory-to-create>
This section shows output from an example v-annotate.pl on three
SARS-CoV-2 sequences from GenBank. The fasta file of those three
sequences can be downloaded
here.
(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)
For this example, the coronavirus model directory is in /usr/local/vadr-models-corona-1.1-1
and the sars-cov2.3.fa sequence file is in the current directory.
To annotate these sequences using the recommended v-annotate.pl options for SARS-CoV-2, run the command:
v-annotate.pl -r -s --lowsimterm 2 --mxsize 64000 --mdir /usr/local/vadr-models-corona-1.1-1 --mkey NC_045512 --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf --lowsc 0.75 sars-cov2.3.fa va-sars-cov2.3
You should see output similar to the following block that lists relevant environment variable values, and input arguments and options:
# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.1 (May 2020)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Thu May 7 13:36:47 2020
# $VADRBIOEASELDIR: /usr/local/vadr-install/Bio-Easel
# $VADRBLASTDIR: /usr/local/vadr-install/ncbi-blast/bin
# $VADREASELDIR: /usr/local/vadr-install/infernal/binaries
# $VADRINFERNALDIR: /usr/local/vadr-install/infernal/binaries
# $VADRMODELDIR: /usr/local/vadr-install/vadr-models
# $VADRSCRIPTSDIR: /usr/local/vadr-install/vadr
#
# sequence file: sars-cov2.3.fa
# output directory: va-sars-cov2.3
# specify that alert codes in <s> cause FAILure: lowscore,fsthicnf,fstlocnf [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr': NC_045512 [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR: /Users/nawrockie/Dropbox/work/notebook/20_0505_vadr_1p1_release/vadr-models-corona-1.1-1 [--mdir]
# lowscore/LOW_SCORE bits per nucleotide threshold is <x>: 0.75 [--lowsc]
# lowsim{5s,5f,3s,3f}/LOW_{FEATURE}_SIMILARITY_{START,END} minimum length is <n>: 2 [--lowsimterm]
# fstlocnf/POSSIBLE_FRAMESHIFT_LOW_CONF minimum average probability for alert is <x>: 0.0 [--fstlowthr]
# set max allowed dp matrix size --mxsize value for cmalign calls to <n> Mb: 64000 [--mxsize]
# use max length ungapped region from blastn to seed the alignment: yes [-s]
# replace stretches of Ns with expected nts, where possible: yes [-r]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Next, v-annotate.pl will output information as it proceeds through different steps of the analysis:
# Validating input ... done. [ 0.1 seconds]
# Preprocessing for N replacement: blastn classification (3 seqs) ... done. [ 0.3 seconds]
# Preprocessing for N replacement: coverage determination from blastn results (3 seqs) ... done. [ 0.0 seconds]
# Replacing Ns based on results of blastn-based pre-processing ... done. [ 0.0 seconds]
# Classifying sequences with blastn (3 seqs) ... done. [ 0.2 seconds]
# Determining sequence coverage from blastn results (3 seqs) ... done. [ 0.0 seconds]
# Joining alignments from cmalign and blastn for model NC_045512 (3 seqs) ... done. [ 0.0 seconds]
# Determining annotation ... done. [ 0.5 seconds]
# Validating proteins with blastx (NC_045512: 3 seqs) ... done. [ 2.0 seconds]
# Generating tabular output ... done. [ 0.0 seconds]
# Generating feature table output ... done. [ 0.0 seconds]
Finally, v-annotate.pl concludes with a summary of the classification of sequences, and the alerts reported.
For more information, see the vadr documentation pages, linked to from here, or the VADR 1.0 paper, available on bioRxiv (https://www.biorxiv.org/content/10.1101/852657v2).
SARS-CoV-2 using an extended vadr model library