-
Notifications
You must be signed in to change notification settings - Fork 27
Coronavirus annotation
Identifying and annotating Coronaviridae sequences other than SARS-CoV-2 using a larger VADR model library
-
Download and install the latest version of VADR, following the instructions on this page
-
Download the latest coronavirus vadr models (gzipped tarball) from this FTP page, unpack them (e.g.
tar xfz <tarball.gz>). Note the path to the directory name created (<coronavirus-models-dir-path>) for step 3. -
Run the
v-annotate.plprogram on an input fasta file with SARS-CoV-2 sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command). NOTE: It is recommended you run the command below on a machine with 64 Gb of RAM or more. In practice, less RAM is often okay (I can succcessfully runv-annotate.plfor all ~2000 current GenBank SARS-CoV-2 sequences on my laptop with 16Gb RAM), but in the worst case (a highly divergent sequence from the NC_045512 RefSeq), 64Gb could be required.
v-annotate.pl -r -s --nomisc --lowsimterm 2 --mxsize 64000 --mdir <coronavirus-models-dir-path> --mkey NC_045512 --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf --lowsc 0.75 <fasta-file-to-annotate> <output-directory-to-create>
This section shows output from an example v-annotate.pl on three
SARS-CoV-2 sequences from GenBank. The fasta file of those three
sequences can be downloaded
here.
(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)
For this example, the coronavirus model directory is in /usr/local/vadr-models-corona-1.1-1
and the sars-cov2.3.fa sequence file is in the current directory, and we will create a new directory called my3
into which output files will be written.
To annotate these sequences using the recommended v-annotate.pl options for SARS-CoV-2, run the command (scroll to right to see full command):
v-annotate.pl -r -s --nomisc --lowsimterm 2 --mxsize 64000 --mdir /usr/local/vadr-models-corona-1.1-1 --mkey NC_045512 --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf --lowsc 0.75 sars-cov2.3.fa my3
The options used are explained further below.
When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:
# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.1 (May 2020)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Thu May 7 13:36:47 2020
# $VADRBIOEASELDIR: /usr/local/vadr-install/Bio-Easel
# $VADRBLASTDIR: /usr/local/vadr-install/ncbi-blast/bin
# $VADREASELDIR: /usr/local/vadr-install/infernal/binaries
# $VADRINFERNALDIR: /usr/local/vadr-install/infernal/binaries
# $VADRMODELDIR: /usr/local/vadr-install/vadr-models
# $VADRSCRIPTSDIR: /usr/local/vadr-install/vadr
#
# sequence file: sars-cov2.3.fa
# output directory: my3
# specify that alert codes in <s> cause FAILure: lowscore,fsthicnf,fstlocnf [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr': NC_045512 [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR: /usr/local/vadr-models-corona-1.1-1 [--mdir]
# in feature table, never change feature type to misc_feature: yes [--nomisc]
# lowscore/LOW_SCORE bits per nucleotide threshold is <x>: 0.75 [--lowsc]
# lowsim{5s,5f,3s,3f}/LOW_{FEATURE}_SIMILARITY_{START,END} minimum length is <n>: 2 [--lowsimterm]
# fstlocnf/POSSIBLE_FRAMESHIFT_LOW_CONF minimum average probability for alert is <x>: 0.0 [--fstlowthr]
# set max allowed dp matrix size --mxsize value for cmalign calls to <n> Mb: 64000 [--mxsize]
# use max length ungapped region from blastn to seed the alignment: yes [-s]
# replace stretches of Ns with expected nts, where possible: yes [-r]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Next, v-annotate.pl will output information as it proceeds through different steps of the analysis:
# Validating input ... done. [ 0.1 seconds]
# Preprocessing for N replacement: blastn classification (3 seqs) ... done. [ 0.3 seconds]
# Preprocessing for N replacement: coverage determination from blastn results (3 seqs) ... done. [ 0.0 seconds]
# Replacing Ns based on results of blastn-based pre-processing ... done. [ 0.0 seconds]
# Classifying sequences with blastn (3 seqs) ... done. [ 0.2 seconds]
# Determining sequence coverage from blastn results (3 seqs) ... done. [ 0.0 seconds]
# Joining alignments from cmalign and blastn for model NC_045512 (3 seqs) ... done. [ 0.0 seconds]
# Determining annotation ... done. [ 0.5 seconds]
# Validating proteins with blastx (NC_045512: 3 seqs) ... done. [ 2.0 seconds]
# Generating tabular output ... done. [ 0.0 seconds]
# Generating feature table output ... done. [ 0.0 seconds]
The v-annotate.pl output then includes a summary of the classification of sequences, and the alerts reported:
# Summary of classified sequences:
#
# num num num
#idx model group subgroup seqs pass fail
#--- --------- ------------ ---------- ---- ---- ----
1 NC_045512 Sarbecovirus SARS-CoV-2 3 2 1
#--- --------- ------------ ---------- ---- ---- ----
- *all* - - 3 2 1
- *none* - - 0 0 0
#--- --------- ------------ ---------- ---- ---- ----
#
# Summary of reported alerts:
#
# alert causes short per num num long
#idx code failure description type cases seqs description
#--- -------- ------- --------------------------- ------- ----- ---- -----------
1 cdsstopn yes CDS_HAS_STOP_CODON feature 2 1 in-frame stop codon exists 5' of stop position predicted by homology to reference
2 cdsstopp yes CDS_HAS_STOP_CODON feature 2 1 stop codon in protein-based alignment
3 peptrans yes PEPTIDE_TRANSLATION_PROBLEM feature 26 1 mat_peptide may not be translated because its parent CDS has a problem
#--- -------- ------- --------------------------- ------- ----- ---- -----------
And finally a long list of the output files created (truncated for brevity):
# Output printed to screen saved in: my3.vadr.log
# List of executed commands saved in: my3.vadr.cmd
# List and description of all output files saved in: my3.vadr.filelist
# copy of input fasta file saved in: my3.vadr.in.fa
# copy of input fasta file with descriptions removed for blastn saved in: my3.vadr.blastn.fa
# esl-seqstat -a output for input fasta file saved in: my3.vadr.seqstat
# copy of input fasta file with Ns replaced with descriptions removed for blastn saved in: my3.vadr.blastn.rpn.fa
# model NC_045512 feature gene#1 predicted seqs saved in: my3.vadr.NC_045512.gene.1.fa
# model NC_045512 feature CDS#1 predicted seqs saved in: my3.vadr.NC_045512.CDS.1.fa
# model NC_045512 feature CDS#2 predicted seqs saved in: my3.vadr.NC_045512.CDS.2.fa
...snip...
# per-sequence tabular annotation summary file saved in: my3.vadr.sqa
# per-sequence tabular classification summary file saved in: my3.vadr.sqc
# per-feature tabular summary file saved in: my3.vadr.ftr
# per-model-segment tabular summary file saved in: my3.vadr.sgm
# per-model tabular summary file saved in: my3.vadr.mdl
# per-alert tabular summary file saved in: my3.vadr.alt
# alert count tabular summary file saved in: my3.vadr.alc
# ungapped seed alignment summary file (-s) saved in: my3.vadr.sda
# replaced stretches of Ns summary file (-r) saved in: my3.vadr.rpn
# 5 column feature table output for passing sequences saved in: my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in: my3.vadr.fail.tbl
# list of passing sequences saved in: my3.vadr.pass.list
# list of failing sequences saved in: my3.vadr.fail.list
# list of alerts in the feature tables saved in: my3.vadr.alt.list
#
# All output files created in directory ./my3/
#
# Elapsed time: 00:00:03.35
# hh:mm:ss
#
Note that all the output files will be in the newly created my3 directory.
The Summary of classified sequences listed that two sequences passed and one failed.
The file my3.vadr.pass.list, lists the two sequences that passed:
MT159720.1
MT308693.1
and my3.vadr.fail.list lists the one sequence that failed:
MT159720.1/1406-G-to-T
For the two sequences that passed, the annotation is available in the output my3.vadr.pass.tbl file and for the sequence that failed
the annotation is in the my3.vadr.fail.tbl file.
my.vadr.pass.tbl: (with the middle of the table for each sequence removed for brevity)
>Feature MT159720.1
266 21555 gene
gene ORF1ab
266 13468 CDS
13468 21555
product ORF1ab polyprotein
exception ribosomal slippage
protein_id MT159720.1_1
266 13483 CDS
product ORF1a polyprotein
protein_id MT159720.1_2
266 805 mat_peptide
product leader protein
protein_id MT159720.1_1
...snip...
28274 29533 gene
gene N
28274 29533 CDS
product nucleocapsid phosphoprotein
protein_id MT159720.1_11
29558 29674 gene
gene ORF10
29558 29674 CDS
product ORF10 protein
protein_id MT159720.1_12
29609 29644 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 1
29629 29657 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 2
29728 29768 stem_loop
note Coronavirus 3' stem-loop II-like motif (s2m)
>Feature MT308693.1
217 21506 gene
gene ORF1ab
217 13419 CDS
13419 21506
product ORF1ab polyprotein
exception ribosomal slippage
protein_id MT308693.1_1
217 13434 CDS
product ORF1a polyprotein
protein_id MT308693.1_2
217 756 mat_peptide
product leader protein
protein_id MT308693.1_1
...snip...
28225 29484 gene
gene N
28225 29484 CDS
product nucleocapsid phosphoprotein
protein_id MT308693.1_11
29509 29625 gene
gene ORF10
29509 29625 CDS
product ORF10 protein
protein_id MT308693.1_12
29560 29595 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 1
29580 29608 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 2
29679 29719 stem_loop
note Coronavirus 3' stem-loop II-like motif (s2m)
my.vadr.fail.tbl (with the middle removed for brevity):
>Feature MT159720.1/1406-G-to-T
266 21555 gene
gene ORF1ab
266 13468 CDS
13468 21555
product ORF1ab polyprotein
exception ribosomal slippage
protein_id MT159720.1/1406-G-to-T_1
266 13483 CDS
product ORF1a polyprotein
protein_id MT159720.1/1406-G-to-T_2
266 805 mat_peptide
product leader protein
protein_id MT159720.1/1406-G-to-T_1
...snip...
29629 29657 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 2
29728 29768 stem_loop
note Coronavirus 3' stem-loop II-like motif (s2m)
Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (ORF1ab polyprotein) in-frame stop codon exists 5' of stop position predicted by homology to reference [revised to 266..1408 (stop shifted 20147 nt)]
ERROR: CDS_HAS_STOP_CODON: (ORF1ab polyprotein) stop codon in protein-based alignment [stop codon(s) end at position(s) 1143]
ERROR: CDS_HAS_STOP_CODON: (ORF1a polyprotein) in-frame stop codon exists 5' of stop position predicted by homology to reference [revised to 266..1408 (stop shifted 12075 nt)]
ERROR: CDS_HAS_STOP_CODON: (ORF1a polyprotein) stop codon in protein-based alignment [stop codon(s) end at position(s) 1408]
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (leader protein) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp2) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp3) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp4) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (3C-like proteinase) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp6) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp7) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp8) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp9) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp10) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (RNA-dependent RNA polymerase) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (helicase) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (3'-to-5' exonuclease) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (endoRNAse) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (2'-O-ribose methyltransferase) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp11) mat_peptide may not be translated because its parent CDS has a problem
Feature table format is described at https://www.ncbi.nlm.nih.gov/Sequin/table.html.
Note that the end of the fail.tbl file lists ERRORs for
MT159720.1/1406-G-to-T, the third sequence in the input file, which
is identical to the first one MT159720.1 with position 1406
changed from a G to a T to purposefully introduce an early stop
codon for purposes of this example. Early stop codons are one reason
a sequence can fail. Note the CDS_HAS_STOP_CODON errors at the end
of the feature table. The PEPTIDE_TRANSLATION_PROBLEM occur for the
mature peptide features which are derived from the two CDS that have
errors.
The annotation information is also available in other files with
different formats, such as the my3/my3.vadr.ftr file, which may be
easier to parse for some applications. For more information on how to
interpret these results and files, including file formats, see the
vadr documentation pages, linked to from
here,
or the VADR 1.0 paper, available on bioRxiv
(https://www.biorxiv.org/content/10.1101/852657v2).
The options used in the above command are the recommended set of
options for SARS-CoV-2 annotation because they are currently (as of
May 7, 2020) being used by NCBI sequence indexers when they evaluate
an incoming SARS-CoV-2 sequence submission. The options are each briefly
explained in the table below. More information can be found
here,
and the -s and -r options are explained more below the table as well.
| option | explanation |
|---|---|
-r |
turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible |
-s |
turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment |
| `--nomisc | specify that features for failing sequences not be changed to misc_features in the output .tbl file |
--lowsimterm 2 |
set the minimum length for an alert related to sequence of low similarity to the RefSeq at the sequence terminii to 2 nt |
--mxsize 64000 |
set the maximum allowed dynamic programming matrix size for the cmalign program to 64Gb |
--mkey NC_045512 |
use the model files with prefix NC_045512 in the directory from --mdir
|
--mdir /usr/local/vadr-models-corona-1.1-1 |
specify that the models to use are in directory /usr/local/vadr-models-corona-1.1-2 |
--fstlowthr 0.0 |
specify that all possible frameshifts are reported as alerts (minimum average posterior probability of 0.0) |
--alt_fail lowscore,fsthicnf,fstlocnf |
specify that lowscore and frameshirt alerts (fsthicnf,fstlocnf) cause a sequence to fail |
--lowsc 0.75 |
specify that a sequence with average per-nucleotide bit score below 0.75 trigger a lowscore alert |
The -r option replaces stretches of Ns in the input sequences with
the expected nucleotides from the RefSeq. Be cautious, as this
strategy actually replaces Ns in your sequence prior to determination
of annotation. If this is not what you want to do, then do not use
-r. Using -r is the current (as of May 7, 2020) practice for NCBI
indexers analyzing incoming SARS-CoV-2 sequences, which is why it is
included as a recommended option here. By doing this the Ns are
assumed to be the 'expected' sequence in the corresponding regions for
purposes of annotation. More information on -r, including
information on other related options is
here.
The second sequence in the sars-cov2.3.fa file includes 344 Ns,
most of which are contained within consecutive stretches of 36 and 290 Ns in two
regions, from sequence positions 11487-11522 and 19237-19526.
With the -r option, v-annotate.pl replaces the Ns in these two stretches with the
corresponding 'expected' nucleotides from the NC_045512 RefSeq, and then determines annotation
with that doctored sequence. Some information on the Ns in each sequence including on those that were replaced is output to a file with suffix .rpn
(format described here).
For the example above, the my3.vadr.rpn file looks like this (scroll to right to see full file):
#seq seq seq num_Ns num_Ns fract_Ns ngaps ngaps ngaps ngaps ngaps nnt nnt replaced_coords
#idx name len model p/f tot rp rp tot int rp rp-full rp-part rp-full rp-part seq(S),mdl(M),#rp(N);
#--- ---------------------- ----- --------- ---- ------ ------ -------- ----- ----- ----- ------- ------- ------- ------- ---------------------
1 MT159720.1 29882 NC_045512 PASS 0 0 - 0 0 0 0 0 0 0 -
2 MT308693.1 29788 NC_045512 PASS 344 326 0.948 2 2 2 2 0 326 0 S:11487..11522,M:11536..11571,N:36/36;S:19237..19526,M:19286..19575,N:290/290;
3 MT159720.1/1406-G-to-T 29882 NC_045512 FAIL 0 0 - 0 0 0 0 0 0 0 -
If we run the above command without -r, the Ns are not replaced and
the MT308693.1 sequence fails because the N regions trigger
alerts/errors to be reported. Specifically, the following three errors
are reported in the fail.tbl file, amongst other
PEPTIDE_TRANSLATION_PROBLEM errors for mature peptides (not shown):
ERROR: LOW_FEATURE_SIMILARITY: (ORF1ab) region within annotated feature lacks significant similarity [36 nt overlap b/t low similarity region (11487..11522) and annotated feature (217..21506), strand: +]
ERROR: LOW_FEATURE_SIMILARITY: (ORF1ab) region within annotated feature lacks significant similarity [290 nt overlap b/t low similarity region (19237..19526) and annotated feature (217..21506), strand: +]
ERROR: INDEFINITE_ANNOTATION_END: (ORF1ab polyprotein) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [2271 > 8 (strand:+ CM:21506 blastx:21506 - 2271, valid stop codon in CM prediction)]
Note that the regions that are causing the LOW_FEATURE_SIMILARITY
errors are stretches of Ns. Using -r allows some sequences like this that
only fail due to the Ns to pass, and sometimes changes their
annotation to be more accurate (based on the assumption that the replaced Ns
represent the expected nucleotides at the corresponding positions).
The -s option speeds up v-annotate.pl by identifying the maximum
length ungapped alignment region using blastn, and fixing that part of
the alignment to accelerate the usually much slower cmalign alignment step.
This heuristic works particularly well for many SARS-CoV-2 sequences
that are highly similar (~99% identical) to the RefSeq NC_045512. If
we run the above example command without -s, it would require several
minutes per sequence instead of several seconds per sequence. When
-s is used, an additional output file with suffix .sda is output
(format described here).
More information on -s, including information on other related
options is
here.
Identifying and annotating Coronaviridae sequences other than SARS-CoV-2 using a larger VADR model library
The set of coronavirus vadr models downloaded in step 1 at the top of this page from this FTP page includes 54 other Coronaviridae RefSeqs, besides the SARS-CoV-2 NC_045512 RefSeq.
The command above will only compare your input sequences to
NC_045512. If you want to annotate v-annotate.pl to perform the
additional step of checking if each input sequence is more similar to
SARS-CoV-2 or a different Coronaviridae RefSeq, or if your input
file contains non-SARS-CoV-2 Coronaviridae sequences,
change the --mkey option in the example command above from
--mkey NC_045512
to
--mkey corona
But be aware that changing this option will result in slower running times.
The available Coronaviridae models (~30Kb) are larger than the
maximum length (25Kb) allowed by v-build.pl and were built with a
still in-development version of VADR that is not yet publicly
available. This means that VADR users cannot currently build their own
Coronaviridae models, but a future version of VADR should make this possible.
Email eric.nawrocki@nih.gov if you have a request for a new model that you'd like
to be created.