Skip to content

Coronavirus annotation

Eric Nawrocki edited this page May 7, 2020 · 53 revisions

How to annotate SARS-CoV-2 sequences with vadr:

  1. Download and install the latest version of vadr, following the instructions on this page

  2. Download the latest coronavirus vadr models (gzipped tarball) from this FTP page, unpack them (e.g. tar xfz <tarball.gz>). Note the path to the directory name created (<coronavirus-models-dir-path>) for step 3.

  3. Run the v-annotate.pl program on an input fasta file with SARS-CoV-2 sequences using the recommended command and options below (careful, the command is long so you will likely have to scroll to the right to view the entire command):

v-annotate.pl -r -s --lowsimterm 2 --mxsize 64000 --mdir <coronavirus-models-dir-path> --mkey NC_045512 --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf --lowsc 0.75 <fasta-file-to-annotate> <output-directory-to-create>

This section shows output from an example v-annotate.pl on three SARS-CoV-2 sequences from GenBank. The fasta file of those three sequences can be downloaded here.

(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)

For this example, the coronavirus model directory is in /usr/local/vadr-models-corona-1.1-1 and the sars-cov2.3.fa sequence file is in the current directory, and we will create a new directory called my3 into which output files will be written.

To annotate these sequences using the recommended v-annotate.pl options for SARS-CoV-2, run the command (scroll to right to see full command):

v-annotate.pl -s -r --lowsimterm 2 --mxsize 64000 --mdir /usr/local/vadr-models-corona-1.1-1 --mkey NC_045512 --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf --lowsc 0.75 sars-cov2.3.fa my3

The options used are explained below.

When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:

# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.1 (May 2020)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Thu May  7 13:36:47 2020
# $VADRBIOEASELDIR:  /usr/local/vadr-install/Bio-Easel
# $VADRBLASTDIR:     /usr/local/vadr-install/ncbi-blast/bin
# $VADREASELDIR:     /usr/local/vadr-install/infernal/binaries
# $VADRINFERNALDIR:  /usr/local/vadr-install/infernal/binaries
# $VADRMODELDIR:     /usr/local/vadr-install/vadr-models
# $VADRSCRIPTSDIR:   /usr/local/vadr-install/vadr
#
# sequence file:                                                                       sars-cov2.3.fa
# output directory:                                                                    my3
# specify that alert codes in <s> cause FAILure:                                       lowscore,fsthicnf,fstlocnf [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':       NC_045512 [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                              /Users/nawrockie/Dropbox/work/notebook/20_0505_vadr_1p1_release/vadr-models-corona-1.1-1 [--mdir]
# lowscore/LOW_SCORE bits per nucleotide threshold is <x>:                             0.75 [--lowsc]
# lowsim{5s,5f,3s,3f}/LOW_{FEATURE}_SIMILARITY_{START,END} minimum length is <n>:      2 [--lowsimterm]
# fstlocnf/POSSIBLE_FRAMESHIFT_LOW_CONF minimum average probability for alert is <x>:  0.0 [--fstlowthr]
# set max allowed dp matrix size --mxsize value for cmalign calls to <n> Mb:           64000 [--mxsize]
# use max length ungapped region from blastn to seed the alignment:                    yes [-s]
# replace stretches of Ns with expected nts, where possible:                           yes [-r]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Next, v-annotate.pl will output information as it proceeds through different steps of the analysis:

# Validating input                                                                        ... done. [    0.1 seconds]
# Preprocessing for N replacement: blastn classification (3 seqs)                         ... done. [    0.3 seconds]
# Preprocessing for N replacement: coverage determination from blastn results (3 seqs)    ... done. [    0.0 seconds]
# Replacing Ns based on results of blastn-based pre-processing                            ... done. [    0.0 seconds]
# Classifying sequences with blastn (3 seqs)                                              ... done. [    0.2 seconds]
# Determining sequence coverage from blastn results (3 seqs)                              ... done. [    0.0 seconds]
# Joining alignments from cmalign and blastn for model NC_045512 (3 seqs)                 ... done. [    0.0 seconds]
# Determining annotation                                                                  ... done. [    0.5 seconds]
# Validating proteins with blastx (NC_045512: 3 seqs)                                     ... done. [    2.0 seconds]
# Generating tabular output                                                               ... done. [    0.0 seconds]
# Generating feature table output                                                         ... done. [    0.0 seconds]

The v-annotate.pl output concludes with a summary of the classification of sequences, and the alerts reported:

# Summary of classified sequences:
#
#                                           num   num   num
#idx  model      group         subgroup    seqs  pass  fail
#---  ---------  ------------  ----------  ----  ----  ----
1     NC_045512  Sarbecovirus  SARS-CoV-2     3     2     1
#---  ---------  ------------  ----------  ----  ----  ----
-     *all*      -             -              3     2     1
-     *none*     -             -              0     0     0
#---  ---------  ------------  ----------  ----  ----  ----
#
# Summary of reported alerts:
#
#     alert     causes   short                            per    num   num  long
#idx  code      failure  description                     type  cases  seqs  description
#---  --------  -------  ---------------------------  -------  -----  ----  -----------
1     cdsstopn  yes      CDS_HAS_STOP_CODON           feature      2     1  in-frame stop codon exists 5' of stop position predicted by homology to reference
2     cdsstopp  yes      CDS_HAS_STOP_CODON           feature      2     1  stop codon in protein-based alignment
3     peptrans  yes      PEPTIDE_TRANSLATION_PROBLEM  feature     26     1  mat_peptide may not be translated because its parent CDS has a problem
#---  --------  -------  ---------------------------  -------  -----  ----  -----------

And finally a long list of the output files created (truncated for brevity):

# Output printed to screen saved in:                                                         my3.vadr.log
# List of executed commands saved in:                                                        my3.vadr.cmd
# List and description of all output files saved in:                                         my3.vadr.filelist
# copy of input fasta file saved in:                                                         my3.vadr.in.fa
# copy of input fasta file with descriptions removed for blastn saved in:                    my3.vadr.blastn.fa
# esl-seqstat -a output for input fasta file saved in:                                       my3.vadr.seqstat
# copy of input fasta file with Ns replaced with descriptions removed for blastn saved in:   my3.vadr.blastn.rpn.fa
# model NC_045512 feature gene#1 predicted seqs saved in:                                    my3.vadr.NC_045512.gene.1.fa
# model NC_045512 feature CDS#1 predicted seqs saved in:                                     my3.vadr.NC_045512.CDS.1.fa
# model NC_045512 feature CDS#2 predicted seqs saved in:                                     my3.vadr.NC_045512.CDS.2.fa
...snip...
# per-sequence tabular annotation summary file saved in:                                     my3.vadr.sqa
# per-sequence tabular classification summary file saved in:                                 my3.vadr.sqc
# per-feature tabular summary file saved in:                                                 my3.vadr.ftr
# per-model-segment tabular summary file saved in:                                           my3.vadr.sgm
# per-model tabular summary file saved in:                                                   my3.vadr.mdl
# per-alert tabular summary file saved in:                                                   my3.vadr.alt
# alert count tabular summary file saved in:                                                 my3.vadr.alc
# ungapped seed alignment summary file (-s) saved in:                                        my3.vadr.sda
# replaced stretches of Ns summary file (-r) saved in:                                       my3.vadr.rpn
# 5 column feature table output for passing sequences saved in:                              my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in:                              my3.vadr.fail.tbl
# list of passing sequences saved in:                                                        my3.vadr.pass.list
# list of failing sequences saved in:                                                        my3.vadr.fail.list
# list of alerts in the feature tables saved in:                                             my3.vadr.alt.list
#
# All output files created in directory ./my3/
#
# Elapsed time:  00:00:03.35
#                hh:mm:ss
# 

Note that all the output files will be in the newly created my3 directory. The Summary of classified sequences listed that two sequences passed and one failed. The file my3.vadr.pass.list, lists the two sequences that passed:

MT159720.1
MT308693.1

and my3.vadr.fail.list lists the one sequence that failed:

MT159720.1/1406-G-to-T

The third sequence MT159720.1/1406-G-to-T is identical to the first one MT159720.1 with position 1406 changed from a G to a T to purposefully introduce an early stop codon for purposes of this example.

Early stop codons are one reason a sequence can fail. To the alerts reported for these sequences, look at the my3.vadr.alt.list file:

#sequence	error	feature	error-description
MT159720.1/1406-G-to-T	CDS_HAS_STOP_CODON	ORF1ab polyprotein	in-frame stop codon exists 5' of stop position predicted by homology to reference [revised to 266..1408 (stop shifted 20147 nt)]
MT159720.1/1406-G-to-T	CDS_HAS_STOP_CODON	ORF1ab polyprotein	stop codon in protein-based alignment [stop codon(s) end at position(s) 1143]
MT159720.1/1406-G-to-T	CDS_HAS_STOP_CODON	ORF1a polyprotein	in-frame stop codon exists 5' of stop position predicted by homology to reference [revised to 266..1408 (stop shifted 12075 nt)]
MT159720.1/1406-G-to-T	CDS_HAS_STOP_CODON	ORF1a polyprotein	stop codon in protein-based alignment [stop codon(s) end at position(s) 1408]
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	leader protein	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp2	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp3	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp4	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	3C-like proteinase	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp6	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp7	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp8	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp9	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp10	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	RNA-dependent RNA polymerase	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	helicase	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	3'-to-5' exonuclease	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	endoRNAse	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	2'-O-ribose methyltransferase	mat_peptide may not be translated because its parent CDS has a problem
MT159720.1/1406-G-to-T	PEPTIDE_TRANSLATION_PROBLEM	nsp11	mat_peptide may not be translated because its parent CDS has a problem

All of these alerts are for MT159720.1/1406-G-to-T. Note the CDS_HAS_STOP_CODON errors. The PEPTIDE_TRANSLATION_PROBLEM occur for the mature peptide features which are derived from the two CDS that have errors.

For more information on how to interpret these results and files, including file formats, see the vadr documentation pages, linked to from here, or the VADR 1.0 paper, available on bioRxiv (https://www.biorxiv.org/content/10.1101/852657v2).

Explanation of options used in example command

The options used in the above command are the recommended set of options for SARS-CoV-2 annotation because they are currently (as of May 7, 2020) being used by NCBI sequence indexers when they evaluate an incoming SARS-CoV-2 sequence submission. The options are each briefly explained in the table below. More information can be found here, and the -s and -r options are explained more below as well.

option explanation
-r turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible
-s turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment
--lowsimterm 2 set the minimum length for an alert related to sequence of low similarity to the RefSeq at the sequence terminii to 2 nt
--mxsize 64000 set the maximum allowed dynamic programming matrix size for the cmalign program to 64Gb
--mkey NC_045512 use the model files with prefix NC_045512 in the directory from --mdir
--mdir /usr/local/vadr-models-corona-1.1-1 specify that the models to use are in directory /usr/local/vadr-models-corona-1.1-2
--fstlowthr 0.0 specify that all possible frameshifts are reported as alerts (minimum average posterior probability of 0.0)
--alt_fail lowscore,fsthicnf,fstlocnf specify that lowscore and frameshirt alerts (fsthicnf,fstlocnf) cause a sequence to fail
--lowsc 0.75 specify that a sequence with average per-nucleotide bit score below 0.75 trigger a lowscore alert

Two important options are the -s and -r options. and

The -r option

Briefly, the -r option replaces stretches of Ns in the input sequences with the expected nucleotides from the RefSeq. Be cautious, this strategy actually replaces Ns in your sequence prior to determination of annotation, if this is not what you want to do, then do not use -r. Using -r is the current (as of May 7, 2020) practice for NCBI indexers analyzing incombing SARS-CoV-2 sequences, which is why it is included as a recommended option here. More information on -r, including information on other related options is here.

The second sequence in the sars-cov2.3.fa file includes 344 Ns, most of which occured as consecutive stretches of 36 and 290 Ns in two regions, from sequence positions 11487-11522 and 19237-19526. With the -r option, v-annotate.pl replaces the Ns in these two stretches with the corresponding 'expected' nucleotides from the NC_045512 RefSeq, and then determines annotation with that doctored sequence. Some information on Ns in each sequence, and which ones were replaced is output to a file with suffix .rpn (format described here. For the example above, the my3.vadr.rpn file looks like this:

#seq  seq                       seq                   num_Ns  num_Ns  fract_Ns  ngaps  ngaps  ngaps    ngaps    ngaps      nnt      nnt  replaced_coords      
#idx  name                      len  model      p/f      tot      rp        rp    tot    int     rp  rp-full  rp-part  rp-full  rp-part  seq(S),mdl(M),#rp(N);
#---  ----------------------  -----  ---------  ----  ------  ------  --------  -----  -----  -----  -------  -------  -------  -------  ---------------------
1     MT159720.1              29882  NC_045512  PASS       0       0         -      0      0      0        0        0        0        0  -                    
2     MT308693.1              29788  NC_045512  PASS     344     326     0.948      2      2      2        2        0      326        0  S:11487..11522,M:11536..11571,N:36/36;S:19237..19526,M:19286..19575,N:290/290;
3     MT159720.1/1406-G-to-T  29882  NC_045512  FAIL       0       0         -      0      0      0        0        0        0        0  -                    

The -s option

The -s option speeds up v-annotate.pl by identifying the maximum length ungapped alignment region using blastn, and fixing that part of the alignment to accelerate the usually much slower alignment step. This heuristic works particularly well for many SARS-CoV-2 sequences that are highly similar (~99% identical) to the RefSeq NC_045512. If we run the example command without -s, it would require several minutes per sequence instead of several seconds per sequence. When -s is used an additional output file with suffix .sda is output with information on the ungapped region used to accelerate alignment, with format described here More information on -s, including information on other related options is here.

sequences other than SARS-CoV-2 using an extended vadr model library


Questions, comments, feature or model requests? Send a mail to eric.nawrocki@nih.gov.

Clone this wiki locally