Variant analysis pipeline for Coronavirus genomes. Specifically, SARS-CoV-2. This pipeline build a whole-genome Multiple Sequence Alignment (MSA) and performs following analyses:
- Variant calling
- Estimation of sequence diversity
- Sequence typing
- Phylogenetic inference
- Selection analysis
Following external programs are required:
- MAFFT : Build Multiple Sequence Alignment
- FastTreeMP : Build phylogeny
- Hyphy : Perform selection analysis to identify sites under positive selection
Create a python virtual environment, at a location of your choice.
- Use the following command, or something equivalent.
python3 -m venv <path/name>
Replace <path/name> by a path of your choice.
- Activate this environment by running
source <path/name>/bin/activate
Clone this repository.
git clone https://github.com/A-Farhan/cova.git
Install the package from the binary distribution.
Find the latest release (*.whl) in the release section of cova's github page, Download and run.
pip install <full-path-to-release-file>
Install external programs.
cd
to cova repositoryRun
./INSTALL.sh
. This is an interactive script. It will ask for the permission to proceed with each installation viz. MAFFT, FastTree and Hyphy. Appropriate response should be y (yes) OR n (no). If you have any of these programs already installed, then you can choose to skip its installation here.CoVa expects external programs to be available system-wide, by default. In other words, programs are invoked without providing the full path. If you installed external programs using the previous step, a new directory local will be created inside the repository. All 3 programs shall then be accessible from the path <repo-path>/local/bin/.
You'll need to add the following line to your ~/.bashrc file.
export COVA_BIN_PATH=<repo-path>/cova/local/bin
You can also access your pre-installed external programs similarly without using full paths. Simply, add their respective paths to your PATH variable.
export PATH=$PATH:<full-path-to-external-program>
You can also find instructions to download and install these programs from the following urls:
https://mafft.cbrc.jp/alignment/software/installation_without_root.html
This pipeline is built to be run as a command-line tool CoVa
To check if the command is available, run the following command inside the virtual environment
CoVa --help
CoVa runs in the curent directory by default. You can provide any directory-path through --indr
option.
CoVa expects a minimum input of a whole-genome multi-FASTA file under this path, named "genomes.fna" by default. Several commands require a reference accession [default: NC_045512
]. If this is not already present in the input file, it can be added by running accessory command preprocess
( more on this later).
To get familiar with CoVa, and its outputs, you can run CoVa on <repo-path>/datasets/example
directory. You can also copy the input file (genomes.fna) from this directory into an empty directory of your choce, for a fresh run.
CoVa has multiple subcommands, and these commands have their own arguments. To see these arguments, you can run, for example
CoVa msabuild --help
You can run these commands individually or in combination, with or without arguments, as all arguments have defaults. Combination works like this.
CoVa --indr <full-path-to-input-directory> msabuild msaref msaunq
Not all combinations would work, of course. As many commands depend on the input from specific preceding command(s), it is a largely rigid chain. You can refer to the above graph for help on this. To run the entire pipeline, use the sub-command full
.
CoVa full
which is equivalent to
CoVa msabuild msaref msaunq seqtype vcalpd anvpd nsvar rmstop msap vcali div tree sel
CoVa also allows to add more sequences and update results of a previous analysis. To do so, the main command can be run with the flag --addseq
. This flag is passed only to full
command, i.e., if you want to add a new set of sequences to an existing MSA, and redo CoVa analysis, then you can run
CoVa --addseq full
You can also use CoVa's cleanup command to remove a cova run's input/output files present in the working directory. It also includes an option to exclude a comma-separated list of files. This command is internally run by CoVa to update a previous analysis, wherein all cova-generated files except the initial alignment are first deleted, and then regenerated.
Other sub-commands are as follows:
MSABUILD
This command is a MAFFT wrapper to build whole-genome Multiple Sequence Alignments (MSA).
Input:
A multi-FASTA file of unaligned whole-genomes present in the working directory, named "genomes.fna" by default.
Output:
A multi-FASTA file of aligned whole-genomes present in the same directory, named "genome_aln.fna" by default.
MSAD
MAFFT allows for addition of new sequences to pre-existing MSAs. CoVa makes use of this feature to simplify incorporation of incoming genomic data and update analysis results. To facilitate rest of the analysis without changing any arguments, the MSA is changed in place and a copy is kept for back up. All the other analysis files are updated without a backup. If you wish to retain previous analysis, you can separately copy these files to a directory.
Input:
MSA generated by msabuild
, "genome_aln.fna" by default.
A FASTA file of possibly multiple genome sequences to be incorporated in the above MSA, named "new_seq.fna" by default.
Output:
Updated input MSA file. Additionally, a copy of the input is generated, named "old_genome_aln.fna" by default.
MSAREF
Before we can call variants ( point mutations and deletions) relative to a reference, our MSA must be restricted to the sites present in this reference. That's the job of this command.
Input:
MSA and a reference accession included in this MSA.
Output:
A multi-FASTA file of the above MSA limited to sites present in the given reference. default: "genome_aln_ref.fna"
MSAUNQ
Since it is possible that the original set of unaligned sequences, or the reference-limited MSA has duplicate sequences, it may be of interest to remove these duplicate sequences before further analysis. It would serve to speed up certain downstream analysis and also, to prevent polytomies in the phylogeny.
Input:
MSA generated by msaref
.
Output:
A multi-FASTA file of the above MSA excluding any duplicate sequences, default: "genome_aln_unq.fna"
A tab-delimited table of duplicate genomes with following columns:, default: "genome_dups.tsv"
- genome id included in the output MSA
- comma-separated list of ids of genomes identical to the one selected
SEQTYPE
Sequence typing is performed by CoVa using the genetic barcoding scheme suggested by Guan et al, 2020. For more details, refer to https://doi.org/10.1101/2020.04.21.054221
Input:
MSA generated by msaunq
Output:
CSV with 2 columns for genome ids and their corresponding types respectively. default: "genome_types.csv"
VCALPD
Variant CALling ( Point mutations / Deletions).
Input:
MSA generated by msaunq
.
Output:
"point_mutations.tsv": table with 1 row per point mutation and 1 column per genome, except the first 2 columns are for 1-indexed genomic coordinate and reference allele respectively.
"deletions.tsv": table with 1 row per deletion, and following columns:
- pos - 1-indexed genomic coordinate of the first base of deletion
- len - length of deletion
- seq - deleted reference sequence
- id - Bits for absence(0) OR presence(1) of deletion in the respective genome. There is one id column for every genome in the MSA.
ANVPD
ANNotation protein Variants w/ Point mutations or Deletions.
Input:
Point mutation table generated by vcalpd
.
Output:
"point_mutations_ann.tsv" w/ following columns:
- protein_id - protein's accession in the reference genome
- name - common name or abbreviation for the protein
- position - 1-indexed genomic position
- ref_base - nucleotide at the above position in the reference
- var_base - a different allele at this position in some genome
- ref_codon - codon at this position in the protein-coding sequence of reference
- var_codon - modified codon due to nucleotide substitution in some genome
- aa_change - amino acid change due to this substitution
- type - (S)ynonymous or (N)on-synonymous change
- freq - number of genomes w/ the mutation
- genomes - comma-separated list of genome ids with this variant
"deletions_ann.tsv" w/ 4 columns:
- pos - 1-indexed genomic position
- len - length of the deletion
- ann - annotation if it's w/i protein-coding sequence
- freq - number of genomes w/ the deletion
NSVAR
It may be of interest to characterize each isolate in terms of its unique variants and the variants that it shares with the others, for further analyses. These results are summarized by this command. Also, only non-synonymous variants are considered, in the interest of readability of the output table.
Input:
Annotated point mutation table generated by anvpd
.
Output:
A tab-delimited table with 1 row per genome and with following columns, default: "genome_vars.tsv"
- genome - genome id
- #variants - total number of variants in the genome
- #shared - number of shared non-synonymous variants
- #unique - number of unique non-synonymous variants
- shared - comma-separated list of shared variants
- unique - comma-separated list of unique variants
RMSTOP
Selection analysis performed by FUBAR throws an error if sequences with nonsense mutations are included in the MSA. This command removes such sequences from the MSA.
Input:
MSA generated by msaunq
Output:
MSA without sequences with nonsense mutations. default: "genome_aln_sf.fna"
MSAP
This command extracts nucleotide MSAs for all proteins/peptides-encoding regions from the whole-genome reference limited MSA. These MSAs are later used for selection analysis on individual proteins.
Input:
MSA file generated by rmstop
.
A directory path to store output MSAs.
Output:
Nucleotide MSA files of individual protein/peptide-encoding regions. default: "prots_nmsa"
VCALI
Variant Calling for Insertions relative to a reference.
Input:
MSA file generated by msabuild
.
Output:
A tab-delimited table with 1 row per insertion and following columns, default: "insertions.tsv"
- pos - 1-indexed genomic position of the reference base in the immediate left of the insertion
- ref - the reference base at the above position
- id - either the reference base, if no insertion is present, OR the insertion sequence. There is 1 id column for every genome.
DIV
This command calculates Nucleotide Diversity for the whole-genome, as well as for all proteins/peptide-encoding regions. Nucleotide Diversity is the average pairwise-difference per base. Optionally, one can calculate diversity with a sliding window over the entire genome. This is handy in identifying hypervariable regions.
Input:
MSA generated by rmstop
.
MSAs of protein/peptide-encoding regions generated by msap
.
Output:
A comma-delimited table. First row is for the whole-genome and following rows are for other regions. First column is the region's name and second column is for its nucleotide diversity. default: "divs.csv"
TREE
This command builds whole-genome based phylogeny using FastTree.
Input:
whole-genome MSA generated by rmstop
.
Output:
Output tree generated by FastTree in NEWICK format. default: "tree.nwk"
SEL
This command runs HYPHY FUBAR which perform selection analysis on protein-encoding regions by estimating synonymous and non-synonymous rates. It also identifies putative sites under positive selection.
Input:
MSAs generated by msap
.
Phylogeny tree generated by tree
.
Output:
Output files generated by FUBAR.
A comma-delimited table of rates with 1 row per protein and following columns:
- protein - common name or abbreviation for the protein
- exp_subs - expected substitution rate
- syn - synonymous rate
- nonsyn - non-synonymous rate
- dnds - (nonsyn-syn)
A comma-delimited table of sites with 1 row per site and following columns:
- protein - common name or abbreviation for the protein
- site - 1-indexed position in the protein
- syn - site-specific synonymous rate
- nonsyn - site-specific non-synonymous rate
- post_prob - posterior probability (nonsyn > syn)
MAFFT
Cova runs the following MAFFT command. To speed up the process, We use no more than 5 refinement iterations. For the same reason, we have assumed sufficient RAM would be available, and included --nomemsave
option, since wide alignments, by default, triggers a 2X slower algorithm to reduce memory requirement. Furthermore, MAFFT within CoVa can be run in 3 modes:- Standard [Default], Fast, and Ultra.
[standard] mafft --quiet --nomemsave --maxiterate 5 --thread <ncpu> <infile>
[fast] mafft --quiet --retree 2 --thread <ncpu> <infile>
[ultra] mafft --quiet --auto --thread <ncpu> --keeplength --addfragments <infile> <ref>
FastTree
FastTree in cova was built from the source with
- Double-precision: improves branch length precision for highly similar sequences, AND
- OpenMP: allows multi-threading for faster computations
using the following command
gcc -DUSE_DOUBLE -DOPENMP -fopenmp -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm
Cova runs the following FastTree command. With speed and memory consideration, we limit split supports calculation to 100.
FastTree -quiet -nt -mlnni 4 -boot 100
Hyphy
Cova makes use of Hyphy's FUBAR program to perform selection analysis and identify sites under positive selection. It runs FUBAR as below.
hyphy fubar --alignment <msafile> --tree <treefile> --cache <cachefile>
cova also has a few other programs packed along with the main pipeline. The programs can be cosidered accessory but quite useful in an analysis. These are:
preprocess
This command performs some preprocessing on the genomes FASTA file before submitting it to CoVa. It adds the reference, if not already present; makes headers pretty in FASTA files ( especially useful for phylogeny plots) and filters out sequences above a threshold of ambiguous characters ( default: 1%).
extract_metadata
If genomes were downloaded from GenBank or GISAID along with their metadata, this command extract from the metadata, information on collection date and location. The output can be used to annotate phylogeny plots.
plottree
Phylogeny plots are drawn using ETE3 ( which needs to be installed separately). The program is packed with a large number of options to tweak various aspects of the plot. It also includes the minimal annotation of sequence types provided seqtype
output is available. Additionally, if extract_metadata
output is provided, it'll include location and date annotations in the plot as well.