Snplord is a Snakemake based pipeline that automates SNP-based phylogenetic analyses using snippy (https://github.com/tseemann/snippy), Gubbins (https://github.com/sanger-pathogens/gubbins), snp-sites (https://github.com/sanger-pathogens/snp-sites), snp-dists (https://github.com/tseemann/snp-dists) and FastTree2.
This analysis is most suited to comparing closely related haploid genomes for which there is a good quality reference genome available, however contig assemblies may also be used as a reference.
It requires Illumina short read pairs with filenames in format name.R1.fastq.gz/name.R2.fastq.gz and reference genome with filename name.fa (NOT .fasta).
This README is intended for members of the Djordjevic lab who work on the UTS iHPC but the pipeline can easily be used on any machine provided you are using conda as a package manager.
You will need to install Miniconda3 and create an environment
conda create -n snakemake -c bioconda snakemake
Clone the snplord repository into your home or data directory and enter this directory, this will be your working directory for analysis
git clone https://github.com/CJREID/snplord.git
cd snplord
Next you will need to edit the config file (config_files/config.yaml) so snplord knows where to look for input files. The simplest way to do this is to put all your reads in one folder with no subdirectories and add this path to the config file. Specify an output folder at the same level or higher than your input directory. Put your reference genome in another folder and add the path to the enclosing folder to the config file 'ref:' The reference genome must be in fasta format with suffix .fa (NOT .fasta).
To quote Croucher et al,
"Gubbins is an iterative algorithm that uses spatial scanning statistics to identify loci containing elevated densities of base substitutions suggestive of horizontal sequence transfer while concurrently constructing a maximum likelihood phylogeny based on the putative point mutations outside these regions of high sequence diversity."
In other words, it identifies and removes putative recombinant regions from whole genome alignments. Unfortunately, the amount of processing time increases exponentially with each additional sequence. For ~200 E. coli genomes, the Gubbins step takes about 3 days on our HPC and requires a lot of RAM. It is possible this time could be reduced if you are able to optimise thread and RAM usage on your machine.
If you are working with a polyclonal collection of genomes you may wish to use snplord without Gubbins as recombinant regions are less likely to form part of the core alignment generated by snippy's snippy-core
function.
Conversely, if you are working with a clonal collection such as E. coli of the same sequence type, Gubbins is highly recommended to ensure accurate estimation of the population structure.
If you have a very large collection of clonal sequences and are feeling patient, you may wish to attempt to optimise thread and memory usage and run the full pipeline with Gubbins. You can adjust the RAM requirements for Gubbins in the config file. Alternatively, you can try your luck without Gubbins and hope your reviewer isn't a hardcore phylogenomics wizard or consult said hardcore phylogenomics wizard for some sage advice.
You may wish to use the --restart-times
argument to snakemake in case Gubbins fails due to a temporary lack of sufficient memory. Or just set RAM as high as possible in the config file.
snakemake -p --use-conda -s Snakefile
snakemake -p --use-conda -s Snakefile_no_gubbins
Return to the snplord directory and activate your snakemake environment
source activate snakemake
Check the workflow that snakemake intends to run
snakemake -np
If you have no errors then run
snakemake -p --use-conda
Since the pipeline can take a very long time at the Gubbins step, use nohup
to run the computation in the background and pipe your standard output and error output to appropriately named text files like so:
nohup snakemake -p --use-conda --resource mem_mb=64000 >snplord.DDMMYY.out >>snplord.DDMMYY.err &
Don't forget the ampersand at the end of the line.
Output files will be present in six subdirectories of your specified output folder.
-
snippyout
This contains folders for each of your samples aligned to the reference genome.
-
core
This contains the outputs from snippy-core including full and core alignments. You may wish to use the 'core.aln' file to generate a tree, however this workflow takes the full alignment 'full.core.aln' and runs it through recombination filtering with Gubbins before identifying SNPs and inferring the tree.
-
gubbins
This folder contains all the output of the Gubbins analysis.
-
snp-sites
This folder contains the output from snp-sites run on the alignment generated by Gubbins.
-
snp-dists
This folder contains a .csv file containing pairwise SNP comparisons between all the samples based on the core defined by snippy and gubbins.
-
fasttree
This folder contains the final recombination filtered SNP tree.
I'd like to thank Max Cummins (github.com/maxlcummins) for his help with creating this pipeline. I'd also like to acknowledge Torsten Seemann (github.com/tseemann) and the sanger-pathogens group (github.com/sanger-pathogens) for developing programs used in this pipeline.
Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 2010;5(3):e9490.