Skip to content
/ snplord Public

Snakemake pipeline automating SNP phylogeny analysis

Notifications You must be signed in to change notification settings

CJREID/snplord

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

snplord

Snplord is a Snakemake based pipeline that automates SNP-based phylogenetic analyses using snippy (https://github.com/tseemann/snippy), Gubbins (https://github.com/sanger-pathogens/gubbins), snp-sites (https://github.com/sanger-pathogens/snp-sites), snp-dists (https://github.com/tseemann/snp-dists) and FastTree2.

This analysis is most suited to comparing closely related haploid genomes for which there is a good quality reference genome available, however contig assemblies may also be used as a reference.

It requires Illumina short read pairs with filenames in format name.R1.fastq.gz/name.R2.fastq.gz and reference genome with filename name.fa (NOT .fasta).

This README is intended for members of the Djordjevic lab who work on the UTS iHPC but the pipeline can easily be used on any machine provided you are using conda as a package manager.

Installation

You will need to install Miniconda3 and create an environment

conda create -n snakemake -c bioconda snakemake

Clone the snplord repository into your home or data directory and enter this directory, this will be your working directory for analysis

git clone https://github.com/CJREID/snplord.git
cd snplord

Prepare input files

Next you will need to edit the config file (config_files/config.yaml) so snplord knows where to look for input files. The simplest way to do this is to put all your reads in one folder with no subdirectories and add this path to the config file. Specify an output folder at the same level or higher than your input directory. Put your reference genome in another folder and add the path to the enclosing folder to the config file 'ref:' The reference genome must be in fasta format with suffix .fa (NOT .fasta).

To Gubb or not to Gubb?

To quote Croucher et al,

"Gubbins is an iterative algorithm that uses spatial scanning statistics to identify loci containing elevated densities of base substitutions suggestive of horizontal sequence transfer while concurrently constructing a maximum likelihood phylogeny based on the putative point mutations outside these regions of high sequence diversity."

In other words, it identifies and removes putative recombinant regions from whole genome alignments. Unfortunately, the amount of processing time increases exponentially with each additional sequence. For ~200 E. coli genomes, the Gubbins step takes about 3 days on our HPC and requires a lot of RAM. It is possible this time could be reduced if you are able to optimise thread and RAM usage on your machine.

If you are working with a polyclonal collection of genomes you may wish to use snplord without Gubbins as recombinant regions are less likely to form part of the core alignment generated by snippy's snippy-core function.

Conversely, if you are working with a clonal collection such as E. coli of the same sequence type, Gubbins is highly recommended to ensure accurate estimation of the population structure.

If you have a very large collection of clonal sequences and are feeling patient, you may wish to attempt to optimise thread and memory usage and run the full pipeline with Gubbins. You can adjust the RAM requirements for Gubbins in the config file. Alternatively, you can try your luck without Gubbins and hope your reviewer isn't a hardcore phylogenomics wizard or consult said hardcore phylogenomics wizard for some sage advice.

With Gubbins

You may wish to use the --restart-times argument to snakemake in case Gubbins fails due to a temporary lack of sufficient memory. Or just set RAM as high as possible in the config file.

snakemake -p --use-conda -s Snakefile

Without Gubbins

snakemake -p --use-conda -s Snakefile_no_gubbins

Running snplord

Return to the snplord directory and activate your snakemake environment

source activate snakemake

Check the workflow that snakemake intends to run

snakemake -np 

If you have no errors then run

snakemake -p --use-conda 

Since the pipeline can take a very long time at the Gubbins step, use nohup to run the computation in the background and pipe your standard output and error output to appropriately named text files like so:

nohup snakemake -p --use-conda --resource mem_mb=64000 >snplord.DDMMYY.out >>snplord.DDMMYY.err &

Don't forget the ampersand at the end of the line.

Output

Output files will be present in six subdirectories of your specified output folder.

  1. snippyout

    This contains folders for each of your samples aligned to the reference genome.

  2. core

    This contains the outputs from snippy-core including full and core alignments. You may wish to use the 'core.aln' file to generate a tree, however this workflow takes the full alignment 'full.core.aln' and runs it through recombination filtering with Gubbins before identifying SNPs and inferring the tree.

  3. gubbins

    This folder contains all the output of the Gubbins analysis.

  4. snp-sites

    This folder contains the output from snp-sites run on the alignment generated by Gubbins.

  5. snp-dists

    This folder contains a .csv file containing pairwise SNP comparisons between all the samples based on the core defined by snippy and gubbins.

  6. fasttree

    This folder contains the final recombination filtered SNP tree.

Acknowledgments

I'd like to thank Max Cummins (github.com/maxlcummins) for his help with creating this pipeline. I'd also like to acknowledge Torsten Seemann (github.com/tseemann) and the sanger-pathogens group (github.com/sanger-pathogens) for developing programs used in this pipeline.

Citations

Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 2010;5(3):e9490.

About

Snakemake pipeline automating SNP phylogeny analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages