A consensus calling pipeline provided in two languages:
- python package
- bash script
Initially, the pipeline was used to make quick alignments of fastq reads against a reference using smalt, now it’s mainly used for HIV and HCV consensus generation for diagnostics.
It does the following:
- Subsample reads with seqtk (optional)
- Make de novo alignment of sampled reads with velvet
- Align sampled reads (and only in the first iteration de novo contigs in triplicate) against reference with smalt
- Create consensus with freebayes
- Create vcf with lofreq
- Calculate depth with samtools
- if max number of iteration reached call the final consensus sequence using final vcf file and the given ambiguity threshold otherwise repeat from step 3
- cov_plot.R can be used to plot the coverage
All the necessary references are in the References directory.
To ensure you have all dependencies needed for SmaltAlign installed you can use the environment.yml
file.
First you need to have Conda installed).
With the command conda env create -f <path>/environment.yml
you will create a copy of the smaltalign environment.
You enter the environment with the command conda activate smaltalign
(and leave it with conda deactivate
).
For more information visit following link to Managing environments.
To install the classic python setup.py install
or pip install .
will work.
smaltalign -d <fastq_file_directory> -r <reference_file> [options]
If you would like to run the bash script and you are not sure about the closest reference sequence, run smaltalign_select_ref.sh
with a set of probable reference sequences in <reference_file>
file in fasta format. It chooses the closest reference sequence from the set of given reference sequences and construct the consensus sequence using the chosen reference sequence.
smaltalign_select_ref.sh -r <reference_file> [options] <fastq_file/directory>
OPTIONS
-r reference_file
-n INT number of reads (default 200'000)
-i INT iterations (default 4)
If you would like to give one reference sequence in the <reference_file>
, one can run smaltalign.sh
. \
smaltalign.sh -r <reference_file> [options] <fastq_file/directory>
OPTIONS
-r reference_file (only one reference sequence)
-n INT number of reads (default 200'000)
-i INT iterations (default 4)
Used to run multiple samples in the current working directory with different references in one batch.
To analyse the results of a Diagnostic sequencing run following steps need to be done:
- create a new folder in
/data/Diagnostics/experiments/
with the date of the sequencing run (start-date, yymmdd) - in that new folder create links to the .fastq files you want to analyse (
ln -sv
) and copy theSampleSheet.csv
of that run - copy the
batch.sh
file into that new folder - add the filenames (you can use
sampleID_to_filename.xltx
) to the empty virus arrays inbatch.sh
separated by a new line (works if you copy from the excel file) - activate SmaltAlign environment (
source activate smaltalign
) - execute
./batch.sh
This shell script was written to process Influenza sequences with SmaltAlign:
- iteration over all
.fastq.gz
files in the current directory - create a folder for each sample containing segment1-8 subfolders
- run
select_ref.py
(written by @ozagordi) which selects the best reference sequence for each segment from a Influenza reference database (selected sequences from the NCBI Influenza Virus Database) - using the best reference sequence to run
smaltalign.sh
for each segment - run Rscripts
cov_plot.R
andwts.R
Usage is the same as in batch.sh except that you don't need to enter the filenames.
wts.R
is an R script to combine consensus sequence, variants and coverage for the last iteration of all lofreq.vcf
files in a directory.
It saves a _x_WTS.fasta
file containing the consensus sequence with wobbles (at a certain threshold x) and a .csv
file containing coverage and variant frequencies for every position.
The the variant threshold and the minimal coverage have to be adapted manually in the first lines.
cov_plot.R
is an R script to plot and save the coverage of all iterations of all .depth
files in the working directory.
- Maryam Zaheri*
- Stefan Schmutz
- Osvaldo Zagordi
- Michael Huber**
*maintainer ; **group leader