btb-seq
is the pipeline for APHA's processing of raw Mycobacterium bovis Whole Genome Sequencing (WGS) data. The pipeline uses nextflow to process batches (1 or more samples) of paired-end fastq.gz
read files generated on an Illumina sequencer.
To get started quickly, just install Nextflow and Docker and run the pipeline using the following:
nextflow run APHA-CSU/btb-seq -with-docker aphacsubot/btb-seq --reads='path/to/input/directory' --outdir='path/to/output/directory'
This will pull the code from this github repository and run the analysis using the pre-prepared docker image containing all required dependencies here.
To run the pipeline on a batch of samples, a directory containing raw .fastq.gz
files is required (and defined using --reads
). Each read-pair sample is represented by a pair of files named *_R1*.fastq.gz
and *_R2*.fastq.gz
. For example, to batch two samples named bovis-a
and bovis-b
, a directory containing bovis-a_R1.fastq.gz
, bovis-a_R2.fastq.gz
, bovis-b_R1.fastq.gz
and bovis-b_R2.fastq.gz
, is required. This can be also be an AWS s3 uri (denoted using s3://..
) if required
Pipeline output is stored in a results directory (which can also be an s3 uri) that contains:
- A summary csv file (
FinalOut.csv
) that contains theOutcome
(see below), WGS Group (Clade) and other high-level metrics for each sample. - Consensus
fasta
files - Mapped
.bam
files - Variant call
.vcf
files - Kraken2/Bracken classification of non-M. bovis contaminants
Clone this github repository
git clone https://github.com/APHA-CSU/btb-seq
To install the software dependencies required in ubuntu, run
bash install.bash
This script installs the following dependencies and adds symlinks to the $PATH
:
nextflow
FastUniq
Trimmomatic
bwa
samtools
andbcftools
bedtools
Kraken2
(and database)Bracken
The pipeline processes data in several stages, as shown below. During the pre-processing stage duplicate reads, low quality bases and adapter sequences are removed from the fastq sample files. Following this, pre-processed reads are mapped to a reference genome (M. bovis AF2122), variant calling is performed, regions of poor mapping (both pre-defined and on a per-sample basis) are masked and the consensus genome sequence for the sample is generated. Samples are also assigned to a "Clade", representing M. bovis lineages known to be circulating in GB, based on sequence variation at ~3000 position in the genome. Data quality assessment assigns an "Outcome" to each sample by analysing data gathered during the pre-processing and alignment stages. The following "Outcomes" are used to signify subsequent lab processing steps:
- Pass: The sample contains a known M. bovis WGS Cluster.
- Contaminated: The sample contains contaminants
- Insufficient Data: The sample contains insufficient data to allow accurate identification of M. bovis
- Check Required: Further scrutiny of the output is needed as quality thresholds fall below certain criteria but is likely to contain M. bovis.
This pipeline has been accredited by the UK Accreditation Service (UKAS) to ISO17025 standard. It has also been internally validated, tested and approved against a dataset in excess of 10,000 samples that have been sequenced at APHA.
The automated tests provided here ensure the software runs as expected. When changes are made to the code, these tests verify the pipeline is behaving as intended. The tests are automatically run by .circleci
on each commit to the github repository.
To run a test
bash tests/jobs/NAME_OF_TEST.bash
A number of small tests that asserts the functionality of individual components
Asserts the Outcome
and WGS_CLUSTER
(clade) against samples uploaded by APHA to ENA.
The limit of detection test ensures mixtures of M. avium and M. bovis at varying proportions give the correct Outcome. This is performed by taking random reads from reference samples of M. bovis and M. avium.
M. bovis (%) | M. avium (%) | Outcome |
---|---|---|
100% | 0% | Pass |
65% | 35% | Pass |
60% | 40% | CheckRequired |
0% | 100% | Contaminated |
The quality test ensures that low quality reads (<20) are not considered for variant calling and genotyping. This is performed by setting uniform quality values to a real-world M. bovis sample and asserting output. Low quality bases are removed from the sequence using Trimmomatic
, which uses a sliding window that deletes reads when the average base quality drops below 20. A table of expected results is shown below.
Base Quality | Outcome |
---|---|
19 | LowQualData |
20 | Pass |