The input for this pipeline are bam files generated by an Ion Torrent S5 sequencer and aligned with Torrent Suite's Torrent Mapping Alignment Program (TMAP).
We use a padded reference genome where the first 400 bases in the HPV31 genome are copied and pasted to the end of the genome. This allows for reads to map fully across the URR portion of the circular genome. Variants called in the padded region are later adjusted to their standard location. Amplicon regions can be found in HPV31.amplicon.bed.
The main Snakefile contains the following steps and rules:
- Merge bams if a sample was sequenced more than once.
- Filter out reads with low mapping quality.
- Call variants using Torrent Suite's Torrent Variant Caller.
- Adjust position of variants called in the padded region to their standard location (e.g. Position 8012 is adjusted to position 100).
- Bam files are filtered to remove primers and non-HPV31 reads.
- A fasta file is generated for each sample using SNVs called in the vcf. This file is used for multiple-sequence alignemnts and phylogenetic tress, so indels are not included.
- Virus genomes with at least 4x coverage across >= 25% of the genome are compiled into a single fasta file.
The subworkflow qc_Snakefile runs FASTQC on the bam files before and after mapping quality filtering. The FASTQC output is summarized and reported using MultiQC.
The subworkflow annotation_Snakefile uses snpEff to annotate the vcf files with HPV31.genes.gtf. A summary report is compiled using MultiQC.
All input files and references are specified in hpv_config.yaml. HPV31-specific files are found in the reference directory.
The workflow inputs aligned bams as designated in the hpv_config.yaml file.
The workflow outputs an annotated vcf file, a bam file filtered for just the amplicon regions, and a consensus fasta file for each sample.