Pipeline is used to process SARS-CoV19 WGS data.
- The analysis begins with quality control of raw fastq files:
- Adapter sequences are removed using cutadapt[1] (screening for default illumina adapter sequences)
- Reads are trimmed based on base quality (Phred score >= 30) and length (read length >= 50bp) using fastp[2]
- Reads that passed QC are aligned to the reference using bwa[3].
- The output from bwa[3] is passed to samtools[6] to produce binary alignment map (bam) file.
- Primer sequences are removed from bam file using ivar[4].
- Primer sequences should be provided in Browser Extensible Data (bed) format - generated from fasta file using bwa[3] & bedtools[5].
- To improve alignment quality, local realignment is performed on primer-free bam file using abra[9].
- bed file with realignment targets is created from primer-free bam file using bedtools[5].
- Alignment QC metrics are extracted from raw and primer-free bam files using samtools[6].
- Variant-calling is performed on post-realignment bam file using freebayes[7].
- Raw variants are filtered using vcflib/vcffilter[10] based on quality (QUAL > 30) and sequencing depth (DP > 15).
- Filtered variants are annotated using snpEff[8], using genbank reference.
- Consensus sequence is generated (in fasta format) from post-realignment bam file using ivar[4].
- invalid base (N) is called if coverage is less that 15.
- Sample id is added to fasta header and invalid bases are replaced with N using bash scripts.
- Annotated vcf file is converted to csv format and coverage depth plot is generated from sequencing depth data using inhouse-developed python scripts.
- Temporary files are deleted after each sample is processed.
- Lineage assignment is performed based on consensus sequence using Pangolin[11].
- Inhouse-developed python & bash scripts are used to control the flow of analysis for multiple samples, generate summary report and visualize the results.
- cutadapt 2.31 - https://doi.org/10.14806/ej.17.1.200
- fastp 0.20.1 - https://doi.org/10.1093/bioinformatics/bty560
- bwa 0.7.17-r1198-dirty - https://arxiv.org/abs/1303.3997
- ivar 1.3.1 - https://doi.org/10.1186/s13059-018-1618-7
- bedtools v.2.30.00 - https://doi.org/10.1093/bioinformatics/btq033
- samtools 1.12 - https://doi.org/10.1093/bioinformatics/btp352
- freebayes v0.9.21 - https://arxiv.org/abs/1207.3907
- snpEff 5.0e - https://pcingola.github.io/SnpEff/adds/SnpEff_paper.pdf
- abra 0.97 - https://doi.org/10.1093/bioinformatics/btu376
- vcflib 1.0.2 - https://doi.org/10.1101/2021.05.21.445151
- Pangolin - https://doi.org/10.1038/s41564-020-0770-5