Variant Calling Workflow Pipeline

Background

Variant calling has become widely accepted in human genetics as a way of identifying variants associated with a specific trait, population or hereditary diseases. It employs next-generation sequencing data to identify two main types of variants, namely single nucleotide variants/polymorphism(SNPs) and INDELs (Small insertion or deletions) within a genome of interest.

Whole exome sequencing data for human genome chromosome 1 was provided for accreditation of the icipe node. The data is based on hg19 GATK 2.8 bundle. The data was reported to have a mutation rate of 0.0003, a coverage of 50, and an error rate of 0.005. The training of the variant calling pipeline was done using two synthetic exome datasets for human chromosome 1 provided. Both the datasets contained variants determined from African genome ERR250949 combined with synthetic variants.

The reads for one training dataset were simulated using Wessim (REF) while those for the second dataset were simulated using a UIUC in-house simulator (unpublished). The latter dataset was expected to contain fewer false negative variants. Together with the datasets, vcf files containing the expected output for both the training datasets were also provided.

Pipeline

The H3ABioNet have developed some standard operating procedures (SOPs) for the H3Africa Consortium. icipe node recently participated in the accreditation exercise, where they established a pipeline using the Jupyter Notebooks and Conda for package management. Although well documented and reproducible, it is not easily portable to other systems. This mini-project entails converting the pipeline to either Nextflow or snakemake. The current pipeline tested various tools for the same task; for example, we used both GATK and Freebayes for variant calling. The pipeline you create should be flexible enough to accommodate the various tools utilised in the provided pipeline.

Find below the pipeline we used for our node accreditation:

Figure 1: Flowchart of the Human Variant Calling Pipeline. The format of the output files is added in (brackets). The tools used in each step are italicized and provided on the right. Double-headed arrows indicate steps which were ran more than once.

Project Task

Reproduce the pipeline by setting up the workspace in the system you will be using
Perform downstream analysis after variant calling
Convert the pipeline into Nextflow or Snakemake

To run the pipeline

Kindly follow through the README.md in the pipeline dir.

Contributors

Michael Landi
Festus Nyasimi
Careen Naitore
Charles Kamonde

Through the project, you need to demonstrate collaborative research skills, informative visualization, and report writing.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
Resources		Resources
pipeline		pipeline
NextGenAccreditationQuestions.docx		NextGenAccreditationQuestions.docx
README.md		README.md
SNAKEMAKE_REPORT.pdf		SNAKEMAKE_REPORT.pdf
Snakefile		Snakefile
Variant Effect Predictor results - Homo sapiens - Ensembl genome browser 97.pdf		Variant Effect Predictor results - Homo sapiens - Ensembl genome browser 97.pdf
Variant_Calling_Presenation.pptx		Variant_Calling_Presenation.pptx
flow.png		flow.png
phase1_variant_calling		phase1_variant_calling
variant_project.Rmd		variant_project.Rmd
variant_project.pdf		variant_project.pdf
whole_genome_exome_worklfow.png		whole_genome_exome_worklfow.png

LandiMi2/Variant_Calling_Project-

Folders and files

Latest commit

History

Repository files navigation

Variant Calling Workflow Pipeline

Background

Pipeline

Project Task

To run the pipeline

Contributors

About

Resources

Stars

Watchers

Forks

Languages