GitHub - SWEEP-Inc/GVCF: SWEEP - Joint variant calling workflow based on GATK best practices.

Published work

Article Source: Evaluation of serverless computing for scalable execution of a joint variant calling workflow John A, Muenzen K, Ausmees K (2021) Evaluation of serverless computing for scalable execution of a joint variant calling workflow. PLOS ONE 16(7): e0254363. https://doi.org/10.1371/journal.pone.0254363

Workflow Description

This SWEEP workflow (termed as GVCF from here onwards) represents the Joint Variant Calling Workflow based on GATK Best Practices [#1].

The GATK best-practice joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow starts by setting per-sample metadata for the entire population required to orchestrate subsequent tasks is prepared and propagated onwards. Tasks 2-6 are then run in parallel, preparing onetime index files from the reference sequence and known SNP/indel files. The remainder of the workflow tasks rely on the successful completion of each predecessor and are run sequentially. Tasks 7-13 are scattered by sample, and produce QC'd GVCF files from the paired-end read files for each sample. Task 14 takes inventory of all GVCF files that have been successfully produced by task 12 and defines the separate task variables for task 15, which is scattered by chromosome. Joint variant calling is performed for chromosomes 1-22 in separate container tasks, and the VCF outputs of each task are stitched together by the Picard GatherVCFs function in task 17. The final output of the workflow is a single joint VCF file that contains SNP and indel information for each sample included in the workflow. The final output of the workflow is a single joint VCF file that contains SNP and indel information for each sample included in the workflow.

Each Task in GVCF has a corresponding folder; under each of the Task folder you will find Docker compose file, and a batch script to create the docker image. In all cases the shell scripts with a sample invocation that you can use can be found in the 'tasks' folder (root folder to to all the Tasks).

Our Use Case

Joint variant calling of 1000 Genomes [2] data with 62 individuals. The files are part of AWS Open data initiative, and are located at http://1000genomes.s3.amazonaws.com/phase3/data/. A DAG for 14 individual run can be found here DAG 14 individuals.

Implementation

SWEEP tasks are pre deployed to the cloud provider, and SWEEP provides the Workflow manangement backbone, but you can run it without SWEEP

Setup

Push the base image to your yourDockerHubId, once pushed, update the repository location in all your task Docker compose files
In the shell scripts, a S3 bucket is used as a handoff between the tasks. So, whereever s3://yourS3bucket appears in te shell script, it needs to be changed to a S3 bucket you have read-write access to.
Base docker compose file has placeholders for AWS credentials that needs to be replaced, namely &

Open Issues

SWEEP on AWS/Azure limits to 62 individuals.
Sex chromosomes are excluded in the original runs.

Benchmarking

We have run it via AWS amd Azure. Instructions in this repository is for AWS, but can be run against Azure as well.

Running without SWEEP Platform

Each of the tasks mentioned in the workflow can be run independently using Docker. For function tasks, the task can be run by invoking the main python file using Python version 3.6 or higher; all the dependeent files are in the corresponding folder as well.

Running on SWEEP Platform

Please sign-up at https://sweep.run

API (Python )

SWEEP workflows can be invoked via API. For more information, please visit our documentation.

Issues

If you have input or feedback or you find issues, please suggest it throught issues feature on this GitHub repository.

References.

[1]: Germline short variant discovery (SNPs + Indels)
[2]: A global reference for human genetic variation

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
tasks		tasks
README.md		README.md
dag_genomics.pdf		dag_genomics.pdf
wf_vc_joint_final_mc.json		wf_vc_joint_final_mc.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Published work

Workflow Description

Our Use Case

Implementation

Setup

Open Issues

Benchmarking

Running without SWEEP Platform

Running on SWEEP Platform

API (Python )

Issues

References.

About

Releases

Packages

Languages

SWEEP-Inc/GVCF

Folders and files

Latest commit

History

Repository files navigation

Published work

Workflow Description

Our Use Case

Implementation

Setup

Open Issues

Benchmarking

Running without SWEEP Platform

Running on SWEEP Platform

API (Python )

Issues

References.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages