The goal of this NSF RAPID project is to democratize genome sequence analysis so that researchers can better understand how COVID-19 affects individuals based on their genetic makeup. Using CloudLab, researchers can perform variant analysis on genomes at no charge. There is growing evidence that our genomes may hold the answers to fight COVID-19. A recent genomic-wide association study linked genes and blood type of individuals to risk of severe COVID-19 (NEJM '2020). Another study from 23andMe hints at links between a person's genes and COVID-19. A recent Nature publication identified genes related to five comorbidities that result in severe COVID-19 infection. The COVID Human Genetic Effort is an international consortium focused on better understanding COVID-19 based on human genomes.
We recently published a data resource containing variant analysis of human genome sequences for COVID-19 research. Please check this out on IEEE DataPort.
[Variant analysis (single node)]
[Variant analysis using AVAH (cluster)]
- Create an account on CloudLab by signing up
here. Select "Join Existing
Project" with
EVA-public
as the project name. (If you already have a CloudLab account, then join the projectEVA-public
after logging in to CloudLab.)
- By signing up, you agree to follow the Acceptable Use Policy of CloudLab.
- After your account is approved, you can login to your account. Read the CloudLab manual on how to start an experiment.
- After you login to Cloudlab, go to "Manage Account" (by clicking the leftmost dropdown menu), and choose
bash
as the default shell.
-
Start an experiment using the profile
EVA-single-node-profile
on CloudLab. (Or just click here.) You will need to select a node/hardware type such asc8220
(Clemson),c240g5
(Wisc), etc. The complete list of supported hardware can be found here. Also provide your CloudLab user name. Check the box to agree to using only deidentified data. It will take a few minutes to start the experiment; so please be patient. -
Go to your experiment and in
Topology View
click the node icon and open a shell/terminal to connect to that node. Alternatively, you can useSSH
to login to the node:$ ssh -i /path/to/CloudLab/private_key_file CloudLab_username@CloudLab_hostname
. (You can also run ssh-agent on your local machine to add your private key.) The CloudLab hostname can be found in the tabsList View
orManifest
. -
Run the following commands in the shell/terminal:
a. Clone the repo in your home directory.
$ cd ${HOME} $ git clone https://github.com/MU-Data-Science/EVA.git
b. Set up all the tools such as bwa, samtools, sambamba, Freebayes, Picard, GATK, etc. Feel free to modify our scripts if you intend to use other tools for variant analysis. (It will take about 6 mins.)
$ ${HOME}/EVA/scripts/setup_tools.sh
c. Change directory to local block storage as we need ample space for running variant analysis.
$ cd /mydata
d. If you already have the reference genome built, copy all the files to
/mydata
. Otherwise, set up and index the reference genome (e.g., hs38 (GRCh38), hs37 (GRCh37)). This is a one-time step and can take one hour or more depending on the node hardware type. To avoid killing the process when the SSH session terminates due to disconnection, use thescreen
command.$ screen -S ref $ ${HOME}/EVA/scripts/setup_reference_genome.sh hs38
Press "Ctrl-a" "d" (i.e., control-a followed by d) to detach from the screen session. To reconnect to the screen, use
screen -r
.The above step will create new files
hs38.*
in/mydata/
.e. Now copy a whole genome sequence (paired-end) sequence to the CloudLab node. It is the user's responsibility to ensure that the data are de-identified prior to storing them on the CloudLab node. Also see the Acceptable Use Policy of CloudLab. Let's use a whole genome sequence sample from The 1000 Genomes Project. The FTP site is
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data
.$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR016/ERR016314/ERR016314_1.fastq.gz $ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR016/ERR016314/ERR016314_2.fastq.gz
If you have deidentified sequences on your local machine to analyze, copy to the CloudLab node using
scp
.Copy the necessary genome files stored in
Genome_Data
under/proj/eva-public-PG0/
, needed for running variant analysis using FreeBayes and GATK in the next steps using the following command${HOME}/EVA/scripts/copy_1kgenome_files.sh
f. There are three variant analysis scripts. The first one performs alignment, sorting, marking duplicates, variant calling using FreeBayes. The other two will perform the alignment, sorting, marking duplicates, variant calling, and Base Quality Score Recalibration based on GATK best practice workflows. The script
run_variant_analysis_fbayes.sh
uses Freebayes for variant calling;run_variant_analysis_gatk.sh
uses GATK's HaplotypeCaller. Run a variant analysis script by passing the required arguments. (See usage statement of the script.) Here are the three examples.$ ${HOME}/EVA/scripts/run_variant_analysis_fbayes_basic.sh hs38 ERR016314_1.fastq.gz ERR016314_2.fastq.gz
OR
$ ${HOME}/EVA/scripts/run_variant_analysis_fbayes.sh hs38 ERR016314_1.fastq.gz ERR016314_2.fastq.gz
OR
$ ${HOME}/EVA/scripts/run_variant_analysis_gatk.sh hs38 ERR016314_1.fastq.gz ERR016314_2.fastq.gz
Note that for the GATK pipeline, a dummy read group is added to the
.bam
file.g. The variants are stored in
.output.vcf
files. If BQSR is used, the file ending withBQSR-output.vcf
contains the variants computed on the analysis-ready reads. Download to your local machine usingscp
.$ scp -i /path/to/CloudLab/private_key_file CloudLab_username@CloudLab_hostname:/path/to/the/output_VCF_file .
Any
.vcf
file can be further processed using readily available tools such as scikit-allel, VCFtools, and GATK. You can also use visualization tools such as IGV.h. To perform variant analysis on more genome sequences, go to Step e.
$ screen -S my_session_name
Press "Ctrl-a" "d" (i.e., control-a followed by d) to detach from the screen session.
To reattach, do either of the following.
$ screen -r my_session_name OR
$ screen -r
To check list of screen sessions, type the following.
$ screen -ls
We are using AVAH [CIKM '21], Apache Spark, Apache Hadoop, and Adam-Cannoli or GATK4.
-
Create and set up a 16-node cluster on CloudLab, following the steps described here. (The nodes are named
vm0
,vm1
, ...,vm15
.) -
Connect to
vm0
using SSH to open a terminal/shell. -
Set up the reference genome in
/mydata
on all nodes.$ ${HOME}/EVA/scripts/copy_files.sh 16
Use of
screen
is recommended as copying the files can take more than 15 minutes. -
On
vm0
, clone the AVAH repo.$ cd ; git clone https://github.com/raopr/AVAH.git $ cp ${HOME}/AVAH/misc/sampleIDs-vlarge.txt /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt $ cp ${HOME}/AVAH/misc/sampleURLs-vlarge.txt /proj/eva-public-PG0/${USER}-sampleURLs-vlarge.txt
-
If using GATK4, do the changes mentioned on this page.
-
If using ADAM-Cannoli, run the following command to create the required files for BQSR/Indel realignment. The known SNPs and INDELs folders will be created on HDFS.
$ ${HOME}/EVA/scripts/convert_known_snps_indels_to_adam.sh 16
This step will take several minutes to complete.
-
Run variant analysis using AVAH. For usage, type:
$ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -h
a. For ADAM-Cannoli (w/ BWA and Freebayes):
- If sequences need to be downloaded to HDFS, then run the following command:
$ hdfs dfs -rm -r /tmp/logs; hdfs dfs -rm -r /spark-events/*; hdfs dfs -rm /*.fastq.gz $ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -i /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt -d /proj/eva-public-PG0/${USER}-sampleURLs-vlarge.txt -n 16 -b 2 -p 15 -P H -B
The file
${USER}-sampleIDs-vlarge.txt
contains the IDs of the sequences and sizes (e.g., one FASTQ file size). The file${USER}-sampleURLs-vlarge.txt
contains the URLs of the (paired-end) FASTQ files. You can also download the sequences directly to CloudLab nodes usingwget
and then copy them to HDFS. This is recommended if the above script does not download the sequences correctly due to issues with CloudLab's network. After that, you can run the below step.- If sequences are already present in HDFS (and correctly downloaded), then run the following command:
$ hdfs dfs -rm -r /tmp/logs; hdfs dfs -rm -r /spark-events/* $ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -i /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt -d NONE -n 16 -b 2 -p 15 -P H -B
- To skip BQSR/indel realignment, drop the
-B
option.
b. For GATK4 (w/ BWA and HaplotypeCaller):
- If sequences need to be downloaded to HDFS, then run the following command:
$ hdfs dfs -rm -r /tmp/logs; hdfs dfs -rm -r /spark-events/*; hdfs dfs -rm /*.fastq.gz $ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -i /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt -d /proj/eva-public-PG0/${USER}-sampleURLs-vlarge.txt -n 16 -b 2 -p 15 -P H -G
The file
${USER}-sampleIDs-vlarge.txt
contains the IDs of the sequences and sizes (e.g., one FASTQ file size). The file${USER}-sampleURLs-vlarge.txt
contains the URLs of the (paired-end) FASTQ files. You can also download the sequences directly to CloudLab nodes usingwget
and then copy them to HDFS. This is recommended if the above script does not download the sequences correctly due to issues with the network/FTP servers. After that, you can run the below step.- If sequences are already present in HDFS (and correctly downloaded), then run the following command:
$ hdfs dfs -rm -r /tmp/logs; hdfs dfs -rm -r /spark-events/* $ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -i /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt -d NONE -n 16 -b 2 -p 15 -P H -G
-
To re-execute failed sequences during variant analysis, use the
-e
option.For changing the YARN settings, refer to this page.
-
The
.vcf
files containing raw variants will be stored in HDFS. Check the files using this command:hdfs dfs -ls /*.vcf
.
We used Apache Spark to parallelize execution of de novo assembly using ABySS. Multiple assemblies will be run on each cluster node using a combination of HDFS and local file systems. Assuming the cluster has been set up to run variant analysis, do the following:
- Login to
vm0
. The two files sampleIDs.txt and sampleURLs.txt can be used. You can create your own files. The filesampleIDs.txt
contains a run accession ID per line. The filesampleURLs.txt
contains URLs of the FASTQ paired-end sequences. - Execute the following command for k-mer length of
31
(default:51
):
${HOME}/EVA/scripts/run_denovo.sh ${HOME}/EVA/misc/sampleIDs.txt ${HOME}/EVA/misc/sampleURLs.txt 31
- Once this task completes, the
.fa
files will be written to HDFS. Check the.log
file written to/mydata
. - To re-run with a different k-mer length but to avoid downloading the FASTQ files again, try this:
${HOME}/EVA/scripts/run_denovo.sh ${HOME}/EVA/misc/sampleIDs.txt NONE 41
-
Praveen Rao and Arun Zachariah. Enabling Large-Scale Human Genome Sequence Analysis on CloudLab. In IEEE INFOCOM Workshops: Computer and Networking Experimental Research using Testbeds (CNERT 2022), 2 pages, 2022. [PDF]
-
Praveen Rao, Arun Zachariah, Deepthi Rao, Peter Tonellato, Wesley Warren, and Eduardo Simoes. Accelerating Variant Calling on Human Genomes Using a Commodity Cluster. In Proc. of 30th ACM International Conference on Information and Knowledge Management (CIKM), pages 3388-3392, Australia, 2021. [PDF] Nominated for Best Short Paper Award
-
Shivika Prasanna and Praveen Rao. A Data Science Perspective of Real-World COVID-19 Databases. Leveraging Artificial Intelligence in Global Epidemics, pages 133-163, Elsevier, 2021. (Editors: Gruenwald, Jain, Groppe) [PDF]
The 1000 Genomes Project has a few thousand human whole genome sequences to test variant analysis tasks at scale. However, these sequences are not from COVID-19 studies. The COVID-19 Data Portal has 700+ human genome sequences from COVID-19 studies using the RNA-sequencing (RNA-Seq) technique.
Note: For AVAH's performance evaluation using ADAM-Cannoli, please refer to our publications.
Please report them here.
Faculty: Drs. Praveen Rao (PI), Deepthi Rao, Peter Tonellato, Wesley Warren, and Eduardo Simoes
Ph.D. Students: Arun Zachariah, Shivika Prasanna
This work is supported by the National Science Foundation under Grant No. 2034247.
- https://github.com/ekg/alignment-and-variant-calling-tutorial
- https://github.com/biod/sambamba
- https://github.com/lh3/bwa
- https://github.com/ekg/freebayes
- https://github.com/samtools/samtools
- https://docs.brew.sh/Homebrew-on-Linux
- https://spark.apache.org
- https://hadoop.apache.org
- http://bdgenomics.org/
- https://cloudlab.us/
- https://biohpc.cornell.edu/lab/doc/Variant_workshop_Part1.pdf
- https://github.com/broadinstitute/picard
- https://github.com/broadinstitute/gatk
- https://github.com/bcgsc/abyss
- https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4