Efficient Variant Analysis (EVA) on Large-Scale Human Genome Sequences

The goal of this NSF RAPID project is to democratize genome sequence analysis so that researchers can better understand how COVID-19 affects individuals based on their genetic makeup. Using CloudLab, researchers can perform variant analysis on genomes at no charge. There is growing evidence that our genomes may hold the answers to fight COVID-19. A recent genomic-wide association study linked genes and blood type of individuals to risk of severe COVID-19 (NEJM '2020). Another study from 23andMe hints at links between a person's genes and COVID-19. A recent Nature publication identified genes related to five comorbidities that result in severe COVID-19 infection. The COVID Human Genetic Effort is an international consortium focused on better understanding COVID-19 based on human genomes.

We recently published a data resource containing variant analysis of human genome sequences for COVID-19 research. Please check this out on IEEE DataPort.

[CloudLab account]

[Variant analysis (single node)]
[Variant analysis using AVAH (cluster)]

[Publications]

[Public Datasets]

[Performance evaluation]

[Report issues]

[Team]

[Acknowledgments]

Signup for a CloudLab account

Create an account on CloudLab by signing up here. Select "Join Existing Project" with EVA-public as the project name. (If you already have a CloudLab account, then join the project EVA-public after logging in to CloudLab.)

By signing up, you agree to follow the Acceptable Use Policy of CloudLab.
After your account is approved, you can login to your account. Read the CloudLab manual on how to start an experiment.
After you login to Cloudlab, go to "Manage Account" (by clicking the leftmost dropdown menu), and choose bash as the default shell.

Running variant analysis on human genomes using a single CloudLab node

Start an experiment using the profile EVA-single-node-profile on CloudLab. (Or just click here.) You will need to select a node/hardware type such as c8220 (Clemson), c240g5 (Wisc), etc. The complete list of supported hardware can be found here. Also provide your CloudLab user name. Check the box to agree to using only deidentified data. It will take a few minutes to start the experiment; so please be patient.
Go to your experiment and in Topology View click the node icon and open a shell/terminal to connect to that node. Alternatively, you can use SSH to login to the node: $ ssh -i /path/to/CloudLab/private_key_file CloudLab_username@CloudLab_hostname. (You can also run ssh-agent on your local machine to add your private key.) The CloudLab hostname can be found in the tabs List View or Manifest.
Run the following commands in the shell/terminal:

a. Clone the repo in your home directory.
```
$ cd ${HOME}
$ git clone https://github.com/MU-Data-Science/EVA.git
```
b. Set up all the tools such as bwa, samtools, sambamba, Freebayes, Picard, GATK, etc. Feel free to modify our scripts if you intend to use other tools for variant analysis. (It will take about 6 mins.)
```
$ ${HOME}/EVA/scripts/setup_tools.sh
```
c. Change directory to local block storage as we need ample space for running variant analysis.
```
$ cd /mydata
```
d. If you already have the reference genome built, copy all the files to /mydata. Otherwise, set up and index the reference genome (e.g., hs38 (GRCh38), hs37 (GRCh37)). This is a one-time step and can take one hour or more depending on the node hardware type. To avoid killing the process when the SSH session terminates due to disconnection, use the screen command.
```
$ screen -S ref
$ ${HOME}/EVA/scripts/setup_reference_genome.sh hs38
```
Press "Ctrl-a" "d" (i.e., control-a followed by d) to detach from the screen session. To reconnect to the screen, use screen -r.

The above step will create new files hs38.* in /mydata/.

e. Now copy a whole genome sequence (paired-end) sequence to the CloudLab node. It is the user's responsibility to ensure that the data are de-identified prior to storing them on the CloudLab node. Also see the Acceptable Use Policy of CloudLab. Let's use a whole genome sequence sample from The 1000 Genomes Project. The FTP site is ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data.
```
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR016/ERR016314/ERR016314_1.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR016/ERR016314/ERR016314_2.fastq.gz
```
If you have deidentified sequences on your local machine to analyze, copy to the CloudLab node using scp.

Copy the necessary genome files stored in Genome_Data under /proj/eva-public-PG0/, needed for running variant analysis using FreeBayes and GATK in the next steps using the following command
```
${HOME}/EVA/scripts/copy_1kgenome_files.sh
```
f. There are three variant analysis scripts. The first one performs alignment, sorting, marking duplicates, variant calling using FreeBayes. The other two will perform the alignment, sorting, marking duplicates, variant calling, and Base Quality Score Recalibration based on GATK best practice workflows. The script run_variant_analysis_fbayes.sh uses Freebayes for variant calling; run_variant_analysis_gatk.sh uses GATK's HaplotypeCaller. Run a variant analysis script by passing the required arguments. (See usage statement of the script.) Here are the three examples.
```
$ ${HOME}/EVA/scripts/run_variant_analysis_fbayes_basic.sh hs38 ERR016314_1.fastq.gz ERR016314_2.fastq.gz
```
OR
```
$ ${HOME}/EVA/scripts/run_variant_analysis_fbayes.sh hs38 ERR016314_1.fastq.gz ERR016314_2.fastq.gz
```
OR
```
$ ${HOME}/EVA/scripts/run_variant_analysis_gatk.sh hs38 ERR016314_1.fastq.gz ERR016314_2.fastq.gz
```
Note that for the GATK pipeline, a dummy read group is added to the .bam file.

g. The variants are stored in .output.vcf files. If BQSR is used, the file ending with BQSR-output.vcf contains the variants computed on the analysis-ready reads. Download to your local machine using scp.
```
$ scp  -i /path/to/CloudLab/private_key_file  CloudLab_username@CloudLab_hostname:/path/to/the/output_VCF_file .
```
Any .vcf file can be further processed using readily available tools such as scikit-allel, VCFtools, and GATK. You can also use visualization tools such as IGV.

h. To perform variant analysis on more genome sequences, go to Step e.

Simple steps to run the screen command

$ screen -S my_session_name

Press "Ctrl-a" "d" (i.e., control-a followed by d) to detach from the screen session.

To reattach, do either of the following.

$ screen -r my_session_name            OR
$ screen -r

To check list of screen sessions, type the following.

$ screen -ls

Running variant analysis on a cluster of CloudLab nodes Using AVAH

We are using AVAH [CIKM '21], Apache Spark, Apache Hadoop, and Adam-Cannoli or GATK4.

Create and set up a 16-node cluster on CloudLab, following the steps described here. (The nodes are named vm0, vm1, ..., vm15.)
Connect to vm0 using SSH to open a terminal/shell.
Set up the reference genome in /mydata on all nodes.
```
$ ${HOME}/EVA/scripts/copy_files.sh 16
```
Use of screen is recommended as copying the files can take more than 15 minutes.

On vm0, clone the AVAH repo.

$ cd ; git clone https://github.com/raopr/AVAH.git
$ cp ${HOME}/AVAH/misc/sampleIDs-vlarge.txt /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt
$ cp ${HOME}/AVAH/misc/sampleURLs-vlarge.txt /proj/eva-public-PG0/${USER}-sampleURLs-vlarge.txt

If using GATK4, do the changes mentioned on this page.
If using ADAM-Cannoli, run the following command to create the required files for BQSR/Indel realignment. The known SNPs and INDELs folders will be created on HDFS.
```
$ ${HOME}/EVA/scripts/convert_known_snps_indels_to_adam.sh 16
```
This step will take several minutes to complete.
Run variant analysis using AVAH. For usage, type:
```
$ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -h
```
a. For ADAM-Cannoli (w/ BWA and Freebayes):
- If sequences need to be downloaded to HDFS, then run the following command:
```
$ hdfs dfs -rm -r /tmp/logs; hdfs dfs -rm -r /spark-events/*; hdfs dfs -rm /*.fastq.gz
$ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -i /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt -d /proj/eva-public-PG0/${USER}-sampleURLs-vlarge.txt -n 16 -b 2 -p 15 -P H -B
```
The file ${USER}-sampleIDs-vlarge.txt contains the IDs of the sequences and sizes (e.g., one FASTQ file size). The file ${USER}-sampleURLs-vlarge.txt contains the URLs of the (paired-end) FASTQ files. You can also download the sequences directly to CloudLab nodes using wget and then copy them to HDFS. This is recommended if the above script does not download the sequences correctly due to issues with CloudLab's network. After that, you can run the below step.
- If sequences are already present in HDFS (and correctly downloaded), then run the following command:
```
$ hdfs dfs -rm -r /tmp/logs; hdfs dfs -rm -r /spark-events/*
$ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -i /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt -d NONE -n 16 -b 2 -p 15 -P H -B
```
- To skip BQSR/indel realignment, drop the -B option.
b. For GATK4 (w/ BWA and HaplotypeCaller):
- If sequences need to be downloaded to HDFS, then run the following command:
```
$ hdfs dfs -rm -r /tmp/logs; hdfs dfs -rm -r /spark-events/*; hdfs dfs -rm /*.fastq.gz
$ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -i /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt -d /proj/eva-public-PG0/${USER}-sampleURLs-vlarge.txt -n 16 -b 2 -p 15 -P H -G
```
The file ${USER}-sampleIDs-vlarge.txt contains the IDs of the sequences and sizes (e.g., one FASTQ file size). The file ${USER}-sampleURLs-vlarge.txt contains the URLs of the (paired-end) FASTQ files. You can also download the sequences directly to CloudLab nodes using wget and then copy them to HDFS. This is recommended if the above script does not download the sequences correctly due to issues with the network/FTP servers. After that, you can run the below step.
- If sequences are already present in HDFS (and correctly downloaded), then run the following command:
```
$ hdfs dfs -rm -r /tmp/logs; hdfs dfs -rm -r /spark-events/*
$ ${HOME}/AVAH/scripts/run_variant_analysis_at_scale.sh -i /proj/eva-public-PG0/${USER}-sampleIDs-vlarge.txt -d NONE -n 16 -b 2 -p 15 -P H -G
```
To re-execute failed sequences during variant analysis, use the -e option.

For changing the YARN settings, refer to this page.
The .vcf files containing raw variants will be stored in HDFS. Check the files using this command: hdfs dfs -ls /*.vcf.

Running de novo assembly on a cluster of CloudLab nodes

We used Apache Spark to parallelize execution of de novo assembly using ABySS. Multiple assemblies will be run on each cluster node using a combination of HDFS and local file systems. Assuming the cluster has been set up to run variant analysis, do the following:

Login to vm0. The two files sampleIDs.txt and sampleURLs.txt can be used. You can create your own files. The file sampleIDs.txt contains a run accession ID per line. The file sampleURLs.txt contains URLs of the FASTQ paired-end sequences.
Execute the following command for k-mer length of 31 (default: 51):

${HOME}/EVA/scripts/run_denovo.sh ${HOME}/EVA/misc/sampleIDs.txt ${HOME}/EVA/misc/sampleURLs.txt 31

Once this task completes, the .fa files will be written to HDFS. Check the .log file written to /mydata.
To re-run with a different k-mer length but to avoid downloading the FASTQ files again, try this:

${HOME}/EVA/scripts/run_denovo.sh ${HOME}/EVA/misc/sampleIDs.txt NONE 41

Publications

Praveen Rao and Arun Zachariah. Enabling Large-Scale Human Genome Sequence Analysis on CloudLab. In IEEE INFOCOM Workshops: Computer and Networking Experimental Research using Testbeds (CNERT 2022), 2 pages, 2022. [PDF]
Praveen Rao, Arun Zachariah, Deepthi Rao, Peter Tonellato, Wesley Warren, and Eduardo Simoes. Accelerating Variant Calling on Human Genomes Using a Commodity Cluster. In Proc. of 30th ACM International Conference on Information and Knowledge Management (CIKM), pages 3388-3392, Australia, 2021. [PDF] Nominated for Best Short Paper Award
Shivika Prasanna and Praveen Rao. A Data Science Perspective of Real-World COVID-19 Databases. Leveraging Artificial Intelligence in Global Epidemics, pages 133-163, Elsevier, 2021. (Editors: Gruenwald, Jain, Groppe) [PDF]

Public Datasets

The 1000 Genomes Project has a few thousand human whole genome sequences to test variant analysis tasks at scale. However, these sequences are not from COVID-19 studies. The COVID-19 Data Portal has 700+ human genome sequences from COVID-19 studies using the RNA-sequencing (RNA-Seq) technique.

Performance Evaluation

Note: For AVAH's performance evaluation using ADAM-Cannoli, please refer to our publications.

Report Issues

Please report them here.

Team

Faculty: Drs. Praveen Rao (PI), Deepthi Rao, Peter Tonellato, Wesley Warren, and Eduardo Simoes

Ph.D. Students: Arun Zachariah, Shivika Prasanna

Acknowledgments

This work is supported by the National Science Foundation under Grant No. 2034247.

Name		Name	Last commit message	Last commit date
Latest commit History 495 Commits
KGV		KGV
RNA-Sequencing		RNA-Sequencing
cluster_config		cluster_config
experiments		experiments
lib		lib
misc		misc
scripts		scripts
web_app		web_app
.gitignore		.gitignore
GATK-README.md		GATK-README.md
LICENSE		LICENSE
README.md		README.md
Supported_Machines.txt		Supported_Machines.txt
YARN-README.md		YARN-README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient Variant Analysis (EVA) on Large-Scale Human Genome Sequences

Signup for a CloudLab account

Running variant analysis on human genomes using a single CloudLab node

Simple steps to run the screen command

Running variant analysis on a cluster of CloudLab nodes Using AVAH

Running de novo assembly on a cluster of CloudLab nodes

Publications

Public Datasets

Performance Evaluation

Report Issues

Team

Acknowledgments

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

MU-Data-Science/EVA

Folders and files

Latest commit

History

Repository files navigation

Efficient Variant Analysis (EVA) on Large-Scale Human Genome Sequences

Signup for a CloudLab account

Running variant analysis on human genomes using a single CloudLab node

Simple steps to run the screen command

Running variant analysis on a cluster of CloudLab nodes Using AVAH

Running de novo assembly on a cluster of CloudLab nodes

Publications

Public Datasets

Performance Evaluation

Report Issues

Team

Acknowledgments

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages