CloudASM: an ultra-efficient cloud-based pipeline for mapping allele-specific DNA methylation
Table of contents
- Biology significance of CloudASM
- Main steps in the CloudASM pipeline
- Bioinformatics packages used in CloudASM
- If you are new to Google Cloud Computing
- How to use the pipeline
- Test the pipeline
- Prepare the fastq files to be analyzed
- Re-run failed jobs
CloudASM is a turnkey pipeline designed to call allele-specific CpG methylation in whole methylomes. It is designed to run on Google Cloud Platform (GCP).
This pipeline takes as an input zipped fastq files and outputs a table of all single nucleotide polymorphisms (SNPs) with allele-specific methylation. Below, we show an example of the output table:
Here are the explanations of the different parameters on this table:
chr: this this the chromosome number where the SNP is located
snp_id: the unique identifier for the SNP that was evaluated for ASM.
snp_pos: Coordinate of the SNP.
asm_snp: whether the SNP has has ASM or not.
asm_region_inf: Position of the CpG with significant ASM for the SNP
snp_idand the smallest coordinate.
asm_region_sup: Position of the CpG with significant ASM for the SNP
snp_idand the smallest coordinate.
nb_ref_reads: Number of genomic segments that cover the REF of the SNP.
nb_alt_reads: Number of genomic segments that cover the ALT of the SNP.
asm_region_effect: The difference between fractional methylation between the ALT and REF genomics segments.
wilcoxon_corr_pvalue: Wilcoxon p-value of the the
asm_region_effectcorrected for multiple testing (across all SNPs).
nb_cpg: Number of CpGs with at least 5x coverage on both the REF and ALT genomic segments.
nb_pos_sig_cpg: Number of CpGs with significant ASM and where fractional methylation between ALT and REF is positive.
nb_neg_sig_cpg: Number of CpGs with significant ASM and where fractional methylation between ALT and REF is negative.
nb_consec_pos_sig_asm: Number of consecutive CpGs with significant ASM and where fractional methylation between ALT and REF is positive.
neg_consec_sig_cpg: Number consecutive of CpGs with significant ASM and where fractional methylation between ALT and REF is negative.
See below how these look like on an example.
Biology significance of CloudASM
Our laboratory has a long-standing expertise in studying allele-specific methylation. To make sure our pipeline avoids outputing false positives, we have implemented the following steps for stringency:
- we filter out the reads where the confidence in the SNP nucleotide is lower than 30 (variable
- we remove CpGs from the context file where the C or G overlap with a SNP found in the unfiltered list of variants identified by BisSNP
- We do not consider for ASM the SNPs that are not within 500 bp of at least a CpG
As described in the figure above, to catch a "true positive" phenomenon of allele-specific methylation, we use two types of calculation:
- single CpG level ASM ("CpG ASM") where we estimate if there is ASM on a CpG that is at least 5x covered on both alleles. We use a cut-off p-value of 0.05 on an exact Fisher's test.
- ASM over a region delimited by CpGs that are found on the same reads as the SNP. The region is delimited by two CpGs showing ASM, must contain at least 3 CpGs (variable
CPG_PER_asm_region), must have at least 2 consecutive CpGs with ASM in the same direction (variable
CONSECUTIVE_CPG), must have an effect size of at least 20% between the two allele (variable
ASM_REGION_EFFECT). The effect size is the difference in methylation percentage between the two alleles calculated across all CpGs in the asm_region. We use a p-value of 0.05 on a Wilcoxon test.
All these variables can be adjusted by the user.
Main steps in the CloudASM pipeline
The pipeline follows these steps:
- Create a bisulfite-converted genome from the reference genome that was chosen
- Unzip fastq files and split them into smaller files ("chards").
- Trim each pair of fastq files
- Align each pair of fastq files
- Merge BAM files per chromosome and prepare them for variant and net methylation calling.
- Net methylation calling
- Re-calibrate of BAM files
- Variant calling
- Allele-specific methylation calling
Note that the user can choose the reference genome to align the bisulfite-converted reads. The script automatically fetches the correct database of variants for the reference genome that is selected.
Bioinformatics packages used in CloudASM
All these packages are included in the Docker-generated publicly-available image
gcr.io/hackensack-tyco/wgbs-asm. Note that we need to use a specific version of JAVA to be able to run BisSNP.
If you are new to Google Cloud Computing
Google Cloud Computing (GCP) is a service for cloud computing, just like Amazon Web Services and Microsoft Azur Cloud Compute.
To be able use CloudASM, you need to create an account on https://cloud.google.com/ and set up how you want to be billed (e.g. with a credit card). As of January 2020, GCP offers $300 in credits for opening an account -- which is enough to test this pipeline.
GCP offers a suite of services and CloudASM uses Compute Engine, where virtual machines can be created and used for CloudASM's tasks, Cloud Storage, where data is stored before and after being processed by virtual machines, and Big Query, where the aligned reads, variants, and methylation states are analyzed jointly to estimate ASM.
Once you open an account on GCP, you need to create a "project" within GCP and choose which geographical "region" and "zone" you want to request resources from. It is recommended to pick a region and zone near your location. Every job you submit within your project will pull resources from the region and zone you chose.
Very basically, data is stored in "buckets" in the module
Cloud Storage. When CloudASM executes a batch of jobs, virtual machines (also called "instances") in the module
Compute Engine are created to execute the job. They download the data from the bucket, obtain the script from CloudASM, execute the script, and upload the output of the script back in the bucket (or BigQuery). Depending on the job, CloudASM requests instances with 2-16 CPUs, adds larger disks to these instances (30-400 GB), and executes the commands on one of the Docker images we built or that were already available.
When you start running CloudASM on more than one sample, pipeline manager dsub will start queuing jobs in the same batch if you do not have enough resoures for your project. When you start running CloudASM, you may want to go to the Quotas and monitor which resources you need to increase.
To run CloudASM, you need to install GCP's pipeline manager called "dsub". We recommend using the method where their repository is cloned in combination with the virtual environment. You can also install dsub via conda but this was not created by the dsub creators so it may not have the latest version of dsub.
The current version of CloudASM was validated with dsub 0.3.7.
How to use the pipeline
Prepare the zipped fastq files to be analyzed (see below how)
Clone this repository on your computer
Customize the Library parameters, the ASM parameters and the GCP parameters in
Launch a virtual environment by typing
python3 -m venv dsub_libsand then
source dsub_libs/bin/activate(does not matter in which directory you are located).
Copy, sequentially, all instructions from
master.shinto the terminal, block by block (a "block" is a set of instructions included in between two headlines).
Before moving on to the next instructions block, re-run the jobs that fail if you use preemptible machines (failure rate is about 5-10% when using preemptible machines). See below how to do that.
Test the pipeline
We prepared a small dataset (~24MB of zipped fastq files) for anyone to test CloudASM without incurring large costs. The dataset was prepared by recombining bisulfite-converted reads overlapping the positions 9,000,000 and 10,000,000 on chromosome 1, using the lymphoblastoid cell line GM12878 (identifier: ENCSR890UQO) made publicly available by the ENCODE consortium.
The zipped fastq files are freely accessible on our GCP’s storage space. Using this dataset, CloudASM assessed 456 SNPs and found 13 ASM regions. All the data generated by CloudASM for this dataset is stored here. The final table of all variants and their assessment for ASM can be downloaded here.
To test the pipeline, you will have to change all GCP parameters except the variable
INPUT_B. As you will when running the pipeline, for each sample, CloudASM creates the following folders:
split_fastq: fastq files unzipped and split into 120,000 rows chards
trimmed_fastq: trimmed pairs of fastq chards from
aligned_per_chard: BAM files created from aligning trimmed chards of fastq files from
bam_per_chard_and_chr: The BAM files from
aligned_per_chardare split across chromosomes
bam_per_chr: The BAM files from
bam_per_chard_and_chrare merged across chromosomes
net_methyl: context files created from calling net methylation on the BAM files located in
recal_bam_per_chr: Recalibrated BAM files from
variants_per_chr: VCF files for each chromosomes created from the BAM files in
asm: Tables of CpG-level and region-level of ASM.
bedgraph: bedgraph files of coverage and methylation percentage across the whole sample
sam: SAM files created from BAM files in
snps_for_cpg: database of SNPs used to remove CpG sites potentially overlapping with a variant.
Prepare the fastq files to be analyzed
When naming your samples, do not use dashes. Only use underscores.
Prepare a sample info file using this model. Download the file as TSV file into your
run_files directory located where you cloned the CloudASM repository (
run_files is automatically created by the
master script). The sample info file looks like the table below. The first column is the name of the sample, the 2nd column is the list of all the zipped fastq files, the 3rd column is the lane of the zipped fastq files, the 4th column is the read number of the zipped fastq file. The 4th column is created automatically from column 1, 3, and 4.