# The WGBS data analysis tutorial 4 - 
# run nf-core/methylseq using Google Batch

For real-world datasets, the sequence file sizes are usually too large to process using a single virtual machine (Vertex AI notebook), or take a long time. In this tutorial, we will show how to run a nf-core/methyseq pipeline to process WGBS data using the Google Batch.  

<img src="images/notebook4_2.png" width="900" />

The [**Google Batch**](https://www.nextflow.io/docs/latest/google.html#cloud-batch) is a managed computing service that allows the execution of containerized workloads in the Google Cloud Platform infrastructure. It provides a simple way to execute a series of Compute Engine containers on Google Cloud. The most common use case when using Google Batch is to run an existing tool or custom script that reads and writes files, typically to and from Cloud Storage. Nextflow provides built-in support for Google Batch, which allows the seamless deployment of a Nextflow pipeline in the cloud, offloading the process executions through the Google Cloud service. Google Batch can run independently over hundreds or thousands of these files. 


There were several steps before we can submit a job to Google Batch through Nextflow. And these steps (listed below) will be covered in this tutorial:

- [Create a Nexflow Service Account](#CNSA) -- if does not exist
- [Create a Notebook with Service Account Permissions](#Create-a-Notebook-with-Service-Account-Permissions)

Now, you can open a new Vertex AI notebook with a Nextflow service account, and:

- [Install Nextflow, and Create a Config File for Google Batch](#INCC) 
- [Download and Test nf-core/methylseq using Google Batch](#TEST)
- [An Example of a Real-World Dataset](#REAL)
- [Configuration of a Full-scale Dataset](#FULL) -- Troubleshooting

## Create a Nextflow Service Account<a name="CNSA"></a>
**Most of what is in this section is unnecessary if you are using an NIH Cloud Lab Project/Account**. If you are a Cloud Lab user, feel free to skip ahead to `Install Nextflow, and create a config file for Google Batch` without creating a Service Account. Everything should run fine without the Service Account. You will still need to enable the APIs.
`

Before creating a new service account, please check if there is already a Nextflow service account available (`Menu` > `IAM & Admin` > `Service Accounts`). There is no need to create a new one if there is one that already exists. If not, follow the steps below to create one.

#### Enable APIs  
- Enable the Batch, Compute Engine, Cloud Logging, and Cloud Storage APIs by searching each of the GCP products and clicking **`ENABLE`** button (for the whole project, should have already been done in the beginning README.md)

#### Create a Nextflow service account  
- Click the main navigation menu, go to **IAM & Admin** click **Service Accounts**
- Select **+ CREATE SERVICE ACOUNT**
- Type in 'nextflow-service-account' as the service account name and press **`DONE`**

<img src="images/4_create_service_account.png" width="800">

#### Add roles to the service account:  
- On the **IAM & Admin** menu click **IAM** then click `edit` next to the Nextflow service account just created
- Add the following roles and click **`SAVE`**:  
    - Service Account User
    - Batch Agent Reporter
    - Storage Admin
    - Storage Object Admin
    - Batch Job Editor

<img src="images/4_create_service_account_roles.png" width="800"> 

## Create a Notebook with Service Account Permissions

When creating a notebook you can edit the permissions to utilize the Nextflow service account.  
- Using the 'IAM & Admin' menu on the left click 'Service Accounts' (if you aren't there already) locate your Nextflow service account and copy the entire email name
- Edit the Permissions section by **unclicking** 'Use Compute Engine default service account' and enter your service account email.
- then click 'Create'

<img src="images/4_create_notebook.png" width="800">

## Create a storage bucket if you haven't

1. In the navigation menu `(≡)`, select `Cloud Storage` and then __Create bucket__.
2. Enter a name for your bucket. You will reference this name when you need to transfer the output results from the GCP or running the nf-core/methylseq pipeline. You can also upload your own dataset to the bucket to use in GCP. (**NOTE**: Do not use underscores (_) in your bucket name. Use hyphens (-) instead.) 
3. Select __Region__ for the __Location type__ and select the __Location__ for your bucket.
4. Select __Standard__ for the default storage class.
5. Select __Uniform__ for the Access control.
6. Select __Create__.
7. Once the bucket is created, you will be redirected to the Bucket details page.
8. Select Permissions, then + Add.
9. Copy the email address of the Nextflow service account into New principals.
10. Select the following roles:
    - Storage Admin
    - Storage Legacy Bucket Owner
    - Storage Legacy Object Owner
    - Storage Object Creator
11. If you have a service account that need to access the bucket, repeat step 9 to enter the service account email, and step 10 to select the following roles: 
    - Storage Admin
    - Storage Object Admin

<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Alert: </b>  Please <b>do not create a service key</b> if instructed by any tutorial. API keys are generally not considered secure; they are typically accessible to clients, making it easy for someone to steal an API key. Once the key is stolen, it has no expiration, so it may be used indefinitely, unless the project owner revokes or regenerates the key.
</div>


## Now open the JupyterLab with the Nextflow service account and download the tutorials from the repository as shown in the README.md before. 

Using command: `! git clone https://github.com/NIGMS/DNA-Methylation-Sequencing-Analysis-with-WGBS.git` and open **tutorial4_methylseq2.ipynb** 

**Note**: if your notebook has the Nextflow service account permission in the beginning, then you don't need to create a notebook and re-download the notebooks (.ipynb).

---

In [None]:
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

! mamba install -c bioconda nextflow -y

### Create and modify your own config file to include a 'gcb' profile block

The config file allows Nextflow to utilize executors like Google Batch. Below is an example config file to run a Nextflow job using Google Batch:  
```bash
profiles{
  gcb{
      google.project = '<PROJECT_ID>'
      google.location = 'us-central1'
      google.region  = 'us-central1'
      
      process.executor = 'google-batch'
      process.machineType = 'c2-standard-30'
      
      workDir = 'gs://BUCKET_NAME/work'
      params.outdir = 'gs://BUCKET_NAME/result'
     }
}
```  
There are some fields that you need to define or pay attention to:
- **Your project ID**. Not the project name, but the project ID. It can be found when you click the project name in the menu at the top of the home page: <img src="images/4_project_ID.png" width="800" />
- **Executor**. To run the job using Google Batch, the executor must be defined here using: `process.executor = 'google-batch'`
- **Region**. Make sure that your region is a region included in Google Batch. A comprehensive list is available [here](https://cloud.google.com/batch/docs/locations). 
- **Machine type**. Specify the machine type you would like to use, ensuring that there is enough memory and CPUs for the workflow. Google Cloud provides different machine types within several machine families that you can choose from to create a virtual machine (VM) instance with the resources you need. For example, the machine type we defined here is `c2-standard-30` that belongs the compute-optimized machine family, which has the highest performance per core on Compute Engine and optimized for compute-intensive workloads.`c2` is the machine series, and `30` is the CPU thread number it has. `c2-standard-30` also has a memory of 120GB and you can attach up to 3TB of local storage to these VMs for applications that require higher storage performance. For more information about the machine types please visit https://cloud.google.com/compute/docs/machine-types.
    - If not defined, Google Batch will automatically use 1 CPU
    - Something to consider is that `c2-standard-30 machine` type is compute intensive and a little more expensive than the `e2` or `n1` machine types. You can use a cheaper option if running time is not the first priority.
- **Data storage**. For a full-scale dataset, make sure you create the bucket ahead of time and a directory in your specified bucket to store the input, output, and intermediate files. The `workDir` define the working directory for intermediate file to store. You can also define the input files and output path using the parameter `params.input` and `params.outdir` to specify your working directory bucket and output directory bucket
    - If not defined, the work directory and output directory with be in your local notebook directory named `work`, and `results`. This is risky, since the intermediate and final outputs can be too large to store in the notebook instance.

If many parameters need to be specified, you can write the config file using **scopes**. Configuration settings can be organized in different scopes by dot prefixing the property names with a scope identifier or grouping the properties in the same scope using the curly brackets notation:
```bash
profiles{
  gcb{
      workDir = 'gs://BUCKET_NAME/work'
      process {
          executor = 'google-batch'
          machineType = 'c2-standard-30'
      }
      google {
          location = 'us-central1'
          region  = 'us-central1'
          project = '<PROJECT_ID>'
      }
      params {
          outdir = 'gs://BUCKET_NAME/output'
          input = 'gs://BUCKET_NAME/*_R{1,2}.fastq.gz'
          max_memory = 120.GB
          max_cpus = 30
          max_time = 24.h
      }   
}
```

__Note:__ Best practices are to make sure your working directory (`workDir`) and output directory (`outdir`) are **different**! Google Batch creates temporary files in the working directory within your bucket that do take up space. So once your pipeline has completed successfully, feel free to delete the temporary files.

An example of the config file is `docs/test_LS.config`, we will use this config file to run the test profile (a very small dataset). 

## Create a Google Cloud Storage Bucket

You can create a customized bucket to use to store the output from the pipeline. Bucket names must be **globally unique** across all Google Cloud projects, including those outside of your organization.


<b>Note:</b> We will use bucket name "dna-methyl" as an example in the following steps, but you need to create your own bucket to store the data. Please <b>replace</b> the "dna-methyl" below with your own bucket name


## Download and **Test** nf-core/methylseq using Google Batch<a name="TEST"/>

The `test` profile (`-profile test`) uses a small dataset allowing you to ensure the workflow works with your config file without long run times. Ensure you include:
- Version of the nf-core tool [-r]
- Location of the config file [-c]

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
<b>Note:</b> Please <b>replace</b> the "dna-methyl" located in docs/test_LS.config with your own bucket name and save it. You can double-click the file to open and edit it, or edit it using terminal.
</div>

`docs/test_LS.config`:
```
profiles{
  gcb{
      workDir = 'gs://dna-methyl/test/work'
      process {
          executor = 'google-batch'
          machineType = 'e2-standard-4'
      }
      google {
          location = 'us-central1'
          region  = 'us-central1'
          project = '<PROJECT_ID>'
      }
      params {
          outdir = 'gs://dna-methyl/test/results'
      } 
     }
}
```

In [1]:
# Create the output directory for this tutorial
! mkdir Tutorial_4

In [3]:
! rm -rf Tutorial_4/test
!nextflow self-update
!nextflow run nf-core/methylseq -r 2.4.0 -profile test,gcb -c docs/test_LS.config

CAPSULE: Downloading dependency org.multiverse:multiverse-core:jar:0.7.0wait .. Downloading nextflow dependencies. It may require a few seconds, please wait .. 2/3 KB   
CAPSULE: Downloading dependency org.apache.ivy:ivy:jar:2.5.1
CAPSULE: Downloading dependency org.codehaus.groovy:groovy:jar:3.0.19
CAPSULE: Downloading dependency commons-io:commons-io:jar:2.11.0
CAPSULE: Downloading dependency com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava
CAPSULE: Downloading dependency com.beust:jcommander:jar:1.35
CAPSULE: Downloading dependency jline:jline:jar:2.9
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-xml:jar:3.0.19
CAPSULE: Downloading dependency org.slf4j:log4j-over-slf4j:jar:2.0.7
CAPSULE: Downloading dependency com.github.zafarkhaja:java-semver:jar:0.9.0
CAPSULE: Downloading dependency io.nextflow:nf-commons:jar:23.10.1
CAPSULE: Downloading dependency io.nextflow:nf-httpfs:jar:23.10.1
CAPSULE: Downloading dependency javax.mail:mail:jar:1.4.

In [19]:
# Remove the remote trace file diretory
! rm -rf gs:

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
<b>Note</b>: The <code>preseq</code> process may failed but ignored in the pipeline. This won't affect the output results. The preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment.
    </div>

This nf-core/methyseq test profile takes about 20 minutes to finish. When compared with the test profile running time (about 3 minutes) from Tutorial 3, we can see that there is extra time needed for Nextflow to talk to Google Batch, Cloud Storage, and VMs. It is not worthwhile for a small dataset, but this time difference can be ignored when running large datasets that need more computational resources. In other words, Google Batch works well for coarse-grained workloads i.e. long-running jobs. It’s not suggested to use this feature for pipelines spawning many short-lived tasks.

## An Example of a Real World Dataset<a name="REAL"/>

1. Install SRA-tools
2. Download the Data
3. Modify the config file
4. Run the job

#### Install SRA-tools

The **[SRA Toolkit](https://github.com/ncbi/sra-tools/wiki)** and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives. Here we use `mamba` to install `sra-tools`:

In [6]:
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

! mamba install -c bioconda "sra-tools > 2.11" -y


Looking for: ["sra-tools[version='>2.11']"]

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
bioconda/linux-64    [33m━━━━━━━━━━━╸[0m[90m━━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.1s
bioconda/noarch      [33m━━━━━━━━╸[0m[90m━━━━━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.1s
conda-forge/linux-64 [33m━━━━━━━━╸[0m[90m━━━━━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.1s
conda-forge/noarch   [90m━━━━━━━━━━━╸[0m[33m━━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.1s
pkgs/main/linux-64   [33m━━━━━━━━━━━━━━╸[0m[90m━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.1s[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[0Gpkgs/main/linux-64                                            No change
[+] 0.2s
bioconda/linux-64    [33m━━━━━━━━━━━━╸[0m[90m━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.2s
bioconda/noarch      [33m━━━━━━━━━╸[0m[90m━━━━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.2s
conda-forge/linux-64 [33m━━━━━━━━━╸[0m[90m━━━━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ?

#### Download the Data

The data was from [Molaro, Antoine, et al. Cell 146.6 (2011): 1029-1041](https://www.sciencedirect.com/science/article/pii/S0092867411009421) and [Laurent, Louise, et al. " Genome research 20.3 (2010): 320-331](https://genome.cshlp.org/content/20/3/320.full). During germ cell and preimplantation development, mammalian cells undergo nearly complete reprogramming of DNA methylation patterns. The studies profiled the methylomes of human and chimp sperm as a basis for comparison to methylation patterns of embryonic stem cells (ESCs).   
<img src="images/4_data_graph.jpg" width="300" />

We use one sample from human sperm and one sample from ESCs as examples to demonstrate the workflow here.

Use `fasterq-dump` to download data from SRA using accession numbers. The data will be store at `Tutorial_4/sra_download`:

In [12]:
! fasterq-dump --threads 4 --gz --progress SRR306435 SRR033942 -O Tutorial_4/sra_download

gzip: compressed data not written to a terminal. Use -f to force compression.
For help, type: gzip -h
2024-03-20T19:41:00 fasterq-dump.3.1.0 err: param unknown while parsing argument list within application support module - Unknown argument '--gz'
2024-03-20T19:41:00 fasterq-dump.3.1.0 err: ArgsMakeAndHandle() -> RC(rcApp,rcArgv,rcParsing,rcParam,rcUnknown)
fasterq-dump quit with error code 3


Remove the temporary output directory from running `fasterq-dump`:

In [8]:
! rm -rf fasterq.tmp.*

Compress the files.

In [14]:
!pigz Tutorial_4/sra_download/SRR*

In [None]:
!pigz

#### Create a samplesheet (located in Tutorial_3) to provide all sample information

**Format:**    
sample, fastq1, fastq2    
sample1,sample1_R1.fastq,sample1_R2.fastq    
control1,control1_R1.fastq,control1  

In [16]:
# Pandas DataFrame by lists of dicts.
import pandas as pd
 
# Initialize data to lists.
samples = [{'sample': 'SRR033942', 'fastq_1': 'Tutorial_4/sra_download/SRR033942_1.fastq.gz', 'fastq_2': 'Tutorial_4/sra_download/SRR033942_2.fastq.gz'},
        {'sample': 'SRR306435', 'fastq_1': 'Tutorial_4/sra_download/SRR306435_1.fastq.gz', 'fastq_2': 'Tutorial_4/sra_download/SRR306435_2.fastq.gz'}
       ]
 
# Creates DataFrame.
df2 = pd.DataFrame(samples)
 
# Print the data
df2

Unnamed: 0,sample,fastq_1,fastq_2
0,SRR033942,Tutorial_4/sra_download/SRR033942_1.fastq.gz,Tutorial_4/sra_download/SRR033942_2.fastq.gz
1,SRR306435,Tutorial_4/sra_download/SRR306435_1.fastq.gz,Tutorial_4/sra_download/SRR306435_2.fastq.gz


Export dataframe to CSV file.

In [17]:
df2.to_csv('Tutorial_4/samplesheet.csv', index=False)

#### Create/Modify the config file 

As mentioned [previously](#Create-and-modify-your-own-config-file-to-include-a-'gcb'-profile-block), we need to modify the config file for the methylseq to run in Google Batch. The config file is located at `docs/human_sperm.config`. In this example, we set the working and output directory in the GCP Cloud Storage bucket `Tutorial_4/methyseq_sperm`. You need to change the path to your own result bucket. The input file is the sample sheet in the directory `Tutorial_4`, that we just created.

```bash
profiles{
  gcb{
      process.executor = 'google-batch'
      google.location = 'us-central1'
      google.region  = 'us-central1'
      google.project = '<PROJECT_ID>'
      workDir = "gs://nosi-hawaii-dna-27fa/methyseq_sperm/work"
      params.outdir = "gs://methyl/methyseq_sperm/results"
      process.machineType = 'c2-standard-16'
     }
}
```

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
<b>Again:</b> Please <b>replace</b> the "dna-methyl" in the docs/human_sperm.config with your own bucket name.
</div>

#### Run methylseq using Google Batch

If not defined in the config file, you can always use command line parameters:
- `-r` pipeline version
- `-profile` profile to use ('gcb' was defined in docs/human_sperm.config)
- `-c` config file
- `--genome` the reference genome to use. Here we use human assembly GRCh38 as the reference genome
- `--clip_r1` instructs Trim Galore to remove certein number of bps from the 5' end of read 1
- `--tracedir` defines a local diretory to save the pipeline information

There will be some pipeline information saved to the default `results` directory. So please make sure the directory is empty before running the pipeline.

In [None]:
! rm -rf Tutorial_4/methyseq_sperm

! nextflow run nf-core/methylseq \
    -profile gcb \
    -r 2.6.0 \
    -c docs/human_sperm.config  \
    --input 'Tutorial_4/samplesheet.csv' \
    --genome GRCh38 \
    --clip_r1 2 \
    --tracedir 'Tutorial_4/methyseq_sperm/pipeline_info' \
    -resume


N E X T F L O W  ~  version 23.10.1
Launching `https://github.com/nf-core/methylseq` [irreverent_liskov] DSL2 - revision: 54f823e102 [2.6.0]
[33mWARN: The following invalid input values have been detected:

* --tracedir: Tutorial_4/methyseq_sperm/pipeline_info

[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/methylseq v2.6.0-g54f823e[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision     : [0;32m2.6.0[0m
  [0;34mrunName      : [0;32mirreverent_liskov[0m
  [0;34mlaunchDir    : [0;32m/home/jupyter/DNA-Methylation-Sequencing-Analysis-with-WGBS[0m

In [None]:
# Remove the remote trace file diretory
! rm -rf gs:

#### Check to see if files are in your output directory bucket

The output files should be saved in your bucket's methylseq_sperm/results directory. You can list the results directory to see the file structures. You can also copy the files to your local directory to view them. For example, the MultiQC report file is located at `gs://dna-methyl/methyseq_sperm/results/MultiQC/multiqc_report.html`. Let's copy and view it using the commands below:

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
<b>Note:</b> Please <b>replace</b> the "dna-methyl" in commands below with your own bucket name.
</div>

In [None]:
# List the output files/directories in the results folder
! gsutil ls gs://dna-methyl/methyseq_sperm/results

# Copy the multiQC output multiqc_report.html to local notebook:
! gsutil cp -r gs://dna-methyl/methyseq_sperm/results/MultiQC/multiqc_report.html .

# View the MultiQC output HTML file:
from IPython.display import IFrame
IFrame(src='multiqc_report.html', width=900, height=600)

There are two files (`execution_timeline.html` and `execution_report.html`) about the pipeline running information will be saved in the results/pipeline_info directory locally in the notebook, which can provide detailed information about the running time for each process and the their resource usages. This can provide more insights for potential optimizations.

In [None]:
from IPython.display import IFrame
IFrame(src='Tutorial_4/methyseq_sperm/pipeline_info/execution_timeline*.html', width=800, height=600)

In [None]:
from IPython.display import IFrame
IFrame(src='Tutorial_4/methyseq_sperm/pipeline_info/execution_report*.html', width=800, height=600)

## <a name="FULL" />Configuration of a Full-scale Dataset - Troubleshooting

For a full-scale WGBS study, the sequencing data size can range from several hundred GBs to several TBs. For example, the data we downloaded in this tutorial: GSE30340 and GSE19418, both have many runs with the size add up to several hundred GBs. Given the large data files, the storage and memory can become an issue when running the pipeline as instructed in this tutorial. 

#### Download the data

There are several options that we can use to download the data:
1. Download the data in a notebook. You need to make sure that the disk size you assigned to the notebook is enough for the data that you want to download. Also, when you use `prefetch` from SRA toolkit, there is a default maximum download-size of 20G; you will need to increase that limit.  
2. Cloud Data Delivery Service. SRA has created a cloud data delivery service to deliver the source files and other file types from NCBI cold storage buckets to individual data consumers' buckets in AWS and GCP. This service is provided for both public and authorized access (dbGaP) data. [More detailed information here](https://www.ncbi.nlm.nih.gov/sra/docs/data-delivery/).
3. Upload to the storage bucket directly. You can upload the data to the GCP storage bucket directly from your local computer, HPC, or service server using the `gsutil` tool. [More detailed information here](https://cloud.google.com/storage/docs/discover-object-storage-gsutil). 

#### Troubleshooting the nf-core pipeline

If the nf-core pipeline does not complete successfully, you can refer to the [troubleshooting](https://nf-co.re/usage/troubleshooting) page that nf-core provided for more information. For our tutorial here, the most likely reasons that the pipeline fails are:
- service account is not set up correctly
- file paths are not correct
- memory or storage issues for large dataset

If you have a command exit status of 104, 134, 137, 139, 143, 247, the most probable cause is an "out of memory" issue. To solve the memory issue, you need to increase the memory limit in the configuration file for the process that fails. For example:   
``` bash
profiles {
    gcb {
      process {
        withName: qualimap { 
              machineType = 'c2-standard-16'
              cpus = 16
              memory = 64.GB
        }
      }
    }
}
```

In GCP, the memory is also limited by the [machine type](https://cloud.google.com/compute/docs/machine-types) you select to run the process. For example, if you choose `c2-standard-8` then the memory is limited to 32GB. You can change the machine types to increase the memory. There are [memory-optimized machine families](https://cloud.google.com/compute/docs/memory-optimized-machines) (m1, m2) that you can use for workloads that require higher memory-to-vCPU ratios.  

#### Optimize nf-core/methylseq configuration

The nf-core/methylseq workflow contains multiple processes, and the requirements of computational and memory resources for each process vary a lot. For better performance or billing purposes, you can change the configuration for each process. You can check the default settings for each process at the pipeline's [base.config](https://github.com/nf-core/methylseq/blob/master/conf/base.config) file. 

As an example of running a 12 sample WGBS data, [docs/optimization_example.config](docs/optimization_example.config) was the config file that finish processing these 24 fastq files (pair-end, averge size 325M reads per fastq file, ) using Google Batch in less than 30 hours.

<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Don't forget:</b> after finish running the notebook, stop the notebook in Vertex AI Workbench to avoid cost accumulation.
</div>