# SubModule #5: Running workflows at scale with Google Batch


# Learning Objectives:
- Understand the benefit of using schedulers like Google Batch
- Identify the key components of a Google Batch profile in a Nextflow config file
- Launch an end-to-end nf-core pipeline on Google Batch

One of the greatest benefits that cloud computing affords you is the ability to pay only for the resources that you need and only as long as you need them. In the previous submodules we have relied on an individual VM that we manually turn on at the beginning of our work and off at the end. If we forgot to turn the VM off at the end of our work, we could incur an unecessarily large bill. Similarly, we might have certain stages of our analysis that require huge computational resources and others that require very little. Rather than stopping and resizing the VM at each stage, we can instead rely on the Google Batch scheduler to reserve the resources required for each analysis. The resources are automatically shut down and deleted at the end, so you don't pay any additional cost.

In the previous submodules, many of the steps had to be run one at a time in the correct order. That is traditionally how bioinformatics pipelines have been execute. The arrivale of workflow managers like [Nextflow](https://www.nextflow.io/) have changed that. Nextflow allows developers to chain many commands together into a mature pipeline. The flow input and output files throughout the pipeline governs the order of execution to ensure that a particular step is not launched if its input has not yet been generated by a previous step. Nextflow has aggregated a growing number of curate workflows into a library of commonly used pipelines called [nf-core](https://nf-co.re/). From nf-core, you can easily launch workflows to run canonical tools for processes like RNA-Seq, methyl-seq, and in our case, 16s metagenomic sequencing. For this submodule, we will run a classic 16s metagenomics analysis suite using nf-core's [ampliseq](https://nf-co.re/ampliseq) pipeline. We have made some adjustments from the original scripts so that it can run on Google Batch, so we will pull it from our Cloud Storage Bucket rather than launch it from GitHub.

First, we need to download some data and our updated version of the ampliseq pipeline. For this submodule, we are using a human fecal microbiome sample from an ulcerative colitis experiment. Information about the sample can be found on its [SRA listing](https://www.ncbi.nlm.nih.gov/sra/?term=SRR24091844).

In [None]:
!gsutil -m cp gs://nigms-sandbox/nosi-usd-biofilms/*.fastq.gz Core_Dataset_Prep/
!gsutil -m cp -r gs://nigms-sandbox/nosi-usd-biofilms/ampliseq .

Nextflow requires Java, so we install it here.

In [None]:
#First install java
!sudo apt update
!sudo apt-get install default-jdk -y
!java -version

Next, we specify versions and platforms for Nextflow, then install it. The nextflow executable sits in our home directory and can be easily run by typing `./nextflow`

In [None]:
#Specify nexflow version and platfrom
! export NXF_VER=21.10.0
! export NXF_MODE=google
#Install nexflow, make it exceutable, and update it
! curl https://get.nextflow.io | bash
! chmod +x nextflow
! ./nextflow self-update

Below is the gcb profile that we have added to the nextflow.config file. Config files specify an enormous range of parameters that Nextflow uses to run your workflow. You can see where we have specified Google Batch as the executor for the pipeline as well as lines that specify a working directory, an output directory, and a project ID. It is important that you update these with paths to your bucket and project ID prior to launching your run in the cell below.

```
gcb{
        process.executor = 'google-batch'
        workDir = 'gs://<your_bucket>/gcb'
        google.location = 'us-central1'
        google.region  = 'us-central1'
        google.project = '<your_project_id>'
        params.outdir = 'gs://<your_bucket>/gcb/outdir'
        process.machineType = 'c2d-highmem-16'
     }
```

In [None]:
! ./nextflow run nf-core/ampliseq \
    -profile gcb \
    -c ampliseq/nextflow.config \
    --input_folder "Core_Dataset_Prep/" \
    --FW_primer GTGYCAGCMGCCGCGGTAA \
    --RV_primer GGACTACNVGGGTWTCTAAT \
    --skip_taxonomy "true" \
    --skip_cutadapt "true"

After the pipeline completes, you can see the output of each stage in the directory you specify as `params.outdir` in the config file. You can easily see the output files generated by fastqc, dada2, and others neatly categorized into its own directory. As you scale the size and number of metagenomic runs that you execute, using GLS with Nextflow will vastly improve the efficiency and reproducibility of your analysis.

---
# Conclusion

You can see from the progress updates above that each step of the pipeline was given a process name and allocated to appropriate resources. Once that process was done, those resources were shut down and deleted so you don't have to continue paying for them. Output files and logging information was stored safely in the storage buckets we specified in the congif file. Use resource managers like Google Batch is a great way to scale your workflows as your datasets get bigger.