# Use Nextflow to run workflows using the Google Batch Part I

## Overview
__What is Google Batch?__ <br>
Batch allows you to schedule, queue, and execute batch processing workloads on a VM instances. Batch provisions resources and manages capacity on your behalf, allowing your batch workloads to run at scale. 

__How does Batch differ from Cloud Life Sciences?__ <br>
You don't need to configure and manage third-party job schedulers, provision and deprovision resources, or request resources one zone at a time. To run a job, you specify parameters for the resources required for your workload, then Batch obtains resources and queues the job for execution. Batch provides native integration with other Google Cloud services to aid in the scheduling, execution, storage, and analysis of batch jobs.

<div class="alert alert-block alert-danger"> <b>Warning:</b> Google Life Sciences API is depreciated and will no longer be available on GCP by July 8, 2025. </div>

Here we are going to walk through submitting simple jobs directly to Google Batch, then dive into interacting with Google Batch using Nextflow. We will run some basic Hello World jobs, then move to a more complex [nf-core Methylseq workflow](https://nf-co.re/methylseq). 

## Learning Objectives
+ Learn how to use Nextflow in Google Cloud
+ Learn how to submit jobs to Google Batch

## Prerequisites
Make sure that Batch, Compute Engine, and Cloud Storage APIs are all enabled.

You also want to make sure your Compute Engine Default Service Account has the following Roles:

- Service Account User
- Batch Agent Reporter 
- Storage Admin
- Storage Object Admin
- Batch Job Editor <br>

Your Service Account should already have these roles assigned, but if not, reach out to Support to have your account updated.

## Get Started

### Install packages and set up environment

### Create a bucket

In [None]:
#make sure you change this name, it needs to be globally unique
%env BUCKET=gbatch-nextflow

In [None]:
#will only create the bucket if it doesn't yet exist
! gsutil ls gs://$BUCKET >& /dev/null || gsutil mb gs://$BUCKET

In [None]:
#set versioning on the bucket so it can overwrite old files
! gsutil versioning set on gs://$BUCKET

### Install dependencies

In [None]:
#First install java
! sudo apt update
! sudo apt-get install default-jdk -y
! java -version

In [None]:
#Specify nexflow version and platfrom
! export NXF_VER=21.10.0
! export NXF_MODE=google
#Install nexflow, make it exceutable, and update it
! curl https://get.nextflow.io | bash
! chmod +x nextflow
! ./nextflow self-update

### Submit Hello World to Batch Directly

#### Submitting a job through the command line

To submit a batch job through the command line you first need to create a __json__ file this is your config file. You can use the below hello world script as a template for your batch job. We will name the file hello-world.json.

```
{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "container": {
                            "imageUri": "gcr.io/google-containers/busybox",
                            "entrypoint": "/bin/sh",
                            "commands": [
                                "-c",
                                "echo Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks >> /mnt/disks/gbatch-nextflow/hello-world.txt"
                            ]
                        }
                    }
                ]
                "volumes": [
                    {
                        "gcs": {
                            "remotePath": "gbatch-nextflow"
                        },
                        "mountPath": "/mnt/disks/gbatch-nextflow"
                    }
                ],
                
                "computeResource": {
                    "cpuMilli": 2000,
                    "memoryMib": 16
                },
                "maxRetryCount": 2,
                "maxRunDuration": "3600s"
            },
            "taskCount": 4,
            "parallelism": 2
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "policy": { "machineType": "e2-standard-4" }
            }
        ]
    },
    "labels": {
        "department": "finance",
        "env": "testing"
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}
```

Let break down the script:
- Our image and commands are specified in the block labeled "container", imageURI being the theimage busybox and our commands being to echo Hello World.
    - You will notice that in the command line we have mounted our bucket this is so our output file hello-world.txt is stored into our bucket __(do not forget to change the mount path to your bucket name)__
    - As you noticed there are some variables that we have added these are universal variables that Google has created that dont need to be defined beforehand, they show which task the job is working on presently and how many tasks in total this job has.
- Under the 'volume' block this is where we are specifying our Google bucket and the path we are using to mount or join our bucket to our container. __(do not forget to change the mountPath and remotePath to your bucket name)__ 
- 'compute Resources' is where we define how long the script should run, how many tasks it should have and how many of thoes taks should be run in parallel at a time.
- Under 'instances' in our script is where we can specify our machine type.


Now we can submit our job specifing title of the job (hello-world) the location (us-central1) and the location of our json file

In [None]:
! gcloud batch jobs submit hello-world \
  --location us-central1 \
  --config ~/hello-world.json

#### Submitting a job through the console

Running a batch job through the console allows for a user-friendly view to input data and scripts and view the status of the jobs you created.

Start by searching __'Batch'__ in the console search bar you should see a similar setting like this  
<img src="../../../images/batch_start.png" width="300" height="300">

Near the upper left corner click <img src="../../../../images/create.png" width="50" height="50">

The follow should appear on the screen
 
 <img src="../../../images/create_job_console.png" width="300" height="300">
 
 This is where you can:
 - Label your job
 - Select a region and zone to excecute your job
 - Select your machine type (e.g. e2-medium)
 - Specify tasks by adding a script and/or specifiying a container to run the task in
 - Allocating resources for each task
 - Add storage volume

Once you have entered the settings for your batch job you can even view the full script that you would submit through the command line by clicking __'EQUIVALENT COMMAND LINE'__ next to __'CREATE'__. Delete the script that is already there and paste the script we had above.

 <img src="../../../images/Batch_command_line_console.png" width="400" height="400">

Once you run your job by clicking __'CREATE'__. <br>
By clicking the job name you can view more information of the jobs setting, resources applied, and logs by clicking  <img src="../../../images/log.png" width="100" height="100">

### Check job status

You can view the status of your job by looking at the __'Job List'__ in the Google Console. Here you will see your job name, status, region, memory per task, machine type, date started and run time.

 <img src="../../../images/Job_l.png" width="500" height="500">
 
To check the job status via the command line enter the following changing the job name and location.

In [None]:
! gcloud batch jobs describe hello-world \
    --location=LOCATION

### View your output

In [None]:
! gsutil ls gs://$BUCKET/

In [None]:
! gsutil cp gs://$BUCKET/hello_world.txt .
! cat hello_world.txt

## Run Nextflow Locally

### Nextflow 101

Nextflow interacts with many different files to have a proper working workflow:

- __Main file__: The main file is a .nf file that holds the processes and channels describing the input, output, a shell script of your commands, workflow which acts like a recipe book for nextflow, and/or conditions. For snakemake users this is equivalent to 'rules'.
    - __Process__: Contains channels and scripts that can be executed in a Linux server like bash commands.
    - __Channel__: Produces ways through which processes communicate to each other for example input and output are channels of value that point the process to where data is or should be located.
- __Config file__: The .config file contains parameters, and multiple profiles. Each profile can contain a different executor type (e.g. LS API, conda, docker, etc.), memory or machine type, output directory, working directory and more!
- __Docker file__: Contains dependencies and enviroments that is needed for the nextflow workflow to run.
- __Schema file__: Schmema files are optional and are structured json files that contain information about the usage and commands that your workflow will excecute.You might have seen this when you run a command along with the flag '--help'.
    

### Run a nextflow 'Hello World' process locally

We are going to first run Hello World locally using the config file called hello.nf. 

It should look like this:

```
#!/usr/bin/env nextflow
nextflow.enable.dsl=2 

params.str = 'Hello World'

process sayHello {
  input:
  val str

  output:
  stdout

  """
  echo $str > hello.txt
  cat hello.txt
  """
}
workflow {
  sayHello(params.str) | view
}
```

In [None]:
! ./nextflow run hello.nf --str 'Hello!'

## Submit Nextflow Job to the Google Batch
Create and modify your own config file to include a 'gbatch' profile block to tell Nextflow to submit the job to Google Batch instead of running locally

The config file allows nextflow to utilize excecuters like Google Batch. In this tutorial the config files is named __'nextflow.config'__. Make sure you open this file and update the `<VARIABLES>` that are account specific.
- Make sure that your region is a region included in the Google Batch!
- Specify your working directory bucket and output directory bucket
- Specify the machine type you would like to use, ensuring that there is enough memory and cpus for the workflow
    - Otherwise Google Batch will automatically use 1 CPU

```
profiles{
  gbatch{
      process.executor = 'google-batch'
      workDir = 'gs://<BUCKET>/methyl-seq'
      google.location = 'us-central1'
      google.region  = 'us-central1'
      google.project = '<YOUR_PROJECT>'
      params.outdir = 'gs://<BUCKET>methyl-seq/outdir'
      process.machineType = 'c2-standard-30'
     }
}
```

__Note:__ Make sure your working directory and output directory are different! Google Batch creates temporary file in the working directory within your bucket that do take up space so once your pipeline has completed succesfully feel free to delete the temporary files.

### Optional: Listing nf-core tools with docker and viewing their commands
Using the command below you can see all the tools that nfcore holds and their versions/lastes releases

In [None]:
! docker run nfcore/tools list

You can view commands for methylseq (or any other specified nf-core tool) by using the [--help] flag

In [None]:
! ./nextflow run nf-core/methylseq -r 1.6.1 --help

### Run Methylseq with the test profile

The 'test' profile uses a small dataset allowing you to ensure the workflow works with your config file without long runtimes. Ensure you include:
- Version of the nf-core tool [-r]
- Location of the config file [-c]

In [None]:
! ./nextflow run nf-core/methylseq -r 1.6.1 -profile test,gbatch -c nextflow-methyseq.config

You will notice in the above that to the left of the process within the __[ ]__ is actually a __tag__ you can search in Google Batch and the text before the __/__ corresponds to the __temporary directories__ within your working directory. Feel free to delete the temporary directories once your workflow has succesfully completed.

Congrats! You are done with Part I. If you want to keep going and learn how to use the Methylseq workflow with real data, then move to Part II. If not, then feel free to clean up your resources. 

## Troubleshooting

Some of the nf-core tools require extra parameters:
- If you receive a error of __'quota exceeded'__ error you can increase your boot disk size to the gbatch profile within your config file using the __google.batch.bootDiskSize__ parameter (e.g., google.batch.bootDiskSize = 100.GB)
- Some errors show that a tool could not be used, was not installed, or gives a error that doesn't really explain the reason for why the process stopped you can try to increase the process time on your profile by using the __process.time parameter__ (e.g., process.time = '2h')
- If you receive a error like below using the new release of Nextflow should fix this v23.04.0 or later
```
Caused by:
  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@49b3a025[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@2e0ceb8c[Wrapped task = TrustedListenableFutureTask@25c1396d[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@2db57b9a]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@aa6214[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

```
- adding the __-log parameter__ on the command line will help produce a log file that will help to troubleshoot other errors like so: 
`./nextflow -log DIRECTORY_NAME/nextflow.log run <process name>`

## Conclusion
Here you learned how to run Nextflow locally, submit a test job to Google Batch, and then to submit a Nextflow job to Google Batch.

## Clean up
If you want to clean up all resources associated with this tutorial then 
+ delete your bucket with `gsutil rm -r $BUCKET`
+ delete this VM in either Vertex AI or Compute Engine