# What is Google Batch?

Batch allows you to schedule, queue, and execute batch processing workloads on a VM instances. Batch provisions resources and manages capacity on your behalf, allowing your batch workloads to run at scale. 

__What are tasks?__

Tasks are actions or steps that will be executed by batch. You can edit how many tasks you have, resources (cpus and memory that each task will need, and how many should be run parallel to each other). 

__How does Batch differ from cloud Life Sciences?__

You don't need to configure and manage third-party job schedulers, provision and deprovision resources, or request resources one zone at a time. To run a job, you specify parameters for the resources required for your workload, then Batch obtains resources and queues the job for execution. Batch provides native integration with other Google Cloud services to aid in the scheduling, execution, storage, and analysis of batch jobs.

# 0. Batch Prerequisites

Before using Batch first enable the __Batch, Compute Engine, and Cloud Logging__ APIs by searching each product and clicking <img src="batch_images/service_account_5.jpg" width="50" height="50"> if you havent already.



Once the APIs are enabled you now need to add the following service account roles add the following roles to service account
- Service Account Admin
- Batch Agent Reporter 
- Storage Admin
- Storage Object Admin
- Batch Job Admin

By going to navigation menu to __IAM & Admin__ then __Service Accounts__. 

<img src="batch_images/service_account_1.jpeg" width="200" height="200"> 

You can create a new service account by clicking <img src="batch_images/service_account_2.jpeg" width="200" height="200"> and call the the service account __"batch service account"__

To add roles to your new service account click the edit icon <img src="batch_images/service_account_3.jpeg" width="30" height="30">

Then add the above roles through the filter bar 

<img src="batch_images/service_account_4.jpg" width="300" height="300">

Once the roles are added click __Save__

__WARNING__: Please __do not create a service key__ if instructed by any tutorial. API keys are generally not considered secure; they are typically accessible to clients, making it easy for someone to steal an API key. Once the key is stolen, it has no expiration, so it may be used indefinitely, unless the project owner revokes or regenerates the key. 

# 1. Submit a job directly to Google Batch

#### 1.1 Submitting a job through the console

Running a batch job through the console allows for a user-friendly view to input data and scripts and view the status of the jobs you created.

Start by searching __'Batch'__ in the console search bar you should see a similar setting like this  
<img src="batch_images/batch_start.png" width="300" height="300">

Near the upper left corner click <img src="batch_images/create.png" width="50" height="50">

 
 The follow should appear on the screen
 
 <img src="batch_images/create_job_console.png" width="300" height="300">
 
 This is where you can:
 - Label your job
 - Select a region and zone to excecute your job
 - Select your machine type (e.g. e2-medium)
 - Specify tasks by adding a script and/or specifiying a container to run the task in
 - Allocating resources for each task
 - Add storage volume

Once you have entered the settings for your batch job you can even view the full script that you would submit through the command line by clicking __'EQUIVALENT COMMAND LINE'__ next to __'CREATE'__ to see the following output. In the next section we will go in more depth of how to edit this script and submit your job through the command line. 

 <img src="batch_images/Batch_command_line_console.png" width="400" height="400">

Once you run your job by clicking __'CREATE'__. 

You can view the status of your job by looking at the __'Job List'__. Here you will see your job name, status, region, memory per task, machine type, date started and run time.

 <img src="batch_images/Job_l.png" width="500" height="500">

By clicking the job name you can view more information of the jobs setting, resources applied, and logs by clicking  <img src="batch_images/log.png" width="100" height="100">

### 1.2 Submitting a job through the command line

To submit a batch job through the command line you first need to create a __json__ file this is your config file. You can use the attached hello world script as a template for your batch job.

The hello world script will look like this:
    
<img src="batch_images/script.png" width="600" height="600">

To submit the job enter:

gcloud beta batch jobs submit _job-name_ \
--location _location that job should be submitted in_ \
--config _the location of your json file_

You can see the output shows our json file and the first line tell us if it was successful.



In [None]:
!gcloud beta batch jobs submit example-script-job \
  --location us-central1 \
  --config batch/hello-world-script.json

To add a __bucket/storage__ to hold outputs from your batch job you can add a __volumes__ block in your config script. This block will be a sub-block apart of the __runnables__ block (see the hello-world-script). You can add any of your existing buckets and ask that they be mounted to the jobs container.

```
"volumes": [
            {
                "gcs": {
                "remotePath": "test-bucket"
                },
                "mountPath": "/mnt/share"
            }
            ]
```

To change the __container__ instead of the default container your runnables block will look quite different.
Instead of writing "script" you will need to identify the image name or address, the entrypoint, and any commands.

```
"runnables": [
            {
                "container": {
                "imageUri": "gcr.io/google-containers/busybox",
                "entrypoint": "/bin/sh",
                "commands": [
                    "-c",
                    "echo Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks."
                ]
                }
            }
            ],
```

## 2. Running Google Batch through Nextflow

Nextflow interacts with many different files to have a proper working workflow:

- __Main file__: The main file is a .nf file that holds the processes and channels describing the input, output, a shell script of your commands, workflow which acts like a recipe book for nextflow, and/or conditions. For snakemake users this is equivalent to 'rules'.
    - __Process__: Contains channels and scripts that can be executed in a Linux server like bash commands.
    - __Channel__: Produces ways through which processes communicate to each other for example input and output are channels of value that point the process to where data is or should be located.
- __Config file__: The .config file contains parameters, and multiple profiles. Each profile can contain a different executor type (e.g. LS API, conda, docker, etc.), memory or machine type, output directory, working directory and more!
- __Docker file__: Contains dependencies and enviroments that is needed for the nextflow workflow to run.
- __Schema file__: Schmema files are optional and are structured json files that contain information about the usage and commands that your workflow will excecute.You might have seen this when you run a command along with the flag '--help'.

Google Batch runs very similarly like Life Sciences through Nextflow the main differences are:
- Need latest release of Nextflow from the edge channel (version 22.07.1-edge or later) to run google batch. We used the latest version: 22.10.4
- Will need to create a new profile in your Nextflow config file 
    - the profile will be name __gbatch__
    - executor name will be 'google-batch'
    
```
profiles{
  gbatch{
      process.executor = 'google-batch'
      process.machineType = 'n2-standard-16'
      workDir = 'gs://BUCKETNAME/YOUR_WORKING_DIRECTORY' #Will store temporary files
      google.location = 'us-central1'
      google.region  = 'us-central1'
      google.project = 'YOUR_PROJECT_ID'
      params.outdir = 'gs://BUCKETNAME/YOUR_OUTPUT_DIRECTORY' #Will store your outputs, should be different than your working directory
      }
}

```

Always remember to self-update Nextflow and export the google plugin

In [13]:
!export NXF_MODE=google
!./nextflow self-update

CAPSULE: Downloading dependency org.multiverse:multiverse-core:jar:0.7.0wait .. Downloading nextflow dependencies. It may require a few seconds, please wait .. 2/3 KB   
CAPSULE: Downloading dependency org.slf4j:jul-to-slf4j:jar:1.7.36
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-templates:jar:3.0.16
CAPSULE: Downloading dependency io.nextflow:nf-httpfs:jar:23.04.0
CAPSULE: Downloading dependency org.slf4j:jcl-over-slf4j:jar:1.7.36
CAPSULE: Downloading dependency commons-io:commons-io:jar:2.11.0
CAPSULE: Downloading dependency org.codehaus.jsr166-mirror:jsr166y:jar:1.7.0
CAPSULE: Downloading dependency commons-codec:commons-codec:jar:1.15
CAPSULE: Downloading dependency com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava
CAPSULE: Downloading dependency ch.artecat.grengine:grengine:jar:3.0.0
CAPSULE: Downloading dependency org.yaml:snakeyaml:jar:1.33
CAPSULE: Downloading dependency com.beust:jcommander:jar:1.35
CAPSULE: Downloading dependency jl

Once you have your config file set we can run a hello world script to see it works by adding the __-profile__ command to say **gbatch** and the using the __-c__ command to write the location of your config file.

In [None]:
!sudo nextflow run https://github.com/nextflow-io/hello -profile gbatch -c nextflow.config

# Test Data

Now we will run the nf-core RNAseq analysis using the test dataset. This analysis even when using the test data still take ~1hr to run. If this process ends in a error you can use the -resume flag to pick back up where it stopped.

In [None]:
#G Batch RNA test
!./nextflow run nf-core/rnaseq -r 3.8.1 -c nextflow.config -profile test,gbatch -resume

## Real Data with Methylseq

In [None]:
#Real Data G Batch
!./nextflow run nf-core/methylseq \
    -r 1.6.1 \
    --input 'SRR067701.fastq.gz' \
    --genome GRCh37 \
    --single_end \
    --max_cpus 32 \
    --max_memory '110.GB' \
    -c nextflow.config \
    -profile gbatch \
    -resume

## Troubleshooting

Some of the nf-core tools require extra parameters:
- If you receive a error of 'quota exceeded' error you can increase your boot disk size to the gbatch profile within your config file using the google.batch.bootDiskSize parameter (e.g., google.batch.bootDiskSize = 100.GB)
- Some errors show that a tool could not be used, was not installed, or gives a error that doesn't really explain the reason for why the process stop you can try to increase the process time on your profile by using the process.time parameter (e.g., process.time = '2h')
- If you receive a error like below using the new release of Nextflow should fix this v23.04.0 or later
```
Caused by:
  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@49b3a025[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@2e0ceb8c[Wrapped task = TrustedListenableFutureTask@25c1396d[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@2db57b9a]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@aa6214[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

```
- adding the -log parameter on the command line will help produce a log file that will help to troubleshoot other errors like so: 
`./nextflow -log DIRECTORY_NAME/nextflow.log run <process name>`