# MDIBL Transcriptome Assembly Learning Module
# Notebook 4: Using TransPi on Google Batch

## Overview

So far, all of the computational work executed has been run locally, using the compute resources available within this Jupyter notebook. Although this is functional, it is not the ideal setup for fast, cost-efficient data analysis.

Google Batch is known as a scheduler, which provisions specific compute resources to be allocated for individual processes within our workflow. This provides two primary benefits:
> - Once each specific process is complete, the computer will automatically turn off, meaning that you aren't wasting any money on unused resources.
> - Multiple processes can be executed at the same time, allowing for the parallelization of computational tasks. This means that the computational process is quicker from start to finish.

Fortunately, Batch and Nextflow are compatible with each other allowing for any Nextflow workflow, including the TransPi workflow that we have been using, to be executable on Batch.


> <img src="images/gcbDiagram.jpg" width="1200">
>
> **Figure 1:** Diagram illustrating the interactions between the components used for the Google Batch run. 

For this to work, there are a few quick adjustment steps to make sure everything is set up for a Google Batch run!

## Learning Objectives:

1. **Utilize Google Batch for efficient and cost-effective data analysis:**  The notebook contrasts local computation with Google Batch, highlighting the benefits of the latter in terms of cost savings (auto-shutdown of unused resources) and speed (parallelization of tasks).

2. **Integrate Nextflow workflows with Google Batch:** The notebook demonstrates how to configure a Nextflow pipeline (TransPi) to run on Google Batch, emphasizing the compatibility between these tools.

3. **Manage files using Google Cloud Storage (GCS):**  The lesson requires users to create or utilize a GCS bucket to store the necessary files for the TransPi workflow, addressing the challenge of accessing local files from external compute resources.

4. **Configure a Nextflow pipeline for Google Batch execution:** This involves modifying the `nextflow.config` file to point to the GCS bucket, adjust compute allocations (CPU and memory), and specify the correct Google Batch profile.  It shows how to use Perl one-liners for efficient configuration changes.

5. **Interpret and compare the timelines of local and Google Batch runs:**  By comparing the `transpi_timeline.html` files from both local and Google Batch executions, users learn to analyze the performance differences and understand the impact of resource allocation.

6. **Execute and manage a Nextflow pipeline on Google Batch:** The notebook provides step-by-step instructions for running TransPi on Google Batch using specific command-line arguments and managing the output.

7. **Understand and utilize Google Cloud commands:**  The notebook uses `gcloud` and `gsutil` commands extensively, teaching users basic Google Cloud command-line interactions.

## Prerequisites

* **A Google Cloud Storage (GCS) Bucket:**  A bucket is needed to store the TransPi workflow's input files and output results. The notebook provides options to create a new bucket or use an existing one.
* **Sufficient Compute Resources:** The user needs to have sufficient quota available in their GCP project to handle the compute resources required by the TransPi workflow (CPUs, memory, disk space).  The notebook uses a `nextflow.config` file to configure the Google Batch execution.
* **`gcloud` CLI:** The Google Cloud SDK (`gcloud`) command-line tool must be installed and configured to authenticate with the GCP project.  The notebook uses `gcloud` commands to interact with GCP services.
* **`gsutil` CLI:** The `gsutil` command-line tool (part of the Google Cloud SDK) is used to interact with GCS.
* **Nextflow:** The Nextflow workflow engine must be installed locally on the Jupyter Notebook environment.
* **TransPi Workflow:** The TransPi Nextflow pipeline code must be available in the Jupyter Notebook environment's file system.  The notebook assumes it's in a `TransPi` directory.
* **Perl:** The notebook uses Perl one-liners for file manipulation.  Perl must be installed in the Jupyter Notebook environment.

## Get Started

In [None]:
#Run the command below to watch the video
from IPython.display import YouTubeVideo

YouTubeVideo('abw2XAg1e_g', width=800, height=400)

**Step 1:** Downsize the VM instance.
> Consider downloading or taking a screenshot of the following image as the downsizing process will involve briefly stopping this VM instance.
>
> <img src="images/VMdownsize.jpg" width="1200">

**Step 2:** Once again we are going to set the local working directory back to `/home/jupyter`.

In [None]:
%cd /home/jupyter

In [None]:
!pwd

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  Bucket for Batch
</div>

> Batch is using external machines to do our computing work for us, which means that it is unable to find files that we have locally within this Jupyter notebook. As a result, we need to put the files that TransPi needs to run in a location that is findable from these machines: Google Cloud Storage (GCS) buckets!

**Step 3:** Create a variable for your Google project name
> - The first line is a Google Cloud command that gets the name of your project and puts it in a list named projName.
> - The second line gets the name, which is at the 0 index of the list and sets it to the variable `myProject`.
> - The third line just prints out the name.

In [None]:
projName=!gcloud config get-value project
myProject=projName[0]
myProject

**Step 4a:** Bucket Setup:

Set the variable `myBucketName` to one of the following:
1. If you plan on using an existing bucket, then set it to the name of that bucket.
2. If you would like to use a new bucket, then set the variable to whatever you would like to name your new bucket. Here are some quick naming guidelines:
    - You can use lowercase letters, numbers, dashes, underscores, and dots. 
    - The name cannot start or end in a dash, underscore, or dot.
    - Keep the name within the quotes.
    - More information can be found [here](https://cloud.google.com/storage/docs/buckets?_ga=2.188214954.-360038957.1673379287#naming).

In [None]:
myBucketName="your-bucket-name"
myBucketName

**Step 4b:** Replace names in files with personal bucket and project names. 

In [None]:
! sed -i "s/<YOUR-PROJECT-ID>/$myProject/g" nextflow.config
! sed -i "s/<YOUR-BUCKET-NAME>/$myBucketName/g" conf/test_params.config

**Step 4c:** Create a new GCS bucket. *If you are using an existing bucket, you can skip this step.*
> To do this, we can use a new `gsutil` command: `mb` which stands for make bucket.

In [None]:
!gsutil mb -p $myProject -c STANDARD -b on gs://$myBucketName

**Step 5:** Create a Google-recognizable path variable named `gbPath`.
> You don't need to change anything, just execute.

In [None]:
gbPath="gs://" + myBucketName + "/TransPi"
gbPath

**Step 6:** Copy the `resources` directory into your bucket.
> These are the same resources that we copied to the local directory in Submodule 01.

In [None]:
!gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/resources $gbPath/resources

**Step 7A:** Adjust our `nextflow.config` file paths.

This changes all of the pointers to our resources in the GCS bucket.
> This is a Perl one-liner that is very similar to the one used in Submodule 03.

In [None]:
!perl -i.annloc -pe s#/home/jupyter#$gbPath#g ./TransPi/nextflow.config

**Step 7B:** Adjust the names of directories and add your project name to the gcb profile.

In [None]:
!perl -i -pe 's/onlyAnnRun/basicRun/g' ./TransPi/nextflow.config

In [None]:
!perl -i -pe s#your-project-name#$myProject#g ./TransPi/nextflow.config

**Step 7C:** Adjust our `nextflow.config` compute allocations.

Now that we are using separately provisioned compute resources, we can allocate more CPU power and memory to specific processes.

> These are also Perl one-liners, but this time they are delimited with `/` instead of `#`.

In [None]:
!perl -i -pe "s/cpus='15'/cpus='20'/g" ./TransPi/nextflow.config

In [None]:
!perl -i -pe "s/memory='100 GB'/memory={ 100.Gb + (task.attempt * 50.Gb)}/g" ./TransPi/nextflow.config

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Checkpoint 1:</b>
</div>

In [None]:
from jupytercards import display_flashcards
display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/04-cp1-1.json')

**Step 8:** Time to run TransPi using Batch.
> This should take about **40 minutes.**

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  gcb profile
</div>

> Note that in the command, we use the profile gcb. This tells Nextflow that we want to use the gcb profile designated within the `nextflow.config` file. Here is what that profile looks like: 
>```
    gcb {
	process.executor = 'google-batch'
    process.container = 'ubuntu'
    google.location = 'us-central1'
    google.region = 'us-central1'
    google.project = 'your-project-name'
    workDir = 'gs://your-bucket-name/TransPi/basicRun/work'
    params.outdir='gs://your-bucket-name/TransPi/basicRun/output'
    google.batch.bootDiskSize=50.GB
    google.storage.parallelThreadCount = 100
    google.storage.maxParallelTransfers = 100
    }
>```

In [None]:
!NXF_VER=22.10.1 ./nextflow run ./TransPi/TransPi.nf \
-profile gcb --k 17,25,43 --maxReadLen 50 --all -resume

**Step 9:** Take a look at `transpi_timeline.html` and compare it to the timeline of the local run.

>First we have to make a local directory to place the output.

In [None]:
!mkdir GCBbasicRun
!mkdir ./GCBbasicRun/output

>Now we can copy over the `pipeline_info` from the bucket to our new local bucket.

In [None]:
!gsutil -m cp -r $gbPath/basicRun/output/pipeline_info ./GCBbasicRun/output

>Now we can visualize both the local and GCB run.

In [None]:
from IPython.display import IFrame
IFrame("../GCBbasicRun/output/pipeline_info/transpi_timeline.html",width=1200, height=900)

> **Figure 1:** GCB Run Timeline

In [None]:
IFrame('../basicRun/output/pipeline_info/transpi_timeline.html',width=1200, height=900)

> **Figure 2:** Local Run Timeline Above

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Checkpoint 2:</b>
</div>

Consider the two figures that you just generated. In the markdown cell below, take some notes on the similarities and differences between the timelines of the two runs.

In [None]:
from jupytercards import display_flashcards
display_flashcards('../quiz-material/04-cp1-2.json')
display_flashcards('../quiz-material/04-cp1-3.json')
display_flashcards('../quiz-material/04-cp1-4.json')

**Step 10:** Now let's try a GCB run with `--onlyAnn`. Before we do, we need to change our `workDir` and `outDir` paths in the `nextflow.config` so that it does not overwrite the output that we just created for the `--all` run.

In [None]:
!perl -i.allgcb -pe 's#basicRun#onlyAnnRun#g' ./TransPi/nextflow.config

**Step 11:** Time to run. The only change that we will make to the run command is to change `--all` to `--onlyAnn`
> This run should take about **30 minutes**.

In [None]:
!NXF_VER=22.10.1 ./nextflow run ./TransPi/TransPi.nf \
-profile gcb  --onlyAnn 

Feel free to explore the results found in the GCB `onlyAnn` run. The following cell will place the `pipeline_info` directory from the run into the directory: `./GCBonlyAnnRun/output`. The rest of the results should be essentially the same as the `onlyAnn` run locally in Submodule 03.

In [None]:
!mkdir GCBonlyAnnRun
!mkdir ./GCBonlyAnnRun/output
!gsutil -m cp -r $gbPath/onlyAnnRun/output/pipeline_info ./GCBonlyAnnRun/output

##### At this point, you have the toolkit necessary to run TransPi in various configurations and the baseline knowledge to interpret the output that TransPi produces. You also have the foundational knowledge of Google Cloud resources with the ability to utilize buckets and cloud computing to execute your computational task. Specifically, Batch which not only works with TransPi but also with any other Nextflow pipeline. We urge you to continue exploring TransPi, using different data sets, and also to explore other Nextflow pipelines as well.

## Conclusion

This module demonstrated the execution of the TransPi transcriptome assembly workflow on Google Batch, a significant advancement from local Jupyter Notebook execution.  By leveraging Google Batch's scheduling capabilities, we achieved both cost efficiency through automated resource allocation and increased speed through parallelization of computational tasks.  The integration of Nextflow with Google Batch streamlined the process, requiring only minor adjustments to the `nextflow.config` file to redirect file paths to Google Cloud Storage (GCS) buckets and optimize compute allocations.  Comparison of local and Google Batch run timelines highlighted the benefits of cloud computing for large-scale bioinformatics analyses.  This learning module equipped users with the skills to effectively utilize Google Batch for efficient and scalable execution of Nextflow pipelines, paving the way for more complex and data-intensive bioinformatics projects.

## Clean Up

You would proceed to the next notebook [`Submodule_05_Bonus_Notebook.ipynb`](./Submodule_05_Bonus_Notebook.ipynb) or shut down your instance if you are finished.