# RNA-Seq Analysis using Snakemake and Google Cloud Life Sciences API

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial uses a popular workflow manager called [Nextflow](https://www.nextflow.io) run via [Google Batch](https://cloud.google.com/batch/docs/get-started). If you completed the other tutorials in this repo, you will see that it is similar to Tutorial 2, but instead of running Snakemake locally, we switch to Nextflow and run it using Batch in a serverless manner. 

## Learning Objectives

* **Installing and configuring Nextflow:** Learners will install mambaforge (a conda distribution) and Nextflow, a workflow management system, within their Jupyter environment. They will also learn how to configure their `nextflow.config` file to utilize the Google Batch execution environment.

* **Understanding and modifying a Nextflow pipeline:** The notebook utilizes a pre-existing Nextflow pipeline (`main.nf`).  The objective is to understand how to modify the pipeline's configuration file to point to the correct input data (samplesheet) and output locations within the GCS bucket. This includes setting appropriate parameters for the Google Batch execution (region, project ID, machine type).

* **Running a Nextflow workflow on Google Batch:**  Learners will submit their configured Nextflow workflow to Google Batch for execution. This involves understanding and utilizing command-line arguments to specify input files, output directory, working directory, and the execution profile.  They'll also learn about the `-resume` option for restarting partially completed workflows.

* **Analyzing RNA-Seq results:**  The final objective is to analyze the generated RNA-Seq data.  This involves using command-line tools (like `grep` and `sort`) to identify and report the top highly expressed genes and the expression levels of specific genes of interest (e.g., a putative acyl-ACP desaturase).

This Jupyter notebook performs RNA-Seq analysis using Snakemake (although the notebook itself uses Nextflow) on Google Cloud.  Here's a breakdown of the prerequisites:

**APIs that should be enabled:**

* **Google Cloud Batch API:**  This is explicitly mentioned in the notebook and is crucial for submitting and managing the Nextflow workflow on Google Batch.
* **Google Cloud Compute Engine API:**  While not explicitly stated, it's highly likely needed by Google Batch to provision the virtual machines required for the workflow execution.
* **Google Cloud Storage API:**  This is necessary to interact with Google Cloud Storage (GCS) buckets for storing input data, intermediate files, and output results.  The notebook uses `gsutil` commands extensively.
* **Google Cloud Logging API:** (mentioned in the notebook)  Likely used for logging information about the batch job execution.

**Cloud Platform Account Roles that must be assigned:**

The specific roles depend on how the user wants to manage resources. However, at minimum, the user needs roles that grant permissions to:

* **Storage Object Admin:**  To create, read, write, and delete objects in the GCS bucket.  This role is needed for uploading inputs and storing outputs.  This could be simplified to a more granular role depending on the needs.
* **Batch Job User:**   To submit Batch Jobs.  Without this, the user wouldn't be able to execute the workflow on Google Batch. A more granular role might suffice.
* **Compute Instance User:**  To allow Batch to provision VMs. This role may be implicitly granted if the Batch user role is sufficiently powerful. A more granular role might suffice.


**Software Prerequisites (not Cloud-specific):**

* **Jupyter Notebook:**  The environment to run the notebook itself.
* **Nextflow:** The workflow management system used.  The notebook includes instructions for installing it using mambaforge.
* **Mambaforge (or Miniconda/Anaconda):**  A package manager for installing Nextflow and its dependencies.
* **Docker:**  Docker containers are utilized within the Google Batch jobs.

## Get Started

### Step 1: Create a new GS Bucket to store input and output files and Enable the Batch APi
Note that your bucket has to be globally unique, so make sure you don't just copy the example here or it won't work

In [None]:
# Change this bucket name
%env BUCKET=nf-testing-bucket

In [None]:
# Will only create the bucket if it doesn't yet exist
! gsutil ls gs://$BUCKET >& /dev/null || gsutil mb gs://$BUCKET

Enable the Batch API, likewise you can do this on the console [like this](https://cloud.google.com/batch/docs/get-started#console).

In [None]:
! gcloud services enable batch.googleapis.com compute.googleapis.com logging.googleapis.com

### STEP 2: Install nextflow
Use mamba to install nextflow. Skip this as needed if you have already completed this step.

In [None]:
# Install Nextflow
! mamba install -y -c conda-forge -c bioconda nextflow

### STEP 3: Review input files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a Google Storage Bucket that we made publicly accessible. All other files needed to run the pipeline are also hosted in this public bucket, and will be staged at runtime by Nextflow. To view the locations of all these files, view the `nextflow.config`. You can modify any of these paths as desired, and you could also create a new samplesheet.csv if you want to point the pipeline to different samples. The samplesheet can be stored locally or in a GS bucket.

### STEP 4: Modify config file to allow your project to interact with Google Batch
Create and modify your own config file to include a 'gbatch' profile block to tell Nextflow to submit the job to Google Batch instead of running locally.

The config file allows nextflow to utilize excecuters like Google Batch. In this tutorial the config files is named 'nextflow.config'. Make sure you open this file and update the <VARIABLES> that are account specific. In this case will will only modify the <PROJECT> with your Project ID. We will specify an outdir and work directory on the command line at run time. 

Make sure that your region is a region included in the Google Batch!
Specify the machine type you would like to use, ensuring that there is enough memory and cpus for the workflow. In this case 16 CPUs is plenty (Otherwise Google Batch will automatically use 1 CPU).
```
profiles{
  gbatch{
      process.executor = 'google-batch'
      google.location = 'us-central1'
      google.region  = 'us-central1'
      google.project = '<YOUR_PROJECT>'
      process.machineType = 'c2-standard-16'
     }
}
```
Note: Make sure your working directory and output directory are different! Google Batch creates temporary file in the working directory within your bucket that do take up space so once your pipeline has completed succesfully feel free to delete the temporary files.

### STEP 5: Submit Nextflow Job to Google Batch

A few things to note here: 
+ --input points to a samplesheet in GS. We could also point to a local samplesheet. This just tells Nextflow where to get the fastq files. 
+ The profile comes from nextflow.config. It tells the pipeline what to use as execution environment (conda, singularity, or docker) and then you give it a compute environment (in this case gbatch, but if left blank would run locally). 
+ We specify an outdir. This can point to a local folder if run locally, but since we are using the serverless Google Batch, we need to point the output to a bucket. 
+ We specify a work dir. Like the outdir, this can be local if run locally, but needs to be in a bucket when running with Batch. 
+ If you need to rerun your pipeline, you can always add `-resume` and it will search the workdir and not rerun any processes that you have already run. 

In [None]:
%%time
! nextflow run main.nf --input gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/samplesheet.csv  -profile docker,gbatch  --outdir gs://$BUCKET/outdir/ -w gs://$BUCKET/work/

### STEP 9: Report the top 10 most highly expressed genes in the samples.

Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [None]:
! gsutil ls gs://$BUCKET/outdir/data/quant

In [None]:
! gsutil cp -r gs://$BUCKET/outdir/data/quant .

View the top 10 most highly expressed genes in the double lysogen sample.


In [None]:
%%bash
for samp in quant/*/quant.sf; 
    do echo $samp; 
    sort -nrk 5,5 quant/*/quant.sf | head -10; 
    done

### STEP 10: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
%%bash
for samp in quant/*/quant.sf; do echo $samp; 
    echo Name    Length  EffectiveLength TPM     NumReads;
    grep 'BB28_RS16545' quant/*/quant.sf; 
    done

## Conclusion 

This Jupyter Notebook demonstrated a streamlined RNA-Seq analysis workflow leveraging the power of Nextflow and Google Cloud's Batch API.  The workflow, encompassing read trimming, quality control, mapping, and gene quantification, was executed efficiently in a serverless environment. By utilizing Google Cloud Storage for input and output data, and Google Batch for task orchestration, the notebook showcased a scalable and reproducible approach to RNA-Seq analysis.  The steps detailed the setup, including Google Cloud project configuration and Nextflow installation, followed by the execution of the pipeline and analysis of the resulting gene expression data.  The notebook successfully retrieved and displayed the top 10 most highly expressed genes and demonstrated the analysis of a specific gene of interest, highlighting the practical application of this cloud-based approach for RNA-Seq data processing and interpretation.  This approach offers a robust and scalable solution for researchers needing to efficiently manage computationally intensive RNA-Seq analyses.

## Clean Up

Remember to move to the next notebook or shut down your instance if you are finished.