# RNA-Seq Analysis using Snakemake and Google Cloud Life Sciences API

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial uses a popular workflow manager called [Nextflow](https://www.nextflow.io) run via [Google Batch](https://cloud.google.com/batch/docs/get-started). If you completed the other tutorials in this repo, you will see that it is similar to Tutorial 2, but instead of running Snakemake locally, we switch to Nextflow and run it using Batch in a serverless manner. 

### Step 1: Create a new GS Bucket to store input and output files and Enable the Batch APi
Note that your bucket has to be globally unique, so make sure you don't just copy the example here or it won't work

In [2]:
# Change this bucket name
%env BUCKET=nf-testing-bucket

env: BUCKET=nf-testing-bucket


In [3]:
# Will only create the bucket if it doesn't yet exist
!gsutil ls gs://$BUCKET >& /dev/null || gsutil mb gs://$BUCKET

Creating gs://nf-testing-bucket/...


Enable the Batch API, likewise you can do this on the console [like this](https://cloud.google.com/batch/docs/get-started#console).

In [2]:
! gcloud services enable batch.googleapis.com compute.googleapis.com logging.googleapis.com

Operation "operations/acat.p2-500111908199-cfb56681-dfb8-4c61-a596-a34583365217" finished successfully.


To take a quick anonymous survey, run:
  $ gcloud survey



### STEP 2: Install mambaforge and nextflow
First install mambaforge, then use mamba to install nextflow. Skip this as needed if you have already completed this step.

In [None]:
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [4]:
# Add to your path, do this every time you restart your kernel
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [None]:
# Install Nextflow
! mamba install -y -c conda-forge -c bioconda nextflow

### STEP 3: Review input files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a Google Storage Bucket that we made publicly accessible. All other files needed to run the pipeline are also hosted in this public bucket, and will be staged at runtime by Nextflow. To view the locations of all these files, view the `nextflow.config`. You can modify any of these paths as desired, and you could also create a new samplesheet.csv if you want to point the pipeline to different samples. The samplesheet can be stored locally or in a GS bucket.

### STEP 4: Modify config file to allow your project to interact with Google Batch
Create and modify your own config file to include a 'gbatch' profile block to tell Nextflow to submit the job to Google Batch instead of running locally.

The config file allows nextflow to utilize excecuters like Google Batch. In this tutorial the config files is named 'nextflow.config'. Make sure you open this file and update the <VARIABLES> that are account specific. In this case will will only modify the <PROJECT> with your Project ID. We will specify an outdir and work directory on the command line at run time. 

Make sure that your region is a region included in the Google Batch!
Specify the machine type you would like to use, ensuring that there is enough memory and cpus for the workflow. In this case 16 CPUs is plenty (Otherwise Google Batch will automatically use 1 CPU).
```
profiles{
  gbatch{
      process.executor = 'google-batch'
      google.location = 'us-central1'
      google.region  = 'us-central1'
      google.project = '<YOUR_PROJECT>'
      process.machineType = 'c2-standard-16'
     }
}
```
Note: Make sure your working directory and output directory are different! Google Batch creates temporary file in the working directory within your bucket that do take up space so once your pipeline has completed succesfully feel free to delete the temporary files.

### STEP 5: Submit Nextflow Job to Google Batch

A few things to note here: 
+ --input points to a samplesheet in GS. We could also point to a local samplesheet. This just tells Nextflow where to get the fastq files. 
+ The profile comes from nextflow.config. It tells the pipeline what to use as execution environment (conda, singularity, or docker) and then you give it a compute environment (in this case gbatch, but if left blank would run locally). 
+ We specify an outdir. This can point to a local folder if run locally, but since we are using the serverless Google Batch, we need to point the output to a bucket. 
+ We specify a work dir. Like the outdir, this can be local if run locally, but needs to be in a bucket when running with Batch. 
+ If you need to rerun your pipeline, you can always add `-resume` and it will search the workdir and not rerun any processes that you have already run. 

In [5]:
%%time
! nextflow run main.nf --input gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/samplesheet.csv  -profile docker,gbatch  --outdir gs://$BUCKET/outdir/ -w gs://$BUCKET/work/

N E X T F L O W  ~  version 23.04.3
Launching `main.nf` [confident_laplace] DSL2 - revision: 2df59780be
[33mWARN: The following invalid input values have been detected:

* --transcriptome: gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/M_chelonae_transcripts.fasta
* --decoys: gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/decoys.txt
* --libtype: A
* --trimmer: ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
* --alignment_mode: false

[39m[K


-[2m----------------------------------------------------[0m-
[0;35m  rnaseqprok/rnaseqprok v1.0dev[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrunName        : [0;32mconfident_laplace[0m
  [0;34mcontainerEngine: [0;32mdocker[0m
  [0;34mlaunchDir      : [0;32m/home/jupyter/RNA-Seq-Differential-Expression-Analysis[0m
  [0;34mworkDir        : [0;32m/work/[0m
  [0;34mprojectDir     : [0;32m/home/jupyter/RNA-

### STEP 9: Report the top 10 most highly expressed genes in the samples.

Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [13]:
!gsutil ls gs://$BUCKET/outdir/data/quant

gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1_meta_info.json
gs://nf-testing-bucket/outdir/data/quant/SRR13349123_T1_meta_info.json
gs://nf-testing-bucket/outdir/data/quant/SRR13349128_T1_meta_info.json
gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1/
gs://nf-testing-bucket/outdir/data/quant/SRR13349123_T1/
gs://nf-testing-bucket/outdir/data/quant/SRR13349128_T1/


In [14]:
!gsutil cp -r gs://$BUCKET/outdir/data/quant .

Copying gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1/aux_info/ambig_info.tsv...
Copying gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1/aux_info/expected_bias.gz...
Copying gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1/aux_info/fld.gz...
Copying gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1/aux_info/meta_info.json...
/ [4 files][ 21.6 KiB/ 21.6 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1/aux_info/observed_bias.gz...
Copying gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1/aux_info/observed_bias_3p.gz...
Copying gs://nf-testing-bucket/outdir/data/quant/SRR13349122_T1/cmd_info.json...
Copying gs://nf-testing-bucket/outdir/d

View the top 10 most highly expressed genes in the double lysogen sample.


In [29]:
%%bash
for samp in quant/*/quant.sf; 
    do echo $samp; 
    sort -nrk 5,5 quant/*/quant.sf | head -10; 
    done

quant/SRR13349122_T1/quant.sf
BB28_RS20665	1293	1058.826	25387.263361	107.000
BB28_RS20665	1293	1052.729	7720.151458	58.000
BB28_RS20665	1293	1051.503	8162.677212	55.000
BB28_RS20690	10377	10136.729	594.408477	43.000
BB28_RS03075	1626	1384.503	4508.650030	40.000
BB28_RS07370	9255	9013.503	623.288755	36.000
BB28_RS20690	10377	10135.503	538.893736	35.000
BB28_RS20685	7731	7490.729	654.724112	35.000
BB28_RS03075	1626	1385.729	3539.192149	35.000
BB28_RS13330	2790	2555.826	3243.690393	33.000
quant/SRR13349123_T1/quant.sf
BB28_RS20665	1293	1058.826	25387.263361	107.000
BB28_RS20665	1293	1052.729	7720.151458	58.000
BB28_RS20665	1293	1051.503	8162.677212	55.000
BB28_RS20690	10377	10136.729	594.408477	43.000
BB28_RS03075	1626	1384.503	4508.650030	40.000
BB28_RS07370	9255	9013.503	623.288755	36.000
BB28_RS20690	10377	10135.503	538.893736	35.000
BB28_RS20685	7731	7490.729	654.724112	35.000
BB28_RS03075	1626	1385.729	3539.192149	35.000
BB28_RS13330	2790	2555.826	3243.690393	33.000
quant/SRR1334912

### STEP 10: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [28]:
%%bash
for samp in quant/*/quant.sf; do echo $samp; 
    echo Name    Length  EffectiveLength TPM     NumReads;
    grep 'BB28_RS16545' quant/*/quant.sf; 
    done

quant/SRR13349122_T1/quant.sf
Name Length EffectiveLength TPM NumReads
quant/SRR13349122_T1/quant.sf:BB28_RS16545	987	745.503	837.319237	4.000
quant/SRR13349123_T1/quant.sf:BB28_RS16545	987	746.729	750.604918	4.000
quant/SRR13349128_T1/quant.sf:BB28_RS16545	987	752.826	333.704510	1.000
quant/SRR13349123_T1/quant.sf
Name Length EffectiveLength TPM NumReads
quant/SRR13349122_T1/quant.sf:BB28_RS16545	987	745.503	837.319237	4.000
quant/SRR13349123_T1/quant.sf:BB28_RS16545	987	746.729	750.604918	4.000
quant/SRR13349128_T1/quant.sf:BB28_RS16545	987	752.826	333.704510	1.000
quant/SRR13349128_T1/quant.sf
Name Length EffectiveLength TPM NumReads
quant/SRR13349122_T1/quant.sf:BB28_RS16545	987	745.503	837.319237	4.000
quant/SRR13349123_T1/quant.sf:BB28_RS16545	987	746.729	750.604918	4.000
quant/SRR13349128_T1/quant.sf:BB28_RS16545	987	752.826	333.704510	1.000
