# RNA-Seq Analysis using Snakemake and Google Cloud Life Sciences API

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial uses a popular workflow manager called [Snakemake](https://snakemake.readthedocs.io/en/stable/) run via the [Google Cloud Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest). If you completed the other tutorials in this repo, you will see that much of it is a repeat of Tutorial 2, but instead of running Snakemake locally, it uses the Life Sciences API to run in a serverless way. You will need to repeat most of the steps here because this notebook copies the data to a bucket instead of locally.
**Make sure you enable the Life Science API before running this tutorial**.

## Learning Objectives

* **Use Snakemake for workflow management:** The notebook demonstrates how to utilize Snakemake to define and execute a bioinformatics workflow in a reproducible and scalable manner.  Learners will understand how to structure a Snakemake workflow (using the provided `snakefile_ls_api`).

* **Execute Snakemake on Google Cloud Life Sciences API:** The core learning objective focuses on leveraging the Google Cloud Life Sciences API to run a Snakemake pipeline in a serverless environment.  This avoids the need for managing local compute resources.  The steps cover authentication with Google Cloud using service account keys.

* **Perform standard RNA-Seq analysis steps:** The workflow encompasses typical RNA-Seq processing steps, including read trimming (Trimmomatic), quality control (FastQC), read mapping (Salmon), and read counting to quantify gene expression.

* **Interpret RNA-Seq results:** The notebook concludes by guiding users through interpreting the generated output, focusing on identifying highly expressed genes and comparing gene expression levels between different samples.  This includes extracting and analyzing data from the Salmon output files.

* **Troubleshoot cloud-based pipelines:** The tutorial emphasizes understanding how to access and interpret logs from the Google Cloud Life Sciences API jobs to troubleshoot any pipeline failures.

## Prerequisites


**APIs:**

* **Google Cloud Life Sciences API:**  This is explicitly stated as a requirement in the notebook.  It's used to run the Snakemake workflow in a serverless manner on Google Cloud.


**Cloud Platform Account Roles:**

The notebook requires at least the following roles (or equivalent permissions) to be assigned to the service account used:

* **Storage Object Admin:** This is needed to create, modify and delete objects (files) in Google Cloud Storage buckets.  Many actions in the notebook (like `gsutil` commands) require this level of access.
* **Life Sciences User:** This role allows the service account to submit jobs to the Life Sciences API.


**Cloud Platform Access:**

* **Google Cloud Project:** A Google Cloud project is essential.  The notebook uses `gcloud` commands which require this project context to be set.
* **Google Cloud Storage (GCS) Bucket:** A GCS bucket is created to store input (FASTQ files, reference genomes) and output files generated by the RNA-Seq pipeline. The notebook uses a unique bucket name to avoid conflicts.
* **Compute Engine (Implicit):** While not explicitly stated, the Life Sciences API relies on Compute Engine resources to execute the workflow's individual steps. Therefore, implicit access to Compute Engine resources within the project is required, even if it's managed automatically by the Life Sciences API.


**Software and Dependencies:**

* **Python:** The notebook uses Python code to interact with the Google Cloud libraries and manipulate data.
* **Mambaforge (or other conda distribution):** Used for managing Python packages and dependencies, specifically installing `snakemake`.
* **Snakemake:** The workflow management system orchestrating the RNA-Seq pipeline.
* **Git:** Used to initialize a local Git repository (not strictly necessary for the pipeline itself).
* **Various bioinformatics tools:** The Snakemake workflow likely depends on several bioinformatics tools (Trimmomatic, Salmon, etc.). These are specified within the Snakefile and its config files, so their specific versions are controlled by the conda environments.
* **Service Account Key File (`cloud_creds.json`):**  A JSON key file downloaded from the Google Cloud Console for the service account that will run the Life Sciences API jobs.  This is crucial for authentication.

## Get Started

### Step 1: Create a new GS Bucket to store input and output files
Note that your bucket has to be globally unique, so make sure you don't just copy the example here or it won't work

In [None]:
#change this bucket name
%env BUCKET=cl-testing-bucket

In [None]:
#will only create the bucket if it doesn't yet exist
!gsutil ls gs://$BUCKET >& /dev/null || gsutil mb gs://$BUCKET

In [None]:
#set versioning on the bucket so it can overwrite old files
!gsutil versioning set on gs://$BUCKET

### STEP 2: Install mambaforge and snakemake
First install mambaforge, then use mamba to install snakemake. Skip this if you have completed the other tutorials.

In [None]:
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
#add to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/Mambaforge/bin"

In [None]:
#install snakemake
! mamba install -y -c conda-forge -c bioconda snakemake

### STEP 3: Copy FASTQ Files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a Google Storage Bucket that we made publicly accessible.

In [None]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/ gs://$BUCKET/data/raw_fastq

Create a fake path to data/fastqc so that snakemake can write files to that bucket path, otherwise the pipeline crashes.

In [None]:
! touch blank.txt
! gsutil cp blank.txt gs://$BUCKET/data/fastqc/


### STEP 4: Copy reference files that will be used by Salmon
Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [None]:
!gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/ gs://$BUCKET/data/


### STEP 5: Copy data file for Trimmomatic

In [None]:
!gsutil -m cp gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa gs://$BUCKET/TruSeq3-PE.fa

### STEP 6: Copy data and config files that will be used in our snakemake environment

Next download config files for our snakemake environment.

In [None]:
!gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/ .
!gsutil -m cp gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config.yaml .
!gsutil -m cp gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/snakefile_ls_api .

Add the bucket path to the end of your config file. Since this file was written for running snakemake locally we have to make a few edits to run on the LS API.

In [None]:
!echo 'bucket:' >> config.yaml

In [None]:
!echo '   '$BUCKET >>config.yaml

Add bucket path to the snakefile

In [None]:
!sed -i 's/print(SAMPLES)/BUCKET=config["bucket"]/' snakefile_ls_api

### Step 7: Set up your local environment
You need to generate a [service account key](https://cloud.google.com/iam/docs/creating-managing-service-account-keys) for the compute engine default service account to interact with the Life Sciences API using Snakemake. Download the key and copy it to this VM. Then assign the path of the json file to an environment variable.

In [None]:
%env GOOGLE_APPLICATION_CREDENTIALS=cloud_creds.json

In [None]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]

Set your project (make sure to replace $Project with your project name)

In [None]:
!gcloud config set project $PROJECT

Initialize a local git repo

In [None]:
!git init

Configure conda

In [None]:
conda config --set channel_priority strict

### STEP 8: Run snakemake using the Life Sciences API

Aside from the .yaml config files which information about software, dependencies, and versions -- snakemake uses a snakefile which contains information about a workflow.

This can be a powerful tool as it allows one to operate and think in terms of workflows instead of individual steps. You should open the snakefile to look at it further. It is composed of 'rules' we have created. Snakefiles work largely based on inputs. For a given input/output, there is an associated 'rule' that runs. Snakefiles may take a while to get the idea of what's going on, but in simplest terms here we take an input of .fastq files, and based on the snakefile rules we created, those fastq files are run through the entire workflow. The rule_all at the top determines which rules are run based on the input files for rule_all (which are outputs from the target rules. Comment out rules you don't want to run. 

Snakemake requires that you have a service account key to authenticate with the Life Sciences API. This actually is not necessary to use the API from within a notebook, but Snakemake does require it since Snakemake is expecting you to run the command from your own terminal using the SDK. To see all the commands you can run with Snakemake via the Life Sciences API, check out the [docs](https://snakemake.readthedocs.io/en/stable/executor_tutorial/google_lifesciences.html).

Now we can run the Life Sciences APi. You will see that each rule is submitted as a separate job. If the pipeline crashes, the way to troubleshoot is by reading the API logs, or the snakemake rule logs (same info). You can find the Life Sciences API logs by pasting in the gcloud command given in yellow.

For example: 
```
gcloud beta lifesciences operations describe <JOB_ID>
```
Or you can view the logs by finding the path given for logs, and then use gsutil to copy that file locally, or go to the bucket and double click the file. You can get the job ID for the output file in the green section of the rule print out.

In [None]:
%%time
! snakemake --forceall --snakefile snakefile_ls_api --google-lifesciences --default-remote-prefix $BUCKET --use-conda --google-lifesciences-region us-central1 -j 24 --rerun-incomplete --default-resources "machine_type=n2-standard"

### STEP 9: Report the top 10 most highly expressed genes in the samples.

Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [None]:
!gsutil rm gs://$BUCKET/data/quants/SRR13349122_quant
!gsutil rm gs://$BUCKET/data/quants/SRR13349128_quant

In [None]:
!gsutil ls gs://$BUCKET/data/quants/*

In [None]:
!gsutil cp -r gs://$BUCKET/data/quants/SRR13349122_quant/ .
!gsutil cp -r gs://$BUCKET/data/quants/SRR13349128_quant/ .

In [None]:
!sort -nrk 5,5 SRR13349122_quant/quant.sf | head -10

Top 10 most highly expressed genes in the double lysogen sample.


In [None]:
!sort -nrk 5,5 SRR13349128_quant/quant.sf | head -10

### STEP 10: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
!grep 'BB28_RS16545' SRR13349122_quant/quant.sf

Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
!grep 'BB28_RS16545' SRR13349128_quant/quant.sf

## Conclusion

This Jupyter Notebook successfully demonstrated a complete RNA-Seq analysis workflow using Snakemake orchestrated via the Google Cloud Life Sciences API.  The workflow, encompassing read trimming, quality control, mapping, and quantification of gene expression, was executed efficiently in a serverless environment. By leveraging the Life Sciences API, the computational demands of the pipeline were seamlessly managed on Google Cloud, eliminating the need for local resource management. The tutorial highlighted the key steps involved in preparing data (copying FASTQ files, reference genomes, and configuration files to a Google Cloud Storage bucket), setting up the execution environment (installing necessary software), configuring Snakemake for cloud execution, and finally running the workflow using the Life Sciences API. The results, including the top 10 most highly expressed genes in both wild-type and double lysogen samples, and the expression levels of a specific gene of interest (BB28_RS16545), were successfully retrieved and presented. This approach showcases a scalable and reproducible method for RNA-Seq analysis, particularly beneficial for large-scale genomics projects.  Future work could involve expanding this workflow to incorporate more sophisticated downstream analyses and integrating additional data types.

## Clean Up

Remember to move to the next notebook or shut down your instance if you are finished.