# RNA-Seq Analysis Training Demo on Azure

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitative gene expression.

## Prerequisites
We assume you have provisioned a compute environment in Azure ML Studio

## Learning objectives
+ Learn how to copy data to and from Blob storage
+ Learn how to run and visualize basic RNAseq analysis

## Get started

### Install packages

Note that within Jupyter you can run a bash command either by using the magic '!' in front of your command, or by adding %%bash to the top of your cell.

For example
```
%%bash
example command
```
Or
```
!example command
```

The first step is to install mambaforge, which is the newer and faster version of the conda package manager.

In [None]:
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
#add to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [None]:
! mamba info --envs

Next, we will install the necessary packages into the current environment.

In [None]:
! mamba install -c conda-forge -c bioconda -c defaults -y sra-tools  pigz pbzip2 fastp fastqc multiqc salmon

Create a set of directories to store the reads, reference sequence files, and output files.


In [None]:
%%bash
mkdir -p data
mkdir -p data/raw_fastq
mkdir -p data/trimmed
mkdir -p data/fastqc
mkdir -p data/aligned
mkdir -p data/reference
mkdir -p data/quants

### Copy FASTQ Files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a Azure Blob storage containers that we made publicly accessible.

In [None]:
!curl https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_1.fastq --output data/raw_fastq/SRR13349122_1.fastq
!curl https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_2.fastq --output data/raw_fastq/SRR13349122_2.fastq
!curl https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_1.fastq --output data/raw_fastq/SRR13349128_1.fastq
!curl https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_2.fastq --output data/raw_fastq/SRR13349128_2.fastq

### Copy reference transcriptome files that will be used by Salmon
Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [None]:
!curl https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/M_chelonae_transcripts.fasta --output data/reference/M_chelonae_transcripts.fasta
!curl https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/decoys.txt --output data/reference/decoys.txt

In [None]:
ls data/raw_fastq

### Trim our data with Fastp

In [None]:
! fastp -i data/raw_fastq/SRR13349122_1.fastq -I data/raw_fastq/SRR13349122_2.fastq -o data/trimmed/SRR13349122_1_trimmed.fastq -O data/trimmed/SRR13349122_2_trimmed.fastq
! fastp -i data/raw_fastq/SRR13349128_1.fastq -I data/raw_fastq/SRR13349128_2.fastq -o data/trimmed/SRR13349128_1_trimmed.fastq -O data/trimmed/SRR13349128_2_trimmed.fastq

### Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Once FastQC is done running, look at the outputs in data/fastqc. What can you say about the quality of the two samples we are looking at here? 

In [None]:
%%bash
fastqc -o data/fastqc data/trimmed/SRR13349122_1_trimmed.fastq
fastqc -o data/fastqc data/trimmed/SRR13349128_1_trimmed.fastq

### Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.
Just as with fastqc, we can look at the mulitqc results after it finishes at data/multiqc_data. Be sure to click on **'Trust HTML'** when you've opend the MultiQC file inorder to view the graphs.

In [None]:
! multiqc -f data/fastqc -f
#! mv multiqc_data/ data/

### Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon

In [None]:
! salmon index -t data/reference/M_chelonae_transcripts.fasta -p 8 -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates

### Run Salmon to Map Reads to Transcripts and Quantify Expression Levels
Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript.

In [None]:
%%bash
salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349122_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349122_quant
salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349128_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349128_quant

In [None]:
ls data/quants/

### Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in the wild-type sample.


In [None]:
! sort -nrk 4,4 data/quants/SRR13349122_quant/quant.sf | head -10

Top 10 most highly expressed genes in the double lysogen sample.


In [None]:
!sort -nrk 4,4 data/quants/SRR13349128_quant/quant.sf | head -10

### Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
!grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf

Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
!grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf

## Conclusion
Here you learned how to import data to and from a Blob storage container and then use fastq files to run basic RNAseq analysis!  

## Clean Up
Make sure you stop your compute instance and if desired, delete the resource group associated with this tutorial.