# RNA-Seq Analysis Training Demo

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression.

![RNA-Seq workflow](images/rnaseq-workflow.png)

## Learning Objectives

* **Install necessary bioinformatics tools:**  Learn to install and manage bioinformatics software using mamba.
*   **Understand the steps in a typical RNA-Seq analysis workflow:** The notebook guides users through a simplified pipeline, including read trimming, quality control, mapping to a transcriptome, and gene expression quantification.
*   **Perform read trimming with Trimmomatic:** Learn how to use Trimmomatic to remove adapter sequences and low-quality bases from FASTQ reads.
*   **Assess read quality with FastQC:** Understand how to use FastQC to evaluate the quality of reads and identify potential biases or issues.
*   **Generate a consolidated QC report with MultiQC:** Learn to use MultiQC to combine FastQC results and generate an overview of quality metrics across multiple samples.
*   **Index a transcriptome with Salmon:** Understand the purpose of indexing and learn how to use Salmon to create an index of the reference transcriptome for efficient read mapping.
*   **Map reads to a transcriptome and quantify expression with Salmon:** Learn how to use Salmon to map reads to transcripts and quantify gene expression levels.

## Prerequisites

**APIs:**

* **gsutil:**  The Google Cloud Storage (GCS) tool `gsutil` is used extensively to download data from a Google Cloud Storage bucket.  This implicitly requires the Google Cloud Storage API to be enabled.

**Software and Dependencies:**

*   **Trimmomatic:** Used for trimming adapter sequences and low-quality bases from RNA-Seq reads.
*   **FastQC:** Used for quality control of FASTQ files, generating reports about read quality, sequence content, etc.
*   **Salmon:**  Used for aligning reads to the transcriptome and quantifying gene expression.
*   **MultiQC:** Used to aggregate and visualize reports from FastQC.

## Get Started

### STEP 1: Install the tools

Using mamba and bioconda, install the tools that will be used in this tutorial.

In [None]:
! mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon

### STEP 2: Setup Environment

Create a set of directories to store the reads, reference sequence files, and output files.


In [None]:
! cd $HOMEDIR
! echo $PWD
! mkdir -p data
! mkdir -p data/trimmed
! mkdir -p data/fastqc
! mkdir -p data/reference

Specify the number of available threads based on the VM

In [None]:
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'
THREADS = int(numthreads[0])

### STEP 3: Copy FASTQ Files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groupsinstead of analyzing all the reads from all six samples. These files have been posted on a Google Storage Bucket that we made publicly accessible.


In [None]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/ data/

### STEP 4: Copy reference transcriptome files that will be used by Salmon
Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [None]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/M_chelonae_transcripts.fasta data/reference/M_chelonae_transcripts.fasta
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/decoys.txt data/reference/decoys.txt


### STEP 5: Copy data file for Trimmomatic

In [None]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa .

### STEP 6: Run Trimmomatic
Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

In [None]:
! trimmomatic PE -threads $THREADS data/raw_fastqSub/SRR13349122_1.fastq data/raw_fastqSub/SRR13349122_2.fastq data/trimmed/SRR13349122_1_trimmed.fastq data/trimmed/SRR13349122_1_trimmed_unpaired.fastq data/trimmed/SRR13349122_2_trimmed.fastq  data/trimmed/SRR13349122_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
! trimmomatic PE -threads $THREADS data/raw_fastqSub/SRR13349128_1.fastq data/raw_fastqSub/SRR13349128_2.fastq data/trimmed/SRR13349128_1_trimmed.fastq data/trimmed/SRR13349128_1_trimmed_unpaired.fastq data/trimmed/SRR13349128_2_trimmed.fastq  data/trimmed/SRR13349128_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

### STEP 7: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Because jupyter is at its core a python editor, we can use python code and html support to display results in-line.

In [None]:
! fastqc -o data/fastqc data/trimmed/SRR13349122_1_trimmed.fastq
! fastqc -o data/fastqc data/trimmed/SRR13349128_1_trimmed.fastq

from IPython.display import IFrame
IFrame(src='./data/fastqc/SRR13349122_1_trimmed_fastqc.html', width=800, height=600)

### STEP 8: Run MultiQC
MultiQC reads in the FastQQ reports and generate a compiled report for all the analyzed FASTQ files.

Being able to use python with bash also means we can seamlessly use popular python packages, such as pandas, to interact with or view the files we create.

In [None]:
! multiqc -f data/fastqc

import pandas as pd
dframe = pd.read_csv("./multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)

### STEP 9: Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon

In [None]:
! salmon index -t data/reference/M_chelonae_transcripts.fasta -p $THREADS -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates


### STEP 10: Run Salmon to Map Reads to Transcripts and Quantify Expression Levels
Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript.

In [None]:
! salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349122_1_trimmed.fastq -p $THREADS --validateMappings -o data/quants/SRR13349122_quant
! salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349128_1_trimmed.fastq -p $THREADS --validateMappings -o data/quants/SRR13349128_quant


### STEP 11: Report the top 10 most highly expressed genes in the samples.

Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [None]:
#Note, the 'write failed error' is not really an error and perfectly fine.
! sort -nrk 5,5 data/quants/SRR13349122_quant/quant.sf | head -10

Top 10 most highly expressed genes in the double lysogen sample.


In [None]:
! sort -nrk 5,5 data/quants/SRR13349128_quant/quant.sf | head -10

### STEP 12: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
! grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf

Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
! grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)

## Conclusion

In summary, this Jupyter Notebook provided a hands-on demonstration of a basic RNA-Seq analysis workflow, guiding users through essential steps such as read trimming with Trimmomatic, quality control with FastQC and MultiQC, transcriptome indexing, and read mapping and quantification using Salmon. By processing a subset of reads from a prokaryotic dataset, we were able to observe the expression levels of specific genes, including a previously identified downregulated acyl-ACP desaturase. This workflow serves as a foundation for more advanced analyses, and further resources are available for exploring extended datasets, incorporating Snakemake for pipeline automation, and utilizing R and DESeq2 for differential gene expression analysis. Ultimately, this tutorial equips users with the basic skills to analyze RNA-seq data and to understand the core components of a typical RNA-seq pipeline.

## Clean Up

Remember to move to the next notebook or shut down your instance if you are finished.