# RNA-Seq Analysis Training Demo

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression.

![RNA-Seq workflow](images/rnaseq-workflow.png)

### STEP 1: Install Mambaforge.

First install Mambaforge.


In [1]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -u -p $HOME/mambaforge
!export PATH="$HOME/mambaforge/bin:$PATH"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 88.1M  100 88.1M    0     0   114M      0 --:--:-- --:--:-- --:--:--  292M
PREFIX=/home/jupyter/mambaforge
Unpacking payload ...
Extracting "cffi-1.15.0-py39h4bc2ebd_0.tar.bz2"
Extracting "xz-5.2.5-h516909a_1.tar.bz2"
Extracting "cryptography-37.0.2-py39hd97740a_0.tar.bz2"
Extracting "ld_impl_linux-64-2.36.1-hea4e1c9_2.tar.bz2"
Extracting "ruamel_yaml-0.15.80-py39hb9d737c_1007.tar.bz2"
Extracting "libnghttp2-1.47.0-h727a467_0.tar.bz2"
Extracting "libgomp-12.1.0-h8d9b700_16.tar.bz2"
Extracting "pycosat-0.6.3-py39hb9d737c_1010.tar.bz2"
Extracting "reproc-14.2.3-h7f98852_0.tar.bz2"
Extracting "requests-2.27.1-pyhd8ed1ab_0.tar.bz2"
Extracting "tk-8.6.12-h27826a3_0.tar.bz2"
Ex

Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.
This can also be done in a single line, for example:

<em>!$HOME/mambaforge/bin/mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon</em>

In [2]:
!$HOME/mambaforge/bin/mamba install -y -c conda-forge -c bioconda trimmomatic
!$HOME/mambaforge/bin/mamba install -y -c conda-forge -c bioconda fastqc
!$HOME/mambaforge/bin/mamba install -y -c conda-forge -c bioconda multiqc
!$HOME/mambaforge/bin/mamba install -y -c conda-forge -c bioconda salmon


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.22.1) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['trimmomatic']

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
conda-forge/linux-64 [90m━━━╸[0m

### STEP 2: Setup Environment

Create a set of directories to store the reads, reference sequence files, and output files.


In [3]:
!cd $HOMEDIR
!echo $PWD
!mkdir -p data
!mkdir -p data/raw_fastq
!mkdir -p data/trimmed
!mkdir -p data/fastqc
!mkdir -p data/reference

/home/jupyter/rnaseq-myco-notebook


### STEP 3: Copy FASTQ Files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groupsinstead of analyzing all the reads from all six samples. These files have been posted on a Google Storage Bucket that we made publicly accessible.


In [4]:
!curl https://storage.googleapis.com/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_1.fastq --output data/raw_fastq/SRR13349122_1.fastq
!curl https://storage.googleapis.com/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_2.fastq --output data/raw_fastq/SRR13349122_2.fastq
!curl https://storage.googleapis.com/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_1.fastq --output data/raw_fastq/SRR13349128_1.fastq
!curl https://storage.googleapis.com/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_2.fastq --output data/raw_fastq/SRR13349128_2.fastq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0   179M      0 --:--:-- --:--:-- --:--:--  179M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0   206M      0 --:--:-- --:--:-- --:--:--  206M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0   168M      0 --:--:-- --:--:-- --:--:--  168M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0   171M      0 --:--:-- --:--:-- --:--:--  171M


### STEP 4: Copy reference transcriptome files that will be used by Salmon
Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [5]:
!curl https://storage.googleapis.com/me-inbre-rnaseq-pipelinev2/data/reference/M_chelonae_transcripts.fasta --output data/reference/M_chelonae_transcripts.fasta
!curl https://storage.googleapis.com/me-inbre-rnaseq-pipelinev2/data/reference/decoys.txt --output data/reference/decoys.txt


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9599k  100 9599k    0     0   170M      0 --:--:-- --:--:-- --:--:--  167M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    14  100    14    0     0   2000      0 --:--:-- --:--:-- --:--:--  2000


### STEP 5: Copy data file for Trimmomatic

In [6]:
!curl https://storage.googleapis.com/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa --output TruSeq3-PE.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    95  100    95    0     0  11875      0 --:--:-- --:--:-- --:--:-- 11875


### STEP 6: Run Trimmomatic
Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

In [7]:
!trimmomatic PE -threads 2 data/raw_fastq/SRR13349122_1.fastq data/raw_fastq/SRR13349122_2.fastq data/trimmed/SRR13349122_1_trimmed.fastq data/trimmed/SRR13349122_1_trimmed_unpaired.fastq data/trimmed/SRR13349122_2_trimmed.fastq  data/trimmed/SRR13349122_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
!trimmomatic PE -threads 2 data/raw_fastq/SRR13349128_1.fastq data/raw_fastq/SRR13349128_2.fastq data/trimmed/SRR13349128_1_trimmed.fastq data/trimmed/SRR13349128_1_trimmed_unpaired.fastq data/trimmed/SRR13349128_2_trimmed.fastq  data/trimmed/SRR13349128_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

TrimmomaticPE: Started with arguments:
 -threads 2 data/raw_fastq/SRR13349122_1.fastq data/raw_fastq/SRR13349122_2.fastq data/trimmed/SRR13349122_1_trimmed.fastq data/trimmed/SRR13349122_2_trimmed.fastq data/trimmed/SRR13349122_1_trimmed_unpaired.fastq data/trimmed/SRR13349122_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 50000 Both Surviving: 49870 (99.74%) Forward Only Surviving: 130 (0.26%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 -threads 2 data/raw_fastq/SRR13349128_1.fastq data/raw_fastq/SRR13349128_2.fastq data/trimmed/SRR13349128_1_trimmed.fastq data/trimmed/SRR133491

### STEP 7: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Because jupyter is at its core a python editor, we can use python code and html support to display results in-line.

In [8]:
!fastqc -o data/fastqc data/trimmed/SRR13349122_1_trimmed.fastq
!fastqc -o data/fastqc data/trimmed/SRR13349128_1_trimmed.fastq

from IPython.display import IFrame
IFrame(src='./data/fastqc/SRR13349122_1_trimmed_fastqc.html', width=800, height=600)

Started analysis of SRR13349122_1_trimmed.fastq
Approx 5% complete for SRR13349122_1_trimmed.fastq
Approx 10% complete for SRR13349122_1_trimmed.fastq
Approx 15% complete for SRR13349122_1_trimmed.fastq
Approx 20% complete for SRR13349122_1_trimmed.fastq
Approx 25% complete for SRR13349122_1_trimmed.fastq
Approx 30% complete for SRR13349122_1_trimmed.fastq
Approx 35% complete for SRR13349122_1_trimmed.fastq
Approx 40% complete for SRR13349122_1_trimmed.fastq
Approx 45% complete for SRR13349122_1_trimmed.fastq
Approx 50% complete for SRR13349122_1_trimmed.fastq
Approx 55% complete for SRR13349122_1_trimmed.fastq
Approx 60% complete for SRR13349122_1_trimmed.fastq
Approx 65% complete for SRR13349122_1_trimmed.fastq
Approx 70% complete for SRR13349122_1_trimmed.fastq
Approx 75% complete for SRR13349122_1_trimmed.fastq
Approx 80% complete for SRR13349122_1_trimmed.fastq
Approx 85% complete for SRR13349122_1_trimmed.fastq
Approx 90% complete for SRR13349122_1_trimmed.fastq
Approx 95% comple

### STEP 8: Run MultiQC
MultiQC reads in the FastQQ reports and generate a compiled report for all the analyzed FASTQ files.

Being able to use python with bash also means we can seamlessly use popular python packages, such as pandas, to interact with or view the files we create.

In [9]:
!multiqc -f data/fastqc

import pandas as pd
dframe = pd.read_csv("./multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)


  [34m/[0m[32m/[0m[31m/[0m ]8;id=819689;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.12[0m

[34m|           multiqc[0m | Search path : /home/jupyter/rnaseq-myco-notebook/data/fastqc
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m4/4[0m  fastqc.html[0med_fastqc.html[0m
[?25h[34m|            fastqc[0m | Found 2 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : multiqc_report.html
[34m|           multiqc[0m | Data        : multiqc_data
[34m|           multiqc[0m | MultiQC complete


Unnamed: 0,Sample,Filename,File type,Encoding,Total Sequences,Sequences flagged as poor quality,Sequence length,%GC,total_deduplicated_percentage,avg_sequence_length,basic_statistics,per_base_sequence_quality,per_sequence_quality_scores,per_base_sequence_content,per_sequence_gc_content,per_base_n_content,sequence_length_distribution,sequence_duplication_levels,overrepresented_sequences,adapter_content
0,SRR13349122_1,SRR13349122_1_trimmed.fastq,Conventional base calls,Sanger / Illumina 1.9,49870.0,0.0,50-51,55.0,19.452577,50.997654,pass,pass,pass,fail,pass,pass,warn,fail,fail,pass
1,SRR13349128_1,SRR13349128_1_trimmed.fastq,Conventional base calls,Sanger / Illumina 1.9,49851.0,0.0,50-51,54.0,17.235361,50.997212,pass,pass,pass,fail,warn,pass,warn,fail,fail,pass


### STEP 9: Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon

In [10]:
!salmon index -t data/reference/M_chelonae_transcripts.fasta -p 8 -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates


Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
index ["data/reference/transcriptome_index"] did not previously exist  . . . creating it
[2022-06-01 16:40:36.506] [jLog] [info] building index
out : data/reference/transcriptome_index
[00m[2022-06-01 16:40:36.506] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers

[00m[00m[2022-06-01 16:40:36.731] [puff::index::jointLog] [info] Replaced 0 non-ATCG nucleotides
[00m[00m[2022-06-01 16:40:36.731] [puff::index::jointLog] [info] Clipped poly-A tails from 0 transcripts
[00mwrote 4868 cleaned references
[00m[2022-

### STEP 10: Run Salmon to Map Reads to Transcripts and Quantify Expression Levels
Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript.

In [11]:
!salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349122_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349122_quant
!salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349128_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349128_quant


Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
### salmon (selective-alignment-based) v1.7.0
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { data/reference/transcriptome_index }
### [ libType ] => { SR }
### [ unmatedReads ] => { data/trimmed/SRR13349122_1_trimmed.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { data/quants/SRR13349122_quant }
Logs will be written to data/quants/SRR13349122_quant/logs
[00m[2022-06-01 16:41:07.111] [jointLog] [info] setting maxHashResizeThreads to 8
[00m[00m[2022-06-01 16:41:07.111] [jointLog] [info] 

### STEP 11: Report the top 10 most highly expressed genes in the samples.

Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [12]:
#Note, the 'write failed error' is not really an error and perfectly fine.
!sort -nrk 5,5 data/quants/SRR13349122_quant/quant.sf | head -10


BB28_RS20665	1293	1043.000	5447.071698	55.000
BB28_RS03075	1626	1376.000	3077.869008	41.000
BB28_RS07370	9255	9005.000	424.426717	37.000
BB28_RS20690	10377	10127.000	367.203150	36.000
BB28_RS19405	3948	3698.000	1005.588509	36.000
BB28_RS18585	1305	1055.000	2741.512828	28.000
BB28_RS20685	7731	7481.000	372.811085	27.000
BB28_RS19310	1194	944.000	2626.176789	24.000
BB28_RS22260	1836	1586.000	1497.991546	23.000
BB28_RS09805	1437	1187.000	2001.528725	23.000
sort: write failed: 'standard output': Broken pipe
sort: write error


Top 10 most highly expressed genes in the double lysogen sample.


In [13]:
!sort -nrk 5,5 data/quants/SRR13349128_quant/quant.sf | head -10


BB28_RS20665	1293	1043.000	16640.051824	107.000
BB28_RS18585	1305	1055.000	5381.096618	35.000
BB28_RS13330	2790	2540.000	2107.343957	33.000
BB28_RS14905	972	722.000	6065.711819	27.000
BB28_RS22260	1836	1586.000	1943.146845	19.000
BB28_RS18315	1767	1517.000	2031.529926	19.000
BB28_RS20685	7731	7481.000	368.590781	17.000
BB28_RS20690	10377	10127.000	240.251247	15.000
BB28_RS03075	1626	1376.000	1768.186333	15.000
BB28_RS11085	1278	1028.000	2208.971570	14.000
sort: write failed: 'standard output': Broken pipe
sort: write error


### STEP 12: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [14]:
!grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf


BB28_RS16545	987	737.000	560.631139	4.000


Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [15]:
!grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf


BB28_RS16545	987	737.000	220.083619	1.000


## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://www.youtube.com/watch?v=NG1U7D4l31o&t=26s>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)