# RNA-Seq Analysis Training Demo on Azure

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitative gene expression.

### STEP 1: Setup Environment

Note that within Jupyter you can run a bash comman either by using the magic '!' in front of your command, or by adding %%bash to the top of your cell.

For example
```
%%bash
example command
```
Or
```
!example command
```

The first step is to install mamba forge, which is the newer and faster version of the conda package manager.

In [1]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 82.9M  100 82.9M    0     0   115M      0 --:--:-- --:--:-- --:--:--  198M
ERROR: File or directory already exists: '/home/azureuser/mambaforge'
If you want to update an existing installation, use the -u option.


In [2]:
#add to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [3]:
! mamba info --envs


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (1.1.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████

# conda environments:
#
                         /anaconda
base                     /home/azureuser/mambaforge



Next, we will install the necessary packages into the current environment.

In [17]:
! mamba install -c conda-forge -c bioconda -c defaults -y sra-tools  pigz pbzip2 fastp fastqc multiqc salmon


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (1.1.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['sra-tools', 'pigz=2.6', 'pbzip2=1.1', 'fastp=0.23.2', 'fastqc=0.11.9', 'multiqc', 'salmon=1.5.1']


Create a set of directories to store the reads, reference sequence files, and output files.


In [33]:
%%bash
mkdir -p data
mkdir -p data/raw_fastq
mkdir -p data/trimmed
mkdir -p data/fastqc
mkdir -p data/aligned
mkdir -p data/reference
mkdir -p data/quants

### STEP 2: Copy FASTQ Files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a Azure Blob storage containers that we made publicly accessible.

In [6]:
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349122_1.fastq --output data/raw_fastq/SRR13349122_1.fastq
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349122_2.fastq --output data/raw_fastq/SRR13349122_2.fastq
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349128_1.fastq --output data/raw_fastq/SRR13349128_1.fastq
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349128_2.fastq --output data/raw_fastq/SRR13349128_2.fastq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0  10.4M      0 --:--:-- --:--:-- --:--:-- 10.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0  9328k      0 --:--:-- --:--:-- --:--:-- 9319k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0  11.1M      0 --:--:-- --:--:-- --:--:-- 11.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0  12.7M      0 --:--:-- --:--:-- --:--:-- 12.7M


### STEP 3: Copy reference transcriptome files that will be used by Salmon
Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [27]:
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/reference/M_chelonae_transcripts.fasta --output data/reference/M_chelonae_transcripts.fasta
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/reference/decoys.txt --output data/reference/decoys.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9599k  100 9599k    0     0  12.3M      0 --:--:-- --:--:-- --:--:-- 12.3M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    14  100    14    0     0     76      0 --:--:-- --:--:-- --:--:--    76


In [38]:
ls data/raw_fastq

[0m[01;32mSRR13349122_1.fastq[0m*  [01;32mSRR13349128_1.fastq[0m*
[01;32mSRR13349122_2.fastq[0m*  [01;32mSRR13349128_2.fastq[0m*


### STEP 4: Trim our data with Fastp

In [39]:
! fastp -i data/raw_fastq/SRR13349122_1.fastq -I data/raw_fastq/SRR13349122_2.fastq -o data/trimmed/SRR13349122_1_trimmed.fastq -O data/trimmed/SRR13349122_2_trimmed.fastq
! fastp -i data/raw_fastq/SRR13349128_1.fastq -I data/raw_fastq/SRR13349128_2.fastq -o data/trimmed/SRR13349128_1_trimmed.fastq -O data/trimmed/SRR13349128_2_trimmed.fastq

Read1 before filtering:
total reads: 50000
total bases: 2550000
Q20 bases: 2451900(96.1529%)
Q30 bases: 2370275(92.952%)

Read2 before filtering:
total reads: 50000
total bases: 2550000
Q20 bases: 2376817(93.2085%)
Q30 bases: 2255260(88.4416%)

Read1 after filtering:
total reads: 49849
total bases: 2542226
Q20 bases: 2444408(96.1523%)
Q30 bases: 2363088(92.9535%)

Read2 after filtering:
total reads: 49849
total bases: 2542226
Q20 bases: 2374927(93.4192%)
Q30 bases: 2253977(88.6616%)

Filtering result:
reads passed filter: 99698
reads failed due to low quality: 246
reads failed due to too many N: 56
reads failed due to too short: 0
reads with adapter trimmed: 18
bases trimmed due to adapters: 146

Duplication rate: 23.57%

Insert size peak (evaluated by paired-end reads): 33

JSON report: fastp.json
HTML report: fastp.html

fastp -i data/raw_fastq/SRR13349122_1.fastq -I data/raw_fastq/SRR13349122_2.fastq -o data/trimmed/SRR13349122_1_trimmed.fastq -O data/trimmed/SRR13349122_2_trimmed.f

### STEP 6: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Once FastQC is done running, look at the outputs in data/fastqc. What can you say about the quality of the two samples we are looking at here? 

In [15]:
%%bash
fastqc -o data/fastqc data/trimmed/SRR13349122_1_trimmed.fastq
fastqc -o data/fastqc data/trimmed/SRR13349128_1_trimmed.fastq

Started analysis of SRR13349122_1_trimmed.fastq
Approx 5% complete for SRR13349122_1_trimmed.fastq
Approx 10% complete for SRR13349122_1_trimmed.fastq
Approx 15% complete for SRR13349122_1_trimmed.fastq
Approx 20% complete for SRR13349122_1_trimmed.fastq
Approx 25% complete for SRR13349122_1_trimmed.fastq
Approx 30% complete for SRR13349122_1_trimmed.fastq
Approx 35% complete for SRR13349122_1_trimmed.fastq
Approx 40% complete for SRR13349122_1_trimmed.fastq
Approx 45% complete for SRR13349122_1_trimmed.fastq
Approx 50% complete for SRR13349122_1_trimmed.fastq
Approx 55% complete for SRR13349122_1_trimmed.fastq
Approx 60% complete for SRR13349122_1_trimmed.fastq
Approx 65% complete for SRR13349122_1_trimmed.fastq
Approx 70% complete for SRR13349122_1_trimmed.fastq
Approx 75% complete for SRR13349122_1_trimmed.fastq
Approx 80% complete for SRR13349122_1_trimmed.fastq
Approx 85% complete for SRR13349122_1_trimmed.fastq
Approx 90% complete for SRR13349122_1_trimmed.fastq
Approx 95% comple

Analysis complete for SRR13349122_1_trimmed.fastq


### STEP 7: Run MultiQC
MultiQC reads in the FastQQ reports and generate a compiled report for all the analyzed FASTQ files.
Just as with fastqc, we can look at the mulitqc results after it finishes at data/multiqc_data

In [25]:
! multiqc -f data/fastqc -f
#! mv multiqc_data/ data/

[1;30m[INFO   ][0m         multiqc : This is MultiQC v1.10.1
[1;30m[INFO   ][0m         multiqc : Template    : default
[1;30m[INFO   ][0m         multiqc : Searching   : /mnt/batch/tasks/shared/LS_root/mounts/clusters/cloud-lab-notebooks/code/Users/oconnellka/NIHCloudLabAzure-main 2/tutorials/notebooks/rnaseq-myco-tutorial-main/data/fastqc
[2KSearching   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m2/2[0m  [2mdata/fastqc/SRR13349122_1_trimmed_fastqc.html[0m
[?25h[1;30m[ERROR  ][0m         multiqc : [31mOops! The 'custom_content' MultiQC module broke... 
  Please copy the following traceback and report it at https://github.com/ewels/MultiQC/issues 
  If possible, please include a log file that triggers the error - the last file found was:
    None
Module custom_content raised an exception: Traceback (most recent call last):
  File "/home/azureuser/mambaforge/lib/python3.10/site-packages/multiqc/multiqc.py", line 594, in run
    output = mod()
  File

### STEP 8: Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon

In [28]:
! salmon index -t data/reference/M_chelonae_transcripts.fasta -p 8 -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###index ["data/reference/transcriptome_index"] did not previously exist  . . . creating it
[2023-04-26 13:54:40.001] [jLog] [info] building index
out : data/reference/transcriptome_index
[00m[2023-04-26 13:54:40.023] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers

[00m[00m[2023-04-26 13:54:40.454] [puff::index::jointLog] [info] Replaced 0 non-ATCG nucleotides
[00m[00m[2023-04-26 13:54:40.454] [puff::index::jointLog] [info] Clipped poly-A tails from 0 transcripts
[00mwrote 4868 cleaned references
[00m[20

### STEP 9: Run Salmon to Map Reads to Transcripts and Quantify Expression Levels
Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript.

In [40]:
%%bash
salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349122_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349122_quant
salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349128_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349128_quant

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###### salmon (selective-alignment-based) v1.5.1
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { data/reference/transcriptome_index }
### [ libType ] => { SR }
### [ unmatedReads ] => { data/trimmed/SRR13349122_1_trimmed.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { data/quants/SRR13349122_quant }
Logs will be written to data/quants/SRR13349122_quant/logs
[2023-04-26 14:00:23.857] [jointLog] [info] setting maxHashResizeThreads to 8
[2023-04-26 14:00:23.857] [jointLog] [info] Fragment inc

In [41]:
ls data/quants/

[0m[34;42mSRR13349122_quant[0m/  [34;42mSRR13349128_quant[0m/


### STEP 10: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in the wild-type sample.


In [42]:
! sort -nrk 4,4 data/quants/SRR13349122_quant/quant.sf | head -10

BB28_RS23830	213	10.625	48612.291220	5.000
BB28_RS02220	204	9.377	33047.563397	3.000
BB28_RS05530	180	6.996	29531.286140	2.000
BB28_RS18945	222	12.150	25504.663975	3.000
BB28_RS11370	195	8.348	24748.475090	2.000
BB28_RS12480	207	9.766	21154.305555	2.000
BB28_RS18745	300	51.326	20125.718383	10.000
BB28_RS20695	231	14.032	14723.212476	2.000
BB28_RS19155	282	36.744	14056.208165	5.000
BB28_RS18020	189	7.759	13312.711241	1.000
sort: write failed: 'standard output': Broken pipe
sort: write error


Top 10 most highly expressed genes in the double lysogen sample.


In [43]:
!sort -nrk 4,4 data/quants/SRR13349128_quant/quant.sf | head -10

BB28_RS18025	177	6.769	47953.929601	2.000
BB28_RS02220	204	9.377	34613.921846	2.000
BB28_RS13585	243	17.264	28200.832626	3.000
BB28_RS01170	225	12.734	25489.885138	2.000
BB28_RS20695	231	14.032	23131.574929	2.000
BB28_RS19045	183	7.236	22428.250651	1.000
BB28_RS04995	192	8.045	20173.388438	1.000
BB28_RS14885	195	8.348	19441.110656	1.000
BB28_RS18745	300	51.326	18971.657043	6.000
BB28_RS23535	201	9.012	18007.533576	1.000
sort: write failed: 'standard output': Broken pipe
sort: write error


### STEP 11: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [44]:
!grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf

BB28_RS16545	987	737.000	560.631139	4.000


Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [45]:
!grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf

BB28_RS16545	987	737.000	220.201284	1.000


### That's it! 