# RNA-Seq Analysis Training Demo (Snakemake)

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial uses a popular workflow manager called 'snakemake'. More information on snakemake can be found <a href="https://snakemake.readthedocs.io/en/stable/">here</a>. Running the code in this tutorial will take approximately 12 minutes.

![RNA-Seq workflow](images/rnaseq-workflow.png)

### STEP 1: Install mambaforge and snakemake
First install mambaforge.

We will use it to install snakemake, as well as create a snakemake environment using mambaforge.

In [None]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [2]:
#add to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [3]:
! mamba install -y -c conda-forge -c bioconda snakemake


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (1.4.2) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['snakemake']

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
conda-forge/linux-64 [90m━━╸[0m[33

**If you already completed Tutorial 1, you should be able to skip to Step 6: Snakemake Configs**

### STEP 2: Create directories that will be used in our analysis

In [4]:
!echo $PWD
!mkdir -p data
!mkdir -p data/raw_fastq
!mkdir -p data/trimmed
!mkdir -p data/reference
!mkdir -p data/fastqc
!mkdir -p envs

/home/jupyter/Untitled Folder


### STEP 3: Copy FASTQ Files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groupsinstead of analyzing all the reads from all six samples. These files have been posted on a Google Storage Bucket that we made publicly accessible.


In [8]:
! gsutil -m cp -r gs://rnaseq-myco-bucket/truncated-reads/* data/raw_fastq

Copying gs://rnaseq-myco-bucket/truncated-reads/SRR13349128_2.fastq...
Copying gs://rnaseq-myco-bucket/truncated-reads/SRR13349128_1.fastq...          
Copying gs://rnaseq-myco-bucket/truncated-reads/SRR13349122_2.fastq...
Copying gs://rnaseq-myco-bucket/truncated-reads/SRR13349122_1.fastq...          
- [4/5 files][ 33.0 MiB/ 33.0 MiB]  99% Done                                    

### STEP 4: Copy reference transcriptome files that will be used by Salmon
Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [9]:
! gsutil -m cp -r gs://rnaseq-myco-bucket/reference/M_chelonae_transcripts.fasta data/reference/M_chelonae_transcripts.fasta
! gsutil -m cp -r gs://rnaseq-myco-bucket/reference/decoys.txt data/reference/decoys.txt

Copying gs://rnaseq-myco-bucket/reference/M_chelonae_transcripts.fasta...
/ [1/1 files][  9.4 MiB/  9.4 MiB] 100% Done                                    
Operation completed over 1 objects/9.4 MiB.                                      
Copying gs://rnaseq-myco-bucket/reference/decoys.txt...
/ [1/1 files][   14.0 B/   14.0 B] 100% Done                                    
Operation completed over 1 objects/14.0 B.                                       


### STEP 5: Copy data file for Trimmomatic

In [10]:
! gsutil -m cp -r gs://rnaseq-myco-bucket/reference/TruSeq3-PE.fa .

Copying gs://rnaseq-myco-bucket/reference/TruSeq3-PE.fa...
/ [1/1 files][   95.0 B/   95.0 B] 100% Done                                    
Operation completed over 1 objects/95.0 B.                                       


gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/*### STEP 6: Download data and config files that will be used in our snakemake environment

Next download config files for our snakemake environment, as well as data files which we will analyze.

### STEP 6: Copy config files for Snakemake


In [12]:
# Copy config and data files
! mkdir envs
! gsutil cp  gs://rnaseq-myco-bucket/snakemake/config.yaml .
! gsutil cp  gs://rnaseq-myco-bucket/snakemake/snakefile .
! gsutil -m cp  gs://rnaseq-myco-bucket/snakemake/envs/*.yaml envs/

Copying gs://rnaseq-myco-bucket/snakemake/config.yaml...
/ [1 files][   67.0 B/   67.0 B]                                                
Operation completed over 1 objects/67.0 B.                                       
Copying gs://rnaseq-myco-bucket/snakemake/snakefile...
/ [1 files][  3.4 KiB/  3.4 KiB]                                                
Operation completed over 1 objects/3.4 KiB.                                      
Copying gs://rnaseq-myco-bucket/snakemake/envs/trimmomatic.yaml...
Copying gs://rnaseq-myco-bucket/snakemake/envs/fastqc.yaml...                   
Copying gs://rnaseq-myco-bucket/snakemake/envs/bwa.yaml...                      
Copying gs://rnaseq-myco-bucket/snakemake/envs/fastqc_old.yaml...
Copying gs://rnaseq-myco-bucket/snakemake/envs/samtools.yaml...                 
Copying gs://rnaseq-myco-bucket/snakemake/envs/sra-tools.yaml...                
Copying gs://rnaseq-myco-bucket/snakemake/envs/trinity.yaml...                  
Copying gs://rnaseq-myco

#### Explanation of config files

Snakemake is unique in that it uses config files to manage workflows in the form of 'yaml' files, as well as a 'snakefile'.

Below is a brief example of some of the yaml config files:

In [13]:
!printf "The config.yaml file contains our sample names:\n\n Config.yaml\n"
!cat config.yaml
!printf "\n\nThe env folder contains information pertaining to packages to be used in the environment, \nas well as their version, for example, here is the 'envs/fastqc.yaml' file:\n\n Fastqc.Yaml\n"
!cat envs/fastqc.yaml

The config.yaml file contains our sample names:

 Config.yaml
samples:
    SRR13349122: SRR13349122
    SRR13349128: SRR13349128


The env folder contains information pertaining to packages to be used in the environment, 
as well as their version, for example, here is the 'envs/fastqc.yaml' file:

 Fastqc.Yaml
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - fastqc ==0.11.9
  - multiqc ==1.12



### STEP 7: Run snakemake on our snakefile

Aside from the .yaml config files which information about software, dependencies, and versions -- snakemake uses a snakefile which contains information about a workflow.

This can be a powerful tool as it allows one to operate and think in terms of workflows instead of individual steps. 

Feel free to open the snakefile to look at it further. It is composed of 'rules' we have created.

Snakefiles work largely based on inputs. For a given input, there is an associated 'rule' that runs.

Snakefiles may take a while to get the idea of what's going on, but in simplest terms here we take an input of .fastq files, and based on the snakefile rules we created, those fastq files are run through the entire workflow of tutorial one.


In [15]:
ls data/raw_fastq

SRR13349122_1.fastq  SRR13349128_1.fastq
SRR13349122_2.fastq  SRR13349128_2.fastq


In [16]:
! conda config --set channel_priority strict
! snakemake --cores --use-conda --forceall

{'SRR13349122': 'SRR13349122', 'SRR13349128': 'SRR13349128'}
[33mBuilding DAG of jobs...[0m
[33mCreating conda environment envs/fastqc.yaml...[0m
[33mDownloading and installing remote packages.[0m
[33mEnvironment for /home/jupyter/Untitled Folder/envs/fastqc.yaml created (location: .snakemake/conda/4c2df237580278961cb95ee28aa5c3d9_)[0m
[33mCreating conda environment envs/multiqc.yaml...[0m
[33mDownloading and installing remote packages.[0m
[33mEnvironment for /home/jupyter/Untitled Folder/envs/multiqc.yaml created (location: .snakemake/conda/16880ab43432da359713b046c214761f_)[0m
[33mCreating conda environment envs/trimmomatic.yaml...[0m
[33mDownloading and installing remote packages.[0m
[33mEnvironment for /home/jupyter/Untitled Folder/envs/trimmomatic.yaml created (location: .snakemake/conda/93a920384b661a8590cde71ba73fbfc4_)[0m
[33mCreating conda environment envs/salmon.yaml...[0m
[33mDownloading and installing remote packages.[0m
[33mEnvironment for /home/ju

### STEP 8: Report the top 10 most highly expressed genes in the samples.

Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [17]:
!sort -nrk 5,5 data/quants/SRR13349122_quant/quant.sf | head -10

BB28_RS20665	1293	1044.000	5646.382017	55.000
BB28_RS03075	1626	1377.000	3191.229092	41.000
BB28_RS07370	9255	9006.000	440.329564	37.000
BB28_RS20690	10377	10128.000	380.966573	36.000
BB28_RS19405	3948	3699.000	1043.100692	36.000
BB28_RS18585	1305	1056.000	2841.856734	28.000
BB28_RS20685	7731	7482.000	386.771196	27.000
BB28_RS19310	1194	945.000	2721.996112	24.000
BB28_RS22260	1836	1587.000	1553.312992	23.000
BB28_RS09805	1437	1188.000	2075.006502	23.000
sort: write failed: 'standard output': Broken pipe
sort: write error


Top 10 most highly expressed genes in the double lysogen sample.


In [18]:
!sort -nrk 5,5 data/quants/SRR13349128_quant/quant.sf | head -10

BB28_RS20665	1293	1044.000	17163.199992	107.000
BB28_RS18585	1305	1056.000	5550.333897	35.000
BB28_RS13330	2790	2541.000	2174.824699	33.000
BB28_RS14905	972	723.000	6253.749084	27.000
BB28_RS22260	1836	1587.000	2004.895111	19.000
BB28_RS18315	1767	1518.000	2096.026708	19.000
BB28_RS20685	7731	7482.000	380.492585	17.000
BB28_RS20690	10377	10128.000	248.017626	15.000
BB28_RS03075	1626	1377.000	1824.199372	15.000
BB28_RS11085	1278	1029.000	2278.387793	14.000
sort: write failed: 'standard output': Broken pipe
sort: write error


### STEP 9: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [19]:
!grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf

BB28_RS16545	987	738.000	580.913806	4.000


Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [20]:
!grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf

BB28_RS16545	987	738.000	226.912606	1.000


## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as the standard short or extended tutorials.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)