# RNA-Seq Analysis Training Demo (Snakemake)

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial uses a popular workflow manager called 'snakemake'. More information on snakemake can be found <a href="https://snakemake.readthedocs.io/en/stable/">here</a>. Running the code in this tutorial will take approximately 12 minutes.

![RNA-Seq workflow](images/rnaseq-workflow.png)

## Learning Objectives

* **Install necessary bioinformatics tools:**  Learn to install and manage bioinformatics software using Mamba.
* **Understanding Snakemake configuration:** Learners will examine Snakemake configuration files (YAML) and understand their role in defining the workflow.

* **Running a Snakemake workflow:** Participants will execute a pre-written Snakemake workflow for RNA-Seq analysis, encompassing steps like read trimming, quality control, mapping, and read counting.  They learn to interpret and utilize a Snakefile.

* **Interpreting RNA-Seq results:**  Learners will learn how to access and interpret the output of the Snakemake workflow, including identifying highly expressed genes and examining the expression levels of specific genes of interest.  This involves using command-line tools like `sort`, `head`, and `grep`.

* **Navigating a modular workflow:** The notebook introduces the concept of modularity in bioinformatics analysis by showcasing additional related notebooks (workflows)  covering different aspects of RNA-Seq analysis (e.g., extended tutorials, differential gene expression analysis using R).  This emphasizes the organization and scalability of workflows.

## Prerequisites

This Jupyter Notebook performs RNA-Seq analysis using Snakemake, relying on several external tools and data sources. Here's a breakdown of the prerequisites:

**APIs:**

* **gsutil:**  The Google Cloud Storage (GCS) tool `gsutil` is used extensively to download data from a Google Cloud Storage bucket.  This implicitly requires the Google Cloud Storage API to be enabled.

**Software and Dependencies:**

* **Snakemake:** This workflow manager orchestrates the entire RNA-Seq pipeline.
* **Bioconda Channels:** The notebook uses bioconda channels to install bioinformatic tools.
* **Other tools (installed via bioconda):** The Snakemake workflow uses multiple bioinformatic tools (like Trimmomatic and Salmon) that will be installed through the specified mamba environments. The notebook doesn't explicitly list all the tools used within the Snakemake workflow. This is specified within the `envs/*.yaml` files.
* **Sufficient disk space:** The workflow requires sufficient space to store the downloaded data, intermediate files, and results. The amount of space will depend on the size of the datasets processed.
* **Multiple CPU cores:** The notebook uses multiple cores (`--cores $CORES`) to speed up the analysis with snakemake. The number of cores utilized is determined by `nproc`.


## Get Started

**If you already completed Tutorial 1, you should be able to skip to Step 6: Snakemake Configs**

### STEP 1: Install the tools

Download and install miniforge, a package that contains conda, mamba, and other package managers


In [1]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh 
!bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p $HOME/miniforge 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0   0     0   0     0     0     0  --:--:-- --:--:-- --:--:--     0
  0     0   0     0   0     0     0     0  --:--:-- --:--:-- --:--:--     0
100 83761k 100 83761k   0     0 133.4M     0  --:--:-- --:--:-- --:--:-- 133.4M
PREFIX=/home/jupyter/miniforge
Unpacking bootstrapper...
ln: failed to create symbolic link '/home/jupyter/miniforge/_conda': File exists


Show the system where to find the miniforge:

In [2]:
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/miniforge/bin" 

Using mamba, install the necessary tools for this analysis

In [3]:
! mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[?25h[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
conda-forge/linux-64 [90m╸[0m[33m━━━━━━━━━━━━━━━╸[0m[90m━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.0s
conda-forge/noarch   [90m━╸[0m[33m━━━━━━━━━━━━━━━╸[0m[90m━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.0s
bioconda/linux-64    [90m━━━━━━━━━━━╸[0m[33m━━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.0s
bioconda/noarch      [90m━━━━━━╸[0m[33m━━━━━━━━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.0s[2K[1A[2K[1A[2K[1A[2K[1A[2K[0G[+] 0.2s
conda-forge/linux-64 [90m━━━━━━━━━━━━━━━━━━━━━━━[0m 879.7kB /  49.9MB @   4.7MB/s  0.1s
conda-forge/noarch   [90m━━━━━━━━━━━━━━━━━━━━━━━[0m 114.2kB /  24.0MB @ 666.0kB/s  0.1s
bioconda/linux-64    [90m━━━━━━━━━━━━━━━━━━━━━━━[0m  29.1kB /   5.4MB @ 167.6kB/s  0.1s
bioconda/noarch      [90m━━━━━━━━━━━━━━━━━━━━━━━[0m  29.1kB /   5.1MB @ 167.3kB/s  0.1s[2K[1A[2K[1A[2K[1A[2K[1A[2K[0G[+] 0.3s
conda-forge/linux-64 [90m━━━━━━━━━━━

### STEP 2: Create directories that will be used in our analysis

In [4]:
! echo $PWD
! mkdir -p data
! mkdir -p data/raw_fastq
! mkdir -p data/raw_fastqSub
! mkdir -p data/trimmed
! mkdir -p data/reference
! mkdir -p data/fastqc
! mkdir -p envs

/home/jupyter/rnaseq-myco-notebook


In [5]:
numthreads=!nproc
numthreadsint = int(numthreads[0])
%env CORES = $numthreadsint
#!echo ${CORES}

env: CORES=4


### STEP 3: Copy FASTQ Files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groupsinstead of analyzing all the reads from all six samples. These files have been posted on a Google Storage Bucket that we made publicly accessible.


In [6]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/*.fastq data/raw_fastq

Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_1.fastq...
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_2.fastq...
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349123_2.fastq...
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_2.fastq...
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349123_1.fastq...
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_1.fastq...
- [6/6 files][ 47.0 MiB/ 47.0 MiB] 100% Done                                    
Operation completed over 6 objects/47.0 MiB.                                     


### STEP 4: Copy reference transcriptome files that will be used by Salmon
Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [7]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/M_chelonae_transcripts.fasta data/reference/M_chelonae_transcripts.fasta
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/decoys.txt data/reference/decoys.txt

Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/M_chelonae_transcripts.fasta...
/ [1/1 files][  9.4 MiB/  9.4 MiB] 100% Done                                    
Operation completed over 1 objects/9.4 MiB.                                      
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/reference/decoys.txt...
/ [1/1 files][   14.0 B/   14.0 B] 100% Done                                    
Operation completed over 1 objects/14.0 B.                                       


### STEP 5: Copy data file for Trimmomatic

In [8]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa .

Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa...
/ [1/1 files][   95.0 B/   95.0 B] 100% Done                                    
Operation completed over 1 objects/95.0 B.                                       


Next download config files for our snakemake environment, as well as data files which we will analyze.

### STEP 6: Copy config files for Snakemake


In [9]:
# Copy config and data files
! gsutil cp gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config.yaml .
! gsutil cp gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/snakefile .
! gsutil -m cp gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/*.yaml envs/

Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config.yaml...
/ [1 files][   67.0 B/   67.0 B]                                                
Operation completed over 1 objects/67.0 B.                                       
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/snakefile...
/ [1 files][  3.4 KiB/  3.4 KiB]                                                
Operation completed over 1 objects/3.4 KiB.                                      
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/bwa.yaml...
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/fastqc.yaml...       
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/fastqc_old.yaml...   
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/multiqc.yaml...      
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/salmon.yaml...       
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/sra-tools.yaml...    
Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/envs/sam

#### Explanation of config files

Snakemake is unique in that it uses config files to manage workflows in the form of 'yaml' files, as well as a 'snakefile'.

Below is a brief example of some of the yaml config files:

In [10]:
! printf "The config.yaml file contains our sample names:\n\n Config.yaml\n"
! cat config.yaml
! printf "\n\nThe env folder contains information pertaining to packages to be used in the environment, \nas well as their version, for example, here is the 'envs/fastqc.yaml' file:\n\n Fastqc.Yaml\n"
! cat envs/fastqc.yaml

The config.yaml file contains our sample names:

 Config.yaml
samples:
    SRR13349122: SRR13349122
    SRR13349128: SRR13349128


The env folder contains information pertaining to packages to be used in the environment, 
as well as their version, for example, here is the 'envs/fastqc.yaml' file:

 Fastqc.Yaml
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - fastqc ==0.11.9
  - multiqc ==1.12



### STEP 7: Install Snakemake

In [11]:
! conda create -c conda-forge -c bioconda -n snakemake snakemake -y;

Retrieving notices: done
Channels:
 - conda-forge
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 25.11.0
    latest version: 25.11.1

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /opt/conda/envs/snakemake

  added / updated specs:
    - snakemake


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |            2_gnu          23 KB  conda-forge
    amply-0.1.6                |     pyhd8ed1ab_1          21 KB  conda-forge
    annotated-types-0.7.0      |     pyhd8ed1ab_1          18 KB  conda-forge
    appdirs-1.4.4              |     pyhd8ed1ab_1          14 KB  conda-forge
    argparse-dataclass-2.0.0   |     pyhd8ed1ab_1   

### STEP 8: Run Snakemake on our snakefile

Aside from the .yaml config files which information about software, dependencies, and versions -- snakemake uses a snakefile which contains information about a workflow.

This can be a powerful tool as it allows one to operate and think in terms of workflows instead of individual steps. 

Feel free to open the snakefile to look at it further. It is composed of 'rules' we have created.

Snakefiles work largely based on inputs. For a given input, there is an associated 'rule' that runs.

Snakefiles may take a while to get the idea of what's going on, but in simplest terms here we take an input of .fastq files, and based on the snakefile rules we created, those fastq files are run through the entire workflow of tutorial one.


In [12]:
! ls data/raw_fastqSub

In [13]:
# Calling snakemake directly from conda to avoid "mamba activate snakemake" command
! /opt/conda/envs/snakemake/bin/snakemake --cores $CORES --forceall   

{'SRR13349122': 'SRR13349122', 'SRR13349128': 'SRR13349128'}
[33mAssuming unrestricted shared filesystem usage.[0m
[33mFalling back to greedy scheduler because no default ILP solver is found (you have to install either coincbc or glpk).[0m
[33mhost: king-hon350-2026[0m
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[32mProvided cores: 4[0m
[32mRules claiming more threads will be scaled down.[0m
[33mConda environments: ignored[0m
[33mJob stats:
job                   count
------------------  -------
all                       1
fastqc_trimmed            1
multiqc_trimmed           1
salmon_index              1
salmon_quant_reads        2
trimmomatic_pe_fq         2
total                     8
[0m
[33mSelect jobs to execute...[0m
[33mExecute 1 jobs...[0m

[32m[Mon Jan  5 19:42:12 2026]
localrule trimmomatic_pe_fq:
    input: data/raw_fastq/SRR13349122_1.fastq, data/raw_fastq/SRR13349122_2.fastq
    output: data/trimmed/SRR13349122_trimmed_1.fastq, 

### STEP 9: Report the top 10 most highly expressed genes in the samples.

Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [14]:
! sort -nrk 5,5 data/quants/SRR13349122_quant/quant.sf | head -10

BB28_RS20665	1293	1043.000	5447.071698	55.000
BB28_RS03075	1626	1376.000	3077.869008	41.000
BB28_RS07370	9255	9005.000	424.426717	37.000
BB28_RS20690	10377	10127.000	367.203150	36.000
BB28_RS19405	3948	3698.000	1005.588509	36.000
BB28_RS18585	1305	1055.000	2741.512828	28.000
BB28_RS20685	7731	7481.000	372.811085	27.000
BB28_RS19310	1194	944.000	2626.176789	24.000
BB28_RS22260	1836	1586.000	1497.991546	23.000
BB28_RS09805	1437	1187.000	2001.528725	23.000
sort: write failed: 'standard output': Broken pipe
sort: write error


Top 10 most highly expressed genes in the double lysogen sample.


In [15]:
! sort -nrk 5,5 data/quants/SRR13349128_quant/quant.sf | head -10

BB28_RS20665	1293	1043.000	16640.051824	107.000
BB28_RS18585	1305	1055.000	5381.096618	35.000
BB28_RS13330	2790	2540.000	2107.343957	33.000
BB28_RS14905	972	722.000	6065.711819	27.000
BB28_RS22260	1836	1586.000	1943.146845	19.000
BB28_RS18315	1767	1517.000	2031.529926	19.000
BB28_RS20685	7731	7481.000	368.590781	17.000
BB28_RS20690	10377	10127.000	240.251247	15.000
BB28_RS03075	1626	1376.000	1768.186333	15.000
BB28_RS11085	1278	1028.000	2208.971570	14.000
sort: write failed: 'standard output': Broken pipe
sort: write error


### STEP 10: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [16]:
! grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf

BB28_RS16545	987	737.000	560.631139	4.000


Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [17]:
! grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf

BB28_RS16545	987	737.000	220.083619	1.000


## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as the standard short or extended tutorials.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)

## Conclusion

This Jupyter Notebook provided a comprehensive demonstration of RNA-Seq data analysis using Snakemake, a powerful workflow management system.  The tutorial efficiently processed a prokaryotic RNA-Seq dataset, covering key steps from read trimming and quality control using Trimmomatic and FastQC to read mapping and quantification of gene expression with Salmon.  The use of Snakemake streamlined the workflow, enabling reproducible and manageable execution of the entire pipeline.  The notebook successfully generated gene expression counts, allowing for downstream analyses like identifying differentially expressed genes (as shown by examples in the notebook) and further exploration using provided links to additional notebooks, including a DEG analysis workflow (Tutorial 3) and an extended version of the current Snakemake workflow focusing on scalability and broader data handling. This tutorial serves as a valuable resource for understanding and implementing RNA-Seq analysis using a robust and reproducible approach.  The clear instructions and readily available supplementary materials make it easily adaptable for both novice and experienced users.

## Clean Up

Remember to move to the next notebook or shut down your instance if you are finished.