# Tutorial 1b - Extended Bulk RNA-Seq Analysis for Mouse

## Overview

<div class="alert alert-block alert-danger"> <b>WARNING</b>: Full fastq files can be rather large, and so the downloading, extracting, and analysis of them means this tutorial can take almost <u>13 hours</u> to run the code fully using an <b>ml.m5.4xlarge</b> instance. </div>

This extended tutorial demonstrates how to run an RNA-Seq workflow using a full *Mus musculus* dataset. Steps in the workflow include read trimming, quality control, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial will analyze data from data published in association with [Mittenbühler MJ et al. 2023](https://pubmed.ncbi.nlm.nih.gov/36681077/).

All outputs used in [Tutorial 2](https://github.com/King-Laboratory/scRNASeq-miRNASeq-and-TF-Network-Analysis/blob/bda75860ace82cf180a6f9eae115ebaf2eabc5f9/Bulk_RNA-Seq_Tutorials/Bulk_RNA-Seq_Mouse/Tutorial_2_DEG_mouse.ipynb) for DEG analysis were created using this extended full dataset tutorial workflow.

![Mouse Bulk RNA-seq workflow](../../images/Mouse_workflow.png)

## Learning Objectives

- Explore an example Bulk RNA-sequencing dataset
- Understand the workflow of generating read counts, including:
    - Accessing SRA metadata
    - Quality control
    - Adapter trimming
    - Read mapping
    - Counting mapped reads
    - Quanitify gene expression levels
- Report expression of the top 10 highly expressed genes
- Combine read count files and store in AWS S3 bucket

## STEP 1: Getting Started

<div class="alert alert-block alert-warning"> NOTE: This Jupyter Notebook was developed to run within a customized container on AWS with all software and tools pre-configured. If running without this customized container, you will need to install tools using the Miniforge environment setup instructions below before moving on to Step 2.</div>

### Without Container: Install Miniforge and Workflow Tools

Miniforge is a lightweight Conda distribution that offers a streamlined installation process and efficient package management. It provides access to a vast repository of packages.

The following code performs these steps:
- Downloads Miniforge or Mambaforge (you can use either based on preference)
- Installs Miniforge (or Mambaforge) - no need to install conda since mamba will be available immediately
- Using miniforge and bioconda, installs the tools that will be used in this tutorial

<div class="alert alert-block alert-info">Tip: If using the Miniforge install, run the following code cells by removing the %%script false command. </div>

In [1]:
%%script false --no-raise-error
# Download Miniforge or Mambaforge (you can use either based on preference)
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh

# Install Miniforge (or Mambaforge) - no need to install conda since mamba will be available immediately
!bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p $HOME/miniforge > /dev/null
!date +"%T"

Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [3]:
#%%script false --no-raise-error
# Update PATH to point to the Miniforge (or Mambaforge) bin files
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/miniforge/bin"

#now we can easily use 'mamba' command to install software 
!mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc sql-magic entrez-direct gffread parallel-fastq-dump sra-tools sql-magic pyathena samtools star rsem entrez-direct subread pigz -y > /dev/null



---------------------------------------
## If running from a container, as noted above, start with <b> STEP 2 </b> below:
## STEP 2: Define Threads & Setup Directory Structures

Specify the number of available threads based on the VM. This is useful for later tools such as trimmomatic, or STAR.

In [4]:
import multiprocessing
import os

num_cores = multiprocessing.cpu_count()
THREADS = max(1, num_cores - 1)

print("Number of threads:", THREADS)
os.environ["THREADS"] = str(THREADS)

Number of threads: 15


Create a set of directories in the sra-data-athena to store the reads, reference sequence files, and output files.

In [12]:
!cd $HOMEDIR
!echo $PWD
!mkdir -p data
!mkdir -p data/raw_fastq
!mkdir -p data/trimmed
!mkdir -p data/fastqc
!mkdir -p data/reference
!mkdir -p data/aligned_bam
!mkdir -p data/rsem_reference/mouse_rsem_reference
!mkdir -p data/rsem_output
!mkdir -p data/reference/STAR_index

/home/ec2-user/SageMaker/Bulk-and-Single-Cell-RNAseq/Bulk_RNA-Seq_Tutorials/Bulk_RNA-Seq_Mouse


## STEP 3: Downloading relevant FASTQ files using the SRA Toolkit



Next we will need to download the relevant fastq files.

Because these files can be large, the process of downloading and extracting fastq files can be quite lengthy.

We will be downloading all samples from this project using the SRA Toolkit from the NCBI's SRA (Sequence Read Archive). However, first we need to find the associated accession numbers in order to download.

### STEP 3.1: Finding run accession numbers.


The SRA stores sequence data in terms of runs, (SRR stands for Sequence Read Run). To download runs, we will need the accession ID for each run we wish to download.

The Mittenbühler MJ et al., project contains 8 runs. To make it easier, these are the run IDs associated with this project:

- SRR21972730
- SRR21972729
- SRR21972728
- SRR21972727
- SRR21972725
- SRR21972724
- SRR21972723
- SRR21972726

In this case, all these runs belong to the Bioproject PRJNA892075. Sequence run experiments can be searched using the SRA database on the NCBI website; and article-specific sample run information can be found in the supplementary section of that article. For instance, here, the the authors posted a link to the sequence data GSE (Gene Series number), GSE164210. This leads to the appropriate 'Gene Expression Omnibus' page where, among other useful files and information, the relevant SRA database link can be found. Once the accession numbers are located, one can make a text file containing the list of accession IDs however they like. Once again, to make things easier, we have made a .txt with these IDs that you can simply download here:

In [10]:
!esearch -db sra -query "PRJNA892075" | efetch -format runinfo | cut -d',' -f1 | tail -n +2 > accs.txt
!cat accs.txt

SRR21972730
SRR21972729
SRR21972728
SRR21972727
SRR21972725
SRR21972724
SRR21972723
SRR21972726


### STEP 3.2 Downloading multiple files using the SRA-toolkit.

The code uses prefetch to download multiple SRA files in parallel. It reads the list of SRR IDs from accs.txt, uses xargs to execute prefetch for each ID, and specifies the output directory and the -f option to create FASTQ files in the same directory as the SRA files. To speed up the download the code uses -P $THREADS option allowing parallel execution using the specified number of threads.

In [5]:
!cat accs.txt | xargs -P $THREADS -I {} prefetch {} -O data/raw_fastq -f yes

2025-05-13T14:17:26 prefetch.3.2.1: 1) Resolving 'SRR21972730'...
2025-05-13T14:17:26 prefetch.3.2.1: 1) Resolving 'SRR21972728'...
2025-05-13T14:17:26 prefetch.3.2.1: 1) Resolving 'SRR21972729'...
2025-05-13T14:17:26 prefetch.3.2.1: 1) Resolving 'SRR21972727'...
2025-05-13T14:17:26 prefetch.3.2.1: 1) Resolving 'SRR21972724'...
2025-05-13T14:17:26 prefetch.3.2.1: 1) Resolving 'SRR21972723'...
2025-05-13T14:17:26 prefetch.3.2.1: 1) Resolving 'SRR21972726'...
2025-05-13T14:17:26 prefetch.3.2.1: 1) Resolving 'SRR21972725'...
2025-05-13T14:17:26 prefetch.3.2.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2025-05-13T14:17:26 prefetch.3.2.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2025-05-13T14:17:26 prefetch.3.2.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2025-05-13T14:17:26 prefetch.3.2.1: Current preference is set to retrieve SR

### STEP 3.3 Converting Multiple SRA files to Fastq


In this step, the SRA files will be processed in parallel using parallel-fastq-dump. Each SRR ID from accs.txt will be read, and xargs will be used to execute parallel-fastq-dump for each SRA ID. This will result in the creation of two paired-end FASTQ files for each SRR ID, which will be compressed into a .gz file to save space.

In [6]:
#process with parallel-fastq-dump using piping
!cat accs.txt | xargs -I {} parallel-fastq-dump -O data/raw_fastq/ --tmpdir . --threads $THREADS --gzip --split-files --sra-id {}

2025-05-13 14:22:27,572 - SRR ids: ['SRR21972730']
2025-05-13 14:22:27,572 - extra args: ['--gzip', '--split-files']
2025-05-13 14:22:27,573 - tempdir: ./pfd_gx2mz2h0
2025-05-13 14:22:27,573 - CMD: sra-stat --meta --quick SRR21972730
2025-05-13 14:22:28,023 - SRR21972730 spots: 52663760
2025-05-13 14:22:28,024 - blocks: [[1, 3510917], [3510918, 7021834], [7021835, 10532751], [10532752, 14043668], [14043669, 17554585], [17554586, 21065502], [21065503, 24576419], [24576420, 28087336], [28087337, 31598253], [31598254, 35109170], [35109171, 38620087], [38620088, 42131004], [42131005, 45641921], [45641922, 49152838], [49152839, 52663760]]
2025-05-13 14:22:28,024 - CMD: fastq-dump -N 1 -X 3510917 -O ./pfd_gx2mz2h0/0 --gzip --split-files SRR21972730
2025-05-13 14:22:28,024 - CMD: fastq-dump -N 3510918 -X 7021834 -O ./pfd_gx2mz2h0/1 --gzip --split-files SRR21972730
2025-05-13 14:22:28,024 - CMD: fastq-dump -N 7021835 -X 10532751 -O ./pfd_gx2mz2h0/2 --gzip --split-files SRR21972730
2025-05-13 1

As before, it is good practice to turn .fastq files into .fastq.gz files to save space.

In our case, we will actually need to concatenate the fastq files later on, and so will zip after this.

The non-redundant SRA files can also be deleted to save more space.

In [7]:
#find and delete all SRR subfolders in the raw_fastq directory
!find data/raw_fastq -type d -name 'SRR*' -exec rm -rf {} \;

find: ‘data/raw_fastq/SRR21972723’: No such file or directory
find: ‘data/raw_fastq/SRR21972727’: No such file or directory
find: ‘data/raw_fastq/SRR21972726’: No such file or directory
find: ‘data/raw_fastq/SRR21972730’: No such file or directory
find: ‘data/raw_fastq/SRR21972725’: No such file or directory
find: ‘data/raw_fastq/SRR21972729’: No such file or directory
find: ‘data/raw_fastq/SRR21972724’: No such file or directory
find: ‘data/raw_fastq/SRR21972728’: No such file or directory


### STEP 3.6 Download reference transcriptome files that will be used by STAR


This step downloads and prepares the reference data needed for your RNA-Seq analysis. It retrieves three essential files:

- **Mouse genome (Mus_musculus.GRCm39.dna.primary_assembly.fa.gz)**: This compressed FASTA file contains the complete mouse genome sequence, that will be used as the reference for aligning your RNA-seq reads.
- **Mouse gene annotations (Mus_musculus.GRCm39.104.gtf.gz)**: This compressed GTF file provides information about the genes and transcripts in the mouse genome, including their locations and structures. This data will crucial for interpreting the aligned RNA-Seq reads and understanding what genes are expressed in each.
- **Mouse feature table (GCF_000001635.27_GRCm39_feature_table.txt.gz)**: This compressed table provides additional annotations for the mouse genome features, potentially including information about gene functions and pathways. This step will further used to analyze the differential gene expression (DEG) analysis. 

In [15]:
!wget ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz -O data/reference/mouse_genome.fa.gz
!wget ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz -O data/reference/mouse_annotation.gtf.gz
!wget -O data/reference/mouse_feature_table.txt.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_feature_table.txt.gz

--2025-05-13 01:47:33--  ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
           => ‘data/reference/mouse_genome.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-104/fasta/mus_musculus/dna ... done.
==> SIZE Mus_musculus.GRCm39.dna.primary_assembly.fa.gz ... 806418890
==> PASV ... done.    ==> RETR Mus_musculus.GRCm39.dna.primary_assembly.fa.gz ... done.
Length: 806418890 (769M) (unauthoritative)


2025-05-13 01:47:55 (37.3 MB/s) - ‘data/reference/mouse_genome.fa.gz’ saved [806418890]

--2025-05-13 01:47:55--  ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz
           => ‘data/reference/mouse_annotation.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62

In [16]:
!gunzip -f data/reference/mouse_genome.fa.gz 
!gunzip -f data/reference/mouse_annotation.gtf.gz
!gunzip -f data/reference/mouse_feature_table.txt.gz

### STEP 3.4: Copy data file for Trimmomatic

One of the functions of Trimmomatic is to trim adapter sequences unique to each sequencing platform. These adapter sequences are typically located within the trimmomatic installation directory in a folder called adapters.

Directories of packages within conda installations can be confusing, so in the case of using conda with Trimmomatic, it may be easier to simply download or create a file with the relevant adapter sequences and store it in an easy to find directory.

In [8]:
!aws s3 cp s3://nigms-sandbox/bulk-scRNAseq/reference/TruSeq3-PE.fa data_19/trimmed/

download: s3://nigms-sandbox/bulk-scRNAseq/reference/TruSeq3-PE.fa to data_19/trimmed/TruSeq3-PE.fa


### STEP 4: Run FastQC

FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

The below code may take a while to run. To make it run faster we can use threads to speed up the process.

In [13]:
# Run fastqc for forward reads in parallel
!cat accs.txt | xargs -P $THREADS -I {} fastqc "data/raw_fastq/{}_1.fastq.gz" -o data/fastqc/

# Run fastqc for reverse reads in parallel
!cat accs.txt | xargs -P $THREADS -I {} fastqc "data/raw_fastq/{}_2.fastq.gz" -o data/fastqc/

application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
Started analysis of SRR21972723_1.fastq.gz
Started analysis of SRR21972729_1.fastq.gz
Started analysis of SRR21972728_1.fastq.gz
Started analysis of SRR21972724_1.fastq.gz
Started analysis of SRR21972725_1.fastq.gz
Started analysis of SRR21972726_1.fastq.gz
Started analysis of SRR21972730_1.fastq.gz
Started analysis of SRR21972727_1.fastq.gz
Approx 5% complete for SRR21972725_1.fastq.gz
Approx 5% complete for SRR21972729_1.fastq.gz
Approx 5% complete for SRR21972724_1.fastq.gz
Approx 5% complete for SRR21972723_1.fastq.gz
Approx 5% complete for SRR21972726_1.fastq.gz
Approx 5% complete for SRR21972728_1.fastq.gz
Approx 5% complete for SRR21972727_1.fastq.gz
Approx 10% complete for SRR21972729_1.fastq.gz
Approx 10% complete for SRR21972725_1.fastq.gz
Approx 10% complete for SRR21972724_1.fastq.gz
Approx 10% complete for SRR21972723_1.fastq.gz
Approx 10%

FastQC will output the results in HTML format, as below, for all forward and reverse reads.

In [14]:
from IPython.display import IFrame
IFrame(src='data/fastqc/SRR21972724_1_fastqc.html', width=800, height=600)

Although it is best practice to look over the quality reports individually, tools like MultiQC allow one to quickly look at a combined summary of the quality reports of all fastq files.

For instance, the below table shows all warnings, passes, or failures, from each FastQC report. There are other summaries created as well by MultiQC.

In [15]:
!multiqc -f data/fastqc/

import pandas as pd
dframe = pd.read_csv("./multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)


[91m///[0m ]8;id=524128;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.28[0m

[34m       file_search[0m | Search path: /home/ec2-user/SageMaker/Bulk-and-Single-Cell-RNAseq/Bulk_RNA-Seq_Tutorials/Bulk_RNA-Seq_Mouse/data/fastqc
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m32/32[0m  mdata/fastqc/SRR21972729_2_fastqc.zip[0m
[?25h[34m            fastqc[0m | Found 16 reports
[34m     write_results[0m | Data        : multiqc_data
[34m     write_results[0m | Report      : multiqc_report.html
[34m           multiqc[0m | MultiQC complete


Unnamed: 0,Sample,Filename,File type,Encoding,Total Sequences,Total Bases,Sequences flagged as poor quality,Sequence length,%GC,total_deduplicated_percentage,...,per_base_sequence_quality,per_tile_sequence_quality,per_sequence_quality_scores,per_base_sequence_content,per_sequence_gc_content,per_base_n_content,sequence_length_distribution,sequence_duplication_levels,overrepresented_sequences,adapter_content
0,SRR21972723_1,SRR21972723_1.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,59889148.0,8.9 Gbp,0.0,150.0,46.0,22.500888,...,pass,pass,pass,fail,warn,pass,pass,fail,warn,fail
1,SRR21972723_2,SRR21972723_2.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,59889148.0,8.9 Gbp,0.0,150.0,46.0,26.104239,...,pass,pass,pass,warn,pass,pass,pass,fail,pass,fail
2,SRR21972724_1,SRR21972724_1.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,51503257.0,7.7 Gbp,0.0,150.0,45.0,26.607164,...,pass,pass,pass,fail,warn,pass,pass,fail,warn,fail
3,SRR21972724_2,SRR21972724_2.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,51503257.0,7.7 Gbp,0.0,150.0,45.0,27.43285,...,pass,pass,pass,warn,pass,pass,pass,fail,pass,fail
4,SRR21972725_1,SRR21972725_1.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,53869826.0,8 Gbp,0.0,150.0,45.0,26.369692,...,pass,pass,pass,fail,pass,pass,pass,fail,warn,fail
5,SRR21972725_2,SRR21972725_2.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,53869826.0,8 Gbp,0.0,150.0,46.0,28.207254,...,pass,pass,pass,warn,pass,pass,pass,fail,pass,fail
6,SRR21972726_1,SRR21972726_1.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,61546553.0,9.2 Gbp,0.0,150.0,45.0,26.423534,...,pass,pass,pass,fail,pass,pass,pass,fail,warn,fail
7,SRR21972726_2,SRR21972726_2.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,61546553.0,9.2 Gbp,0.0,150.0,46.0,27.994995,...,pass,pass,pass,warn,pass,pass,pass,fail,pass,fail
8,SRR21972727_1,SRR21972727_1.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,79310114.0,11.8 Gbp,0.0,150.0,46.0,24.947601,...,pass,pass,pass,fail,warn,pass,pass,fail,warn,fail
9,SRR21972727_2,SRR21972727_2.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,79310114.0,11.8 Gbp,0.0,150.0,47.0,27.034112,...,pass,pass,pass,warn,pass,pass,pass,fail,pass,fail


### STEP 5: Run Trimmomatic

Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

Using piping and our original list, it is possible to queue up a batch run of Trimmomatic for all our files, note that this is a different way to run a loop compared with what we did before.

The below code may take approximately 30 minutes to run.

In [None]:
!cat accs.txt | xargs -I {} \
trimmomatic PE -threads $THREADS \
'data/raw_fastq/{}_1.fastq.gz' 'data/raw_fastq/{}_2.fastq.gz' \
'data/trimmed/{}_1_trimmed.fastq' 'data/trimmed/{}_1_trimmed_unpaired.fastq' \
'data/trimmed/{}_2_trimmed.fastq' 'data/trimmed/{}_2_trimmed_unpaired.fastq' \
ILLUMINACLIP:data/trimmed/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

TrimmomaticPE: Started with arguments:
 -threads 15 data/raw_fastq/SRR21972730_1.fastq.gz data/raw_fastq/SRR21972730_2.fastq.gz data/trimmed/SRR21972730_1_trimmed.fastq data/trimmed/SRR21972730_1_trimmed_unpaired.fastq data/trimmed/SRR21972730_2_trimmed.fastq data/trimmed/SRR21972730_2_trimmed_unpaired.fastq ILLUMINACLIP:data/trimmed/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 52663760 Both Surviving: 36050371 (68.45%) Forward Only Surviving: 16604182 (31.53%) Reverse Only Surviving: 0 (0.00%) Dropped: 9207 (0.02%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 -threads 15 data/raw_fastq/SRR21972729_1.fastq.gz data/raw_fastq/SRR21972729_2.fastq.gz data/trimmed/SRR21972

## STEP 5: Run FastQC
It's best practice to run FastQC after trimming. However, you may decide to run FastQC only once, before or after trimming.

We will proceed with only the forward reads -- this is because, looking at Trimmomatic, there were very few 'orphaned' reads. That is to say, most forward and reverse reads were successfully paired together. Because we are just trying to map to a reference genome, the read lengths of the forward reads alone, in this case, around 60 million basepairs, should be sufficient.

The below code may take around 15-20 minutes to run.

In [None]:
# Run FastQC
!cat accs.txt | xargs -P $THREADS -I {} fastqc data/trimmed/{}_1_trimmed.fastq data/trimmed/{}_2_trimmed.fastq -o data/fastqc_samples/

## STEP 6: Run MultiQC
MultiQC reads in the FastQC reports and generates a compiled report for all the analyzed FASTQ files.

In [None]:
#!multiqc -f data/fastqc_samples/
!multiqc -f -o data/multiqc_samples/ data/fastqc_samples/

## STEP 7: Preparing the STAR-Compatible RSEM Reference

This command prepares a reference genome and annotation files for RNA-Seq analysis using RSEM (RNA-Seq by Expectation-Maximization) and STAR (Spliced Transcripts Alignment to a Reference). It generates files needed to quantify gene and isoform expression. The rsem-prepare-reference function takes a GTF file with gene annotations (mouse_annotation.gtf) and a FASTA file with the reference genome sequence (mouse_genome.fa). It processes these files to create a reference, saving the output in the mouse_reference directory. The --star option ensures the reference is compatible with STAR for efficient transcriptome alignment. The -p $THREADS option sets the number of threads used for parallel processing, speeding up the preparation process.

In [None]:
# Preparing the reference genome
!rsem-prepare-reference --gtf data/reference/mouse_annotation.gtf --star -p $THREADS data/reference/mouse_genome.fa mouse_reference > /dev/null

## STEP 8: Run STAR for Alignment, Prepare and Run RSEM for Quantification

This script automates RNA-Seq gene expression quantification using RSEM and STAR. It reads SRR accession IDs from accs.txt, saves results in data/rsem_output, and runs rsem-calculate-expression for each ID. It uses paired-end trimmed FASTQ files from data/trimmed/ and a STAR-aligned RSEM reference (mouse_reference).

In [None]:
import os

# Ensure you set the path to the RSEM binary
# Read the SRR accessions from the file
with open('accs.txt', 'r') as f:
    srr_accessions = [line.strip() for line in f.readlines()]

# Define the output directory
output_dir = "data/rsem_output"

# Loop through each SRR accession and run rsem-calculate-expression
for srr in srr_accessions:
    !rsem-calculate-expression -p $THREADS --paired-end --star \
    data/trimmed/{srr}_1_trimmed.fastq data/trimmed/{srr}_2_trimmed.fastq mouse_reference data/rsem_output/{srr} > /dev/null

### STEP 10: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [None]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data/rsem_output'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Sort the DataFrame by TPM values in descending order and get the top 10 genes
    top_10_genes = df.sort_values(by='TPM', ascending=False).head(10)

    # Print the top 10 genes with their TPM values
    print(f"Top 10 Genes by TPM for {srr_id}:")
    print(top_10_genes[['gene_id', 'TPM']])

### STEP 11: Report the expression of ENSMUSG00000064351 for each file

Use `grep` to report the expression in the wild-type sample. The fields in the RSEM `genes.results` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data/rsem_output'

# Target gene ID
target_gene = 'ENSMUSG00000064351'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Filter for the target gene
    target_gene_data = df[df['gene_id'] == target_gene]

    # Print the target gene's TPM value for the SRR ID
    print(f"TPM for {target_gene} in {srr_id}: {target_gene_data['TPM'].values[0]}")

### STEP 12: Export Read counts to AWS S3 Bucket


The code effectively extracts gene expression data from RSEM output files and stores them in a structured format on an S3 bucket. This data will be accessible for further analysis in Tutorial 2 and Tutorial 3.

In [None]:
import os
import pandas as pd
import boto3

# Define the path to your RSEM output directory
rsem_output_path = "data/rsem_output"

# Define the S3 bucket and output path
s3_bucket = "nigms-sandbox/bulk-scRNAseq"
s3_output_path = "readcounts/"

# Initialize S3 client
s3_client = boto3.client('s3')

# Get a list of all .genes.results files in the directory
genes_files = [f for f in os.listdir(rsem_output_path) if f.endswith('.genes.results')]

# Loop through each file to extract gene ID, expected counts, and gene length
for file in genes_files:
    file_path = os.path.join(rsem_output_path, file)
    
    # Read the .genes.results file
    rsem_data = pd.read_csv(file_path, sep="\t")

    # Check if the necessary columns exist
    if all(col in rsem_data.columns for col in ["gene_id", "expected_count", "length"]):
        # Create a new dataframe with required columns
        result_data = rsem_data[["gene_id", "expected_count", "length"]]
        result_data.columns = ["GeneID", "Count", "GeneLength"]

        # Define the output filename based on the input file name
        output_file_name = f"{os.path.splitext(file)[0]}.txt"
        s3_output_file_path = f"{s3_output_path}{output_file_name}"

        # Convert the DataFrame to a CSV string
        csv_buffer = result_data.to_csv(sep="\t", index=False)

        # Upload the result directly to S3
        s3_client.put_object(Bucket=s3_bucket, Key=s3_output_file_path, Body=csv_buffer)

    else:
        print(f"Warning: Required columns are missing in file: {file}")

# Optionally, print a message indicating completion
print("Extraction and file creation complete.")



### STEP 13: Save Merged Read Counts

This code combines multiple RSEM gene count files into a single, unified file, making it easier to analyze and visualize the gene expression data. This files was also uploaded to S3 Bucket to allow further analysis in other Tutorials. 

In [None]:
# Ensure the RSEM quantification results directory exists
!mkdir -p data/rsem_output

# Merge RSEM results by gene counts (similar to Salmon's numreads merge)
!rsem-generate-data-matrix data/rsem_output/*.genes.results > data/rsem_output/merged_gene_counts.txt

# Optionally, rename the columns based on the samples
# If you want to assign your GSM identifiers or any other custom names, edit the header.
!sed -i "1s/.*/Name\tGSM6658439\tGSM6658438\tGSM6658435\tGSM6658441\tGSM6658433\tGSM6658431\tGSM6658429\tGSM6658427/" data/rsem_output/merged_gene_counts.txt

# Remove any unnecessary prefixes like 'gene-' or 'rna-' for easier formatting
!sed -i "s/gene-//g" data/rsem_output/merged_gene_counts.txt
!sed -i "s/rna-//g" data/rsem_output/merged_gene_counts.txt

# Show a preview of the merged quantification file
!head data/rsem_output/merged_gene_counts.txt

import boto3
import os

# Define the file path and S3 bucket details
file_path = "data/rsem_output/merged_gene_counts.txt"
bucket_name = "nigms-sandbox/bulk-scRNAseq"
s3_key = "readcounts/merged_gene_counts.txt"

# Initialize an S3 client
s3_client = boto3.client('s3')

# Upload the file to the specified S3 bucket
try:
    s3_client.upload_file(file_path, bucket_name, s3_key)
    print(f"File {file_path} uploaded successfully to {bucket_name}/{s3_key}")
except Exception as e:
    print(f"Error uploading file: {e}")

# Define the file paths and S3 bucket details
rsem_output_path = "data/rsem_output"
feature_table_path = "data/reference/mouse_feature_table.txt"
bucket_name = "nigms-sandbox/bulk-scRNAseq"
s3_output_path = "readcounts/"
s3_feature_table_path = "reference/mouse_feature_table.txt"

# Upload the gene count file
s3_client.upload_file(file_path, bucket_name, s3_key)

# Upload the feature table file
s3_client.upload_file(feature_table_path, bucket_name, s3_feature_table_path)

### STEP 14: Save RSEM reference and STAR index to AWS S3 Bucket

In [None]:
!aws s3 cp data/rsem_reference s3://nigms-sandbox/bulk-scRNAseq/reference/rsem_reference/ --recursive

## Conclusion

In this tutorial, we covered the following key concepts and workflow steps:

- **Bulk RNA-seq Preprocessing**: Downloading the full dataset and metadata using the SRA Toolkit and Athena, and setting up directories
- **Quality Control**: Use FastQC and MultiQC to assess the quality of reads in the dataset and combine results for all samples to generate a comprehensive overview of quality metrics across multiple samples.
- **Adapter Trimming**: Learn how to use Trimmomatic to remove adapter sequences and low-quality bases from FASTQ reads.
- **Read Mapping and Quantification**: Understand the purpose of indexing and learn how to use STAR to create an index of the reference genome for efficient read mapping. Map reads to reference genome and quantify gene expression levels using RSEM.
- **Storing Data Outputs**: Learn how to merge read count data and store it in an AWS S3 bucket to be utilized in subsequent analysis in Tutorial 2 and 3.

In summary, this Jupyter Notebook provided a hands-on demonstration of an extended Bulk RNA-Seq analysis workflow, guiding users through essential steps such as obtaining sequencing data from the Sequence Read Archive using the SRA Toolkit and Athena, read trimming with Trimmomatic, quality control with FastQC and MultiQC, reference genome indexing with STAR, read mapping and quantification using RSEM, and storing data on an AWS S3 bucket. By processing the full set of reads from a mouse dataset, we were able to observe the expression levels of the top 10 most highly expressed genes in each sample. This workflow serves as a foundation for more advanced analyses, and further resources are available for utilizing R/DESeq2 for differential gene expression analysis using the read counts generated in this tutorial, and using NetAct for transcription factor network analysis. Ultimately, this tutorial equips users with the basic skills to analyze RNA-seq data and to understand the core components of a typical RNA-seq pipeline.



## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1_subsampling_mouse-miniforge.ipynb) A short introduction to downloading and mapping sequences to a mouse genome using STAR and RSEM.


[Workflow Two (DEG Analysis):](Tutorial_2_DEG_Analysis_mouse.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.

[Workflow Three (Network Analysis):](Tutorial_3_NetAct.ipynb) Using NetAct and R to conduct transcription factor network analysis.
