# Bulk RNA-Seq Analysis Training Demo

## Overview

The short tutorial workflow uses truncated and partial run data from the Mittenbühler MJ et al., project.

This short tutorial demonstrates how to run an RNA-Seq workflow using a subsampling Mus musculus data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression.

This tutorial can take over 1 hour 30 minutes to run the code fully. This is part of the reason we have a short and easy introductory tutorial, and this longer more full tutorial for those interested.

![RNA-Seq workflow](https://github.com/King-Laboratory/scRNASeq-miRNASeq-and-TF-Network-Analysis/blob/main/images/Tutorial_1_Workflow.png)

### STEP 1: Install Mambaforge

First install Mambaforge.

Mambaforge is a Conda package manager. Conda is a tool which creates Conda ‘environments’ that Conda packages can be installed into.

Conda packages and environments are useful for several reasons. Conda packages contain metadata. This metadata includes information about what other programs the given software needs to be installed, in order to run. When installing a package with Conda, those other packages are automatically also installed. In this way, the user does not have to worry about manually installing each dependency. This makes installation quick and simple.

These packages are installed inside of environments, which are simply folders within the local installation of Conda. This has several benefits. Local installation means easier installation for non-admin users who may not have access to all system directories. Each environment can hold specific software with specific versions, and it easy to swap to different environments. In addition, the environments themselves are portable, as each environment contains a manifest on how to recreate that environment.

Mambaforge itself is a Conda package manager, this means it requires Conda in order to work. It is used to install and update Conda packages, which it gets from a ‘channel’, or repository. It is an alternative to the native Conda package manager. It is often used for reasons of speed.

Bioconda is a ‘channel’, or repository, that the Mambaforge package manager can download packages from. It is a repository of Conda packages that are related to biology. These packages are versions of popular biology software that are curated and uploaded by contributing users.


In [None]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -u -p $HOME/mambaforge
!date +"%T"

Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [None]:
#tell the computer where the mambaforge bin files are located
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

#now we can easily use 'mamba' command to install software 
!mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon gsutil sql-magic entrez-direct gffread parallel-fastq-dump sra-tools sql-magic pyathena samtools star rsem entrez-direct subread pigz -y

### STEP 2: Setup Environment

Create a set of directories to store the reads, reference sequence files, and output files.

In [None]:
!cd $HOMEDIR
!echo $PWD
!mkdir -p data_sub
!mkdir -p data_sub/trunc_rawfastq
!mkdir -p data_sub/trimmed
!mkdir -p data_sub/fastqc
!mkdir -p data_sub/fastqc_samples/
!mkdir -p data_sub/reference
!mkdir -p data_sub/star_alignments
!mkdir -p data_sub/quants
!mkdir -p data_sub/aligned_bam
!mkdir -p data_sub/rsem_reference/mouse_rsem_reference
!mkdir -p data_sub/rsem_output

Specify the number of available threads based on the VM.

This is useful for later tools such as trimmomatic, or STAR.

In [None]:
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'

#python variable to hold the amount of threads your cpu has,
#useful for downstream tools like STAR, trimmomatic, etc
threads = int(numthreads[0])

#its also good to have a shell version of the variable for commands that use piping, 
#in jupyter, shell commandds with piping sometimes causes python variables to not work and generally be wonky.
%env THREADS=$threads

### STEP 3: Downloading relevant files



### STEP 3.1: Finding run accession numbers.


This code retrieves the SRR IDs associated with project PRJNA892075 from the NCBI database and saves them to a .txt file

In [None]:
!esearch -db sra -query "PRJNA892075" | efetch -format runinfo | cut -d',' -f1 | tail -n +2 > accs.txt
!cat accs.txt

### STEP 3.2: Download subsampling FASTQ files from S3 bucket

In order for this tutorial to run quickly, we will only analyze 1,000,000 reads from each of the eight samples in PRJNA892075. These subsampled files have been posted on a publicly accessible S3 bucket.

The code will connect to the S3 bucket and download the paired-end FASTQ files for each SRR.

In [None]:
#Load SRR IDs from accs.txt (if available)
with open('accs.txt', 'r') as f:
    accs = [line.strip() for line in f.readlines()]
    
# Define the path to S3 bucket

s3_path = 's3://sra-data-athena/fastqfiles/'

In [None]:
for acc in accs:
    !aws s3 cp {s3_path}subsampled_{acc}_1.fastq data_sub/raw_fastq/
    !aws s3 cp {s3_path}subsampled_{acc}_2.fastq data_sub/raw_fastq/

### STEP 3.3: Download reference transcriptome files that will be used by STAR

This step downloads and unzips reference files for the mouse genome and annotations needed by STAR. These reference files will be used by STAR to align RNA-seq reads during the analysis.

In [None]:
! wget ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz -O data_sub/reference/mouse_genome.fa.gz
! wget ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz -O data_sub/reference/mouse_annotation.gtf.gz
! wget -O data_sub/reference/mouse_feature_table.txt.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_feature_table.txt.gz

In [None]:
!gunzip -f data_sub/reference/mouse_genome.fa.gz 
!gunzip -f data_sub/reference/mouse_annotation.gtf.gz
!gunzip -f data_sub/reference/mouse_feature_table.txt.gz

### STEP 3.4: Copy data file for Trimmomatic


This step copies a data file for Trimmomatic from a Google Cloud Storage bucket to the local directory.

In [None]:
!gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa .
!head TruSeq3-PE.fa 

### STEP 4: Run Trimmomatic

Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

Trimmomatic takes an input, in this case a forward and reverse FASTQ file, and the user names the outputs. In this case the first two outputs are the ‘paired’ and ‘orphaned’ trimmed files for the forward reads. The second two are the ‘paired’ and ‘orphaned’ trimmed files for the reverse reads.

Paired in this case means the forward and reverse sequences were able to be aligned to each other after trimming. Orphaned, or unpaired, typically means one of the reads was discarded as a result of trimming, and only the forward or reverse read survived.

Here we take only the file with paired reads, as there are only a few orphaned reads, and including orphaned reads can complicate different downstream analyses. Unless there is a significant amount of them, or a specific reason to use them, it is generally easier to discard unpaired reads. Also in the interest of simplicity and speed, we proceed in further steps with just using the paired-end reads of just the forward-end read files, which in this context is sufficient, however in different contexts, using both forward and reverse is often preferable.

The last part of the command specifies how the trimming is done: ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

In greater detail:

‘ILLUMINACLIP:TruSeq3-PE.fa’ refers to which adapters should be cut from the reads.

‘2:30:10:2’ refers to various metrics, which are recommended defaults.

‘2’ refers to the seed mismatch. This refers to the amount of mismatches a 'seed' may have in aligning to a possible adapter.

‘30’ refers to the palindrome clip threshold. This refers to the similarity score. If forward and reverse reads, after having an adapter attached to them, are greater than this score, trimming of adapter fragments will be performed. Forward reads are clipped, and reverse reads dropped.

‘10’ refers to the simple clip threshold. The basic, and alternative method to palindromic searching for adapters. Adapters are tested against reads and if sufficiently matched (above the threshold, in this case, 10), they are clipped.

‘2’ refers to the minimum adapter fragment length in palindrome mode.

‘LEADING:3’ refers to trimming bases from the start of a read. Bases at the start of a read will continue to be trimmed, sequentially, as long as the bases remain below a PHRED score of 3.

‘TRAILING:3’ refers to the same as above, but at the end of a read.

‘MINLEN:36’, the read is discarded if below this length.

Greater information about parameters can be found in the trimmomatic documentation.

In [None]:
!cat accs.txt | xargs -I {} \
trimmomatic PE -threads $THREADS \
'data_sub/raw_fastq/subsampled_{}_1.fastq' 'data_sub/raw_fastq/subsampled_{}_2.fastq' \
'data_sub/trimmed/subsampled_{}_1_trimmed.fastq.gz' 'data_sub/trimmed/subsampled_{}_1_trimmed_unpaired.fastq.gz' \
'data_sub/trimmed/subsampled_{}_2_trimmed.fastq.gz' 'data_sub/trimmed/subsampled_{}_2_trimmed_unpaired.fastq.gz' \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

### STEP 5: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Because jupyter is at its core a python editor, we can use python code and html support to display results in-line.

FastQC looks for different characteristics of quality in reads. It is very rare that every metric will pass. In many cases, they serve as warnings, which should be compared to the context of the experiment. For instance, here, per base sequence content, sequence length distribution, sequence duplication levels, and overrepresented sequences all throw warnings. Per base sequence content routinely fails in RNA-sequencing in the first 15~ or so bases due to biased fragmentation. In most of our samples, this is where we see the failure (20% or more difference between A/T or G/C), and so is not unexpected. The overrepresented sequences can be BLASTed to show the majority of them are ribosomal RNA. Ribosomal RNA contamination is also common and will not be indexed later, and so not a large concern. Other metrics look good.

In [None]:
# Run FastQC
!cat accs.txt | xargs -P $THREADS -I {} fastqc data_sub/trimmed/subsampled_{}_1_trimmed.fastq.gz data_sub/trimmed/subsampled_{}_2_trimmed.fastq.gz -o data_sub/fastqc_samples/

### STEP 6: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

Being able to use python with bash also means we can seamlessly use popular python packages, such as pandas, to interact with or view the files we create.

In [None]:
#!multiqc -f data_sub/fastqc_samples/
!multiqc -f -o data_sub/multiqc_samples/ data_sub/fastqc_samples/

### STEP 7: Build STAR Genome Index for Efficient RNA-Seq Read Alignment

This step generates a genome index for STAR (Spliced Transcripts Alignment to a Reference), a highly efficient and accurate aligner optimized for RNA-seq data. STAR is widely recognized for its ability to handle large datasets and accurately map reads across splice junctions, which is critical for eukaryotic RNA-seq analysis where splicing is common.

In this step, the index is built using the mouse genome sequence in FASTA format and corresponding gene annotations in GTF format, which were prepared in Step 3.3. The indexing process, essential for speeding up subsequent read alignments, typically takes around 25 minutes to complete.

In [None]:
import os
import subprocess
import pandas as pd

# Get the number of threads from the shell command
numthreads = !lscpu | grep '^CPU(s)' | awk '{print $2-1}'
threads = int(numthreads[0])

os.environ['THREADS'] = str(threads)

In [None]:
!/usr/bin/time -v STAR --runThreadN $THREADS --runMode genomeGenerate --genomeDir data_sub/reference/STAR_index --genomeFastaFiles data_sub/reference/mouse_genome.fa --sjdbGTFfile data_sub/reference/mouse_annotation.gtf --sjdbOverhang 100 --limitGenomeGenerateRAM 29000000000 --genomeSAsparseD 2

### STEP 8: Align RNA-Seq Reads with STAR and Quantify Transcript Abundance with RSEM

This step employs the STAR aligner to map RNA-seq reads to the reference genome and generates aligned output in BAM format for each sample. The code processes multiple samples by iterating over a list of Sequence Read Archive (SRA) Run IDs (SRR) specified in the accs.txt file. For each sample, the paired-end FASTQ files are aligned to the reference genome using the STAR index.

The alignment is performed using the --quantMode TranscriptomeSAM GeneCounts option, which directs STAR to not only align the reads but also produce transcriptome-specific BAM files and gene-level read counts. The gene counts output by STAR can be utilized for downstream analyses, such as differential gene expression.

The use of BAM format (--outSAMtype BAM SortedByCoordinate) ensures that the aligned reads are saved in a compressed and indexed format, sorted by genomic coordinate, which is critical for efficient storage and further processing. Paired-end reads are provided to STAR via the --readFilesIn option, and gzip-compressed FASTQ files are handled with the --readFilesCommand zcat argument.

This alignment process typically takes approximately 10 minutes per sample.

In [None]:
# Align each sample
!cat accs.txt | xargs -I {} STAR --runThreadN $THREADS --genomeDir data_sub/reference/STAR_index --readFilesIn data_sub/trimmed/subsampled_{}_1_trimmed.fastq.gz data_sub/trimmed/subsampled_{}_2_trimmed.fastq.gz --readFilesCommand zcat --outFileNamePrefix data_sub/aligned_bam/subsampled_{}_ --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts

This command creates a reference index for RSEM (RNA-Seq by Expectation-Maximization), a widely-used tool for quantifying gene and transcript expression from RNA-seq data. The --star and --star-path parameters specify that STAR will be used as the alignment tool, with the path to the STAR executable explicitly provided, allowing RSEM to internally handle the read alignment process by leveraging the STAR index previously generated.
The command processes the provided genome annotation file (.gtf format) and the genome sequence file (.fa format) to build the RSEM reference index.

On average, this step takes approximately 10 minutes to complete

In [None]:
!rsem-prepare-reference --gtf data_sub/reference/mouse_annotation.gtf --star --star-path /sw/STAR data_sub/reference/mouse_genome.fa data_sub/rsem_reference/mouse_rsem_reference

In this part of the code, RSEM utilizes the previously created reference index to map RNA-seq reads to specific genes and accurately estimate gene expression levels. The reference index streamlines the quantification process by precomputing essential data structures, enabling efficient read mapping and abundance estimation.
The code iterates through the SRR IDs, applying RSEM to each sample to quantify gene expression. The effectiveness of RSEM hinges on the reference index, which contains crucial information about the genome and gene annotations. This information guides RSEM in accurately mapping aligned reads and estimating gene expression levels. This step typically takes around 40 minutes to complete.

In [None]:
# Run RSEM quantification for each sample listed in accs.txt
!cat accs.txt | xargs -I {} bash -c "rsem-calculate-expression --paired-end --alignments --bam --star -p $THREADS data_sub/aligned_bam/subsampled_{}_Aligned.toTranscriptome.out.bam data_sub/rsem_reference/mouse_rsem_reference data_sub/rsem_output/subsampled_{}"

### STEP 11: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [None]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data_sub/rsem_output'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/subsampled_{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Sort the DataFrame by TPM values in descending order and get the top 10 genes
    top_10_genes = df.sort_values(by='TPM', ascending=False).head(10)

    # Print the top 10 genes with their TPM values
    print(f"Top 10 Genes by TPM for {srr_id}:")
    print(top_10_genes[['gene_id', 'TPM']])

Top 10 most highly expressed genes in the double lysogen samples.


### STEP 12: Report the expression of ENSMUSG00000064356 for each file

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data_sub/rsem_output'

# Target gene ID
target_gene = 'ENSMUSG00000064356'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/subsampled_{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Filter for the target gene
    target_gene_data = df[df['gene_id'] == target_gene]

    # Print the target gene's TPM value for the SRR ID
    print(f"TPM for {target_gene} in {srr_id}: {target_gene_data['TPM'].values[0]}")

Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)