# RNA-Seq Analysis Training Demo

## Overview:

This code provides a comprehensive framework for analyzing RNA-Seq data to identify differentially expressed genes and investigate potential regulatory mechanisms. 

The code will analyze read count data, followed by differential expression analysis utilizing the DESeq2 and edgeR packages to pinpoint genes with statistically significant expression changes between experimental groups. 

Additionally, the code explores the regulatory landscape by identifying potential transcription factors (TFs) involved in modulating these differentially expressed genes using the NetAct package. 

It further estimates TF activity levels and constructs networks of interactions among these TFs. 

Overall, this demo serves to illustrates the essential steps in RNA-Seq analysis, emphasizing the identification of differential expression and the exploration of TF regulatory networks.

## STEP 1: Install Packages

### STEP 1.1 Install Mambaforge

First install Miniforge.

In [1]:
# Download Miniforge
system('curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh', intern = TRUE)

# Install Miniforge
system('bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p $HOME/miniforge', intern = TRUE)

# Add Miniforge bin to the system path
Sys.setenv(PATH = paste(Sys.getenv("HOME"), "/miniforge/bin:", Sys.getenv("PATH"), sep = ""))


Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [2]:
# Install gsutil and dependencies using mamba
system('mamba install -y -c conda-forge -c bioconda gsutil', intern = TRUE)

### STEP 1.2 Install Bioconductor Packages

In [3]:
# Install BiocManager if not installed
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Set repositories
options(repos = BiocManager::repositories())

# Install Bioconductor packages
BiocManager::install(c("ComplexHeatmap", "DESeq2", "edgeR"), force = TRUE)

# Install CRAN packages
install.packages(c("dplyr", "pheatmap", "ggrepel", "ggfortify", "devtools", "R.utils"), dependencies = TRUE)

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.r-project.org

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    BioCsoft: https://bioconductor.org/packages/3.19/bioc
    BioCann: https://bioconductor.org/packages/3.19/data/annotation
    BioCexp: https://bioconductor.org/packages/3.19/data/experiment
    BioCworkflows: https://bioconductor.org/packages/3.19/workflows
    BioCbooks: https://bioconductor.org/packages/3.19/books
    CRAN: https://cran.r-project.org

Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

Installing package(s) 'ComplexHeatmap', 'DESeq2', 'edgeR'

also installing the dependencies ‘statmod’, ‘limma’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Old packages: 'data.tab

In [67]:
# Organism Annotation Package - Zebrafish
install.packages("org.Dr.eg.db")
library(org.Dr.eg.db)
install.packages("biomaRt")
library(biomaRt)


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

also installing the dependencies ‘filelock’, ‘BiocFileCache’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



### STEP 1.3 Load libraries

In [4]:
library(DESeq2)
library(dplyr)
library(ComplexHeatmap)
library(edgeR)
library(ggplot2)
library(ggrepel)
library(ggfortify)
library(devtools)
library(Biobase)
library(R.utils)

Loading required package: S4Vectors

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min



Attaching package: ‘S4Vectors’


The following object is masked from ‘package:utils’:

    findMatches


The following objects are masked from ‘package:base’:

    expand.grid, I, unname


Loading required package: IRanges

Loading required package: GenomicRanges

Loading required package: GenomeInfoDb

Loading r

### STEP 1.4 Install and Load NetAct from GitHub

In [5]:
devtools::install_github("lusystemsbio/NetAct", dependencies = TRUE, build_vignettes = FALSE)
# Load NetAct
library(NetAct)

Downloading GitHub repo lusystemsbio/NetAct@HEAD



Biostrings   (NA     -> 2.72.1  ) [CRAN]
KEGGREST     (NA     -> 1.44.1  ) [CRAN]
umap         (NA     -> 0.2.10.0) [CRAN]
visNetwork   (NA     -> 2.1.2   ) [CRAN]
Annotatio... (NA     -> 1.66.0  ) [CRAN]
fastmatch    (NA     -> 1.1-4   ) [CRAN]
cowplot      (NA     -> 1.1.3   ) [CRAN]
data.table   (1.15.4 -> 1.16.0  ) [CRAN]
sRACIPE      (NA     -> 1.20.0  ) [CRAN]
org.Hs.eg.db (NA     -> 3.19.1  ) [CRAN]
org.Mm.eg.db (NA     -> 3.19.1  ) [CRAN]
infotheo     (NA     -> 1.2.0.1 ) [CRAN]
entropy      (NA     -> 1.3.1   ) [CRAN]
mclust       (NA     -> 6.1.1   ) [CRAN]
fgsea        (NA     -> 1.30.0  ) [CRAN]
qvalue       (NA     -> 2.36.0  ) [CRAN]


Installing 16 packages: Biostrings, KEGGREST, umap, visNetwork, AnnotationDbi, fastmatch, cowplot, data.table, sRACIPE, org.Hs.eg.db, org.Mm.eg.db, infotheo, entropy, mclust, fgsea, qvalue

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/RtmpLFr8ZR/remotes3142c04fd62/lusystemsbio-NetAct-f160a23/DESCRIPTION’ ... OK
* preparing ‘NetAct’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
  NB: this package now depends on R (>= 3.5.0)
  serialize/load version 3 cannot be read in older versions of R.
  File(s) containing such objects:
    ‘NetAct/data/hDB.rdata’ ‘NetAct/data/mDB.rdata’
    ‘NetAct/vignettes/gsearslts_tutorial.RDS’
* building ‘NetAct_1.0.7.tar.gz’



“replacing previous import ‘mclust::logsumexp’ by ‘limma::logsumexp’ when loading ‘NetAct’”






## STEP 2: Download FASTQ Files from S3

Download the read counts from the S3 bucket created by STAR and RSEM analysis in Tutorial 1b.

In [15]:
# Define the correct path for your files
path <- "data/counts/"

# Load SRR IDs from accs.txt (if available)
accs <- readLines('accs.txt')

for (acc in accs) {
  # Construct the URL for the S3 bucket
  url <- paste0("https://sra-data-athena.s3.amazonaws.com/readcounts/", acc, ".genes_zebrafish.txt")
  
  # Construct the wget command
  cmd <- paste("wget -P data/counts/ ", url)
  
  # Execute the shell command
  system(cmd)
}


## STEP 3: Read Count Data and Define Experimental Groups

Step 3 is focused on reading in the gene expression data from previously generated files and organizing it for further analysis, specifically for differential expression analysis using the edgeR package.

In [38]:
# List the files with full paths
files <- paste0(path, c("SRR3371417.genes_zebrafish.txt", "SRR3371418.genes_zebrafish.txt", 
                        "SRR3371419.genes_zebrafish.txt", "SRR3371420.genes_zebrafish.txt",
                        "SRR3371421.genes_zebrafish.txt", "SRR3371422.genes_zebrafish.txt"))

# Read the DGE data from the files
x <- edgeR::readDGE(files, columns=c("GeneID","Count"))
x$counts <- round(x$counts)

# Define the group factor
group <- as.factor(c("control", "control", "control", "cortisol", "cortisol", "cortisol"))
xgroup <- group

# Define the sample names and set column names
samplenames <- c("control1", "control2", "control3", "cortisol1", "cortisol2", "cortisol3")
colnames(x) <- samplenames

## STEP 4: Data Preprocessing for Differential Expression Analysis

rownames(phenoData)This preprocesses the RNA-seq count data to prepare it for differential expression analysis. It creates a comparison list for the analysis, constructs phenotype data to annotate samples, preprocesses the counts, and removes any duplicate gene entries. These steps ensure the quality and integrity of the data before applying statistical methods to identify differentially expressed genes between the specified groups.

In [61]:
compList <- c("control-cortisol")
phenoData = new("AnnotatedDataFrame", data = data.frame(celltype = group))
rownames(phenoData) = colnames(x$counts)

In [64]:
counts <- Preprocess_counts(counts = x$counts, groups = group, mouse = FALSE)
# Remove duplicate rows, keeping the first occurrence
counts <- counts[!duplicated(rownames(counts)), ]

ERROR: Error in .testForValidKeys(x, keys, keytype, fks): None of the keys entered are valid keys for 'ENTREZID'. Please use the keys method to see a listing of valid arguments.


### STEP 5: Run Differential Expression Analysis and Create Expression Set

This step identifies genes that are differentially expressed between the specified experimental groups and prepares the data for downstream analysis by creating a structured expression set object

In [None]:
DErslt = RNAseqDegs_limma(counts = counts, phenodata = phenoData, 
                          complist = compList, qval = 0.05)

In [None]:
neweset = Biobase::ExpressionSet(assayData = as.matrix(DErslt$e), phenoData = phenoData)

In [None]:
DErslt$e

### STEP 6: Perform Transcription Factor (TF) Selection

This step will identify transcription factors (TFs) that are potentially involved in regulating the differentially expressed genes. By utilizing a database of TFs and employing statistical testing, this analysis aims to uncover crucial regulators that may contribute to the observed changes in gene expression.

In [None]:
data("mDB")
calc <- TRUE

if (calc) {
  gsearslts <- TF_Selection(GSDB = mDB, DErslt = DErslt, minSize = 5, nperm = 10000,
                            qval = 0.05, compList = compList,
                            nameFile = "zebrafish_gsearslts")
} else {
  gsearslts <- readRDS(file = "zebrafish_gsearslts.RDS")
}

tfs <- gsearslts$tfs
tfs

### STEP 7: Reselect Transcription Factors (TFs) by Applying a Stricter q-value Threshold of 0.01

The revised text accurately conveys the purpose of this step. By applying a stricter q-value threshold of 0.01, the code refines the selection of TFs, ensuring that only those with a higher level of statistical significance are retained. This refinement is crucial for downstream analyses as it helps to prioritize the most relevant TFs that may be driving the observed gene expression changes.

In [None]:
Reselect_TFs(GSEArslt = gsearslts$GSEArslt, qval = 0.01)

### STEP 8: Calculate TF Activity

Calculating TF activity is crucial for understanding the regulatory mechanisms underlying the differential expression of genes. By assessing how active specific transcription factors are in different experimental groups, researchers can gain insights into the regulatory networks that may influence biological processes and pathways of interest. The heatmap visualization further aids in the interpretation of these results, highlighting the relationships between TF activity and gene expression patterns.

In [None]:
act.me <- TF_Activity(tfs, mDB, neweset, DErslt)
acts_mat = act.me$all_activities
Activity_heatmap(acts_mat, neweset)

### STEP 9: Build TF Network and Simulate Circuit

TF_Filter function constructs a filtered network of transcription factor interactions based on activity levels and known regulatory links, setting the stage for subsequent analysis and the transcriptional regulatory network plot.

In [None]:
tf_links = TF_Filter(acts_mat, mDB, miTh = .05, nbins = 8, corMethod = "spearman", DPI = T)

### STEP 10: Plot the transcription factor network

After the TF analysis, the plot_network function effectively visualizes the transcription factor network, enabling researchers to intuitively understand the complex regulatory interactions among TFs and their target genes.

In [None]:
plot_network(tf_links)


### STEP 11: Run RACIPE simulation on the constructed GRN

Lastly, the RACIPE simulation is executed on the constructed gene regulatory network, allowing for the exploration of gene expression dynamics under varying regulatory parameters. The results can help to derive conclusions about the potential regulatory mechanisms at play and to identify key transcription factors that might be crucial in the system.

In [None]:
racipe_results <- sRACIPE::sracipeSimulate(circuit = tf_links, numModels = 200, plots = TRUE)