# RNA-Seq Analysis Training Demo

## Overview:

This code provides a comprehensive framework for analyzing RNA-Seq data to identify differentially expressed genes and investigate potential regulatory mechanisms. The code will analyze read count data, followed by differential expression analysis utilizing the DESeq2 and edgeR packages to pinpoint genes with statistically significant expression changes between experimental groups. Additionally, the code explores the regulatory landscape by identifying potential transcription factors (TFs) involved in modulating these differentially expressed genes using the NetAct package. It further estimates TF activity levels and constructs networks of interactions among these TFs. 

Overall, this demo serves to illustrates the essential steps in RNA-Seq analysis, emphasizing the identification of differential expression and the exploration of TF regulatory networks.

<div class="alert alert-block alert-warning"> NOTE: This Jupyter Notebook was developed to run within a customized container on AWS with all software and packages pre-configured. If running without this customized container, you will need to install tools using the Miniforge environment setup instructions below before moving on to <b><u>Step 1.3</u></b>.</div>

## STEP 1: Getting Started

### Without Container: Install Miniforge and R Packages

Miniforge is a lightweight Conda distribution that offers a streamlined installation process and efficient package management. It provides access to a vast repository of packages.

The following code performs these steps:
- Downloads Miniforge or Mambaforge (you can use either based on preference)
- Installs Miniforge (or Mambaforge) - no need to install conda since mamba will be available immediately
- Installs gsutil and dependencies
- Using miniforge and bioconda, installs R packages that will be used in this tutorial


<div class="alert alert-block alert-info">Tip: If using the Miniforge install, run the following code cells by removing the # pound from each command line. </div>

In [None]:
# Download Miniforge
#system('curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh', intern = TRUE)

# Install Miniforge
#system('bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p $HOME/miniforge', intern = TRUE)

# Add Miniforge bin to the system path
#Sys.setenv(PATH = paste(Sys.getenv("HOME"), "/miniforge/bin:", Sys.getenv("PATH"), sep = ""))


Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [None]:
# Install gsutil and dependencies using mamba
#system('mamba install -y -c conda-forge -c bioconda gsutil', intern = TRUE)

<div class="alert alert-block alert-info">Tip: If using the Miniforge install, run the following code cells by removing the # pound from each command line. </div>

### STEP 1.2 Install Bioconductor Packages

In [None]:
# Install BiocManager if not installed
#if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

# Set repositories
#options(repos = BiocManager::repositories())

# Install Bioconductor packages
#BiocManager::install(c("ComplexHeatmap", "DESeq2", "edgeR"), force = TRUE)

# Install CRAN packages
#install.packages(c("dplyr", "pheatmap", "ggrepel", "ggfortify", "devtools", "R.utils"), dependencies = TRUE)

# Install Devtools packages
#devtools::install_github("lusystemsbio/NetAct", dependencies = TRUE, build_vignettes = FALSE)

---------------------------------------

## If running from a container, as noted above, start with <b> STEP 1.3 </b> below:

### STEP 1.3 Load libraries

In [None]:
# Install CRAN package ggfortify, even if running from container
install.packages("ggfortify", dependencies = False)

In [None]:
library(DESeq2)
library(dplyr)
library(ComplexHeatmap)
library(edgeR)
library(ggplot2)
library(ggrepel)
library(ggfortify)
library(devtools)
library(Biobase)
library(R.utils)
library(NetAct)

## STEP 2: Download FASTQ Files from S3

Download the read counts from the S3 bucket created by STAR and RSEM analysis in Tutorial 1b.

In [None]:
# Define the correct path for your files
path <- "data/counts/"

# Load SRR IDs from accs.txt (if available)
accs <- readLines('accs.txt')

# Define the path to the S3 bucket
s3_path <- 's3://nigms-sandbox/bulk-scRNAseq/readcounts/'

# STEP 5: Download FASTQ files from S3 bucket
for (acc in accs) {
  # Construct the command
  cmd <- paste("aws s3 cp", paste0(s3_path, acc, ".genes.txt"), "data/counts/")
  
  # Execute the shell command
  system(cmd)
}

In [None]:
# Define the correct path for your files
path <- "data/counts/"

# Load SRR IDs from accs.txt (if available)
accs <- readLines('accs.txt')

# Define the base URL for the S3 bucket
base_url <- 'https://nigms-sandbox.s3.us-east-1.amazonaws.com/bulk-scRNAseq/readcounts/'

# STEP 5: Download gene count files using wget
for (acc in accs) {
  # Construct the full URL for each file
  file_url <- paste0(base_url, acc, ".genes.txt")
  
  # Construct the wget command
  cmd <- paste("wget -P", path, file_url)
  
  # Execute the shell command
  system(cmd)
}

## STEP 3: Read Count Data and Define Experimental Groups

Step 3 is focused on reading in the gene expression data from previously generated files and organizing it for further analysis, specifically for differential expression analysis using the edgeR package.

In [None]:
# List the files with full paths
files <- paste0(path, c("SRR21972725.genes.txt", "SRR21972724.genes.txt", 
                        "SRR21972723.genes.txt", "SRR21972726.genes.txt",
                        "SRR21972730.genes.txt", "SRR21972729.genes.txt", 
                        "SRR21972728.genes.txt", "SRR21972727.genes.txt"))

# Read the DGE data from the files
x <- edgeR::readDGE(files, columns=c("GeneID","Count"))
x$counts <- round(x$counts)

# Define the group factor
group <- as.factor(c("PSAP", "PSAP", "PSAP", "PSAP", "GFP", "GFP", "GFP", "GFP"))
xgroup <- group

# Define the sample names and set column names
samplenames <- c("PSAP1", "PSAP2", "PSAP3", "PSAP4", "GFP1", "GFP2", "GFP3", "GFP4")
colnames(x) <- samplenames

## STEP 4: Data Preprocessing for Differential Expression Analysis

This preprocesses the RNA-seq count data to prepare it for differential expression analysis. It creates a comparison list for the analysis, constructs phenotype data to annotate samples, preprocesses the counts, and removes any duplicate gene entries. These steps ensure the quality and integrity of the data before applying statistical methods to identify differentially expressed genes between the specified groups.

In [None]:
compList <- c("PSAP-GFP")
phenoData = new("AnnotatedDataFrame", data = data.frame(celltype = group))
rownames(phenoData) = colnames(x$counts)

In [None]:
counts <- Preprocess_counts(counts = x$counts, groups = group, mouse = TRUE)
# Remove duplicate rows, keeping the first occurrence
counts <- counts[!duplicated(rownames(counts)), ]

### STEP 5: Run Differential Expression Analysis and Create Expression Set

This step identifies genes that are differentially expressed between the specified experimental groups and prepares the data for downstream analysis by creating a structured expression set object

In [None]:
DErslt = RNAseqDegs_limma(counts = counts, phenodata = phenoData, 
                          complist = compList, qval = 0.05)

In [None]:
neweset = Biobase::ExpressionSet(assayData = as.matrix(DErslt$e), phenoData = phenoData)

In [None]:
DErslt$e

### STEP 6: Perform Transcription Factor (TF) Selection

This step will identify transcription factors (TFs) that are potentially involved in regulating the differentially expressed genes. By utilizing a database of TFs and employing statistical testing, this analysis aims to uncover crucial regulators that may contribute to the observed changes in gene expression.

In [None]:
data("mDB")
calc <- TRUE

if (calc) {
  gsearslts <- TF_Selection(GSDB = mDB, DErslt = DErslt, minSize = 5, nperm = 10000,
                            qval = 0.05, compList = compList,
                            nameFile = "mouse_gsearslts")
} else {
  gsearslts <- readRDS(file = "mouse_gsearslts.RDS")
}

tfs <- gsearslts$tfs
tfs

### STEP 7: Reselect Transcription Factors (TFs) by Applying a Stricter q-value Threshold of 0.01

The revised text accurately conveys the purpose of this step. By applying a stricter q-value threshold of 0.01, the code refines the selection of TFs, ensuring that only those with a higher level of statistical significance are retained. This refinement is crucial for downstream analyses as it helps to prioritize the most relevant TFs that may be driving the observed gene expression changes.

In [None]:
Reselect_TFs(GSEArslt = gsearslts$GSEArslt, qval = 0.01)

### STEP 8: Calculate TF Activity

Calculating TF activity is crucial for understanding the regulatory mechanisms underlying the differential expression of genes. By assessing how active specific transcription factors are in different experimental groups, researchers can gain insights into the regulatory networks that may influence biological processes and pathways of interest. The heatmap visualization further aids in the interpretation of these results, highlighting the relationships between TF activity and gene expression patterns.

In [None]:
act.me <- TF_Activity(tfs, mDB, neweset, DErslt)
acts_mat = act.me$all_activities
Activity_heatmap(acts_mat, neweset)

### STEP 9: Build TF Network and Simulate Circuit

TF_Filter function constructs a filtered network of transcription factor interactions based on activity levels and known regulatory links, setting the stage for subsequent analysis and the transcriptional regulatory network plot.

In [None]:
tf_links = TF_Filter(acts_mat, mDB, miTh = .05, nbins = 8, corMethod = "spearman", DPI = T)

### STEP 10: Plot the transcription factor network

After the TF analysis, the plot_network function effectively visualizes the transcription factor network, enabling researchers to intuitively understand the complex regulatory interactions among TFs and their target genes.

In [None]:
plot_network(tf_links)

### STEP 11: Run RACIPE simulation on the constructed GRN

Lastly, the RACIPE simulation is executed on the constructed gene regulatory network, allowing for the exploration of gene expression dynamics under varying regulatory parameters. The results can help to derive conclusions about the potential regulatory mechanisms at play and to identify key transcription factors that might be crucial in the system.

In [None]:
racipe_results <- sRACIPE::sracipeSimulate(circuit = tf_links, numModels = 200, plots = TRUE)