# Differential Gene Expression Notebook
Welcome to the Differential Gene Expression workflow for the SSRP workshop. This notebook will cover step-by-step in greater details the downstream RNA-seq analyses. 

As this is built to operate on our [Binderhub](https://binderhub.readthedocs.io/en/latest/index.html), the environment has been preconfigured with appropriate software installed. These softwares will be called as needed. To view them, please see the environment file within the binder directory. If you try to run this notebook on a local machine without properly installing each software package, it will not operate correctly.

This is an attenuated workflow focusing on just the downstream analyses. For a walkthrough that includes the upstream processing, you can see the [full workshop here.](https://github.com/RU-MaGIC-Classes/SSRP_Workshop) 

As this is focused on just the downstream, this will be exclusively in R for ease. 

### Analyses steps
Since we will be skipping the first few, this is a summary of what has/will be performed
- Primary Analysis: Initial demultiplexing and conversion from raw image files to .fastq files. This should include initial sequencing QC. 
- Secondary Analysis: The alignment steps and associated QC. This should include things like removal of garbage reads, ambiguous bases, and for RNAseq quantification of things like ribosomal content. 
- Tertiary analysis: The steps we will start with here. The alignment and feature generation was performed upstream and we start with those files to perform the actual differential gene expression comparisons. 
- Quartenary analysis: We will also perform some of these, this includes visualization and things like pathway analysis, but are only limited by your biological question's contraints. 

## Tertiary analysis start
Now that we have determined we have a quality mapping and we have our table of gene counts, we are ready for the next step- differential gene expression analysis. One of the most prolificly utilized is [DESeq2.](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)

Specifically for RNA-seq experiments like this, we need to take into account several aspects of the biology at play for the statistics. Primarily, a standard t-test assumes a normal distribution. Like a coin flip- If you flip a coin 50 times per test and do 50 test, you would expect the brunt of the output to be near the 25 heads/25 tails, with it being less and less likely to get 1 heads/49 tails or 1 tails/49 heads. That would be a normal distribution. With RNA-seq, we expect to see thousands of "tests" (genes) with many of them having few coin flips (reads) and then a select few having a larger portion of reads. This is more comparable to the lottery, many many will play, some will win a little bit, and very few will hit the jackpot. This type of distribution is usually called a Poisson distribution. RNA-seq goes even a bit further and no longer assumes that the mean and variance will be the same, and instead assumes that the variance is independent of the mean. This is called negative binomial distribution, and as you can see below fits an RNA-seq distribution.

![https://biohpc.cornell.edu/doc/RNA-Seq-2019-Lecture3.pdf](img/nb_mean_var.png)

You also have to consider how to normalize your data. This is critical in RNA-seq data for the following effects:

- Significance due to different sequencing depth. If Sample A is twice as deep as Sample B, well then the genes will look twice as expressed
- Significance due to gene size. If Gene X is twice as long as Gene Y, you would expect more reads to map to Gene X since more of the original RNA would come from the larger gene, and therefore Gene X will look more expressed than Gene Y
- Significance due to RNA depth/composition. There will be several genes that are the most expressed based on the negative binomial distribution. These "super signals" will reduce the visibility of other signals. I always think of a car with those stupid bright HID LED lights driving at night looks like the light of the sun, and then the next car with normal headlights looks like its lights arent even on. Similar concept.

With DESeq2, we fortunately can take into account these normalizations which we will go into a bit more below.

### Load the required libraries
This is where we load all the various libraries that we will be needing for the analyses. 

In [None]:
library(DESeq2)
library(tidyr)
library(tidyverse)
library(dplyr)
library('pheatmap')
library(RColorBrewer)
library(ggplot2)
library('PCAtools')
library(plotly)
library(EnhancedVolcano)

library(clusterProfiler)
library(msigdbr)
library(stringr)
library(enrichplot)
#These below arent true libraries, but are specific parameters we will use for the quartenary analysis. 
organism='org.Hs.eg.db'
korg='hsa'
msig_org='Homo sapiens'

In [None]:
filecontents <- read.csv('./input_data/merged_gene_counts.txt', header=T, sep=',', row.names=1)
counts <- filecontents[, -c(1)]
counts <- counts[,order(names(counts))]


genetable <- subset(filecontents, select=c(1))
names(genetable)[1] <- 'Geneid'
genetable$Gene <- rownames(genetable)


filecontents = read.csv('./input_data/metadata.csv', header=T, sep=',', row.names=1)

sampletable <- filecontents[order(row.names(filecontents)),]
countdata <- as.matrix(counts)[, colnames(counts) %in% rownames(sampletable)]
sampletable$condition <- factor(sampletable$Group)

dds <- DESeqDataSetFromMatrix(countData=countdata, colData=sampletable, design=~ condition)
dds <- DESeq(dds)

In [None]:
#Extract the VST data
#The VST data is one of two normalization factors you can use- Rlog is also viable but VST is way faster and usually used for larger datasets. 
vst <- vst(dds,blind=TRUE)

#We need to pull out the matrix that is the vst data to then manipulate it. In this case, we will be removing the batch effects due to the group batch, but we could limit it based on other aspects as well
mat <- assay(vst)
mm <- model.matrix(~condition, colData(vst))
mat <- limma::removeBatchEffect(mat, vst$batch, design=mm)

#And now we are going to plot out the PCA plot
p <- pca(mat, metadata = sampletable, removeVar = 0.1)
plot <- biplot(p, colby='Etiology',
               shape='batch',
            legendPosition='top', legendLabSize=16, legendIconSize=8.0,
            pointSize=6,
            labSize=5)
print(plot)

In [None]:
p <- pca(counts(dds, normalized=TRUE), metadata=sampletable, removeVar=0.1)
uncorrected_plot <- biplot(p, colby='Etiology',
            legendPosition='top', legendLabSize=16, legendIconSize=8.0,
            pointSize=6,
            labSize=5)
print(uncorrected_plot)

In [None]:
#I have a particular loop to do this- just because often times I set this up and have multiple comparisons to run. I just implement my loop regardless just cause its easier for me on a code base. But you set up the comparisons in the "compares" list. 

compares <- list(
  list('HCVTumor','HCVControl')
)

#For each comparison I then pull out the results of the deseq based on the comparisons, then perform a logfc shrink. This doesnt change what is significant or not, but makes it more even for plotting purposes. 
for(i in compares){
  out_res <- results(dds, contrast=c('condition',i[[1]], i[[2]]), alpha=0.05)
  shrink <- lfcShrink(dds, contrast=c('condition',i[[1]], i[[2]]),res=out_res, type="normal")
  comp_out <- shrink[order(shrink$padj),]
  comp_out <- merge(as.data.frame(comp_out), as.data.frame(counts(dds, normalized=TRUE)),
    by='row.names', sort=FALSE)
  names(comp_out)[1] <- 'Gene'
  
  DataIn <- merge(as.data.frame(comp_out), as.data.frame(genetable), by="Gene", sort=FALSE)
  DataIn <- DataIn %>% relocate(GeneID, .after = Gene)
  write.csv(DataIn, file=paste('./hcv_only/',i[[1]],'_vs_',i[[2]],'.csv', sep=''))
  assign(paste(i[[1]],'_vs_',i[[2]], sep=''), DataIn)
}

In [None]:
#log2 values
compares <- list(
  list('Radiation','Control')
)

for(i in compares){
  out_res <- results(dds, contrast=c('condition',i[[1]], i[[2]]), alpha=0.05)
  shrink <- lfcShrink(dds, contrast=c('condition',i[[1]], i[[2]]),res=out_res, type="normal")
  comp_out <- shrink[order(shrink$padj),]
  comp_out <- merge(as.data.frame(comp_out), as.data.frame(mat),
    by='row.names', sort=FALSE)
  names(comp_out)[1] <- 'Gene'
  
  DataIn <- merge(as.data.frame(comp_out), as.data.frame(genetable), by="Gene", sort=FALSE)
  DataIn <- DataIn %>% relocate(Geneid, .after = Gene)
  write.csv(DataIn, file=paste('./',i[[1]],'_vs_',i[[2]],'.csv', sep=''))
  assign(paste(i[[1]],'_vs_',i[[2]], sep=''), DataIn)
}

In [None]:
plot <- EnhancedVolcano(DataIn,
                       lab=as.character(DataIn$Geneid),
                       title='Irradiated vs Control',
                       x='log2FoldChange',
                       y='padj',
                       legendPosition='bottom',
                       legendLabels=c('Not significant','Log2FC','padj','padj and Log2FC'),
                       pCutoff=0.05,
                       FCcutoff=1)

     
jpeg('mRNA_volcano.jpg', height=800, width=1000)
print(plot)
dev.off()
if (!require("processx")) install.packages("processx")
fig <- ggplotly(plot + aes(x= log2FoldChange, y= -log10(padj), label = Geneid))

htmlwidgets::saveWidget(as.widget(fig), "./mRNA_volcano.html")


In [None]:
topVarianceGenes <- head(order(rowVars(mat), decreasing=T), 250)

my_colors = brewer.pal(n = 11, name = "RdBu")
    my_colors = colorRampPalette(my_colors)(50)
    my_colors = rev(my_colors)

plot <- pheatmap(mat[topVarianceGenes,], cluster_rows=TRUE, color=my_colors,
    show_rownames=FALSE, cluster_cols=TRUE, scale='row', fontsize=20, annotation_col=mat_col)

return(plot)