## DESeq2 tutorial: Investigating differential gene expression across different carbon sources
In this tutorial, we will be using DESeq2 to investigate genes that are differntially expressed between 4 different carbon source conditions: glucose, fructoselysine, glucose+fructoselysine, and glucose+lysine. We will start with raw count data from the Wolf et al., 2019 Cell Host Microbe study (https://doi.org/10.1016/j.chom.2019.09.001), perform differential gene expression analysis, and visually explore some of the results.

### Import packages and load metadata

In [None]:
library("DESeq2")
packageVersion("DESeq2")

In [None]:
#Get metadata
map=read.table("Invitro_RNASeq_map.txt",sep="\t",header=T)
row.names(map)=map$Description
group=map$Treatment
#Peek at metadata
map

Now let's import and subset data that has already been mapped to C. intestinalis genome to get columns that contain count data.

In [None]:
data=read.table(file = paste("Cint.normalized"),header=T,sep="\t")
row.names(data)=data$gene
counts=data[,grep("counts",colnames(data))]
colnames(counts)=unlist(lapply(strsplit(as.character(colnames(counts)),"\\."),function(x)x[2]))
counts.invitro=counts

In [None]:
#Peek at count data
counts.invitro[1:4,]

### Pre-filtering and formatting
In this step we will pre-filter genes that are not at least 40 counts in 4 samples. We will also fix our data to include only one bacterial species, C. intestinalis.

In [None]:
#counts as pre-processed include fractions so round, also remove pseudo count
tf=apply(counts.invitro,1,function(x)length(which(x>40))>4)
counts.invitro.adj=round(counts.invitro[tf,],digits = 0)-1
#fix map for just the one species
map.species=map[colnames(counts.invitro.adj),]
map.species=map.species[!is.na(map.species$Experiment),]
counts.invitro.adj=counts.invitro.adj[,row.names(map.species)]
#Peek at files
map.species  
head(counts.invitro.adj)

### Format data for DESeq2 and run DESeq analysis 
Here we will use the command DESeqDataSetFromMatrix to format our data to be interpretable to the DESeq program, and then run differential expression analysis on our data. Below we'll show the results of log2-fold changes in gene expression between two treatment conditions, glucose+lysine vs fructoselysine.

In [None]:
dds <- DESeqDataSetFromMatrix(countData = counts.invitro.adj, colData = map.species, design = ~Treatment)
dds

In [None]:
dds.deseq <- DESeq(dds)
deseq2.res <- results(dds.deseq, alpha = 0.05) #extract results from deseq2 analysis, optimizing for an FDR p.adj cutoff of 0.05
deseq2.res

In [None]:
summary(deseq2.res)

### Analysis and visualization of results

The plot below shows the log2-fold changes for each gene for the above treatment comparison over the mean of the normalized counts across all samples.

In [None]:
plotMA(deseq2.res, ylim=c(-2,2))

In [None]:
resultsNames(dds.deseq)

In [None]:
BiocManager::install("apeglm")
library(apeglm)
#plotMA(resLFC, ylim=c(-2,2))

#### <ins>Log fold change shrinkage for visualization and ranking</ins>:
Shrinkage of LFC estimates is useful for visualization and ranking of genes, which remove the noise associated with log2 fold changes from low count genes without the need for arbitrary filters.

In [None]:
resLFC <- lfcShrink(dds.deseq, coef="Treatment_glucose.lysine_vs_fructoselysine")
resLFC

In [None]:
plotMA(resLFC)

#### <ins>Regularized logarithm transformation</ins>:
The *rlog* command uses experiment-wide trend of variance over mean, in order to transform the data to remove the experiment-wide trend. *rlog*  fits a model with a term for each sample and a prior distribution on the coefficients which is estimated from the data.

In [None]:
#Rlog transform
rld <- rlog(dds.deseq, blind=FALSE)

### Further visualizations of our results

In [None]:
plotPCA(rld, intgroup=c("Treatment"))

In [None]:
select=row.names(deseq2.res)[deseq2.res$padj<.05]
select=select[!is.na(select)]
toplot.sig=assay(rld[select,])

In [None]:
heatmap(toplot.sig, Colv = NA, Rowv = T, scale="row",labCol=map.species[colnames(toplot.sig),"Treatment"])

In [None]:
plotCounts(dds.deseq, gene=which.min(deseq2.res$padj), intgroup="Treatment")

Taking a closer look at the most highly significant genes: https://www.patricbrc.org/

In [None]:
deseq2.res.sig <- deseq2.res[order(deseq2.res$padj),]
head(deseq2.res.sig)