<h1 style="font-size: 40px; margin-bottom: 0px;">12.2 Clustering and Differential Expression Analysis</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 950px;"></hr>

Today, we'll continue to play around with our RNA-seq counts by looking at two other ways we can cluster our data, and then we'll perform differential expression analysis to obtain a .csv file that we can then load into Python. As we're performing the differential expression analysis, we'll break it up into smaller steps to review what Dr. Ingolia taught in lecture to see what's going on under the hood.

<strong>Learning objectives:</strong>

<ul>
    <li>Play with color palettes</li>
    <li>Explore clustering methods</li>
    <li>Review differential expression analysis</li>
    <li>Perform differential expression analysis</li>
</ul>

<h1>Load in packages</h1>

Like our previous lesson, we'll first load in the packages that we'll need for today's analysis, and then we'll briefly review some of the initial steps of DESeq2 that we did previously to set up for our principal component analysis. We'll have the same set up here in this notebook in order to perform some other clustering methods.

Two new packages that we'll make use of today are <code>pheatmap</code> to generate heatmaps and <code>viridis</code> to get a specific type of color palette. And we'll still be making use of <code>DESeq2</code>, <code>ggplot2</code>, and <code>hexbin</code>. 

In [None]:
library(DESeq2)
library(ggplot2)
BiocManager::install("hexbin")
library(pheatmap)
library(viridis)

<h1>Play with color palettes in R</h1>

So to start off with today, we'll play around with some color palettes. Specifically, we can take a look at the viridis colormap, which you might recognize as the default colormap for matplotlib in Python when we didn't specify a colormap for our imported image files. We'll then make use of these color palettes that we create for visualizing our clustering results.

<a href="https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html" rel="noopener noreferrer" target="_blank"><u>A helpful vignette on the viridis color scale can be found here</u></a> (Rudis, Ross and Garnier). It describes the different color scales contained within the viridis package, and also demonstrates how the color scales, particularly viridis (designed by Eric Firing), is useful for data visualization that is readable by those with different types of color-blindness, allowing your data visualizations to be more accessible. The vignette also contains a pretty visualization that we can use to test out our own color palettes as well.

<h2>Viridis color palette</h2>

We can use the <code>viridis()</code> function to quickly generate a list of hex codes corresponding to the viridis color palette. We can provide it with the number of hex codes we want it to generate, and it will generate a list spanning the viridis color map.

```
viridis(255)
```

This example will generate a list containing 255 hex codes that span the viridis color map.

Now let's see how the colors work in a visualization, pulling the code from the vignette with some slight modifications:

In [None]:
ggplot(data.frame(x = rnorm(10000), y = rnorm(10000)), aes(x = x, y = y)) +
    geom_hex() +
    theme_void() +
    coord_fixed() +
    scale_fill_gradientn(colors=viridis(255))

<h2>Setting up a color palette using <code>colorRampPalette()</code></h2>

Plenty of different color maps exist both for R and in Python, which can be used for data visualizations such as heatmaps. However, sometimes you might feel that existing color maps don't capture exactly how you want your data to be visualized stylistically. In that case, you can make use of the <code>colorRampPalette()</code> function to generate your own graded color palettes. <a href="https://www.rdocumentation.org/packages/dichromat/versions/1.1/topics/colorRampPalette" rel="noopener noreferrer" target="_blank"><u>Documentation for <code>colorRampPalette()</code> is here.</u></a>

So you can provide the function with a list of your colors (either names, hex codes, etc) that you want it to span, and it can generate a gradient of colors that span the ones you specify:

```
my.fav.col.map <- colorRampPalette(colors = c("aquamarine", "grey", "hotpink"))
```

<a href="https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf" rel="noopener noreferrer" target="_blank"><u>A helpful cheatsheet for colors and other color palettes can be found here by Melanie Frazier.</u></a>


You can then specify how many hex codes to generate by slightly modifying your line of code:

```
my.fav.col.map <- colorRampPalette(colors = c("aquamarine", "grey", "hotpink"))(255)
```

In this setup, you'll generate a list of 255 hex codes that span from aquamarine to grey to hot pink. 

Then, you can provide this color map to your functions for data visualizations, allowing you to create custom color maps based on the needs of your individual figures or the data/information that you are trying to convey.

<h1>Prepare your <code>DESeqDataSet</code></h1>

Since this is a new notebook, we'll need to bring in our counts matrix and conditions matrix again and use them to create a <code>DESeqDataSet</code> for us to use to do some more clustering. 

For convenience, I've just copied over the code from our previous lesson, so you don't need to retype it.

In [None]:
#Here we're importing our counts matrix.
counts <- read.csv('~/MCB201B_F2024/Week_10/quant/1M_counts_matrix.csv',
                   stringsAsFactors=FALSE,
                   row.names=1
                   )

#Then we're importing our conditions matrix.
conditions <- read.csv('~/MCB201B_F2024/Week_10/quant/1M_conditions_matrix.csv',
                   stringsAsFactors=FALSE,
                   row.names=1
                   )

#Update our column headers to match
colnames(counts) <- rownames(conditions)

#Filter out non-expressed genes.
means <- apply(counts, 1, mean)
counts <- counts[which(means>0),]

#Create your DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = conditions,
                              design = ~ condition
                             )

#Estimate size factors - this is actually the start of differential expression analysis
dds <- estimateSizeFactors(dds)

#Perform a regularized log transformation
#Like with PCA, this will be the values we use for additional clustering
#This part is not differential expression but more like QC
rld <- rlog(dds, blind=FALSE)

#Pull the rlog transformed values to sort them and get the top 500 variance genes
rld.values <- assay(rld)
rld.var.sort <- rld.values[order(rowVars(rld.values), decreasing = TRUE),]
top500.var.rld <- head(rld.var.sort, 500)

<h1>Guided Exercise: Generate a distance matrix</h1>

Recall from Dr. Ingolia's clustering lecture that we can determine the similarities and dissimilarities of our samples by calculating their distances from one another, then using the resulting distance matrix to identify clusters of closely grouped samples. 

<h2>Calculate distances between replicates</h2>

To do this, we'll make use of the <code>dist()</code> function, which computes the distance matrix of a given data matrix. <a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html" rel="noopener noreferrer" target="_blank"><u>Documentation for <code>dist()</code> is here.</u></a> If we dig into the documentation, we can see that it calculates the Euclidean distance between the rows, and it will output an object that can the be converted to a matrix via the function <code>as.matrix()</code>.

```
rep.distances <- dist(t(assay(rld)))
```

Breaking down this line of code, we have:

<code>rep.distances</code>

This is the variable to which we are saving our <code>dist</code> object.

<hr style="border: 1px solid; border-color: #AAAAAA;"></hr>

<code>&lt;-</code>

This is our assignment operator.

<hr style="border: 1px solid; border-color: #AAAAAA;"></hr>

<code>dist()</code>

This is the function to calculate the Euclidean distance between each row.

<hr style="border: 1px solid; border-color: #AAAAAA;"></hr>

<code>t(assay(rld))</code>

Here, we provide it with the transposed rlog transformed counts matrix. Like with principal component analysis, the distances are determined between the rows, and since we are more interested in the similarities/differences between each of our replicates, we will provide it with a transposed matrix of our rlog transformed counts using the <code>t()</code> function.

Let's take a look at the output:

<h2>Convert to a matrix</h2>

Now we can convert our <code>dist</code> object into a matrix that we can then use to generate a heatmap of our data based on the distance values for each point in our matrix. To do this, we'll use <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/matrix.html" rel="noopener noreferrer" target="_blank"><u>a standard function called <code>as.matrix()</code></u></a>, which can convert our <code>dist</code> object into a 2D matrix. This function takes the object that you pass it and attempts to coerce it into a matrix.

```
rep.distances.matrix <- as.matrix(rep.distances)
```

Let's take a look at how our distance matrix looks:

<h1>Plot a heatmap of sample distances</h1>

Now that we have a matrix of the Euclidean distances between our samples, we can then generate a heatmap of their distances while clustering our samples based on how close or far away they are from each other.

For this type of plotting, we'll make use of the pheatmap package. <a href="https://cran.r-project.org/web/packages/pheatmap/pheatmap.pdf" rel="noopener noreferrer" target="_blank"><u>Documentation for the pheatmap package is here.</u></a> Specifically, we'll make use of the <code>pheatmap()</code> function, which will allow us to cluster our data based on their Euclidean distances that we calculated using the <code>dist()</code> function.

You should see a heatmap colored with the color map that you specified, where the diagonal corresponds to a zero because it is the distance between the sample itself. 

We can identify groupings of our data based on the dendogram built up based on the distances (recall from Dr. Ingolia's lecture), and we can also see them as the larger squares that we see.

<h2>Save distance matrix heatmap</h2>

We can make a quick adjustment to our code to then output the plot directly to a file rather than into the notebook by passing an additional argument to the <code>pheatmap()</code> function. This argument is <code>filename='name-of-file.ext'</code>.

<h1>Perform hierarchical clustering</h1>

Another way of determining the similarities and dissimilarities of our samples is to perform hierarchical clustering based on our top 500 genes with the highest variance.

We'll continue to make use of the <code>pheatmap()</code> function, but we'll provide it with a different set of arguments.

You should see a heatmap that looks a little different than the heatmap that we generated earlier. You can see how our data is grouped together, but now we can also see clusters of our genes, where we have clusters of genes that are overexpressed in our TAZ KO samples and groups of genes that exhibit reduced expression in our KO samples. 

You might notice that the legend for this figure looks different than that of our distance matrix heatmap. This is due to the fact that we supplied the argument <code>scale="row"</code>. This argument centers the mean of the data around 0 and scales the standard deviation to be 1, which makes differences between the rows (our genes) more apparent. You'll want to be careful interpretting the result because a negative scaled value does not necessarily mean that the gene exhibits reduced expression. Rather, it means that it is however many standard deviations below the mean expression level for the dataset you're looking at. Try commenting out the <code>scale="row"</code> argument to see how the heatmap changes.

And again, if we want to output the figure into a file, we can make use of the same argument we used earlier. 

<h1>Differential Expression Analysis</h1>

Here, we'll return to our <code>DESeqDataSet</code> to finish up our differential expression analysis. Recall from Dr. Ingolia's lecture that we've already performed some of the intial steps for differential expression analysis, where we estimated the size factors in order to account for differences in sequencing depth.

<h2>Estimate Dispersions</h2>

The next step is to then estimate the spread of our measurements, otherwise referred to as the dispersion. DESeq2 calculates the estimated dispersion (&#593;) as a function of the mean (&micro;) and variance. In other words, the estimated dispersion is the expected spread of the data for a given mean based on your data. This allows DESeq2 to identify what is likely to be true variation in the data resulting from biological or technical effects by shrinking the dispersion  of each gene towards the calculated estimated value for that specific mean. This can be thought of as modeling what the noise is for our experiment in order to distinguish biological and technical differences in our samples from what are just noise in our measurements.

We'll do this by making use of the <code>estimateDispersions()</code> function, which is part of the DESeq2 package.

```
dds <- estimateDispersions(dds)
```

Like with our other intermediate calculations, we place the output into our <code>DESeqDataSet</code>.

Then we can visualize the dispersion estimate using the <code>plotDispEsts()</code> function.

```
plotDispEsts(dds)
```

In the plot, each dot corresponds to a single gene plotted with its mean on the X-axis and its calculated dispersion on the Y-axis. The red line is the estimated dispersion based on your whole dataset. A "good" dispersion plot should have genes following the estimated dispersion. The blue indicates the final dispersion after shrinkage, essentially removing what variation may be just due to noise in the measurements. The dots with the genes showing potential biological or technical variation greater than the expected noise marked with a blue outline.

What you would normally expect to see is a fitted line that increases in dispersion as the mean decreases (inversely correlated to mean). This is due to the fact that noise has a greater impact (accounts for more of the variation) when the mean is smaller.

<h2>Hypothesis testing: negative binomial Wald test to determine significance</h2>

Finally, we'll use the Wald test to detect differentially expressed genes and determine if they are significant. DESeq2 models expression based on a negative binomial distribution, and recall from Dr. Ingolia's lecture that the negative binomial distribution can be thought of like a Poisson distribution but with extra variance as a second parameter. 

For hypothesis testing, DESeq2 sets the null hypothesis for each gene as having no difference between sample groups, so no log fold change difference (equal to 0). To test this hypothesis, DESeq2 makes use of the Wald test to compare the sample groups.

To run a Wald test on our samples, we can make use of the <code>nbinomWaldTest()</code> function.

```
dds <- nbinomWaldTest(dds)
```

Then we can pull out the results out of our <code>DESeqDataSet</code> using the <code>results()</code> function, and then assigning that to a new variable.

```
res <- results(dds)
```

Let's take a look at how our results table looks like:

<h2>Export results of differential expression analysis</h2>

Like with Python, we can export the dataframe containing our results, just with slightly different syntax.

```
write.csv(res, '1M_results.csv')
```

This will output a .csv file containing the results of our differential expression analysis.

<h2>Export rlog transformed counts</h2>

While we're at it, let's also export our rlog transformed counts for use later on.

<h2>Generate an MA Plot</h2>

Recall that you previously generated an MA plot for your group's replicate using the ratio between your TAZ KO and control counts and their average counts.

Now with our differential expression analysis results, we can generate an MA plot from our class dataset. To do this, we'll make use of DESeq2's <code>plotMA()</code> function.

```
plotMA(res)
```

By default, <code>plotMA()</code> highlights genes whose p-adjusted (p-value corrected for multiple hypothesis testing) is less than 0.1.

We can supply additional arguments to the function to slightly adjust our plot:

```
plotMA(res,
       alpha=0.05,
       xlab='Mean of normalized counts', 
       ylab='Log fold change', 
       main='MA plot',
       ylim=c(-5,5)
       )
```

<h2>Obtain shrunken log fold change values</h2>

As you can see, much like with our MA plot for our single replicate, genes that have a lower mean also exhibit a greater log fold change, giving the MA plot its characteristic arrowhead shape. This is due to the fact that noise in the measurements will lead to larger dispersion and a greater log fold change.

We can correct for this by calculating the shrunken log fold change. This allows us to better visualize genes whose differential expression is likely due to true biological or technical variance rather than noise. 

First, let's pull out the comparison groups from our <code>DESeqDataSet</code>:

```
resultsNames(dds)
```

Then, we can call up the <code>lfcShrink()</code> function to calculate our shrunken log fold change:

```
resLFC <- lfcShrink(dds, 
                    coef="condition_tazko_vs_ctrl",
                    type="apeglm"
                    )
```

Now let's take a look at the results and export them for use later:

<h2>Plot MA plot for shrunken log fold change</h2>

Now let's take a look at how our MA plot looks like when we take into account that lower means are expected to have higher log fold change: