<h1 style="font-size: 40px; margin-bottom: 0px;">13.1 Exploring DESeq2 results</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 950px;"></hr>

Last week, we ran DESeq2 on our class dataset and got the results of our differential expression analysis. Today, we'll be playing around with our data, looking at our results in aggregate and pulling out data that we might think is interesting to look at in more detail. We'll make some plots that we've generated in R, but breaking it down into smaller steps to better understand what is going on under the hood and exactly what we're looking at in each of these plots. We'll work together to walk through the logic behind step as we build up each plot increasing in complexity.

For those of you who are comfortable, you can feel free to go on ahead at your own pace. And for those of you who are more comfortable using R, you can also feel free to change this notebook's kernel to R, and do this lesson in R. I'll switch between R and Python if people want to work in R as well.

<strong>Learning objectives:</strong>

<ul>
    <li>Navigate differential expression results</li>
    <li>Practice working with data in Python</li>
    <li>Practice data visualization</li>
    <ul>
        <li>MA plot</li>
        <li>Volcano plot</li>
        <li>Violin plot</li>
        <li>Box-and-whisker plot</li>
    </ul>
</ul>

<h2>Packages for those of you who want to work in Python</h2>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<h2>Packages for those of you who want to work in R</h2>

In [None]:
# library(ggplot2)
# library(dplyr)
# library(ggrepel)
# library(reshape2)

<h1>Import data for today's exercises</h1>

To start, we'll first import the data that we'll need for today's exercises. We'll be working with:

<ul>
    <li>Normalized counts matrix extracted from DESeq2</li>
    <li>DESeq2 results matrix</li>
    <li>DESeq2 shrunken log fold change results matrix</li>
    <li>Conditions matrix</li>
</ul>

In [None]:
norm_counts 
res 
shrinklfc 
conditions 

To make things easier later on, let's update the <code>Unnamed: 0</code> column name for our DataFrames.

Now let's double check to see how these DataFrames look:

<h1>(Re)generate an MA plot</h1>

To refamiliarize ourselves with Python, we'll regenerate an MA plot using our results from the DESeq2 dataset to help us better understand what we're looking at in the plot and where the values are coming from.

For this, we'll once again make use of <code>sns.scatterplot()</code>. <a href="https://seaborn.pydata.org/generated/seaborn.scatterplot.html" rel="noopener noreferrer" target="_blank"><u>Documentation is here.</u></a>

<h2>Plot MA plot of shrunken log fold change</h2>

For this one, we'll regenerate an MA plot of the shrunken log fold change and add a little bit more complexity by visually differentiating between significantly upregulated and significantly downregulated genes.

<h1>Generate a volcano plot from your DESeq2 results</h1>

Another plot that you'll commonly see with accompanying differential expression analyses is the volcano plot. In volcano plots, each gene's log2 fold change is plotted along the X-axis, and its -log10(FDR) is plotted along the Y-axis. The resulting scatter plot sort of resembles an erupting volcano with the most significant genes higher up along the Y-axis, while genes that exhibit a greater log fold change are found further towards the negative and positive extremes of the X-axis.

For this plot, we'll continue to visually differentiate between upregulated and downregulated genes, and then add another layer of complexity by labeling the top 10 most significantly upregulated and top 10 most significantly downregulated genes.

<h1>Guided Exercise: Violin plots and box-and-whisker plots</h1>

Another way that you can visualize your RNA-seq data is to generate violin plots or box-and-whisker plots for individual genes (or sets of genes) using the normalized count matrix. The set up for either one is the same since they are essentially different ways of visualizing the distribution of your samples.

For plotting violin plots, we'll make use of <code>sns.violinplot()</code>. <a href="https://seaborn.pydata.org/generated/seaborn.violinplot.html" rel="noopener noreferrer" target="_blank"><u>Documentation is here.</u></a>

And to plot a box-and-whisker plot, we'll make use of <code>sns.boxplot()</code>. <a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html" rel="noopener noreferrer" target="_blank"><u>Documentation is here.</u></a>

For this guided exercise, we can continue to make use of the genes that were identified to be differentially expressed that we labeled in our volcano plot and use them to pull out the associated normalized counts. First we can do this for our top ten significantly upregulated genes (based on their padj).

Let's take a look at the new filtered DataFrame.

Much like how DESeq2 required a conditions matrix to understand which condition each sample belonged to, we'll swap out our column headers with the information from our conditions matrix. That way, we can specify how we want to group our data later on based on which condition each sample belongs to. 

We can then move our gene names and use them as an index. That way, when we transpose our DataFrame, the gene names will become the column headers.

Now let's transpose our DataFrame.

Now let's take another look at our data:

<h2>Plot a violin plot for a single gene</h2>

Now let's set up a violin plot to take a look at a single gene first by identifying what our X-axis will be and what our Y-axis will be. Then we can begin adding additional parameters to modify the plot, and then call up specific plot attributes to pretty things up.

<h2>Plot a box-and-whisker plot for a single gene</h2>

We can take our code for the violin plot and make modifications to the arguments that we pass to the <code>sns.boxplot()</code> function:

<h2>Set up to plot multiple gene(s) on a single plot</h2>

The set up to plot multiple genes on a single violin plot or box-and-whisker plot is slightly different than plotting a single gene. Whereas the "wide-format" of our original DataFrame allows us to distinguish between the conditions of each gene, if we want to plot multiple genes, we'll also need to distinguish between genes as well. One way to do this is to convert the "wide-format" DataFrame into a "long-format" DataFrame, where all the normalized count values are contained within a single, long column, and the associated information on which condition (either control or TAZ KO) and which gene it comes from are located in their own respective columns. So with this format, each row corresponds to a single normalized count value and its "metadata".

<table style="text-align: center; margin: auto;">
    <tr>
        <th style="border: none">&nbsp;</th>
        <th style="border: 1px solid; border-color: #000000;">condition</th>
        <th style="border: 1px solid; border-color: #000000;">count</th>
        <th style="border: 1px solid; border-color: #000000;">gene</th>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">0</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">100.000</td>
        <td style="border: 1px solid; border-color: #000000;">first_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">1</th>
        <td style="border: 1px solid; border-color: #000000;">taz_ko</td>
        <td style="border: 1px solid; border-color: #000000;">200.000</td>
        <td style="border: 1px solid; border-color: #000000;">first_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">2</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">150.000</td>
        <td style="border: 1px solid; border-color: #000000;">first_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">3</th>
        <td style="border: 1px solid; border-color: #000000;">taz_ko</td>
        <td style="border: 1px solid; border-color: #000000;">300.000</td>
        <td style="border: 1px solid; border-color: #000000;">first_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">4</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">400.000</td>
        <td style="border: 1px solid; border-color: #000000;">second_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">5</th>
        <td style="border: 1px solid; border-color: #000000;">taz_ko</td>
        <td style="border: 1px solid; border-color: #000000;">900.000</td>
        <td style="border: 1px solid; border-color: #000000;">second_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">6</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">300.000</td>
        <td style="border: 1px solid; border-color: #000000;">second_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">7</th>
        <td style="border: 1px solid; border-color: #000000;">taz_ko</td>
        <td style="border: 1px solid; border-color: #000000;">600.000</td>
        <td style="border: 1px solid; border-color: #000000;">second_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">...</th>
        <td style="border: 1px solid; border-color: #000000;">...</td>
        <td style="border: 1px solid; border-color: #000000;">...</td>
        <td style="border: 1px solid; border-color: #000000;">...</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">96</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">50.000</td>
        <td style="border: 1px solid; border-color: #000000;">last_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">97</th>
        <td style="border: 1px solid; border-color: #000000;">taz_ko</td>
        <td style="border: 1px solid; border-color: #000000;">70.000</td>
        <td style="border: 1px solid; border-color: #000000;">last_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">98</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">20.000</td>
        <td style="border: 1px solid; border-color: #000000;">last_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">99</th>
        <td style="border: 1px solid; border-color: #000000;">taz_ko</td>
        <td style="border: 1px solid; border-color: #000000;">30.000</td>
        <td style="border: 1px solid; border-color: #000000;">last_gene</td>
    </tr>
</table>

This can kind of be thought of as flattening our DataFrame, since we're collapsing our 2D normalized count matrix into a single column, and the other columns can be thought of as extra information on where the values came from, so that we can distinguish between genes and conditions.

First, let's take a look again at our normalized counts, and for this example, we're interested in taking a look at just our top upregulated genes.

To flatten our DataFrame, we can make use of the function <code>pd.melt()</code> which will allow us to convert the format of our DataFrame from a "wide-format" to a "long-format".

<a href="https://pandas.pydata.org/docs/reference/api/pandas.melt.html" rel="noopener noreferrer" target="_blank"><u>Documentation for <code>pd.melt()</code> is here.</u></a>

<h2>Plot violin plot for upregulated genes</h2>

We can make use of the same code that we used before to plot a violin plot for one gene with slight modifications to have it plot multiple genes together on a single plot.

<h2>Plot a box-and-whisker plot for upregulated genes</h2>

We can similarly modify our box-and-whisker plot code to have it plot multiple genes on the same plot: