<h1 style="font-size: 40px; margin-bottom: 0px;">13.1 Exploring DESeq2 results</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

In notebook 12-2, we ran DESeq2 on our class dataset and got the results of our differential expression analysis. Today, we'll be playing around with our data, looking at our results in aggregate and pulling out data that we might think is interesting to look at in more detail. We'll set up the MA plot now in Python, but breaking it down into smaller steps to better understand what is going on under the hood when we call up DESeq2's <code>plotMA()</code> function. 

Then we'll set up a volcano plot, which Dr. Ingolia introduced in his differential expression analysis lecture, so that we can visualize the overall transcriptomic changes by plotting the fold change and significance for each gene.  

We'll then pull out specific genes to take a look at in a little more depth by generating violin plots and box-and-whisker plots to take a look at how TAZ KO alters their level of expression.

<strong>Learning objectives:</strong>

<ul>
    <li>Navigate differential expression results</li>
    <li>Practice working with data in Python</li>
    <li>Practice data visualization</li>
    <ul>
        <li>MA plot</li>
        <li>Volcano plot</li>
        <li>Violin plot</li>
        <li>Box-and-whisker plot</li>
    </ul>
</ul>

<h1 style="font-size: 40px; margin-bottom: 0px;">Load in packages</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

For today's plotting, we'll load in a package called <code>adjustText</code> which will allow us to more easily plot a bunch of text without needing to know exactly the text and the positions for which we're plotting. That way, we can pull out relevant information from our data set for plotting and visualization. We'll be able to specify the text objects to plot, and then <code>adjustText</code> will iterate through possible positions to space the text apart and add connecting lines to the data point being annotated. Specifically, we'll make use of it's function <code>adjust_text()</code>.

<a href="https://adjusttext.readthedocs.io/en/latest/" rel="noopener noreferrer"><u>Documentation for <code>adjust_text()</code> can be found here.</u></a>

Since this isn't installed in our Biology Hub, we can go ahead and install it using our notebook.

In [None]:
pip install adjustText

Then let's go ahead and import the packages that we'll use for this notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import adjustText

<h1 style="font-size: 40px; margin-bottom: 0px;">Import data for today's notebook</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

To start, we'll first import the data that we'll need for today's notebook exercises. We'll be working with:

<ul>
    <li>Normalized counts matrix extracted from DESeq2</li>
    <li>DESeq2 results matrix</li>
    <li>DESeq2 shrunken log fold change results matrix</li>
    <li>Conditions matrix</li>
</ul>

To make things easier later on, let's update the first column name to be <code>'gene'</code> for our DataFrames.

<h1 style="font-size: 40px; margin-bottom: 0px;">Exercise #1: Generate MA plots with seaborn</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

To refamiliarize ourselves with Python, we'll regenerate MA plots using our results from the DESeq2 dataset to help us better understand what we're looking at in the plot and where the values are coming from.

For this exercise, see if you can plot an MA plot containing the following:
<ul>
    <li>non-significant genes in gray</li>
    <li>upregulated genes highlighted in red</li>
    <li>downregulatd genes highlighted in blue</li>
    <li>a dashed horizontal line at y=0 (<a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axhline.html" rel="noopener noreferrer"><u>you can use <code>plt.axhline()</code></u></a>)</li>
    <li>TAZ's datapoint highlighted in green</li>
    <li>Annotation to TAZ</li>
</ul>

<h2>Plot MA plot of shrunken log2 fold change</h2>

For this, see what adjustments you'll need to make to your code from the previous MA plot to visualize the shrunken log2 fold change.

<h1 style="font-size: 40px; margin-bottom: 0px;">Exercise #2: Generate a volcano plot from your results</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

Another plot that you'll commonly see with accompanying differential expression analyses is the volcano plot. In volcano plots, each gene's log2 fold change is plotted along the x-axis, and its -log10(padj) is plotted along the y-axis. The resulting scatterplot sort of resembles an erupting volcano with the most significant genes higher up along the y-axis, while genes that exhibit a greater log2 fold change are found further towards the negative and positive extremes of the x-axis.

For this exercise, take what you know from setting up your MA plots to now generate a volcano plot that includes the following:
<ul>
    <li>non-significant genes in gray</li>
    <li>upregulated genes highlighted in red</li>
    <li>downregulatd genes highlighted in blue</li>
    <li>dashed lines demarcating the following:</li>
    <ul>
        <li>a log2 fold change of -1</li>
        <li>a log2 fold change of +1</li>
        <li>a padj of 0.05</li>
    </ul>
    <li>Top 10 upregulated genes annotated based on their significance</li>
    <li>Top 10 downregulated genes annotated based on their significance</li>
</ul>

<h1 style="font-size: 40px; margin-bottom: 0px;">Violin plots and box-and-whisker plots</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

Another way that you can visualize your RNA-seq data is to generate violin plots or box-and-whisker plots for individual genes (or sets of genes) using the normalized counts. The set up for either one is the same since they are essentially different ways of visualizing the distribution of your samples.

For plotting violin plots, we'll make use of <code>sns.violinplot()</code>. <a href="https://seaborn.pydata.org/generated/seaborn.violinplot.html" rel="noopener noreferrer"><u>Documentation for <code>sns.violinplot()</code> can be found here.</u></a>

And to plot a box-and-whisker plot, we'll make use of <code>sns.boxplot()</code>. <a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html" rel="noopener noreferrer"><u>Documentation for <code>sns.boxplot()</code> can be found here.</u></a>

For this guided exercise, we can continue to make use of the genes that were identified to be differentially expressed that we labeled in our volcano plot and use them to pull out the associated normalized counts. First we can do this for our top ten significantly upregulated genes (based on their <code>'padj'</code>).

Let's take a look at the new filtered DataFrame.

Much like how DESeq2 required a conditions matrix to understand which condition each sample belonged to, we'll swap out our column headers with the information from our conditions matrix. That way, we can specify how we want to group our data later on based on which condition each sample belongs to. 

We can then move our gene names and use them as an index. That way, when we transpose our DataFrame, the gene names will become the column headers.

Now let's transpose our DataFrame.

Now let's take another look at our data:

<h2>Plot a violin plot for a single gene</h2>

Now let's set up a violin plot to take a look at a single gene first by identifying what our X-axis will be and what our Y-axis will be. Then we can begin adding additional parameters to modify the plot, and then call up specific plot attributes to pretty things up.

<h2>Plot a box-and-whisker plot for a single gene</h2>

We can take our code for the violin plot and make modifications to the arguments that we pass to the <code>sns.boxplot()</code> function:

<h2>Swarmplot for individual genes</h2>

Like with our earlier notebooks, we can also plot our data as a swarmplot with overlaid annotations.

<h2>Set up to plot multiple gene(s) on a single plot</h2>

The set up to plot multiple genes on a single violin plot or box-and-whisker plot is slightly different than plotting a single gene. Whereas the "wide-format" of our original DataFrame allows us to distinguish between the conditions of each gene, if we want to plot multiple genes, we'll also need to distinguish between genes as well. One way to do this is to convert the "wide-format" DataFrame into a "long-format" DataFrame, where all the normalized count values are contained within a single, long column, and the associated information on which condition (either control or TAZ KO) and which gene it comes from are located in their own respective columns. So with this format, each row corresponds to a single normalized count value and its "metadata".

<table style="text-align: center; margin: auto;">
    <tr>
        <th style="border: none">&nbsp;</th>
        <th style="border: 1px solid; border-color: #000000;">condition</th>
        <th style="border: 1px solid; border-color: #000000;">count</th>
        <th style="border: 1px solid; border-color: #000000;">gene</th>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">0</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">100.000</td>
        <td style="border: 1px solid; border-color: #000000;">first_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">1</th>
        <td style="border: 1px solid; border-color: #000000;">tazko</td>
        <td style="border: 1px solid; border-color: #000000;">200.000</td>
        <td style="border: 1px solid; border-color: #000000;">first_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">2</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">150.000</td>
        <td style="border: 1px solid; border-color: #000000;">first_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">3</th>
        <td style="border: 1px solid; border-color: #000000;">tazko</td>
        <td style="border: 1px solid; border-color: #000000;">300.000</td>
        <td style="border: 1px solid; border-color: #000000;">first_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">4</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">400.000</td>
        <td style="border: 1px solid; border-color: #000000;">second_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">5</th>
        <td style="border: 1px solid; border-color: #000000;">tazko</td>
        <td style="border: 1px solid; border-color: #000000;">900.000</td>
        <td style="border: 1px solid; border-color: #000000;">second_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">6</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">300.000</td>
        <td style="border: 1px solid; border-color: #000000;">second_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">7</th>
        <td style="border: 1px solid; border-color: #000000;">tazko</td>
        <td style="border: 1px solid; border-color: #000000;">600.000</td>
        <td style="border: 1px solid; border-color: #000000;">second_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">...</th>
        <td style="border: 1px solid; border-color: #000000;">...</td>
        <td style="border: 1px solid; border-color: #000000;">...</td>
        <td style="border: 1px solid; border-color: #000000;">...</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">96</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">50.000</td>
        <td style="border: 1px solid; border-color: #000000;">last_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">97</th>
        <td style="border: 1px solid; border-color: #000000;">tazko</td>
        <td style="border: 1px solid; border-color: #000000;">70.000</td>
        <td style="border: 1px solid; border-color: #000000;">last_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">98</th>
        <td style="border: 1px solid; border-color: #000000;">control</td>
        <td style="border: 1px solid; border-color: #000000;">20.000</td>
        <td style="border: 1px solid; border-color: #000000;">last_gene</td>
    </tr>
    <tr>
        <th style="border: 1px solid; border-color: #000000;">99</th>
        <td style="border: 1px solid; border-color: #000000;">tazko</td>
        <td style="border: 1px solid; border-color: #000000;">30.000</td>
        <td style="border: 1px solid; border-color: #000000;">last_gene</td>
    </tr>
</table>

This can kind of be thought of as flattening or melting our DataFrame, since we're collapsing our 2D normalized count matrix into a single column, and the other columns can be thought of as extra information on where the values came from, so that we can distinguish between genes and conditions.

First, let's take a look again at our normalized counts, and for this example, we're interested in taking a look at just our top upregulated genes.

To flatten our DataFrame, we can make use of the function <code>pd.melt()</code> which will allow us to convert the format of our DataFrame from a "wide-format" to a "long-format".

<a href="https://pandas.pydata.org/docs/reference/api/pandas.melt.html" rel="noopener noreferrer"><u>Documentation for <code>pd.melt()</code> is here.</u></a>

<h2>Plot violin plot for upregulated genes</h2>

We can make use of the same code that we used before to plot a violin plot for one gene with slight modifications to have it plot multiple genes together on a single plot.

<h2>Plot a box-and-whisker plot for upregulated genes</h2>

We can similarly modify our box-and-whisker plot code to have it plot multiple genes on the same plot:

<h2>Plot an annotated swarmplot for our upregulated genes</h2>

Now, let's go ahead and use our long form DataFrame for our annotated swarmplot to visualize our set of top 10 upregulated genes.