<h1 style="font-size: 40px; margin-bottom: 0px;">11.1 Prepare matrices for DESeq2</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

We'll now prepare our class datasets for us to dive more deeply into our RNA-seq data by first merging each group's count matrices into a single data matrix and creating a conditions matrix that will be the metadata for our class dataset. We'll incorporate what we've covered in notebook 10-2 and this notebook into a Python script that can then be integrated into our RNA-seq analysis pipeline.

<strong>Learning objectives:</strong>

<ul>
    <li>Review counts quality control with "bad" counts set up</li>
    <li>Practice setting up Python scripts</li>
    <li>Integrating Python script into your RNA-seq pipeline</li>
</ul>

<h1 style="font-size: 40px; margin-bottom: 0px;">Prepare a set of "bad" counts</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

To have an example of when you might notice if something went wrong in how you set up <code>htseq-count</code>, we'll be starting today's lesson by running <code>htseq-count</code> using the incorrect strandedness for our library, which should lead to an unusual count matrix where a large number of reads will be unassigned/not counted.

In [None]:
%%bash

We'll then also separately analyze this bad count file to see how the output looks like when we mix up the strandness of our RNA-seq dataset.

<h1 style="font-size: 40px; margin-bottom: 0px;">Exercise #1: Work out Python script to QC counts</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

For this first exercise, you'll take what you know about setting up Python scripts and the exercise sets from notebook 10-2 to set up a Python script in the code cell below. For this QC, you'll want to be able to do the following:

<ul>
    <li>Import any needed packages</li>
    <li>Confirm that you're in <code>Week_10</code> directory (to stay consistent with rest of RNA-seq pipeline)</li>
    <li>Make a <code>counts-qc</code> directory</li>
    <li>Generate a stacked bar plot of count statistics</li>
    <li>Create a scatter plot of ctrl vs tazko counts and highlight potential upregulation and downregulation</li>
    <ul>
        <li>Export the plot as a PDF to <code>counts-qc</code></li>
    </ul>
    <li>Remove uncounted read statistics from counts file and export as a new <code>1M_*_gene_counts.csv</code></li>
    <ul>
        <li>The asterisk should be replaced with your group number in the format <code>g1</code></li>
        <li>Export the <code>.csv</code> to <code>counts</code> and keep the headers</li>
    </ul>
</ul>

You can use your "bad" counts file to test out your code as you're working on it, as we'll also take a look at that output on the side.

In [None]:
#Space for working on your Python script

Once you've confirmed that your script works, we'll reconvene and transfer that code into a new Python file called <code>counts-qc-script.py</code> in our <code>Week_10</code> directory for convenience.

With our Python script ready, we'll then integrate it into our RNA-seq script that we've been putting together and give it a test run on our truncated dataset.

<h2>Planned break</h2>

During this break, email Jack your group's <code>.csv</code> file for your group's read counts. You don't need to send your "bad" one.

<h1 style="font-size: 40px; margin-bottom: 0px;">Exercise #2: Work out Python script to prepare matrices for DESeq2</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

For this exercise, you'll create another Python script called <code>class-merger.py</code> also in your <code>Week_10</code> directory for simplicity. The goal of this script will be to pull the separate class matrix files from our shared directory and merge the data into a single matrix. We'll also generate a metadata file that contains the information on the experimental setup/condition for each sample.

Normally, if you have multiple replicates for <code>htseq-count</code>, you can supply all your alignments at once, and each alignment will have its own column in the counts matrix that <code>htseq-count</code> outputs. Since we each analyzed our own replicates, we'll need to merge them all together into a single matrix. We'll also in the same script create a metadata file that goes with it so that DESeq2 "knows" what condition each column belongs to.

Then, you will export both matrices as <code>.csv</code> files in a subdirectory called <code>class-set</code> within your <code>counts</code> directory.

<h2>Class counts matrix</h2>

Using our Python script, we'll want to merge all our data into a matrix that resembles:

<table style="text-align:center;">
    <tr>
        <th>Gene</th>
        <th>ctrl_g1</th>
        <th>tazko_g1</th>
        <th>ctrl_g2</th>
        <th>tazko_g2</th>
        <th>...</th>
        <th>ctrl_g9</th>
        <th>tazko_g9</th>
    </tr>
    <tr>
        <td>gene name</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>...</td>
        <td>count data</td>
        <td>count data</td>
    </tr>
    <tr>
        <td>gene name</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>...</td>
        <td>count data</td>
        <td>count data</td>
    </tr>
    <tr>
        <td>gene name</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>...</td>
        <td>count data</td>
        <td>count data</td>
    </tr>    
</table>

<h2>Conditions matrix</h2>

While we know what condition each column's samples come from, DESeq2 does not, so we need to provide it with information on which condition each sample came from. That way, DESeq2 can assign them to the correct condition (either control or tazko) and make the correct comparisons between conditions.

In the same Python script that we use to merge our class data, we'll also simultaneously create a conditions matrix that acts as the metadata for each of our samples, and it will look something like:

<table style="text-align:center;">
    <tr>
        <th>[Index]</th>
        <th>condition</th>
    </tr>
    <tr>
        <th>ctrl_g1</th>
        <td>control</td>
    </tr>
    <tr>
        <th>tazko_g1</th>
        <td>tazko</td>
    </tr>
    <tr>
        <th>control_g2</th>
        <td>control</td>
    </tr>
    <tr>
        <th>tazko_g2</th>
        <td>tazko</td>
    </tr>
    <tr>
        <th>...</th>
        <td>...</td>
    </tr>
    <tr>
        <th>control_g9</th>
        <td>control</td>
    </tr>
    <tr>
        <th>tazko_g9</th>
        <td>tazko</td>
    </tr>
</table>

The index values corresponds to the column header of your counts matrix for each sample while the values in the <code>condition</code> column indicates whether that sample is a control sample or a TAZ KO sample, allowing DESeq2 to properly group the data for differential expression analysis.

In [None]:
#Space for working on your Python script

With our second Python script ready, let's also integrate it into our RNA-seq script and give it another test run using our truncated dataset.