<h1 style="font-size: 40px; margin-bottom: 0px;">11.1 Prepare matrices for DESeq2</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

We'll now prepare our class datasets for us to dive more deeply into our RNA-seq data by first merging each group's count matrices into a single data matrix and creating a conditions matrix that will be the metadata for our class dataset. We'll incorporate what we've covered in notebook 10-2 and this notebook into a Python script that can then be integrated into our RNA-seq analysis pipeline.

<strong>Learning objectives:</strong>

<ul>
    <li>Review counts quality control with "bad" counts set up</li>
    <li>Practice setting up Python scripts</li>
    <li>Integrating Python script into your RNA-seq pipeline</li>
</ul>

<h1 style="font-size: 40px; margin-bottom: 0px;">Prepare a set of "bad" counts</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

To have an example of when you might notice if something went wrong in how you set up <code>htseq-count</code>, we'll be starting today's lesson by running <code>htseq-count</code> using the incorrect strandedness for our library, which should lead to an unusual count matrix where a large number of reads will be unassigned/not counted.

In [None]:
%%bash

#############################
#
# Just for practice purposes
# And to highlight QC
#
#############################

#We'll just make it in our Week_10 directory

htseq-count \
-t exon \
-i gene_id \
-r name \
-s yes \
-f bam \
./alignments-temp/bams/1M_g1_ctrl-name.bam \
./alignments-temp/bams/1M_g1_tazko-name.bam \
~/shared/2025-fall/courses/1547808/rna-seq/rna-feature/hg19-refseq.gtf \
> ~/MCB201B_F2025/Week_10/1M_g1_bad_counts.txt

We'll then also separately analyze this bad count file to see how the output looks like when we mix up the strandness of our RNA-seq dataset.

<h1 style="font-size: 40px; margin-bottom: 0px;">Exercise #1: Work out Python script to QC counts</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

For this first exercise, you'll take what you know about setting up Python scripts and the exercise sets from notebook 10-2 to set up a Python script in the code cell below. For this QC, you'll want to be able to do the following:

<ul>
    <li>Import any needed packages</li>
    <li>Confirm that you're in <code>Week_10</code> directory (to stay consistent with rest of RNA-seq pipeline)</li>
    <li>Make a <code>counts-qc</code> directory</li>
    <li>Generate a stacked bar plot of count statistics</li>
    <li>Create a scatter plot of ctrl vs tazko counts and highlight potential upregulation and downregulation</li>
    <ul>
        <li>Export the plot as a PDF to <code>counts-qc</code></li>
    </ul>
    <li>Remove uncounted read statistics from counts file and export as a new <code>1M_*_gene_counts.csv</code></li>
    <ul>
        <li>The asterisk should be replaced with your group number in the format <code>g1</code></li>
        <li>Export the <code>.csv</code> to <code>counts</code> and keep the headers</li>
    </ul>
</ul>

You can use your "bad" counts file to test out your code as you're working on it, as we'll also take a look at that output on the side.

In [1]:
#Space for working on your Python script

#Import any needed packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import os

#Confirm our directory
print('Confirming our directory...\n')
print(f'Current directory is {os.getcwd()}')

os.chdir('/home/jovyan/MCB201B_F2025/Week_10')
print(f'Changed to confirm that the directory is {os.getcwd()}')

#Make our new counts-qc directory
try:
    os.mkdir('counts-qc')
except FileExistsError:
    pass

#Pull in our file names
os.chdir('./counts')
file_name = [name for name in os.listdir() if '.txt' in name]
print(f'Pulled in the file: {file_name[0]}')

#Set up your base name
base_name = file_name[0].split('_counts')[0]
print(f'Setting base name as: {base_name}')

#Load in our file
counts = pd.read_csv(file_name[0],
                     sep='\t',
                     header=None,
                     names=['gene', 'ctrl', 'tazko']
                    )

#Pull out the read count statistics
read_stats = counts[counts['gene'].str.contains('__')].copy()
read_stats['gene'] = read_stats['gene'].str.replace('__', '')

#Pull out how many genes were counted and set to our read_stats
counted_reads = counts[~counts['gene'].str.contains('__')].copy()

print(f'Saving a .csv file for {base_name}...')
counted_reads.to_csv(f'{base_name}_gene_counts.csv',
                     index=False,
                    )

#Update DataFrame for QC
read_stats.loc[len(read_stats)] = ['counted_reads', sum(counted_reads['ctrl']), sum(counted_reads['tazko'])]

#Update our DataFrame for plotting
read_stats.set_index('gene', inplace=True)
read_stats = read_stats.div(read_stats.sum(axis=0), axis=1) * 100
read_stats = read_stats.T

#Set up our stacked bar plot
#See 10-2 notebook
#We are just pulling and updating from that code
print('Generating our stacked bar plot for QC...\n')

#Set up plot in the usual way
#From first exercise set in 10-2, setting up stacked bar plot
fig, ax = plt.subplots()

x = read_stats.index
y = read_stats.columns[::-1]

bottom = np.zeros(len(x))

use_colors = pd.Series(mcolors.CSS4_COLORS)

for i in range(0, len(y), 1):
    plt.bar(x,
            read_stats[y[i]],
            label=y[i],
            bottom=bottom,
            color=use_colors[125+i*4],
            lw=0.5,
            edgecolor='k',
           )
    bottom+=read_stats[y[i]]

plt.title('Counts QC')
plt.ylabel('Percent of total read counts')
plt.xticks([0,1],
           ['Control', '$TAZ$ KO']
          )
plt.legend(loc='center',
           bbox_to_anchor=(1.3,0.5),
           fontsize=4,
           edgecolor='w',
          )

sns.despine()
fig.set_size_inches(2, 3)
fig.set_dpi(300)

#Here we then save our stacked bar plot
os.chdir('/home/jovyan/MCB201B_F2025/Week_10/counts-qc')
print(f'Saving PDF of stacked bar plot for {base_name} in {os.getcwd()}...\n')
fig.savefig(f'{base_name}_stacked_qc_barplot.pdf', bbox_inches='tight')

#Close out the figure
plt.close(fig)

#Now we set up our scatter plot
counted_reads['ratio'] = counted_reads['tazko'] / counted_reads['ctrl']
downreg = counted_reads[counted_reads['ratio'] <= 0.5]
upreg = counted_reads[counted_reads['ratio'] >= 2]

#Create a new figure but with the same variables
fig, ax = plt.subplots()

#We can set up lists to iterate through for our scatterplot
datasets = [counted_reads, downreg, upreg]
colors_to_use = ['grey', 'b', 'r']
labels_to_use = ['no change', 'downregulated', 'upregulated']

#Set up for loop to loop through all three lists
for i in range(0, len(datasets), 1):
    sns.scatterplot(data=datasets[i],
                    x='ctrl',
                    y='tazko',
                    color=colors_to_use[i],
                    label=labels_to_use[i],
                    s=4,
                   )

#Pretty up scatter plot
plt.loglog()
plt.xlabel('Control read counts')
plt.ylabel('$TAZ$ KO read counts')

plt.legend(loc='center',
           bbox_to_anchor=(1.1,0.5),
           fontsize=6,
           edgecolor='w',
          )

fig.set_size_inches(3, 3)
fig.set_dpi(300)
sns.despine()

#Save scatterplot
print(f'Saving PDF of scatter plot for {base_name} in {os.getcwd()}...\n')
fig.savefig(f'{base_name}_qc_scatterplot.pdf', bbox_inches='tight')

#Close out figure
plt.close(fig)

print('Done with counts QC!')

Confirming our directory...

Current directory is /home/jovyan/MCB201B_F2025/Week_11
Changed to confirm that the directory is /home/jovyan/MCB201B_F2025/Week_10
Pulled in the file: 1M_g1_counts.txt
Setting base name as: 1M_g1
Saving a .csv file for 1M_g1...
Generating our stacked bar plot for QC...

Saving PDF of stacked bar plot for 1M_g1 in /home/jovyan/MCB201B_F2025/Week_10/counts-qc...

Saving PDF of scatter plot for 1M_g1 in /home/jovyan/MCB201B_F2025/Week_10/counts-qc...

Done with counts QC!


Once you've confirmed that your script works, we'll reconvene and transfer that code into a new Python file called <code>counts-qc-script.py</code> in our <code>Week_10</code> directory for convenience.

With our Python script ready, we'll then integrate it into our RNA-seq script that we've been putting together and give it a test run on our truncated dataset.

<h2>Planned break</h2>

During this break, email Jack your group's <code>.csv</code> file for your group's read counts. You don't need to send your "bad" one.

<h1 style="font-size: 40px; margin-bottom: 0px;">Exercise #2: Work out Python script to prepare matrices for DESeq2</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

For this exercise, you'll create another Python script called <code>class-merger.py</code> also in your <code>Week_10</code> directory for simplicity. The goal of this script will be to pull the separate class matrix files from our shared directory and merge the data into a single matrix. We'll also generate a metadata file that contains the information on the experimental setup/condition for each sample.

Normally, if you have multiple replicates for <code>htseq-count</code>, you can supply all your alignments at once, and each alignment will have its own column in the counts matrix that <code>htseq-count</code> outputs. Since we each analyzed our own replicates, we'll need to merge them all together into a single matrix. We'll also in the same script create a metadata file that goes with it so that DESeq2 "knows" what condition each column belongs to.

Then, you will export both matrices as <code>.csv</code> files in a subdirectory called <code>class-set</code> within your <code>counts</code> directory.

<h2>Class counts matrix</h2>

Using our Python script, we'll want to merge all our data into a matrix that resembles:

<table style="text-align:center;">
    <tr>
        <th>Gene</th>
        <th>ctrl_g1</th>
        <th>tazko_g1</th>
        <th>ctrl_g2</th>
        <th>tazko_g2</th>
        <th>...</th>
        <th>ctrl_g9</th>
        <th>tazko_g9</th>
    </tr>
    <tr>
        <td>gene name</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>...</td>
        <td>count data</td>
        <td>count data</td>
    </tr>
    <tr>
        <td>gene name</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>...</td>
        <td>count data</td>
        <td>count data</td>
    </tr>
    <tr>
        <td>gene name</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>count data</td>
        <td>...</td>
        <td>count data</td>
        <td>count data</td>
    </tr>    
</table>

<h2>Conditions matrix</h2>

While we know what condition each column's samples come from, DESeq2 does not, so we need to provide it with information on which condition each sample came from. That way, DESeq2 can assign them to the correct condition (either control or tazko) and make the correct comparisons between conditions.

In the same Python script that we use to merge our class data, we'll also simultaneously create a conditions matrix that acts as the metadata for each of our samples, and it will look something like:

<table style="text-align:center;">
    <tr>
        <th>[Index]</th>
        <th>condition</th>
    </tr>
    <tr>
        <th>ctrl_g1</th>
        <td>control</td>
    </tr>
    <tr>
        <th>tazko_g1</th>
        <td>tazko</td>
    </tr>
    <tr>
        <th>control_g2</th>
        <td>control</td>
    </tr>
    <tr>
        <th>tazko_g2</th>
        <td>tazko</td>
    </tr>
    <tr>
        <th>...</th>
        <td>...</td>
    </tr>
    <tr>
        <th>control_g9</th>
        <td>control</td>
    </tr>
    <tr>
        <th>tazko_g9</th>
        <td>tazko</td>
    </tr>
</table>

The index values corresponds to the column header of your counts matrix for each sample while the values in the <code>condition</code> column indicates whether that sample is a control sample or a TAZ KO sample, allowing DESeq2 to properly group the data for differential expression analysis.

In [14]:
#Space for working on your Python script

####################################################################################
#
# Picked up here Wed 11/12
#
# This is set up due to the nature of how we analyzed our data as a class
# Note that for normal analyses, you can feed all BAMs to htseq at once
# For simplicity, we just worked in this notebook without adding to our bash script
#
####################################################################################

#Import our needed packages
import numpy as np
import pandas as pd
import os

#Confirm where our script is currently sitting
print('Getting our current directory...\n')
print(f'Current directory is {os.getcwd()}')

#Change to our usual directory
print('Changing into our counts file for good measure...\n')

#Everyone submitted their group's .csv file and Jack uploaded to shared directory
os.chdir('/home/jovyan/shared/2025-fall/courses/1547808/rna-seq/truncated-counts')
print(f'Current directory is changed to {os.getcwd()}')

#Let's pull in our data files to merge together
#We use our usual way of pulling names into a list via list comprehension
print('Pulling in our files...\n')
data = [name for name in os.listdir() if '.csv' in name]
data.sort()

#Set up a little for loop to then output file names just for checking
print(f'Files that we are merging are in the following order: \n')
for name in data:
    print(name, end='\n\n')

print('Going to create our lovely matrices for DESeq2...\n')

#When pulling objects into a list using list comprehension we can operate on that object
#In this case, we're pulling in name and using it to load in the data with pd.read_csv()
#This just easily loads in our data as a DataFrame and populates it into a list that we can loop through
full_set = [pd.read_csv(name, index_col='gene') for name in data]

#One thing to note is that we set the index values based on the 'gene' column using index_col='gene'
#So the only remaining columns are 'ctrl' and 'tazko', and the headers are all the same for all replicates
#So then we can prepare our DataFrames for joining/merging by first updating all the headers
#Essentially just going to replace/tack on additional group/replicate info

#Initialize for loop to iterate through our list of DataFrames
for i in range(0, len(full_set), 1):
    #This then creates a list adaptable to the group number using the position (i) + 1 b/c we don't have a group 0
    #Then the elements of that list replaces each replicate's column headers
    #So for group 1, i=0 and thus ctrl -> ctrl_g1 and tazko -> tazko_g1
    full_set[i].columns = [f'ctrl_g{i+1}', f'tazko_g{i+1}']

#Now we use pd.DataFrame.join() to join
#If you are a little confused, check the documentation for the join function
#It allows us to specify a left-side for our join, which will be our first DataFrame (group 1's)
#Then we can give it a list of DataFrames to join to that first one (groups 2 to 9)
#So we can pull out first element full_set[0]
#And the remaining replicates are encompassed by fulll_set[1:]
#Then we can join based on the left (group 1) index values which are gene names
#Should be shared for all replicates
class_data = full_set[0].join(full_set[1:], how='left')
#You'll then have a mega-matrix for the class data

#Refer back to text above for how "metadata" matrix should look
#We create a list with the different sample conditions - either control or tazko
#And since we set up our mega-matrix ordered as ctrl and tazko repeated for each replicate for columns
#We can just multiply by how many replicates we have to repeat our list elements
conditions = ['control', 'tazko']*len(data)

#Then create the DataFrame according to the text above
#The index will be based on column headers because we need this information for DESeq2
#It does not actually "know" what ctrl_g1 is and that it is a control condition
#So we have to tell it with this "metadata" matrix
metadata = pd.DataFrame(conditions, index=class_data.columns, columns=['condition'])


#Export using our usual way
print('Saving our class counts matrix and our conditions matrix as .csv files...\n')
os.chdir('/home/jovyan/MCB201B_F2025/Week_10')
try:
    os.mkdir('class-set')
except FileExistsError:
    pass

os.chdir('./class-set')


#We will proceed with our R notebook and DESeq2 + clustering analyses using these two files
class_data.to_csv('1M_class_counts_matrix.csv')
metadata.to_csv('1M_class_conditions_matrix.csv')

print('All done!')

Getting our current directory...

Current directory is /home/jovyan/shared/2025-fall/courses/1547808/rna-seq/truncated-counts
Changing into our counts file for good measure...

Current directory is changed to /home/jovyan/shared/2025-fall/courses/1547808/rna-seq/truncated-counts
Pulling in our files...

Files that we are merging are in the following order: 

1M_g1_gene_counts.csv

1M_g2_gene_counts.csv

1M_g3_gene_counts.csv

1M_g4_gene_counts.csv

1M_g5_gene_counts.csv

1M_g6_gene_counts.csv

1M_g7_gene_counts.csv

1M_g8_gene_counts.csv

1M_g9_gene_counts.csv

Going to create our lovely matrices for DESeq2...

Saving our class counts matrix and our conditions matrix as .csv files...

All done!


With our second Python script ready, let's also integrate it into our RNA-seq script and give it another test run using our truncated dataset.