<h1 style="font-size: 40px; margin-bottom: 0px;">15.1 Multiomics: Integrating ChIP-seq and RNA-seq</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 950px;"></hr>

We've taken a look at the TAZ-TEAD-DNA binding interactions through ChIP-seq analysis, and we identified a set of reproducible peaks where TAZ-TEAD are likely to bind to DNA. Our analysis showed that TAZ-TEAD primarily bind to regions distant from the transcription start site (TSS) rather than the nearby promoter region, suggesting a longer range regulation of target gene expression through distant enhancers. Then we explored the transcriptional changes resulting from TAZ KO through RNA-seq, and we performed differential expression analysis to determine which genes are impacted by TAZ KO and if this can explain the phenotypes we observed in MCB201A. 

In this lesson, we'll be taking a basic multiomics approach by integrating our two big data analyses together in order to identify overlapping sets of genes from our two analyses. This approach can give us insight into the regulation of TAZ target genes and potentially identify some direct targets of TAZ and determine how TAZ regulates their transcription.

<strong>Learning objectives:</strong>

<ul>
    <li>Use HOMER to label ChIP-seq peaks with associated genes</li>
    <li>Continue practice working with data in Python</li>
    <li>Continue practicing data visualization</li>
    <li>Continue practicing statistical analysis and functional analysis</li>
</ul>

<h1>Annotate ChIP-seq peaks based on nearest TSS method</h1>

A common way of annotating peaks that we've made use of before is to determine the closest TSS to a significant peak and then assign that TSS's gene to the peak (nearest TSS method). This works well with annotating peaks in promoter regions since promoters regulate transcription at a much shorter distance from the TSS, and this generally works well for enhancer regions as well. However, there is the caveat that enhancer regions can act at very long distances, where there are intervening genes between their target TSS and the enhancer region. Additionally, some enhancers may act on multiple genes, or multiple enhancers can regulate a single gene. These situations can complicate annotating peaks via the nearest TSS method, and other functional genomics approaches can get around this issue.

To annotate peaks based on the nearest TSS, we can return to HOMER's <code>annotatePeaks.pl</code> command. Included in today's lesson is a .narrowPeak file containing merged peaks from the full ChIP-seq dataset. Instead of just the peaks that are in the top 500 in both replicates, this file contains peaks present in the top 2500 in both replicates. We'll pass this file to HOMER to annotate based on the nearest TSS:

<h2>Set up HOMER</h2>

Change to your HOMER directory.

<pre style="width: 450px; margin-top: 15px; margin-bottom: 15px; color: #000000; background-color: #EEEEEE; border: 1px solid; border-color: #AAAAAA; padding: 10px; border-radius: 15px; font-size: 12px;">cd ~/homer</pre>

Then update your PATH, so that Terminal can find where all the HOMER files are:

<pre style="width: 450px; margin-top: 15px; margin-bottom: 15px; color: #000000; background-color: #EEEEEE; border: 1px solid; border-color: #AAAAAA; padding: 10px; border-radius: 15px; font-size: 12px;">PATH=$PATH:/home/jovyan/homer/.//bin/</pre>

<h2>Annotate the concordant ChIP-seq peaks</h2>

Change to your Week 15 directory:

<pre style="width: 450px; margin-top: 15px; margin-bottom: 15px; color: #000000; background-color: #EEEEEE; border: 1px solid; border-color: #AAAAAA; padding: 10px; border-radius: 15px; font-size: 12px;">cd ~/MCB201B_F2024/Week_15</pre>

Then annotate peaks using <code>annotatePeaks.pl</code>

<pre style="width: 450px; margin-top: 15px; margin-bottom: 15px; color: #000000; background-color: #EEEEEE; border: 1px solid; border-color: #AAAAAA; padding: 10px; border-radius: 15px; font-size: 12px;">annotatePeaks.pl \
./top_concordant_peaks_full-set.narrowPeak \
hg19 \
-noann \
> ./annotated-peaks-full-set.txt</pre>

<h1>Install and import packages needed for today</h1>

In [None]:
pip install matplotlib-venn

In [None]:
pip install gseapy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib_venn as vn
import seaborn as sns
import scipy.stats as stats
import gseapy as gp

<h1>Exercise #1: Load in your annotated peaks and pull out the gene names</h1>

For this exercise, load in your annotated peaks file then generate a set (the data type) of gene names that correspond to peaks annotated to a protein-coding gene.

Let's take a look to make sure the data imported properly:

Now, see if you can pull out the gene names corresponding to protein-coding genes from our dataset.

What additional modification can you make to the code line above to at the same time convert your data type to a set?

Let's take a quick look at our data:

<h1>Exercise #2: Load in your DESeq2 results</h1>

Let's load in our RNA-seq results, so that we can begin looking for the intersection between our two 'omics datasets. With today's lesson is also a .csv file containing the full results table for our class's complete RNA-seq analysis that you can load into this notebook.

Let's also rename the <code>Unnamed: 0</code> column header to <code>gene</code>, to make things easier for us later when we're merging our DataFrames.

Take a quick look at your DataFrame to make sure everything looks okay:

<h1>Guided Exercise: Pull out genes from DESeq results</h1>

Let's then take a look at our DESeq2 results to remove the genes that were filtered out by DESeq2's independent filtering method. In short, these genes were identified as having a mean normalized count that is too low, and as a result, the high dispersion of these genes make it very unlikely that these genes will be significant and are removed. While they will still have a calculated log2FoldChange, they will not have a adjusted p-value, so those values will be set to <code>NaN</code>.

More detailed information on independent filtering can be found in <a href="https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilttheory" rel="noopener noreferrer" target="_blank"><u>the DESeq2 vignette.</u></a>

There are also additional filters that DESeq2 applies to filter out other genes based on if the gene is not expressed in any sample (zero counts in in all samples) or if there is an extreme outlier present in one of the samples. These filters help to increase the detection power of DESeq2 without really changing the probability of a type I error.

<h2>Identify genes filtered out by DESeq2</h2>

We can determine how many genes were filtered out by counting how many <code>NaN</code> are present in the DataFrame.

We can see that no genes were filtered out based on an absence of expression in all samples because we had already done that prior to providing DESeq2 with our counts matrix. We can also see that a number of genes have been filtered out due to their low mean normalized counts based on the number of <code>NaN</code> in the adjusted p-values column.

<h2>Pull out gene names for genes that pass the filter</h2>

We can drop the rows containing <code>NaN</code>, which will allow us to focus just on those genes that passed the filter and use those to identify overlaps with our ChIP-seq peaks.

<h2>Pull out gene names for significant hits</h2>

We can also pull out just the significant hits identified by DESeq2 to narrow down our set of genes to those that are differentially expressed when we KO TAZ.

<h1>Guided: Exercise: Create a venn diagram of ChIP-seq and RNA-seq results</h1>

Now let's create a venn diagram to see how our two analyses overlap. To do this, we'll make use of the matplotlib_venn package that we imported. <a href="https://pypi.org/project/matplotlib-venn/" rel="noopener noreferrer" target="_blank"><u>Documentation is here.</u></a>

We can start with a simple venn diagram, and then pull out elements from it to modify the style of our figure.

```
venn_diag = vn.venn2((chip_genes_only, sig_hits_genes_only), 
                     set_labels=('ChIP Peaks', 'RNA-seq DE'))
```

We can see that our two groups do share some overlapping genes, but not all the genes actually overlap. As Dr. Ingolia talked about in lecture this week, there are additional considerations to keep in mind that can inform how we interpret our results. A direct target is likely to be found in both groups, whereas indirect targets may be a significant hit from our DESeq2 analysis, but it may not have an associated ChIP-seq peak. On the other hand, additional layers of transcriptional regulation may result in a ChIP-seq peak whose associated gene is not differentially expressed.

<h1>Guided Exercise: Identify overlapping genes between both sets</h1>

Now that we have an idea of the numbers of overlapping and non-overlapping genes between our two groups, let's take a look at which genes are in each group. We've generated a set data type object that contains the names of the genes from our ChIP-seq peaks and the names of the genes from our differential expression analysis. To identify overlapping genes, we can make use of set operators.

We can also use this set to create a DataFrame that has a column named <code>gene</code> that has all the shared genes between our two analyses.

Let's take a look at our DataFrame object containing our overlapping genes:

<h1>Guided Exercise: Pull in DESeq2 results for overlapping genes</h1>

Let's merge our data contained in our DESeq2 results with our overlapping genes. This will allow us to look at the distribution of our log2 fold changes

Now let's take the log2 fold change data and create a histogram to take a look at the distribution of expression changes for these genes.

<h1>Exercise #3: Pull out results for genes not corresponding to a peak</h1>

For this exercise, see if you can identify the genes that do not correspond to a ChIP-seq peak and pull out the DESeq2 results for those genes.

<h1>Exercise #4: Plot distribution of log2 fold change</h1>

Once you've identified and pulled the information for genes that did not correspond to a ChIP-seq peak, generate a histogram of the distribution of log2 fold change values. Use the same bin set-up as we used for the overlapping genes, so we can visually compare the two subsets of our data.

<h1>Exercise #5: Determine if there is a statistical difference between the distribution of the two sets of data</h1>

There looks like maybe there's a minor difference in the distribution of our two subsets of data. For this exercise, you'll make use of the scipy.stats package to determine if there is a stastically significant difference in the distribution of their log2 fold change values.

<h1>Determine functional group enrichment in TAZ direct targets</h1>

With our subsetted data, we can perform an over-representation analysis or GSEA to determine which functional groups TAZ may be directly or indirectly regulating. 

First, let's prepare our gene sets from MSigDB Hallmark collection for use in functional analysis:

<h2>Perform over-representation analysis</h2>

Then we can prepare our set of background genes based on the genes present in our DESeq2 results DataFrame for an over-representation analysis.

Now we can define what our "interesting" genes are that we want to determine what gene sets or functional groups are enriched:

Let's run the over-representation analysis using the GSEApy package's <code>gp.enrich()</code> function.

Now let's take a look at the output of our analysis:

We can sort our data by the adjusted p-value just to reorganize our data where the most significant over-representation is at the top of the DataFrame:

<h2>Generate a dot plot of over-represented gene sets</h2>

We can make use of the <code>gp.dotplot()</code> function to generate a dot plot of our over-representation analysis results.

What adjustments can we make to our code to instead look at the set of genes that are indirectly regulated by TAZ?

<h1>Perform GSEA on overlapping genes</h1>

Like with over-representation analysis, we can also perform GSEA on our overlapping genes to get an idea of what functional groups may be direct targets of TAZ by taking a look at the difference in enrichment between our TAZ KO and control samples. 

First, we can prepare our counts matrix to pass to GSEApy's <code>gp.GSEA()</code> function:

We can rename our <code>Unnamed: 0</code> column to <code>gene</code>, so that it's easier for us to merge our DataFrames:

We already have our DataFrame containing our set of overlapping genes, so let's use that to pull the count information as a new DataFrame:

Like before, we can generate our own variables that would have been created by the .cls parser.

Now we can run a GSEA on the genes from our RNA-seq results that overlap with our ChIP-seq results as another way to identify enrichment and to glean information on the change relative to our controls.

Let's sort our output DataFrame by the FDR, so we can see the significant hits at the top of the DataFrame.

<h2>Generate GSEA plots for significant hits</h2>

Let's work together to come up with a way to generate a set of GSEA plots for our significant hits without needing to manually count how many gene sets are enriched.

In our previous lesson, we made use of the following for-loop:

```
for i in np.arange(0, 5, 1):
    fig = gs_res.plot(gs_res.res2d.Term[i])
    fig.set_size_inches(4, 6)
```

Now let's modify this code to be a bit more flexible.

<h1>Exercise #6: Run GSEA on genes that do not overlap with our ChIP-seq peaks</h1>

For this last exercise, practice running GSEA on our RNA-seq genes that were found to not overlap with our ChIP-seq genes, and after, we can take a look at what the result might tell us about the indirect targets of TAZ.