# Lecture 14 - Co-expression networks and enrichment analysis

In this lecture you learned how to obtain biological insight from omics data by creating co-expression networks and performing enrichment analysis.

In this tutorial we will continue to use the dataset from the previous lecture [(Lee et al, 2016)](https://www.sciencedirect.com/science/article/pii/S1550413116302480) that analysed gene expression levels of liver and adipose tissue of 12 obese patients undergoing bariatric surgery.

### Learning objectives:

- Improve you data manipulation skills with *pandas*
- Refresh your memory with network visualization with *pyvis*
- Learn to perform statistical tests with *scipy*

----------

**Note:** Before continuing please change this in your settings:

![trust_html](../lecture_09/trust_html.png)

----------

## Exercise 1: gene co-expression networks

This time we will use the *"cleaned up"* version of the data that we generated last time.

In [None]:
import pandas as pd
data = pd.read_csv('files/E-GEOD-83322_clean.tsv', sep='\t')
data.sample(5)

Let's search for genes that are co-expressed across all patients on liver samples. 

> **Note** (1): To make our life easier, we will first filter by tissue and convert the table to wide format.

> **Note** (2): The dataset contains **> 12k genes**, resulting in about **150 million** gene pair combinations. To make computations faster, we will only use **100** (randomly selected) genes.

In [None]:
df_liver = data.query('tissue == "liver"').pivot(index='gene', columns='patient', values='value').sample(100)

# this was a so-called "one-liner", we could also have done it step by step:
#
# df_liver = data.query('tissue == "liver"')
# df_liver = pivot(index='gene', columns='patient', values='value')
# df_liver = sample(100)

### 1.1

Calculate the [Spearman](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html) correlation between every pair of genes, storing the results in the format you find most appropriate.

> Tip 1: check the [docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html) to understand how to use the *spearmanr* function.

> Tip 2: we only need to compute half the number of combinations. The correlation of (gene1, gene2) is the same as that of (gene2, gene1).

In [None]:
from scipy.stats import spearmanr

# type your code here

Click below to see a solution:

In [None]:

from scipy.stats import spearmanr

correlations = []

for i in range(len(df_liver)):
    for j in range(i+1, len(df_liver)):
        gene1 = df_liver.index[i]
        gene2 = df_liver.index[j]
        values1 = df_liver.iloc[i,:]
        values2 = df_liver.iloc[j,:]
        r, p = spearmanr(values1, values2)
        correlations.append((gene1, gene2, r, p))
        
correlations = pd.DataFrame(correlations, columns=['gene1', 'gene2', 'r', 'p'])

### 1.2

Now filter only the gene pairs with an absolute correlation above 0.8, and plot the co-expression network using [Pyvis](https://pyvis.readthedocs.io/en/latest/).

> Remember: you have already used Pyvis in **Lecture 9 (exercise 2)**. You can go back and see how to use it.

In [None]:
# type your code here...

# first filter the pairs

# this is here to help you create the network
from pyvis.network import Network
net = Network(directed=False, notebook=True, height='300px', width='500px')

# now add the selected genes as nodes
# use net.add_node() or net.add_nodes()

# then add the links between pairs as edges
# use net.add_edge() or net.add_edges()

# now show the result
net.show('tmp.html')
display(HTML(filename='tmp.html'))

Click below to see a solution:

In [None]:

selected = correlations.query("abs(r) > 0.8")

from pyvis.network import Network
net = Network(directed=False, notebook=True, height='300px', width='500px')

net.add_nodes(selected['gene1'])
net.add_nodes(selected['gene2'])
net.add_edges(selected[['gene1', 'gene2']].values)

net.show('tmp.html')
display(HTML(filename='tmp.html'))

> 🤔 Remember that we used 100 randomly selected genes, you can try running a second time with a different selection (going all the way back to `.sample(100)` and re-running from there), to see if there is a different result. You can also try using a different value (other than 0.8) for the threshold. 

## Exercise 2: Gene enrichment analysis

In this exercise, we will search for genes that are **over or under-expressed in adipose tissue compared to liver** samples.

We will begin with some *"pandas on steroids"* to compute fold-change, p-values, and adjusted p-values (also known as q-values).

In [None]:
# change from long to wide format (each tissue and patient becomes a column, 
# then the patient information is dropped)
df = data.pivot(index='gene', columns=['tissue','patient'], values='value').droplevel(1, axis=1)

# calculate mean for each tissue and divide to get fold-change
df['FC'] = df['adipose tissue'].mean(axis=1) / df['liver'].mean(axis=1)

# use student t-test for independent samples (ttest_ind) to compute p-values
from scipy.stats import ttest_ind
df['p'] = df.apply(lambda x: ttest_ind(x['liver'], x['adipose tissue'])[1], axis=1)

# use FDR correction to calculate adjusted p-values (also known as q-values)
# also returns a boolean value (is_significant) if q < 0.05 (default)
from statsmodels.stats.multitest import fdrcorrection
df['is_significant'], df['q'] = fdrcorrection(df['p'])

### 2.1

Draw a **volcano plot** as follows:

- use *log2(fold-change)* on the x-axis and *-log10(q-value)* on the y-axis
- you can import *log2* and *log10* from *numpy*
- you can use *df.plot.scatter()* if you store the results as new columns

In [None]:
# type your code here

Click to see solution below:

In [None]:

import numpy as np
df['log2FC'] = np.log2(df['FC'])
df['log10q'] = -np.log10(df['q'])
df.plot.scatter('log2FC', 'log10q', alpha=0.2)

### 2.2

As expected, there are thousands of genes that are differentially expressed between liver and adipose tissues (after all they are different cells with different functions).

Now create two list of genes that are:
- *significantly* over-expressed by more than 100-fold 
- *significantly* under-expressed by more than 100-fold 
    
Finally, upload each list of genes separately in [**g:Profiler**](https://biit.cs.ut.ee/gprofiler/gost) to run the enrichment analysis.

> Tip 1: `.query()` is your friend here... 

> Tip 2: use `' '.join()` to print a space-separated list of genes

In [None]:
# type your code here...

Click to see solution below:

In [None]:

print('Over-expressed:')
over = df.query('FC > 100 and q < 0.05').index
print(' '.join(over))
print()
print('Under-expressed:')
under = df.query('FC < 0.01 and q < 0.05').index
print(' '.join(under))

🧠 Finally, explore the results you obtained with  [**g:Profiler**](https://biit.cs.ut.ee/gprofiler/gost). 

> Do these results they make sense? 🤔
> Remember that under-expressed in adipose tissue compared to liver is the same as over-expressed in liver compared to adipise tissue.

----------

## Wrap-up

This was a difficult tutorial regarding the coding part. Hopefully, by lecture 14 you are becoming comfortable with python 🐍 (and pandas 🐼🐼🐼). 

It is fine if you had to take a peek at some of the solutions. Do ask for help if you have any questions 😉