# Lecture 14 - Co-expression networks and enrichment analysis

In this lecture you learned how to obtain biological insight from omics data by creating co-expression networks and performing enrichment analysis.

In this tutorial we will continue to use the dataset from the previous lecture [(Lee et al, 2016)](https://www.sciencedirect.com/science/article/pii/S1550413116302480) that analysed gene expression levels of liver and adipose tissue of 12 obese patients undergoing bariatric surgery.

This time we will use the *"cleaned up"* version of the data that we generated last time.

In [None]:
import pandas as pd
data = pd.read_csv('files/E-GEOD-83322_clean.tsv', sep='\t')
data.sample(5)

## Exercise 1: gene co-expression networks

Let's search for genes that are co-expressed across all patients on liver samples. 

To make our life easier, let's filter by tissue and convert the table to wide format:

In [None]:
df_liver = data.query('tissue == "liver"').pivot(index='gene', columns='patient', values='value')

Now iterate (twice) over all genes to calculate the Spearman correlation between every pair of genes, storing the results in the format you find most appropriate.

> Note: the dataset contains **> 12k genes**, resulting in about **150 million** pairs. To make things faster, we will only use **100** (randomly selected) genes. 

> Also: note that we only need to compute half the number of combinations. The correlation of (gene1, gene2) is the same as that of (gene2, gene1).

In [None]:
from scipy.stats import spearmanr

df_liver = df_liver.sample(100) # sample 100 random genes

In [None]:
# type your code here

Click below to see solution:

In [None]:
%%time  
# this is just to print the total computation time

corr = []
for i in range(len(df_liver)):
    for j in range(i+1, len(df_liver)):
        gene1 = df_liver.index[i]
        gene2 = df_liver.index[j]
        values1 = df_liver.iloc[i,:]
        values2 = df_liver.iloc[j,:]
        r, p = spearmanr(values1, values2)
        corr.append((gene1, gene2, r, p))
        
corr = pd.DataFrame(corr, columns=['gene1', 'gene2', 'r', 'p'])

Now filter only the gene pairs with an absolute correlation above 0.8, and plot the co-expression network using [Pyvis](https://pyvis.readthedocs.io/en/latest/).

> Remember: you have already used Pyvis in Lecture 9, Exercise 2. You can go back and see how to use it.

In [None]:
# type your code here...

Click below to see solution:

In [None]:
selected = corr.query("abs(r) > 0.8")

from pyvis.network import Network
net = Network(directed=False, notebook=True, height='300px', width='500px')

net.add_nodes(selected['gene1'])
net.add_nodes(selected['gene2'])
net.add_edges(selected[['gene1', 'gene2']].values)

net.show('tmp.html')

Remember that we used 100 randomly selected genes, so try running a second time with a different selection.

## Exercise 2: Gene enrichment analysis

In this exercise, we will search for genes that are over or under-expressed in adipose tissue compared to liver samples.

We will begin with some pandas magic to compute fold-change, p-values, and adjusted p-values (also known as q-values).

In [None]:
df = data.pivot(index='gene', columns=['tissue','patient'], values='value').droplevel(1, axis=1)
df['FC'] = df['adipose tissue'].mean(axis=1) / df['liver'].mean(axis=1)

from scipy.stats import ttest_ind
df['p'] = df.apply(lambda x: ttest_ind(x['liver'], x['adipose tissue'])[1], axis=1)

from statsmodels.stats.multitest import fdrcorrection
df['is_significant'], df['q'] = fdrcorrection(df['p'])

Now draw a **volcano plot** as follows:

- use *log2(fold-change)* on the x-axis and *-log10(q-value)* on the y-axis
- you can import *log2* and *log10* from *numpy*
- you can use *df.plot.scatter()* if you store the results as new columns

In [None]:
# type your code here

Click to see solution below:

In [None]:
df['log2FC'] = np.log2(df['FC'])
df['log10q'] = -np.log10(df['q'])
df.plot.scatter('log2FC', 'log10q', alpha=0.2)

Now create two list of genes that are:
- significantly over-expressed by more than 100-fold 
- significantly under-expressed by more than 100-fold 
    
Upload each list of genes separately in [g:Profiler](https://biit.cs.ut.ee/gprofiler/gost) to run the enrichment analysis.

> Tip: use `' '.join()` to print a space separated list of genes. 

In [None]:
# type your code here...

Click to see solution below:

In [None]:
print('Over-expressed:')
over = df.query('FC > 100 and q < 0.05').index
print(' '.join(over))
print()
print('Under-expressed:')
under = df.query('FC < 0.01 and q < 0.05').index
print(' '.join(under))

Finally, explore the results you obtained with g:Profiler. Do they make sense?