## MCL-based interactome clustering

This notebook documents how the interactome is clustered using Markov clustering (MCL). 

In [1]:
import pandas as pd
import sys
sys.path.append('../../scripts/interactome_markov_clustering')

import markov_clustering_utils as mcu

### Import datasets
For public convenience, we will use output clusters from the Markov clustering tool from Cytoscape (a popular PPI analysis software). The PPI edgelist that was used for original clustering is also included. The stoichiometry edges are already pre-weighted. Please refer to the manuscript for more details - but in short, interactions with stoichiometries that suggest strong, stable interactions are given strong edge weights. 

In [3]:
root = '../../data/ppi_analysis/clustering/'
edge_list = pd.read_csv(root + 'oc_interactions_stoich_weighted.csv', index_col=0)
first_clusters = pd.read_csv(root + 'mcl_i3.0_exclusion.csv')

In [12]:
# for second MCL clustering, we only use interactions with high stoichiometries
# that we identify as core interactions
cores = edge_list[edge_list['circle_stoi']>2]

# rename and reformat the raw Cytoscape output
cleaned_clusters = mcu.clean_up_cytoscape_mcl(first_clusters, grouped=True)

### Prune the original clusters
We prune the original clusters with recursive haircut, meaning that single edges are pruned until there are no single edges left in all clusters.

In [39]:
mcl_haircut = mcu.mcl_haircut(
    cleaned_clusters, edge_list, 'prot_1', 'prot_2', edge='circle_stoi', edge_thresh=1, clean=True
)

### Do the clustering to obtain the 'core clusters'
Then we merge the original super-cluster table with the core-cluster table and clean up to create the final cluster file.

In [51]:
core_clusters = mcu.second_mcl(
    mcl_haircut, 
    cores, 
    'prot_1', 
    'prot_2',
    first_thresh=15, 
    mcl_thresh=2, 
    mcl_inflation=3, 
    edge='circle_stoi', 
    clean=True
)

Clustering contains overlapping, to enable soft clustering set keep_overlap to True
Clustering contains overlapping, to enable soft clustering set keep_overlap to True
Clustering contains overlapping, to enable soft clustering set keep_overlap to True


In [67]:
master_table = mcl_haircut.explode('gene_names')
master_table = master_table.merge(core_clusters,  on=['super_cluster','gene_names'], how='left')

master_table = (
    master_table[['gene_names', 'super_cluster', 'core_cluster']]
    .sort_values(by=['super_cluster', 'core_cluster', 'gene_names'])
    .reset_index(drop=True)
)

In [70]:
master_table

Unnamed: 0,gene_names,super_cluster,core_cluster
0,ATG4B,0,0.0
1,ATG7,0,0.0
2,CAPRIN1,0,0.0
3,CCDC124,0,0.0
4,DRG1,0,0.0
...,...,...,...
2430,AGAP1,338,
2431,AGAP3,338,
2432,TANC1,338,
2433,AATF,339,
