# This jupyter notebook contains a basic example of
- how to cluster and (potentially) select REX structure ensembles from a contact-guided REX simulation

Note: you should be already familiar with:
- concept of dimension reduction (here TSNE)
- concept of clustering (here KMeans)

In [None]:
%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt
import pyrexMD.misc as misc
import pyrexMD.core as core
import pyrexMD.topology as top
import pyrexMD.analysis.analysis as ana
import pyrexMD.analysis.contacts as con
import pyrexMD.analysis.gdt as gdt
import pyrexMD.decoy.cluster as clu
import MDAnalysis as mda
import os

from tqdm.notebook import tqdm
misc.apply_matplotlib_rc_settings()

# general steps
1) pre-filter REX trajectory based on task-specific criteria (e.g. QBias, QNative, Energies...)
<br>2) calculate distance matrices for filtered frames
<br>3) cluster filtered frames

In this example we start at step 3) and use unpublished sample data

In [None]:
# load data
QBias = misc.pickle_load("./files/cluster/QBias.pickle")
RMSD = misc.pickle_load("./files/cluster/RMSD.pickle")
GDT_TS = misc.pickle_load("./files/cluster/GDT_TS.pickle")

score_file = "./files/cluster/energies.log"
ENERGY = misc.read_file(score_file, usecols=1, skiprows=1)
DM = clu.read_h5("./files/cluster/DM.h5")

In [None]:
# apply TSNE for dimension reduction
tsne = clu.apply_TSNE(DM, n_components=2, perplexity=50, random_state=1)

### apply KMeans on TSNE-transformed data (two variants with low and high cluster number)
# note: here we set the high number only to 20 because our sample is small with only 500 frames

cluster10 = clu.apply_KMEANS(tsne, n_clusters=10, random_state=1)
cluster20 = clu.apply_KMEANS(tsne, n_clusters=20, random_state=1)

In [None]:
### map scores (energies) and accuracy (GDT, RMSD) to clusters
cluster10_scores = clu.map_cluster_scores(cluster_data=cluster10, score_file=score_file)
cluster10_accuracy = clu.map_cluster_accuracy(cluster_data=cluster10, GDT=GDT_TS, RMSD=RMSD)

cluster20_scores = clu.map_cluster_scores(cluster_data=cluster20, score_file=score_file)
cluster20_accuracy = clu.map_cluster_accuracy(cluster_data=cluster20, GDT=GDT_TS, RMSD=RMSD)

In [None]:
### plot cluster data
# here: TSNE-transformed data with n_clusters = 10
# also: plot cluster centers with different colors 
#     - red dot: n20 centers
#     - black dot: n10 centers

clu.plot_cluster_data(cluster10, tsne, ms=40)
clu.plot_cluster_center(cluster10, marker="o", color="red", ms=20)
clu.plot_cluster_center(cluster20, marker="o", color="black")

In [None]:
### plot cluster data
# here: TSNE-transformed data with n_clusters = 20
# also: plot cluster centers with different colors 
#     - red dot: n20 centers
#     - black dot: n10 centers

clu.plot_cluster_data(cluster20, tsne)
clu.plot_cluster_center(cluster10, marker="o", color="red", ms=20)
clu.plot_cluster_center(cluster20, marker="o", color="black")

In [None]:
### print table with cluster scores stats

_ = clu.WF_print_cluster_scores(cluster_data=cluster10, cluster_scores=cluster10_scores, 
                                score_file=score_file)
print("-------------------------------------------------------------------")
_ = clu.WF_print_cluster_scores(cluster_data=cluster20, cluster_scores=cluster20_scores, 
                                score_file=score_file)

In [None]:
### print table with cluster accuracy stats

_ = clu.WF_print_cluster_accuracy(cluster_data=cluster10, accuracy_data=cluster10_accuracy)
print("---------------------------------------------------------------------------------")
_ = clu.WF_print_cluster_accuracy(cluster_data=cluster20, accuracy_data=cluster20_accuracy)

Note: based on initial filtering and setup of energy function it is possible to "guess" good structure ensembles and verify selections based on accuracy stats