# Clustering Example

In this example we cluster a short trajectory (1000 frames) of the disordered peptide
[hiAPP](https://www.ncbi.nlm.nih.gov/pubmed/24021023)

We create a normalized covariance matrix using four different metrics:
    - Radius of gyration
    - Exposed solvent surface
    - Asphericity
    - End-to-end distance

<a id='Table of Contents'></a><h3>Table of Contents</h3>
<a href='#load_env'>Load Environment</a> 
<a href='#download_data'>Donwload Data</a>  
<a href='#load_traj'>Loading the Trajectory</a>  
<a href='#vis_traj'>Quick Trajectory Visualization</a>  
<a href='#clustering'>Clustering</a>  
<a href='#vis_cluster_tree'>Quick View of the Clustering Tree</a>  
<a href='#pdb_repr'>Extract PDB Files for Representative Structures</a>  
<a href='#xray_crysol'>Calculation of X-Ray Profiles with CRYSOL</a>  
<a href='#bench_xray'>"Experimental" X-Ray profile</a>  
<a href='#fit_tree'>Fit the Tree Against the Experimental Profile</a>  
<a href='#best_fit'>Analysis of the Tree Level with Best Fit to Experimental Profile</a>  
<a href='#weight_cluster'>Weight of Each Cluster</a>  



(<a href='#Table of Contents'>Top</a>)<a id='load_env'></a><h3>Update Environment</h3>

Python packages specific to the example notebooks may be needed to install. Use either `pip` or `conda`.

**Packages installed with conda**: [conda_requirements.yml](https://raw.githubusercontent.com/jmborr/idpflex/master/notebooks/conda_requirements.yml)    
`conda env update -f conda_requirements.yml`

**Packages installed with pip**: [pip_requirements.txt](https://raw.githubusercontent.com/jmborr/idpflex/master/notebooks/pip_requirements.txt)   
`pip install -r pip_requirements.txt`

(<a href='#Table of Contents'>Top</a>)<a id='load_env'></a><h3>Load Environment</h3>

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
plt.ion()

import os
import sys
import subprocess
import numpy as np
import MDAnalysis as mda
import nglview
from tqdm import tqdm
import pathos
import multiprocessing
import numpy as np
import scipy
from scipy.cluster.hierarchy import dendrogram

from idpflex.cnextend import load_tree
from idpflex.cluster import cluster_with_properties
from idpflex.properties import (RadiusOfGyration, EndToEnd, SaSa, Asphericity,
                               SaxsProperty, propagator_size_weighted_sum)
from idpflex.utils import write_frame
from idpflex.bayes import fit_to_depth

(<a href='#Table of Contents'>Top</a>)<a id='download_data'></a><h3>Donwload Data</h3>

It's assumed <code>git</code> is installed in your system. Otherwise,
[follow instructions](http://idpflex.readthedocs.io/en/latest/installation.html#testing-tutorials-data)
to download and unpack your data to <code>/tmp/idpflex_data</code>.

In [None]:
%%bash
idpflex_data_dir="/tmp/idpflex_data"
if [ -d "${idpflex_data_dir}" ]; then
    cd ${idpflex_data_dir}
    git pull --rebase
else
    git clone https://github.com/jmborr/idpflex_data ${idpflex_data_dir}
fi

In [None]:
idpflex_data_dir = '/tmp/idpflex_data'
data_dir = os.path.join(idpflex_data_dir, 'data', 'simulation')
print(data_dir)

(<a href='#Table of Contents'>Top</a>)<a id='load_traj'></a><h3>Loading the Trajectory</h3>

In [None]:
simulation = mda.Universe(os.path.join(data_dir, 'hiAPP.pdb'),
                          os.path.join(data_dir, 'hiAPP.xtc'))
print('Number of frames in trajectory is ', simulation.trajectory.n_frames)

(<a href='#Table of Contents'>Top</a>)<a id='vis_traj'></a><h3>Quick Trajectory Visualization</h3>

In [None]:
w_show = nglview.show_mdanalysis(simulation)
w_show

(<a href='#Table of Contents'>Top</a>)<a id='clustering'></a><h3>Clustering</h3>

We cluster usin four different scalar properties

* Radius of gyration
* End to end distance
* Solvent accessible surface area
* Asphericity

In [None]:
properties = [RadiusOfGyration, EndToEnd, SaSa, Asphericity]
cl = cluster_with_properties(simulation, properties,
                             segment_length=100, n_representatives=100)

(<a href='#Table of Contents'>Top</a>)<a id='vis_cluster_tree'></a><h3>Quick View of the Clustering Tree</h3>

In [None]:
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('RMSD (Angstroms)')
dendrogram(cl.tree.z,
           truncate_mode='lastp',  # show only the last p merged clusters
           p=10,  # show this many cluster at the bottom of the tree
           show_leaf_counts=False,  # otherwise numbers in brackets are counts
           leaf_rotation=90.,
           leaf_font_size=12.,
           show_contracted=True,  # to get a distribution impression in truncated branches
          )

(<a href='#Table of Contents'>Top</a>)<a id='pdb_repr'></a><h3>Extract PDB Files for Representative Structures</h3>

We extract PDB files for each of the 100 representatives and store under directory `/tmp/PDB`

In [None]:
pdb_names = ['/tmp/PDB/conf_{}.pdb'.format(idx) for idx in cl.idx]

subprocess.call(['mkdir', '-p', '/tmp/PDB'])  # directory to store the PDB files
for idx, name in tqdm(list(zip(cl.idx, pdb_names))):
    write_frame(simulation, idx, name)

(<a href='#Table of Contents'>Top</a>)<a id='xray_crysol'></a><h3>Calculation of X-Ray Profiles with CRYSOL</h3>

It is assumed that `crysol` is installed in your computer, otherwise we fetch the output CRYSOL files from the `idpflex_data` repository. We store a profile for each representative in directory `/tmp/CRYSOL`.

In [None]:
crysol_names = ['/tmp/CRYSOL/conf_{}.int'.format(idx) for idx in cl.idx]

if find_executable('crysol') is None:
    subprocess.call('cp /tmp/idpflex_data/data/simulation/CRYSOL.tar.gz /tmp'.split())
    subprocess.call('tar zxf /tmp/CRYSOL.tar.gz -C /tmp'.split())
    profiles = [SaxsProperty().from_crysol_int(name) for name in crysol_names]
else:
    pool = pathos.pools.ProcessPool(processes=multiprocessing.cpu_count())
    profiles = list(tqdm(pool.map(SaxsProperty().from_crysol_pdb, pdb_names), total=len(pdb_names)))

In [None]:
[profile.to_ascii(name) for profile, name in zip(profiles, crysol_names)]
propagator_size_weighted_sum(profiles, cl.tree)

(<a href='#Table of Contents'>Top</a>)<a id='bench_xray'></a><h3>"Experimental" X-Ray profile</h3>

We do not have an experimental profile], so we are going to create a fake experimental profile using the profiles from some of the nodes. The task for the fit engine will be to identify which nodes did we use.

Starting from the top of the tree (the root node), we will descend to `level=5`, which contains 6 nodes (the first level is the root node corresponding to `level=0`) We will assign different weights to each of the seven profiles and construct our profile with these weigths.

The profile will be stored as a [SAXS property](http://idpflex.readthedocs.io/en/latest/idpflex/properties.html#idpflex.properties.SaxsProperty)

In [None]:
nodes = cl.tree.nodes_at_depth(5)
weights = np.asarray([0.00, 0.13, 0.00, 0.55, 0.32, 0.00])  # the weights add up to one
# x are the Q-values
x = nodes[0]['saxs'].x
# y are the intensities
y = np.sum(weights.reshape((6, 1)) * np.asarray([n['saxs'].y for n in nodes]), axis=0)
# Errors simple taken as 10% of the intensities
e = y * 0.1
# Now we create our X-Ray property
exp_saxs = SaxsProperty(qvalues=x, profile=y, errors=e)

we can plot the property

In [None]:
fig, ax = plt.subplots(1,1)
ax.plot(exp_saxs.x, exp_saxs.y)
ax.set_xlabel('Q', size=25)
ax.set_ylabel('Intensity', size=25)
plt.tight_layout()

(<a href='#Table of Contents'>Top</a>)<a id='fit_tree'></a><h3>Fit the Tree Against the Experimental Profile</h3>

Starting from the root node, we fit each tree level against the experimental profile, up to a maximum depth (in this case, `level=7`. Then we will inquire the goodnes of fit for each level

In [None]:
fits = fit_to_depth(cl.tree, exp_saxs, exp_saxs.name, max_depth=7)

`fits` is a list of [ModelResult](https://lmfit.github.io/lmfit-py/model.html#lmfit.model.ModelResult) instances, one result for every level. We extract the goodness of fit `\chi^2` and plot versus level

In [None]:
chi2 = [fit.redchi for fit in fits]
fig, ax = plt.subplots(1,1)
ax.set_xlabel('level', size=25)
ax.set_ylabel('Chi-squared', size=25)
ax.set_yscale('log')
ax.plot(chi2)
plt.tight_layout()

the steep drop in orders of magnitude for $\chi^2$ at `level=5` indicates the fit engine successfully fitted the experimental profile.

(<a href='#Table of Contents'>Top</a>)<a id='best_fit'></a><h3>Analysis of the Tree Level with Best Fit to Experimental Profile</h3>

In [None]:
best_fit = fits[5]

(<a href='#Table of Contents'>Top</a>)<a id='weight_cluster'></a><h3>Weight of Each Cluster</h3>

We inquire the weight that the fit engine assigned to each of the seven clusters of `level=6`

In [None]:
for key in best_fit.best_values:
    if 'amplitude' in key:
        print(key, '{:4.2f}'.format(best_fit.best_values[key]))
print(['{:4.2f}'.format(x) for x in weights])  # weights used to construct the experimental profile

The order in which the fitted weights are printed is different that the order of the experimental weight. Object `best_fit.best_values` is a python dictionary and order is not guaranteed for this type of object. However, we can use the node id in the amplitude name to sort the fitted weights from smaller to higher node id.

The fit procedure correctly identified that only three out of the seven nodes are contributing to the experimental profile.

(<a href='#Table of Contents'>Top</a>)<a id='node_repr'></a><h3>Representative Structures of the Nodes</h3>

Find a representative structure for each of the three nodes contributing to the match of the experimental profile

In [None]:
node_ids = [190, 192, 193]  # ID's for the clusters matching the experimental profile 
leafs = [cl.tree[id].representative(cl.rmsd) for id in node_ids]
repr_names = [pdb_names[l.id] for l in leafs]  # representative structures for each node
print(repr_names)

In [None]:
view = nglview.show_file(repr_names[0])
view.display()

In [None]:
view = nglview.show_file(repr_names[1])
view.display()

In [None]:
view = nglview.show_file(repr_names[2])
view.display()