# Asp7 Example - Advanced Usage

Run this notebook on Google Colab:

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AG-Peter/encodermap/blob/main/tutorials/notebooks_starter/02_Advanced_Usage-Asp7_Example.ipynb)

Find the documentation of EncoderMap:

https://ag-peter.github.io/encodermap

## For Google colab users: Install EncoderMap

In [1]:
# !pip install "git+https://github.com/AG-Peter/encodermap.git@main"
# !pip install -r https://raw.githubusercontent.com/AG-Peter/encodermap/main/tests/test_requirements.md

In this tutorial we will use example data from a molecular dynamics simulation and learn more about advanced usage of EncoderMap. Encoder map can create low-dimensional maps of the vast conformational spaces of molecules. This allows easy identification of the most common molecular conformations and helps to understand the relations between these conformations. In this example, we will use data from a simulation of a simple peptide: hepta-aspartic-acid.

First we need to import some libraries:

In [2]:
import encodermap as em
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from math import pi
%config Completer.use_jedi=False
%load_ext autoreload
%autoreload 2

ModuleNotFoundError: No module named 'encodermap.autoencoder'

Fix the random state of tensorflow for reproducibility.

In [3]:
import tensorflow as tf
tf.random.set_seed(3)

2023-02-01 07:26:36.908623: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-01 07:26:37.030904: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
2023-02-01 07:26:37.030925: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


2023-02-01 07:26:37.732056: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
2023-02-01 07:26:37.732133: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib


Next, we need to load the input data. Different kinds of variables can be used to describe molecular conformations: e.g. Cartesian coordinates, distances, angles, dihedrals... In principle EncoderMap can deal with any of these inputs, however, some are better suited than others. The molecular conformation does not change when the molecule is translated or rotated. The chosen input variables should reflect that and be translationally and rotationally invariant. 

In this example we use the backbone dihedral angles phi and psi as input as they are translationally and rotationally invariant and describe the backbone of a protein/peptide very well.

The "asp7.csv" file contains one column for each dihedral and one row for each frame of the trajectory. Additionally, the last column contains a cluster_id from a gromos clustering which we can later use for comparison. We can load this data using numpy.loadtxt:

In [4]:
csv_path = "asp7.csv"
data = np.loadtxt(csv_path, skiprows=1, delimiter=",")
dihedrals = data[:, :-1]
cluster_ids = data[:, -1]

NameError: name 'np' is not defined

In [5]:
import nglview as nv
import mdtraj as md
traj = md.load('asp7.xtc', top='asp7.pdb')
view = nv.show_mdtraj(traj)
view.add_representation('hyperball')
view

AttributeError: 'super' object has no attribute '_ipython_display_'

Similarly to the previous example, we need to set some parameters. In contrast to the Cube example we now have periodic input data. The dihedral angles are in radians with a 2pi periodicity. We also set some further parameters but don't bother for now. 

In [6]:
parameters = em.Parameters()
parameters.main_path = em.misc.run_path("runs/asp7")
parameters.n_steps = 1000
parameters.dist_sig_parameters = (4.5, 12, 6, 1, 2, 6)
parameters.periodicity = 2*pi
parameters.l2_reg_constant = 10.0
parameters.summary_step = 1
parameters.tensorboard

%matplotlib inline
em.plot.distance_histogram(dihedrals[::10], 
                           parameters.periodicity, 
                           parameters.dist_sig_parameters,
                           bins=50)

NameError: name 'em' is not defined

Next we can run the dimensionality reduction:

In [7]:
e_map = em.EncoderMap(parameters, dihedrals)

NameError: name 'em' is not defined

The new tensorflow 2 version of EncoderMap allows you to also view the output of the latent space during the training. Switch that feature on with `e_map.add_images_to_tensorboard()`.

In [8]:
e_map.add_images_to_tensorboard()

NameError: name 'e_map' is not defined

In [9]:
e_map.train()

NameError: name 'e_map' is not defined

project all dihedrals to the low-dimensional space...

In [10]:
low_d_projection = e_map.encode(dihedrals)

NameError: name 'e_map' is not defined

 and plot the result:

In [11]:
%matplotlib inline

# rgba color from cluster_id. All 1.0s are grey RGBA=(.5, .5, .5, .1)
# the rest are colored with maptlotlib C0, C1, C2, ...
def colors_from_cluster_ids(cluster_ids, max_clusters=10):
    colors = np.full(shape=(len(cluster_ids), 4), fill_value=(.5, .5, .5, .1))
    for i in range(2, max_clusters + 2):
        where = np.where(cluster_ids == i)
        color = (*mpl.colors.to_rgb(f'C{i - 2}'), 0.3)
        colors[where] = color
    return colors

# define max clusters
max_clusters = 5

# create figure
fig, ax = plt.subplots()
scatter = ax.scatter(*low_d_projection.T, s=20, c=colors_from_cluster_ids(cluster_ids, max_clusters))

# fake a legend, because using scatter with RGBA values will not produce a legend
recs = []
for i in range(max_clusters):
    recs.append(mpl.patches.Rectangle((0, 0), 1, 1, fc=f"C{i}"))
ax.legend(recs, range(max_clusters), loc=4)

plt.show()

NameError: name 'plt' is not defined

In the above map points from different clusters (different colors) should be well separated. However, if you didn't change the parameters, they are probably not. Some of our parameter settings appear to be unsuitable. Let's see how we can find out what goes wrong.

## Visualize Learning with TensorBoard

### Running tensorboard on Google colab

To use tensorboard in google colabs notebooks, you neet to first load the tensorboard extension

```python
%load_ext tensorboard
```

And then activate it with:

```python
%tensorboard --logdir .
```

The next code cell contains these commands. Uncomment them and then continue.

### Running tensorboard locally

TensorBoard is a visualization tool from the machine learning library TensorFlow which is used by the EncoderMap package. During the dimensionality reduction step, when the neural network autoencoder is trained, several readings are saved in a TensorBoard format. All output files are saved to the path defined in `parameters.main_path`. Navigate to this location in a shell and start TensorBoard. Change the paramter Tensorboard to `True` to make Encodermap log to Tensorboard.

In case you run this tutorial in the provided Docker container you can open a new console inside the container by typing the following command in a new system shell.
```shell
docker exec -it emap bash
```
Navigate to the location where all the runs are saved. e.g.:
```shell
cd notebooks_easy/runs/asp7/
```
Start TensorBoard in this directory with:
```shell
tensorboard --logdir .
```

You should now be able to open TensorBoard in your webbrowser on port 6006.  
`0.0.0.0:6006` or `127.0.0.1:6006`

In the SCALARS tab of TensorBoard you should see among other values the overall cost and different contributions to the cost. The two most important contributions are `auto_cost` and `distance_cost`. `auto_cost` indicates differences between the inputs and outputs of the autoencoder. `distance_cost` is the part of the cost function which compares pairwise distances in the input space and the low-dimensional (latent) space.

**Fixing Reloading issues**
Using Tensorboard we often encountered some issues while training multiple models and writing mutliple runs to Tensorboard's logdir. Reloading the data and event refreshing the web page did not display the data of the current run. We needed to kill tensorboard and restart it in order to see the new data. This issue was fixed by setting `reload_multifile` `True`.

```bash
tensorboard --logdir . --reload_multifile True
```


In your case, probably the overall cost as well as the auto_cost and the distance_cost are still decreasing after all training iterations. This tells us that we can simply improve the result by increasing the number of training steps. The following cell contains the same code as above. Set a larger number of straining steps to improve the result (e.g. 3000).

In [12]:
# for Google colab uncomment these lines
# %load_ext tensorboard
# %tensorboard --logdir .

In [13]:
parameters = em.Parameters(
    main_path=em.misc.run_path("runs/asp7"),
    n_steps=1000,
    dist_sig_parameters=(4.5, 12, 6, 1, 2, 6),
    periodicity=2*pi,
    l2_reg_constant=10,
    summary_step=1,
    tensorboard=True
)

e_map = em.EncoderMap(parameters, dihedrals)

# Logging images to Tensorboard can greatly reduce performance.
# So they need to be specifically turned on
e_map.add_images_to_tensorboard(dihedrals, 2,
                                scatter_kws={'s': 50, 'c': colors_from_cluster_ids(cluster_ids, 5)}
                               )

e_map.train()

NameError: name 'em' is not defined

The molecule conformations form different clusters (different colors) should be separated a bit better now. In TensorBoard you should see the cost curves for this new run. When the cost curve becomes more or less flat towards the end, longer training does not make sense.

The resulting low-dimensional projection is probably still not very detailed and clusters are probably not well separated. Currently we use a regularization constant `parameters.l2_reg_constant = 10.0`. The regularization constant influences the 
complexity of the network and the map. A high regularization constant will result in a smooth map with little details. A small regularization constant will result in a rougher more detailed map.

Go back to the previous cell and decrease the regularization constant (e.g. `parameters.l2_reg_constant = 0.001`). Play with different settings to improve the separation of the clusters in the map. Have a look at TensorBoard to see how the cost changes for different parameters.

In [14]:
%matplotlib inline
plt.close('all')
plt.scatter(*e_map.encode(dihedrals).T)

NameError: name 'plt' is not defined

**Here is what you can see in Tensorboard:**

<img src="Tensorboard_Cost.png" width="800">
<img src="Tensorboard_Histograms.png" width="800">
<img src="Tensorboard_Parameters.png" width="800">
<img src="Tensorboard_Images.png" width="800">

### Save and Load
Once you are satisfied with your EncoderMap, you might want to save the result. The good news is: Encoder map automatically saves checkpoints during the training process in `parameters.main_path`. The frequency of writing checkpoints can be defined with `patameters.checkpoint_step`. Also, your selected parameters are saved in a file called `parameters.json`. Navigate to the driectory of your last run and open this `parameters.json` file in some text editor. You should find all the parameters that we have set so far. You also find some parameters which were not set by us specifically and where EncoderMap used its default values.

Let's start by looking at the parameters from the last run and printing them in a nicely formatted table with the `.parameters` attribute.

In [15]:
loaded_parameters = em.Parameters.from_file('runs/asp7/run0/parameters.json')
print(loaded_parameters.parameters)

NameError: name 'em' is not defined

Before we can reload our trained network we need to save it manually, because the checkpoint step was set to 5000 and we did only write a checkpoint at 0 (random initial weights). We call `e_map.save()` to do so.

In [16]:
e_map.save()

NameError: name 'e_map' is not defined

And now we reload it.

In [17]:
# get the most recent file
import os
from pathlib import Path
latest_checkpoint_file = str(list(sorted(Path("runs/asp7").rglob("*model*"), key=os.path.getmtime, reverse=True))[0]).rstrip("_decoder").rstrip("_encoder")
print(latest_checkpoint_file)
loaded_e_map = em.EncoderMap.from_checkpoint(latest_checkpoint_file, overwrite_tensorboard_bool=True)

IndexError: list index out of range

Now we are finished with loading and we can for example use the loaded EncoderMap object to project data to the low_dimensional space and plot the result:

In [18]:
low_d_projection = e_map.encode(dihedrals)

# Plotting:
%matplotlib inline
fig, ax = plt.subplots()
ax.plot(low_d_projection[:, 0], low_d_projection[:, 1], linestyle="", marker=".",
         markersize=5, color="0.7", alpha=0.1)
for i in range(9):
    mask = cluster_ids == i + 1
    ax.plot(low_d_projection[:, 0][mask], low_d_projection[:, 1][mask], label=str(i),
             linestyle="", marker=".", markersize=5, alpha=0.3)

legend = ax.legend()
for lh in legend.legendHandles:
    if hasattr(lh, "legmarker"):
        lh.legmarker.set_alpha(1)

NameError: name 'e_map' is not defined

### Generate Molecular Conformations
Already in the cube example, you have seen that with EncoderMap it is not only possible to project points to the low-dimensional space. Also, a projection of low-dimensional points into the high-dimensional space is possible. 

Here, we will use a tool form the EncoderMap library to interactively select a path in the low-dimensional map. We will project points along this path into the high-dimensional dihedral space, and use these dihedrals to reconstruct molecular conformations. This can be very useful to explore the landscape an to see what changes in the molecular conformation going from one cluster to another.

The next cell instantiates a class with which you can interact with the low-dimensional projection of the Autoencoder. You can select clusters with the `Polygon`, `Ellipse,`, `Rectangle` and `Lasso` tools. The clusters will be selected fron the input conformations in `asp7.xtc`.

More interesting is the `Bezier` and `Path` tool. With these you can generate molecular conformations from a path in the latent space.

Click `Set Points` and then `Write`/`Generate` to make your own clusters/paths. You can have a look at what you selected using the `sess.view` attribute of the InteractivePlotting class. For this you need to have nglview installed.

Give the InteractivePlotting a try. We would like to hear your feedback at GitHub.

**Note:** Sometime the `notebook` backend of matplotlib does not behave well and will not work with the InteractivePlotting class. Try switching to a different backend with

```python
%matploltib qt5
```

Sadly, you need to restart the Kernel before switching. We hope you saved your trained EncoderMap.

In [19]:
sess = em.InteractivePlotting(e_map, "asp7.xtc", data=low_d_projection,
                              top='asp7.pdb', scatter_kws={'s': 2})

NameError: name 'em' is not defined

In [20]:
sess.view

NameError: name 'sess' is not defined

As backbone dihedrals contain no information about the side-chains, only the backbone of the molecule can be reconstructed. 
In case the generated conformations change very abruptly it might be sensible to increase the regularization constant to obtain a smoother representation. If the generated conformations along a path are not changing at all, the regularization is probably to strong and prevents the network form generating different conformations.

### Conclusion

In this tutorial we applied EncoderMap to a molecular system. You have learned how to monitor the EncoderMap training procedure with TensorBoard, how to restore previously saved EncoderMaps and how to generate Molecular conformations using the path selection tool.