# Working with trajectory ensembles

Run this notebook on Google Colab:

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AG-Peter/encodermap/blob/main/tutorials/notebooks_MD/01_Working_with_trajectory_ensembles.ipynb)

Find the documentation of EncoderMap:

https://ag-peter.github.io/encodermap

### For Google colab only:

If you're on Google colab, please uncomment these lines and install EncoderMap.

In [1]:
# !wget https://raw.githubusercontent.com/AG-Peter/encodermap/main/tutorials/install_encodermap_google_colab.sh
# !sudo bash install_encodermap_google_colab.sh

## Primer

### Collective Variables

Collective Variables (CVs) are often used to simplify and filter the xyz-Data of MD simulations. They are often employed to make sense of a complex protein (or more general molecular) system with via just a few well-defined descriptors. They are similar to reaction coordinates, which are 1-dimensional variables along a reaction pathway but can be higher dimensional. When we think about a receptor-ligand system, the distance between the two species is often used as a reaction coordinate. A set of this distance CV and the relative rotation between receptor and ligand can add much more information and help understand the system. It becomes apparent that clever selection of CVs is an important task that many scientists in different fields face day to day.

With the tools presented in this notebook we want to give you the ability to work in a fluent and natural way with MD trajectories and their associated CV data. In the scope of this work we define CVs as **a collection of data that is in one of its dimensions aligned with the frames/timesteps of the underlying simulation data.**

### Example CVs

Here's a list from the top of my head of CVs that are widely used.

- Distances: This category of CVs can contain a 1-diemsnional scalar value describing the distance between two species in a receptor-ligand system. A 1-dimensional CV of the end-to-end distance of a protein can describe the proteins folding state in an approximate and generalizing manner. Distance CVs can also be higher-dimensional. The pairwise distances between $\mathrm{C_\alpha}$ atoms of a protein with $n$ residues can be captured as a $n \times n$ matrix. A so-called hollow matrix, where all diagonal-elements are zero which is also symmetric. Most often a vector of length $\binom{n}{2}$ is used to describe the pairwise distances in a protein. The distance between a protein and a membrane/interface would also fall into this category.
 
- Angles: Angular CVs lie in a periodic space of $[-\pi, \pi]$ or $[0, 2\pi]$, or $[0, 360]$. The well-known Ramachandran plot uses a protein's $\phi$ and $\psi$ angles to find energetically favored conformations in a protein's backbone. Besides the backbone angles other angles can become important in understanding a protein's conformations. Lysine is often modified via acetylation, phosphorylation, methylation, ubiquitylation and it's sidechain angles ($\chi_1$ to $\chi_5$) can be important descriptors in such a system. Pseudo-Dihedral angles lie also in this category. 

- Integer/binary values. If a protein has well-defined states (folded and unfolder) a binary value describing these states could also be a useful CV.

- Values from other calculations and hyperparameters. An example for this category could be the temperature at which the simulation was carried out, when simulations were conducted at multiple temperatures. If phase space sub states were obtained by either using Markov-Chain models, or by using some sort of clustering algorithm (GROMOS), the membership to such cluster could also present CVs.

- Positional values: Maybe even the full position of an atom or a group of atoms could be an important descriptor for a system.

- etc.

### Sharing Trajectory files

Thanks to Christoph Wehmeyer: https://github.com/markovmodel/mdshare and his `mdshare` python package we will use a convenient way of obtaining MD data without bundling the codebase of `encodermap` with gigabytes upon gigabytes of MD trajectories.

In [2]:
import mdshare
import encodermap as em

# use autoreload
%load_ext autoreload
%autoreload 2

2023-02-07 11:06:32.994202: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-07 11:06:33.185093: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
2023-02-07 11:06:33.185119: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


2023-02-07 11:06:34.177966: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
2023-02-07 11:06:34.178057: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib




You can initialize the encodermap Repository like so:

In [3]:
repo = em.Repository(ignore_checksums=True)

FileNotFoundError: [Errno 2] No such file or directory: '/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/encodermap/trajinfo/data/repository.yaml'

Print all available files.

In [4]:
repo.files[:5]

NameError: name 'repo' is not defined

Print a nicely formatted catalogue.

In [5]:
repo.print_catalogue()

NameError: name 'repo' is not defined

You can search for files with unix-like filename patterns.

In [6]:
search = repo.search('*PF*')
for k, v in search.items():
    print(f'{k:<25}{v}')

NameError: name 'repo' is not defined

`mdshare` is an easy way to download MD data directly via python. You can choose the directory to download files to, or you can let encodermap download the files into its own directory structure, which allows you to use files that have already been downloaded anytime from anywhere. You can download files by doing this:

In [7]:
files, directory = repo.fetch('PFFP_single*')

NameError: name 'repo' is not defined

Once downloaded you can use the files however you like:

In [8]:
import mdtraj as md
traj = md.load(files[0], top=files[1])
print(traj)
print()
with open(files[1], 'r') as f:
    lines = f.read().splitlines()[:7]
print('\n'.join(lines))

NameError: name 'files' is not defined

Normally files will be put into the directory of encodermap (This should also work if you installed encodermap via pip), but you can specify the trajectory via:

In [9]:
repo = em.Repository()
repo.fetch('1am7*', working_directory='.')

FileNotFoundError: [Errno 2] No such file or directory: '/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/encodermap/trajinfo/data/repository.md5'

You can also fetch complete projects. These projects will return one of two new classes. After downloading the necessary files, you will be introduced to these new classes.

In [10]:
traj = repo.load_project('C23')
view = traj.show_traj()
view

NameError: name 'repo' is not defined

## Classes for Trajectory Ensembles

After we have obtained some files let us work with Encodermap's `Info`-classes.

In [11]:
import encodermap as em
import xarray as xr
import numpy as np
import mdtraj as md
import MDAnalysis as mda
import matplotlib as mpl
import matplotlib.pyplot as plt

import glob
import os

# if you have nglview set up you can also import it
try:
    import nglview as ngl
except ImportError:
    pass

## The new SingleTraj class

The `SingleTraj` class is meant as a single container to hold a trajectory's xyz coordinates, its topology, and its CVs. This class builds the backbone of a fast and coprehensive representation of *trajectory ensembles*.

### Initialize.

In the background of the `SingleTraj` class `mdtraj` works its magic, howver this class offers more benefits:

- Loading CVs in a multitude of ways and directly accessing them via indexing.
- Keeping the actual data unloaded on disk until it is actually needed.
- Keeping track of the original trajectory file and the original topology file.

The `SingleTraj` class can be initialized in many ways:

In [12]:
# from traj and top file on disk
traj_xtc = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')

# from a url of the pdb database
traj_pdb = em.SingleTraj('https://files.rcsb.org/view/1YUF.pdb')

# with an alternate constructor providing a pdb-id
traj_pdb = em.SingleTraj.from_pdb_id('1YUG')

# with the encpdermap repository by using the load_project() method
traj_proj = repo.load_project('C23', working_directory='.')

# from an existing mdtraj trajectory
traj = md.load('1am7_corrected.xtc', top='1am7_protein.pdb')
traj_mdtraj = em.SingleTraj(traj)

NameError: name 'repo' is not defined

In [13]:
print(traj_xtc, '\n\n', traj_pdb, '\n\n', traj_proj, '\n\n', traj_mdtraj)

NameError: name 'traj_proj' is not defined

**If you initialized the traj from files you get some extra options like the basename**

In [14]:
print(traj_xtc.basename, traj_xtc.traj_file, traj_xtc.top_file)
print(traj_pdb.basename, traj_pdb.traj_file, traj_pdb.top_file)

1am7_corrected 1am7_corrected.xtc 1am7_protein.pdb
1YUG https://files.rcsb.org/view/1YUG.pdb https://files.rcsb.org/view/1YUG.pdb


### On demand loading

**Difference between `traj`, `trajectory` and `top`, `topology`**

`traj` and `top` always give `mdtraj.Trajectory` and `mdtraj.Topology`, respectively. They are loaded "on demand" and return the corresponding `mdtraj` object. After they are loaded, they are deleted again and the `SingleTraj` class is garbage collected.

`trajectory` and `topology` can be `False` and represent the current *backend* of the TrajEnsemble object.

This method saves RAM.

In [15]:
print(traj_xtc.topology)
print(traj_xtc.top)
print(traj_xtc.topology)

False


FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

In [16]:
print(traj_xtc.trajectory)
print(traj_xtc.traj)
print(traj_xtc.trajectory)

False


FileNotFoundError: [Errno 2] No such file or directory: PosixPath('1am7_corrected.xtc')

Directly accessing attributes of the mdtraj.Trajectory will load it from disk and return the attributes. This means if we would execute these lines:

```python
>>> print(type(traj.xyz))
<class 'numpy.ndarray'>
```

we would always get the xyz coordinate of the trajectory (because the trajectory is loaded to RAM). Even this method:

```python
>>> hasattr(traj, 'xyz')
True
```

would cause the trajectory to get loaded. Only `traj.__dict__` gives use the information, that the xyz coordinates are currenlty not there.

In [17]:
hasattr(traj_xtc.__dict__, 'xyz')

False

In [18]:
if 'xyz' not in traj_xtc.__dict__:
    print("No xyz data here")
print(traj_xtc.xyz[0,0])

No xyz data here


FileNotFoundError: [Errno 2] No such file or directory: PosixPath('1am7_corrected.xtc')

**Loading can be forced**

Using the `load()` method, the trajectory will be loaded. From that point forwards, the xyz data is kept in RAM and can be accessed. The `unload()` function does the reverse and frees up the RAM, but the xyz data can be loaded again. If your RAM is large enough you would not need the `unload()` function, but it is there nonetheless.

In [19]:
traj_xtc.load_traj()
print(traj_xtc.topology)
traj_xtc.unload()
print(traj_xtc.topology)

FileNotFoundError: [Errno 2] No such file or directory: PosixPath('1am7_corrected.xtc')

### Take a look with nglview

If nglview is set up, you can take a look at the trajectory with this code:

In [20]:
view = traj_xtc.show_traj()
view

FileNotFoundError: [Errno 2] No such file or directory: PosixPath('1am7_corrected.xtc')

### Class attributes

These attributes might be useful when you work with `SingleTraj` classes

In [21]:
traj_xtc.load_traj()
print(traj_xtc.n_frames)
print(traj_xtc.n_atoms)
print(traj_xtc.traj_file)
print(traj_xtc.top_file)
print(traj_xtc.basename)

FileNotFoundError: [Errno 2] No such file or directory: PosixPath('1am7_corrected.xtc')

### Duplication of mdtraj

Some methods and attributes are duplicated from `mdtraj`. This allows us to call some `mdtraj` functions on the `SingleTraj` object directly.

In [22]:
selection = traj_xtc.select('name CA')
print(selection[:5])
dssp = md.compute_dssp(traj_xtc)
print(dssp[0, :5].tolist())

FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

In [23]:
md.compute_center_of_mass(traj_xtc)[0]

OSError: File does not exist: b'1am7_corrected.xtc'

### Indexing

By indexing the `SingleTraj` class you get another instance of the `SingleTraj` class containing only one frame.

In [24]:
frame = traj_xtc[0]
print(frame)

OSError: File does not exist: b'1am7_corrected.xtc'

If the traj has currently not been loaded (backend = 'no_load') the frame number will be stored, until the traj is loaded.

In [25]:
traj_xtc.unload()
frame = traj_xtc[1]
print(frame)

OSError: File does not exist: b'1am7_corrected.xtc'

In [26]:
frame.load_traj()
print(frame)

NameError: name 'frame' is not defined

If the index is larger than the number of frames, the loading will throw and exception.

In [27]:
traj_xtc.unload()
frame = traj_xtc[1000]
print(frame)

OSError: File does not exist: b'1am7_corrected.xtc'

In [28]:
frame.load_traj()
print(frame)

NameError: name 'frame' is not defined

### Advanced slicing

You can also give a numpy array, a list or even a slice into the indexing.

Indexing with without a stop (`[::5]`) will put the trajectory into memory.

In [29]:
traj_xtc.unload()
subsample = traj_xtc[::2]
print(traj_xtc.traj.n_frames)
print(subsample)
print(subsample.n_frames)

FileNotFoundError: [Errno 2] No such file or directory: PosixPath('1am7_corrected.xtc')

In [30]:
traj_xtc.unload()
subsample = traj_xtc[[0, 1, 5, 6]]
print(subsample)
print(subsample.n_frames)

OSError: File does not exist: b'1am7_corrected.xtc'

Dancy slices (`[start:stop:step]`) won't put the traj into memory, but can raise errors down the line, when stop is greater than the number of frames.

**Note:**
Andvanced slicing can result in trajectories with 0 frames in them, or possibly reverse the time axis. Use this feature only if you are sure about what you are doing.

In [31]:
traj_xtc.unload()
subsample = traj_xtc[5:46:3]
print(subsample)
print(subsample.n_frames)

encodermap.SingleTraj object. Current backend is no_load. Basename is 1am7_corrected. At indices (None, slice(5, 46, 3)). Not containing any CVs.


OSError: File does not exist: b'1am7_corrected.xtc'

### Advanced slicing with HDF5

The HDF5 file format (ending wiht .h5) allows us to directly extract frames and accelerate loading.
We can save a .h5 formatted file from an `SingleTraj` class by calling:

In [32]:
traj_xtc.save('1am7_corrected.h5', overwrite=True)

FileNotFoundError: [Errno 2] No such file or directory: PosixPath('1am7_corrected.xtc')

Loading of .h5 files is similar to all files:

In [33]:
traj_h5 = em.SingleTraj('1am7_corrected.h5')

FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '1am7_corrected.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

In [34]:
print(em.trajinfo.info_single.PRINTED_HDF_ANNOTATION)

AttributeError: module 'encodermap.trajinfo.info_single' has no attribute 'PRINTED_HDF_ANNOTATION'

In [35]:
traj_h5.unload()
subsample = traj_h5[5:46:3]
print(subsample)
print(subsample.n_frames)

NameError: name 'traj_h5' is not defined

Similar to the xtc trajectory providing a slice with only steps (`[::3]`) will load the traj into memory. You will be greeted by a nice message thanking you for using HDF5-formatted trajectories. This message is only printed once. To see it again you have to restart the kernel.

In [36]:
traj_h5.unload()
subsample = traj_h5[::3].traj
print(subsample)
print(subsample.n_frames)

NameError: name 'traj_h5' is not defined

In [37]:
traj_h5.unload()
subsample = traj_h5[[0, 1, 5, 6]].traj
print(subsample)
print(subsample.n_frames)

NameError: name 'traj_h5' is not defined

## Loading CVs

After learning about the basics of the `SingleTraj` class we will come back to collective variables. There are many ways of adding CVs to you trajectories. The easiest would be to provide an already existing numpy array. However, you will be asked to also provide the attribute name (`attr_name`) of the array. With this you could load multiple CV datasets, that differ in ther attribute names. Here's an example:

### From numpy

In [38]:
traj = traj_xtc

# random phi/psi angles in a [0, 2pi] interval
random_raman_angles = np.random.random((traj.n_frames, 2 * traj.n_residues)) * 2 * np.pi

# define labels:
phi_angles = [f'phi {i}' for i in range(traj.n_residues)]
psi_angles = [f'psi {i}' for i in range(traj.n_residues)]
raman_labels = [None]*(len(phi_angles)+len(psi_angles))
raman_labels[::2] = phi_angles
raman_labels[1::2] = psi_angles

# load the CV
traj.load_CV(random_raman_angles, 'raman', labels=raman_labels)

# define some integer values (can be cluster memberships)
random_integers_per_frame = np.random.randint(0, 3, size=traj.n_frames)
traj.load_CV(random_integers_per_frame, 'cluster_membership')

OSError: File does not exist: b'1am7_corrected.xtc'

These values can be accessed via directly calling their attribute names (so make sure to use valid identifiers).

In [39]:
print(traj.cluster_membership)

AttributeError: 'SingleTraj' object has no attribute 'cluster_membership'

In [40]:
print(traj.raman)

Exception ignored in: <function ReaderBase.__del__ at 0x7f626622c5e0>
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/MDAnalysis/coordinates/base.py", line 1512, in __del__
    self.close()
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/MDAnalysis/coordinates/XDR.py", line 187, in close
    self._xdr.close()
AttributeError: 'XTCReader' object has no attribute '_xdr'


AttributeError: 'SingleTraj' object has no attribute 'raman'

There's also the attribute `CVs` that is a dict of these collective variables.

In [41]:
traj.CVs

{}

However, this is not the end. CVs in a `SingleTraj` class are stored as `xarray.Dataset`s. The dataset can be accessed via `_CVs`.

In [42]:
traj._CVs

**Why xarray?**

The underlying `xarray.Dataset` is intended to make sure "everything is correct". Every value can be accessed via an unambigous identifier.

In [43]:
traj._CVs['raman'].loc[{'frame_no': 20, 'RAMAN': 'psi 50'}].values

KeyError: 'raman'

### Slicing with CVs.

Slicing keeps your values where they should be.

In [44]:
index = np.where(np.array(raman_labels) == 'psi 50')
print(traj[20].raman[index])
print(traj[[0, 5, 10, 20]].raman[:,index])

NameError: name 'raman_labels' is not defined

### Loading from files

CVs can be loaded by providing a string to files. First, let us save some files.

In [45]:
# save numpy
np.save('raman_file.npy', traj.raman)

# save text
np.savetxt('cluster_membership_file.txt', traj.cluster_membership)

# save full CV dataset as NetCDF
traj._CVs.to_netcdf('full_CV_dataset.nc')

AttributeError: 'SingleTraj' object has no attribute 'raman'

If not providing an `attr_name`, while loading files, the filename will be used:

In [46]:
traj.load_CV('raman_file.npy')
traj.load_CV('cluster_membership_file.txt')

Exception: If features are loaded via a string, the string needs to be 'all', a features name ('central_dihedrals') or an existing file. Your string "raman_file.npy"is none of those

In [47]:
print(traj.CVs.keys())

dict_keys([])


Multiple CVs can be reconstructed from xarray NetCDF files (most end with .nc). If there are conflicts the new data from disk will overwrite the old.

In [48]:
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')
print(traj.CVs.keys())
traj.load_CV('full_CV_dataset.nc')
print(traj.CVs.keys())

dict_keys([])


Exception: If features are loaded via a string, the string needs to be 'all', a features name ('central_dihedrals') or an existing file. Your string "full_CV_dataset.nc"is none of those

### Loading with PyEMMA featurizer

We will now use PyEMMA's featurization pipeline (http://emma-project.org/latest/) to load CV data into our trajectory. For this encodermap has its own Version of PyEMMA's featurizer accessible with `em.Featurizer` which can simply be provided to the `SingleTraj` class.

In [49]:
import encodermap as em
%load_ext autoreload
%autoreload 2
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')

# instantiate featurizer
feat = em.Featurizer(traj)

# add features
feat.add_backbone_torsions()

# load
traj.load_CV(feat, attr_name='backbone_torsion')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

Possible `add_*` features can be found via:

In [50]:
i = 0
for attr in dir(feat):
    if attr.startswith('add_'):
        help(getattr(feat, attr))
        i += 1
    if i == 2:
        break

NameError: name 'feat' is not defined

The advantages of this method are:

- The same can be done with the `TrajEnsemble` class (more on that later), which is also parallelized.
- Most of the features contain comprehensive labels themselves.

The labels can be accessed via the `.coordinates` attribute of the `SingleTraj`'s `xarray.Dataset`. They are similar to the `attr_names` but without underscores and all caps.

In [51]:
print(traj._CVs.coords)

Coordinates:
    *empty*


In [52]:
print(traj._CVs.coords['BACKBONETORSIONFEATURE'].values[:10])

KeyError: 'BACKBONETORSIONFEATURE'

Here, it can be seen, that there are some errors on PyEMMA's backbone_torsion feature. The sequence of backbone angles is scrambled.

### Loading with Encodermap Features

Encodermap features inherit from pyemma, but they are better formatted, regarding the labels. They can be loaded via `traj.load_CV('all')` to load all, or via a single string of list of these strings:

In [53]:
from encodermap.misc.misc import FEATURE_NAMES
print(FEATURE_NAMES.values())

dict_values(['all_cartesians', 'all_distances', 'central_cartesians', 'central_distances', 'central_angles', 'central_dihedrals', 'side_cartesians', 'side_distances', 'side_angles', 'side_dihedrals'])


In [54]:
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')
traj.load_CV(['central_angles', 'central_dihedrals'])

FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

In [55]:
print(traj._CVs.coords['CENTRAL_DIHEDRALS'].values)

KeyError: 'CENTRAL_DIHEDRALS'

In [56]:
print(traj._CVs.coords['CENTRAL_ANGLES'].values)

KeyError: 'CENTRAL_ANGLES'

### Wrtiting custom features No 1

Writing your custom features can be done by subclassing pyemma's features. Required methods and attributes to make your feature work are:

- The class-level attributes `__serialize_version` and `__serialize_fields`
- The methods `__init__`, `describe`, and `transform`.
- The instance attribute `dimension`, which defines the shape of the returned array.

If you want to change the name of the feature, as it appears in the `xarray.Dataset` you can set the attribute `name`.

In the next cell we will define a Feature that provides a random integer to an atom, based on its hash.

In [57]:
import encodermap as em
from encodermap.loading.features import Feature
import copy

class RandomIntForAtomFeature(Feature):
    # class inherits from encodermap CustomFeature
    # set required class-level variables
    __serialize_version = 0
    __serialize_fields = ('indexes', 'selstr', )
    
    # write an __init__
    def __init__(self, top, selstr='all'):
        """Init of RandomIntoForAtomFeature.
        
        Args:
            top (mdtraj.Topology): The topology to select atoms from.
            
        Keyword Args:
            selstr (str, optional): The string to provide to top.select().
            Defaults to 'all'.
        
        """
        # Copy top to save it from hypothetical changes
        self.top = copy.deepcopy(top)
        
        # define indexes (this is one of the serializable fields,
        # which could be used by pyemma to save a feature to disk.)
        self.indexes = top.select(selstr)
        
        # set dimension
        self.dimension = len(self.indexes)
        
        # inherit missing methods from base
        super().__init__()
        
    def describe(self):
        """This method is not allowed to take any arguments.
        
        This method provides labels.
        
        Returns:
            list: A lsit of str, each str describing one feature.
            
        """
        # In this method we will build a list of str
        # Each str should describe one of our features
        # We assign ints to atoms, so the labels should tell something about the atoms
        getlbl = lambda at: f"atom {at.name:>4}:{at.index:5} {at.residue.name}:{at.residue.resSeq:>4}"
        labels = []
        for i in self.indexes:
            i = self.top.atom(i)
            labels.append(f"Random int for {getlbl(i)}")
        return labels
    
    def transform(self, traj):
        """This method provides values.
        
        Args:
            traj (mdtraj.Trajectory): An mdtraj.Trajectory.
            
        Returns:
            np.ndarray: The values of the features defined in describe.
        
        """
        # Make sure that the returned array has correct shape
        # In general it is a good idea, that this array has the same length as
        # the trajectory has frames
        # In general means, like, ..., always
        values = traj.xyz[:,:,0].astype(int)
        for i in self.indexes:
            values[:,i] = int(str(hash(str(self.top.atom(i))))[-5:])
        return values
    
    @property
    def name(self):
        # define the name of the feature to appear in `SingleTraj._CVs`
        return 'MyAwesomeFeature'

ImportError: cannot import name 'Feature' from 'encodermap.loading.features' (/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/encodermap/loading/features.py)

In [58]:
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')
print(traj)
featurizer = em.Featurizer(traj)
feat = RandomIntForAtomFeature(traj.top)
for i in feat.describe()[:200:25]:
    print(i)

encodermap.SingleTraj object. Current backend is no_load. Basename is 1am7_corrected. At indices (None,). Not containing any CVs.


FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

In [59]:
featurizer.add_custom_feature(feat)

NameError: name 'featurizer' is not defined

In [60]:
traj.load_CV(featurizer)

NameError: name 'featurizer' is not defined

In [61]:
traj._CVs.coords['MYAWESOMEFEATURE'].values

KeyError: 'MYAWESOMEFEATURE'

### Writing custom features No 2

In this example we will implement a method of calculating a nematic order parameter. This example will be quite different (working with coarse-grained carbon-hydrate chains (so-called telechelics), and not with proteins), but we will work our way through. Here are some references you might consider:

```
@article{mukherjee2012derivation,
  title={Derivation of coarse grained models for multiscale simulation of liquid crystalline phase transitions},
  author={Mukherjee, Biswaroop and Delle Site, Luigi and Kremer, Kurt and Peter, Christine},
  journal={The Journal of Physical Chemistry B},
  volume={116},
  number={29},
  pages={8474--8484},
  year={2012},
  publisher={ACS Publications}
}

@article{flachmuller2021coarse,
  title={Coarse grained simulation of the aggregation and structure control of polyethylene nanocrystals},
  author={Flachm{\"u}ller, Alexander and Mecking, Stefan and Peter, Christine},
  journal={Journal of Physics: Condensed Matter},
  volume={33},
  number={26},
  pages={264001},
  year={2021},
  publisher={IOP Publishing}
}
```

## Saving trajectory and CVs into one file

A trajectory can (with its CVs) saved as one comprehensive file with the `save()` method. What's more: Loading such a file again makes it possible to access any frames and their corresponding CVs almost instantaneously.

In [62]:
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')
traj.load_CV('all')
traj.save('1am7_all_CVs.h5')

FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

In [63]:
new_traj = em.SingleTraj('1am7_all_CVs.h5')
frames = new_traj[[0, 5, 20, 35]]
frames.CentralCartesians.shape

FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '1am7_all_CVs.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

In [64]:
frames = new_traj[::5]
print(frames.CentralBondDistances.shape)
print(frames._CVs.coords['CENTRALBONDDISTANCES'].values)

NameError: name 'new_traj' is not defined

However, CVs are deleted, when the number of atoms is altered.

In [65]:
subset = frames.atom_slice(frames.select('name CA'))

NameError: name 'frames' is not defined

In [66]:
subset.CVs

NameError: name 'subset' is not defined

In [67]:
index = np.where(np.array(raman_labels) == 'psi 50')
print(traj[20].raman[index])
print(traj[[0, 5, 10, 20]].raman[:,index])

NameError: name 'raman_labels' is not defined

### Loading from files

CVs can be loaded by providing a string to files. First, let us save some files.

In [68]:
# save numpy
np.save('raman_file.npy', traj.raman)

# save text
np.savetxt('cluster_membership_file.txt', traj.cluster_membership)

# save full CV dataset as NetCDF
traj._CVs.to_netcdf('full_CV_dataset.nc')

AttributeError: 'SingleTraj' object has no attribute 'raman'

If not providing an `attr_name`, while loading files, the filename will be used:

In [69]:
traj.load_CV('raman_file.npy')
traj.load_CV('cluster_membership_file.txt')

Exception: If features are loaded via a string, the string needs to be 'all', a features name ('central_dihedrals') or an existing file. Your string "raman_file.npy"is none of those

In [70]:
print(traj.CVs.keys())

dict_keys([])


Multiple CVs can be reconstructed from xarray NetCDF files (most end with .nc). If there are conflicts the new data from disk will overwrite the old.

In [71]:
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')
print(traj.CVs.keys())
traj.load_CV('full_CV_dataset.nc')
print(traj.CVs.keys())

dict_keys([])


Exception: If features are loaded via a string, the string needs to be 'all', a features name ('central_dihedrals') or an existing file. Your string "full_CV_dataset.nc"is none of those

### Loading with PyEMMA featurizer

We will now use PyEMMA's featurization pipeline (http://emma-project.org/latest/) to load CV data into our trajectory. For this encodermap has its own Version of PyEMMA's featurizer accessible with `em.Featurizer` which can simply be provided to the `SingleTraj` class.

In [72]:
import encodermap as em
%load_ext autoreload
%autoreload 2
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')

# instantiate featurizer
feat = em.Featurizer(traj)

# add features
feat.add_backbone_torsions()

# load
traj.load_CV(feat, attr_name='backbone_torsion')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

Possible `add_*` features can be found via:

In [73]:
i = 0
for attr in dir(feat):
    if attr.startswith('add_'):
        help(getattr(feat, attr))
        i += 1
    if i == 2:
        break

NameError: name 'feat' is not defined

The advantages of this method are:

- The same can be done with the `TrajEnsemble` class (more on that later), which is also parallelized.
- Most of the features contain comprehensive labels themselves.

The labels can be accessed via the `.coordinates` attribute of the `SingleTraj`'s `xarray.Dataset`. They are similar to the `attr_names` but without underscores and all caps.

In [74]:
print(traj._CVs.coords)

Coordinates:
    *empty*


In [75]:
print(traj._CVs.coords['BACKBONETORSIONFEATURE'].values[:10])

KeyError: 'BACKBONETORSIONFEATURE'

Here, it can be seen, that there are some errors on PyEMMA's backbone_torsion feature. The sequence of backbone angles is scrambled.

### Loading with Encodermap Features

Encodermap features inherit from pyemma, but they are better formatted, regarding the labels. They can be loaded via `traj.load_CV('all')` to load all, or via a single string of list of these strings:

In [76]:
from encodermap.misc.misc import FEATURE_NAMES
print(FEATURE_NAMES.values())

dict_values(['all_cartesians', 'all_distances', 'central_cartesians', 'central_distances', 'central_angles', 'central_dihedrals', 'side_cartesians', 'side_distances', 'side_angles', 'side_dihedrals'])


In [77]:
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')
traj.load_CV(['central_angles', 'central_dihedrals'])

FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

In [78]:
print(traj._CVs.coords['CENTRAL_DIHEDRALS'].values)

KeyError: 'CENTRAL_DIHEDRALS'

In [79]:
print(traj._CVs.coords['CENTRAL_ANGLES'].values)

KeyError: 'CENTRAL_ANGLES'

### Wrtiting custom features No 1

Writing your custom features can be done by subclassing pyemma's features. Required methods and attributes to make your feature work are:

- The class-level attributes `__serialize_version` and `__serialize_fields`
- The methods `__init__`, `describe`, and `transform`.
- The instance attribute `dimension`, which defines the shape of the returned array.

If you want to change the name of the feature, as it appears in the `xarray.Dataset` you can set the attribute `name`.

In the next cell we will define a Feature that provides a random integer to an atom, based on its hash.

In [80]:
import encodermap as em
from encodermap.loading.features import Feature
import copy

class RandomIntForAtomFeature(Feature):
    # class inherits from encodermap CustomFeature
    # set required class-level variables
    __serialize_version = 0
    __serialize_fields = ('indexes', 'selstr', )
    
    # write an __init__
    def __init__(self, top, selstr='all'):
        """Init of RandomIntoForAtomFeature.
        
        Args:
            top (mdtraj.Topology): The topology to select atoms from.
            
        Keyword Args:
            selstr (str, optional): The string to provide to top.select().
            Defaults to 'all'.
        
        """
        # Copy top to save it from hypothetical changes
        self.top = copy.deepcopy(top)
        
        # define indexes (this is one of the serializable fields,
        # which could be used by pyemma to save a feature to disk.)
        self.indexes = top.select(selstr)
        
        # set dimension
        self.dimension = len(self.indexes)
        
        # inherit missing methods from base
        super().__init__()
        
    def describe(self):
        """This method is not allowed to take any arguments.
        
        This method provides labels.
        
        Returns:
            list: A lsit of str, each str describing one feature.
            
        """
        # In this method we will build a list of str
        # Each str should describe one of our features
        # We assign ints to atoms, so the labels should tell something about the atoms
        getlbl = lambda at: f"atom {at.name:>4}:{at.index:5} {at.residue.name}:{at.residue.resSeq:>4}"
        labels = []
        for i in self.indexes:
            i = self.top.atom(i)
            labels.append(f"Random int for {getlbl(i)}")
        return labels
    
    def transform(self, traj):
        """This method provides values.
        
        Args:
            traj (mdtraj.Trajectory): An mdtraj.Trajectory.
            
        Returns:
            np.ndarray: The values of the features defined in describe.
        
        """
        # Make sure that the returned array has correct shape
        # In general it is a good idea, that this array has the same length as
        # the trajectory has frames
        # In general means, like, ..., always
        values = traj.xyz[:,:,0].astype(int)
        for i in self.indexes:
            values[:,i] = int(str(hash(str(self.top.atom(i))))[-5:])
        return values
    
    @property
    def name(self):
        # define the name of the feature to appear in `SingleTraj._CVs`
        return 'MyAwesomeFeature'

ImportError: cannot import name 'Feature' from 'encodermap.loading.features' (/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/encodermap/loading/features.py)

In [81]:
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')
print(traj)
featurizer = em.Featurizer(traj)
feat = RandomIntForAtomFeature(traj.top)
for i in feat.describe()[:200:25]:
    print(i)

encodermap.SingleTraj object. Current backend is no_load. Basename is 1am7_corrected. At indices (None,). Not containing any CVs.


FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

In [82]:
featurizer.add_custom_feature(feat)

NameError: name 'featurizer' is not defined

In [83]:
traj.load_CV(featurizer)

NameError: name 'featurizer' is not defined

In [84]:
traj._CVs.coords['MYAWESOMEFEATURE'].values

KeyError: 'MYAWESOMEFEATURE'

### Writing custom features No 2

In this example we will implement a method of calculating a nematic order parameter. This example will be quite different (working with coarse-grained carbon-hydrate chains (so-called telechelics), and not with proteins), but we will work our way through. Here are some references you might consider:

```
@article{mukherjee2012derivation,
  title={Derivation of coarse grained models for multiscale simulation of liquid crystalline phase transitions},
  author={Mukherjee, Biswaroop and Delle Site, Luigi and Kremer, Kurt and Peter, Christine},
  journal={The Journal of Physical Chemistry B},
  volume={116},
  number={29},
  pages={8474--8484},
  year={2012},
  publisher={ACS Publications}
}

@article{flachmuller2021coarse,
  title={Coarse grained simulation of the aggregation and structure control of polyethylene nanocrystals},
  author={Flachm{\"u}ller, Alexander and Mecking, Stefan and Peter, Christine},
  journal={Journal of Physics: Condensed Matter},
  volume={33},
  number={26},
  pages={264001},
  year={2021},
  publisher={IOP Publishing}
}
```

## Saving trajectory and CVs into one file

A trajectory can (with its CVs) saved as one comprehensive file with the `save()` method. What's more: Loading such a file again makes it possible to access any frames and their corresponding CVs almost instantaneously.

In [85]:
traj = em.SingleTraj('1am7_corrected.xtc', '1am7_protein.pdb')
traj.load_CV('all')
traj.save('1am7_all_CVs.h5')

FileNotFoundError: [Errno 2] No such file or directory: '1am7_protein.pdb'

In [86]:
new_traj = em.SingleTraj('1am7_all_CVs.h5')
frames = new_traj[[0, 5, 20, 35]]
frames.CentralCartesians.shape

FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '1am7_all_CVs.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

In [87]:
frames = new_traj[::5]
print(frames.CentralBondDistances.shape)
print(frames._CVs.coords['CENTRALBONDDISTANCES'].values)

NameError: name 'new_traj' is not defined

However, CVs are deleted, when the number of atoms is altered.

In [88]:
subset = frames.atom_slice(frames.select('name CA'))

NameError: name 'frames' is not defined

In [89]:
subset.CVs

NameError: name 'subset' is not defined