# Welcome

To start this tutorial I want to quote Daniel M. Zuckerman:

"The trajectory ensemble is everything you’ve always wanted, and more.  Really, it is.  Trajectory ensembles unlock fundamental ideas in statistical mechanics, including connections between equilibrium and non-equilibrium phenomena.  Simple sketches of these objects immediately yield important equations without a lot of math.  Give me the trajectory-ensemble pictures over fancy formalism any day.  It’s harder to make a mistake with a picture than a complicated equation."

Read more here: http://statisticalbiophysicsblog.org/?p=92

This idea and many more went into the creation of our new EncoderMap package. I have some exciting new ideas and concepts to show you so let us jump straight in with trajectory ensembles.

As computational chemists most of the time, we don't work with single trajectories. Most modern python-packages have this feature built-in. They allow to load multiple files and combine them into one data stream. However, they all fall short when it comes to ensembles of different topologies. But oftentimes we want to directly compare a mutation on a protein with its wild type. A mutation might introduce additional atoms which will divide the WT and mutant in all analysis pipelines. Calculations are made for every protein separately and then combined to make some sense.

With the new framework of EncoderMap this is still possible. But what's even better you can now just give a list of trajectories, topologies and EncoderMap will put them through the same pipeline.

In this tutorial you will be introduced to two new classes in EndoerMap. These two classes are the `SingleTraj` and `TrajEnsemble` class. They are meant to work with large **trajectory ensembles** and be easy on your system's resources (i.e. use less RAM). These classes only point to the files on disk, until the data is needed, at which point, the data is loaded into RAM and kept there for later use.

Most MD workflows don't use all coordinates but extract a subset of internal coordinates. So-called collective variables (CVs). The two new classes keep track of your CVs and corresponding trajectory frames, so that you can simply grab a single frame from a large, possibly fragmented database of simulations.


In this tutorial you will be introduced to these two new classes and see how they work. You will learn how:
- To instantiate the classes from different trajectory formats.
- See how you can slice and subsample single trajectories.
- Load a set of trajectories with **different topologies** and group them by a common string.
- Use a subsample of the whole trajectory ensemble.
- Load and save high-dimensional and low-dimensional CVs for your trajectory ensemble.
- Keep track where individual frames come from.

# Imports

First let us import the packages we are working with.

In [None]:
import encodermap as em
import numpy as np
import mdtraj as md
import MDAnalysis as mda
import matplotlib as mpl
import matplotlib.pyplot as plt

import glob
import os

# if you have nglview set up you can also import it
import nglview as ngl

# autoreload and matplotlib backend
%matplotlib notebook
%load_ext autoreload
%autoreload 2

# New classes for working with trajs and their CVs

## The new `SingleTraj` class

The `SingleTraj` class is meant as a single container to hold a trajectories xyz coordinates, its topology, its high-dimensional CVs and its low-dimensional representation. Here are some examples of what you can do with it.

### Initialize.

The `SingleTraj` class can be initialized in many ways. Most of the input is just piped to mdtraj. The three most common are:

- From a trajectory file and a topology file
- From a h5 trajectory file (faster for random accesses, i.e. clustering)
- From an existing mdtraj trajectory.

In [None]:
traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")
traj2 = em.SingleTraj("tests/data/traj.h5")
_ = md.load("tests/data/1am7_corrected.xtc", top="tests/data/1am7_protein.pdb")
traj3 = em.SingleTraj(_)

In [None]:
print(traj1, '\n', traj2, '\n', traj3)

**If you initialized the traj from files you get some extra options like the basename**

In [None]:
print(traj1.basename, traj1.traj_file, traj1.top_file)
print(traj2.basename, traj2.traj_file, traj2.top_file)

### The topology is always there

The topology of a `SingleTraj` class is always accessible through its `top` argument. Getting the topology does not require much resources as most trajectory formats save it in a separate file (.gro, .pdb, ...), or as a separate quickly-accessible datafield (.h5).

In [None]:
for traj in [traj1, traj2, traj3]:
    print(traj.top.to_fasta())

### On demand loading

**Difference between traj, trajectory and top, topology**

traj and top always give mdtraj.Trajectory and mdtraj.Topology, respectively. They are loaded "on demand" and return the corresponding mdtraj object. After they are loaded, they are deleted again and the SingleTraj class is garbage collected.

trajectory and topology can be `False` and represent the current *backend* of the TrajEnsemble object.

This method saves RAM.

In [None]:
print(traj2.topology)
print(traj2.top)
print(traj2.topology)

In [None]:
print(traj1.trajectory)
print(traj1.traj)
print(traj1.trajectory)

Directly accessing attributes of the mdtraj.Trajectory will load it from disk and return the attributes

In [None]:
if 'xy' not in traj1.__dict__:
    print("No xyz data here")
print(traj1.xyz[0,0])

**Loading can be forced**

In [None]:
traj1.load_traj()
print(traj1.topology)
traj1.unload()
print(traj1.topology)

### Take a look with nglview

In [None]:
view = traj2.show_traj()
view

### Class attributes

**len** of the class is special, as it also reflects the loading state. If the current backend is 'no_load', len(em.SingleTraj) is 0.

In [None]:
traj1.load_traj()
print(traj1.n_frames)
print(traj1.n_atoms)
print(traj1.basename)
print(len(traj1), len(traj2), len(traj3))

### Duplication of mdtraj

Some methods and attributes are duplicated from mdtraj. This allows us to call some mdtraj functions on the SingleTraj object.

In [None]:
selection = traj1.top.select('name CA')
print(selection[:5])
dssp = md.compute_dssp(traj1.traj)
print(dssp[0, :5].tolist())

In [None]:
md.compute_center_of_mass(traj1.traj)[0]

### Indexing

By indexing the SingleTraj class you get another instance of the SingleTraj class containing only one frame.

In [None]:
frame = traj1[0]
print(len(traj1))
print(frame)
print(len(frame))

If the traj has currently not been loaded (backend = 'no_load') the frame number will be stored, until the traj is loaded.

In [None]:
traj1.unload()
frame = traj1[1]
print(frame)

In [None]:
frame.load_traj()
print(frame)

### Advanced slicing

You can also give a numpy array, a list or even a slice into the slicing.

In [None]:
traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")
traj1.unload()
subsample = traj1[::2]
print(traj1.n_frames)
print(subsample)
print(subsample.n_frames)

In [None]:
traj1.unload()
subsample = traj1[[0, 1, 5, 6]]
print(traj1.n_frames)
print(subsample)
print(subsample.n_frames)

In [None]:
traj1.unload()
subsample = traj1[5:46:3]
print(traj1.n_frames)
print(subsample)
print(subsample.n_frames)

### Advanced slicing with HDF5

The HDF5 file format (ending wiht .h5) allows us to directly extract frames and accelerate loading.

In [None]:
traj2.unload()
subsample = traj2[5:46:3]
print(traj2.n_frames)
print(subsample)
print(subsample.n_frames)

In [None]:
subsample.load_traj()
print(subsample.traj)

In [None]:
traj2.unload()
subsample = traj2[::3].traj
print(traj2.n_frames)
print(subsample)
print(subsample.n_frames)

In [None]:
traj2.unload()
subsample = traj2[[0, 1, 5, 6]].traj
print(subsample)
print(subsample.n_frames)

### Stacking, joining and adding

There are three operations to concatenate two `SingleTraj` objects.

- Adding (`traj1 + traj2`) adds trajectories along the 'trajectory-axis' and returns an `TrajEnsemble` (more on that later).
- Stacking returns a `SingleTraj` with atoms stacked along the 'atom-axis'. For this method, the trajs need to have the same number of atoms. This method returns an mdtraj Trajectory, because `SingleTraj` can't handle multiple file sources (yet?).
- Joining returns a `SingleTraj` with atoms stacked along the 'time-axis'. For this method, the trajs need to have the same topologies. This method returns an mdtraj Trajectory, because `SingleTraj` can't handle multiple file sources (yet?).

In [None]:
traj1.traj

In [None]:
traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")
traj2 = em.SingleTraj("tests/data/traj.h5")
new = traj1 + traj2
print(new, '\n')

if traj1.n_frames > traj2.n_frames:
    traj1 = traj1[np.arange(traj2.n_frames)]
else:
    traj2 = traj1[np.arange(traj1.n_frames)]
new = traj1.stack(traj2)
print(new, '\n')


traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")[:5]
traj2 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")[:-5]
new = traj1.join(traj2)
print(new)

### Superposing

Similar to mdtraj `SingleTrajs` can be superposed.

In [None]:
traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")[0]
traj2 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")[-2:-1]

superposed = traj1.superpose(traj2)

view = superposed.show_traj()
view

### Equality

In [None]:
traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")
traj2 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")

print(traj1 == traj2)

### Inside a context manager

Inside a context manager the traj is loaded and upon exit, the traj is unloaded.

In [None]:
traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")

with traj1 as t:
    print("inside context")
    print(t.backend)
    print(t.basename)
    print(t.trajectory)
    
print("\noutside context")
print(traj1.backend)
print(traj1.basename)
print(traj1.trajectory)

### Reversed

In [None]:
traj1 = reversed(em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb"))
print(traj1.time[:5])

### Iteration

To make iteration work and know, where to stop, the trajectory is loaded into memory.

In [None]:
for o, name in zip(out, ['name1', 'name2']):
    

In [None]:
traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")
traj1.load_CV(traj1.xyz[:,:,0], 'z_coordinate')

for i, frame in enumerate(traj1):
    print(frame)
    print(frame.z_coordinate)
    if i == 3:
        break

### Save

Th HDF5 file format is especially useful, when saving trajs, as it also offers the possibility to directly save CVs into the same file.

In [None]:
traj1 = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")
traj1.load_CV(traj1.xyz[:,:,:2], 'x_and_y_coordinate')
print(traj1.x_and_y_coordinate.shape)
traj1.save("tests/data/1am7_corrected_with_CVs.h5", overwrite=True)

In [None]:
test = em.SingleTraj("tests/data/1am7_corrected_with_CVs.h5")
test.x_and_y_coordinate.shape

## The new `TrajEnsemble` class.

This class is meant to keep track of many trajectories. Internally the TrajEnsemble class contains a list of SingleTraj classes.

### Initalize

As input for the TrajEnsemble class lists of traj and top files are provided. These files might also possible have different topologies (number of atoms, bonds). In that case the `common_str` argument is used to group the trajectory files and topology files into sub-units with identical topology.

**Make sure that the `common_str` argument is a substring of the trajectory and topology files.**

**In contrast to the `SingleTraj` class here, the trajectories are loaded to ensure that all lists and arrays are of the correct size.**

In [None]:
# traj = em.SingleTraj("https://files.rcsb.org/view/1YUG.pdb")
# np.save("development/1YUG_x_and_y_coordinate.npy", traj.xyz[:,:,:2])
# traj = em.SingleTraj("https://files.rcsb.org/view/1YUF.pdb")
# np.save("development/1YUF_x_and_y_coordinate.npy", traj.xyz[:,:,:2])

In [None]:
import encodermap as em


In [None]:
# traj1 = "https://files.rcsb.org/view/1YUG.pdb"
# traj2 = "https://files.rcsb.org/view/1YUF.pdb"
# trajs = em.TrajEnsemble([traj1, traj2])

trajs.load_CV('x_and_y_coordinate', directory='development/')
print(trajs.)

In [None]:
import shutil
shutil.copyfile("tests/data/1am7_protein.pdb", "tests/data/1am7_protein.pdb")

In [None]:
trajs = glob.glob('tests/data/')
print(len(trajs)) # prints the number pf objects in list

ref_pdbs = glob.glob('/home/kevin/projects/expansion_elephant/example_files/*.pdb')
print(len(ref_pdbs))

# Loading arbitrary CVs

## Getting highd Info

The `loading` submodule is heavily leaned on PyEMMA's featurization. That's the way to extract HighD Data

In [None]:
traj = em.SingleTraj("tests/data/1am7_corrected.xtc", "tests/data/1am7_protein.pdb")

Initialize a encodermap Featurizer. In contrast to PyEMMA Encodermap can work with mutliple trajectories with arbitrary topologies (more on that in the TrajEnsemble class).

In [None]:
feat = em.loading.Featurizer(traj.reference)

Add some info and describe.

In [None]:
feat.add_backbone_torsions()
feat.describe()[:5]

Load. This is heavily parallelized, thanks to PyEMMA.

In [None]:
highd = em.loading.load(traj, feat)

Set the highd data. Note how now the frame info will be known, because the Trajectory has been opened once and the frames are in the highD data.

In [None]:
traj.set_highd(highd)

In [None]:
import pandas as pd
pd.options.display.max_rows = 5
traj.df

In [None]:
# %% Test the different ways to save a CV
# numpy
traj = em.SingleTraj('tests/data/1am7_corrected.xtc', 'tests/data/1am7_protein.pdb')
traj.load_CV("central_cartesians_test.npy")
traj.central_cartesians_test

# %% Numpy
np_array = np.squeeze(data['central_cartesians'].values)
traj = em.SingleTraj('tests/data/1am7_corrected.xtc', 'tests/data/1am7_protein.pdb')
traj.load_CV(np_array, 'central_cartesians')
traj.central_cartesians

# %% xarray
traj = em.SingleTraj('tests/data/1am7_corrected.xtc', 'tests/data/1am7_protein.pdb')
traj.load_CV(data['central_cartesians'])
traj.central_cartesians

# %% Feature
traj = em.SingleTraj('tests/data/1am7_corrected.xtc', 'tests/data/1am7_protein.pdb')
backbone_torsions = em.loading.features.CentralTorsions(traj.top)
traj.load_CV(backbone_torsions)

# %% Featurizer
traj = em.SingleTraj('tests/data/1am7_corrected.xtc', 'tests/data/1am7_protein.pdb')
backbone_torsions = em.loading.features.CentralTorsions(traj.top)
feat = em.Featurizer(traj)
feat.add_list_of_feats()
traj.load_CV(feat)
print(traj.CVs)