## Writing Analysis classes

Writing an Analysis class is a good way to share your analysis with other people.

In this tutorial we will walk through writing an Analysis class to perform a very simple analysis; the average end to end distance for a polymer chain.

**Additional resources**
 - During the workshop, feel free to ask questions at any time
 - For more on how to use MDAnalysis, see the [User Guide](https://userguide.mdanalysis.org/2.0.0-dev0/) and [documentation](https://docs.mdanalysis.org/2.0.0-dev0/)
 - Ask questions on the [user mailing list](https://groups.google.com/group/mdnalysis-discussion) or on [Discord](https://discord.gg/fXTSfDJyxE)
 - Report bugs on [GitHub](https://github.com/MDAnalysis/mdanalysis/issues?)

### Loading the data

First we'll load and briefly explore the dataset we'll be using.
This is a dataset with 126 coarse-grained polymer chains in solvent.

In [None]:
import MDAnalysis as mda

import MDAnalysisData

In [None]:
data = MDAnalysisData.CG_fiber.fetch_CG_fiber()

In [None]:
u = mda.Universe(data.topology, data.trajectory)

### Explore!

Have a browse of the dataset to familiarise yourself with the names, residue names, etc in the system...
It is fairly common to be given a dataset and have to figure out how good the topology information is!

In [None]:
u.atoms
# ok we've got 8,000 atoms...

In [None]:
len(u.trajectory)
# and a nice number of frames!

In [None]:
len(u.residues), set(u.atoms.resnames)
# it looks like we've got A, B, C & D and Ion and W(ater)

In [None]:
len(u.segments), set(u.atoms.segids)
# sadly chains don't seem to be grouped into segments

In [None]:
u.atoms.bonds
# thankfully the system has bond information available!

In [None]:
# if we have bonds, we can define the fragments
# reminder: a fragment is a a group of atoms fully traversible through its bonds, i.e. a "molecule"
frags = u.atoms.fragments
print(frags)

In [None]:
chain = frags[0]
print(chain)

In [None]:
print([len(a.bonded_atoms) for a in chain])
# hmm most atoms have 2 bonds, but some have 1...

### Extracting our data

Based upon what we've seen, 

In [None]:
# grab all polymer chains, this is just fragments with more than one atom (others are solvent)
chains = [ag for ag in u.atoms.fragments if (len(ag) > 1)]

In [None]:
def get_start_and_end(chain):
    # for a CG chain, grab the first and last atoms in the chain
    # we know that (for this system) if an atom only has one bond, it's on the end of the chain
    start, end = [atom for atom in chain if len(atom.bonded_atoms) == 1]
    return start, end

In [None]:
start, end = get_start_and_end(chains[0])

print(start, end)

In [None]:
# the distance between these atoms can be calculated thus:
mda.lib.distances.calc_bonds(start.position, end.position, box=u.dimensions)

### Writing the Analysis class

Now we've played a bit with what we want to do, it's time to roll it into a proper class.

Remember:
- the `__init__` function must request all the necessary pieces of data (i.e. the chains).  There won't be any chances later to get information from the user!  It is also important to call `super().__init__(trajectory)` to pass the trajectory to the base class so that the `run()` method works.
- the `_prepare` function needs to setup the required data structures (lists, dicts etc) which later functions expect.  These won't be visible to the users, but are used to save intermediate values.  Remember that these need to be stored as "`self.X`" to be persistant to the class.
- the `_single_frame` calculates the value of interest for a single given frame. Moving between frames is handled by the `run()` function.
- `_conclude()` is called at the end to finalise everything.  At this point the data structure created in `_prepare` have been populated by many calls to `_single_frame`, and now these must be reduced to our final values in `.results`.

The skeleton for the class then looks like this:

In [None]:
from MDAnalysis.analysis.base import AnalysisBase


class EndToEndDistance(AnalysisBase):
    def __init__(self):
        pass
    
    def _prepare(self):
        pass
    
    def _single_frame(self):
        pass
    
    def _conclude(self):
        pass

## Solution

A proposed solution is given below.

One trick that has been used is that it is more efficient to calculate many distance at once in a single call to `calc_bonds` rather than calling this function once for each chain.
To achieve this, the input (a list of tuples) has been "transposed" into two AtomGroups, one containing the "first" atom in each chain and the other containing all the "last" atom in each chain.

Secondly, as we're only doing the mean average, we don't need to remember every value we calculate, but instead can do some of the reduction before `_conclude`.
This sort of early reduction can prevent the storage (memory) requirements of your Analysis from becoming too large and slowing things down.

In [None]:
from MDAnalysis.analysis.base import AnalysisBase


class EndToEndDistance(AnalysisBase):
    def __init__(self, starts_and_ends):
        """
        Parameters
        ----------
        tops - list of tuples of (Atom, Atom)
          a list containing the start and end of each chain
        """        
        # we ask for a list of tuples, but we transpose these to two atomgroups
        self.ag1 = sum(v[0] for v in starts_and_ends)
        self.ag2 = sum(v[1] for v in starts_and_ends)
        
        # remember to set up the base class too 
        super().__init__(self.ag1.universe.trajectory)
    
    def _prepare(self):
        # we will accumulate end to end values here
        self.total_e2e = 0.0
        self.n_frames = 0
    
    def _single_frame(self):
        # we could call calc_bonds() individually for each chain,
        # but it's more efficient to call it once and calculate all these distances at once
        d = mda.lib.distances.calc_bonds(self.ag1.positions, self.ag2.positions, box=self.ag1.dimensions)
        
        # we can simply add the sum of all end-to-end distances to our accumulator
        self.total_e2e += d.sum()
        self.n_frames += 1
    
    def _conclude(self):
        # our total needs to be averaged across
        # - the number of frames
        avg_e2e = self.total_e2e / self.n_frames
        # - the number of chains
        avg_e2e /= len(self.ag1)
        
        self.results.e2e_distance = avg_e2e

In [None]:
e2ed = EndToEndDistance([get_start_and_end(c) for c in chains])

In [None]:
e2ed.run()

In [None]:
e2ed.results

## Extension work

To extend this you could try calculating a timeseries of the end to end distance, or the autocorrelation of the end to end distance, which is an important metric for polymer systems.  The module `MDAnalysis.lib.correlations` contains some useful functions for calculating autocorrelations.

Alternatively, feel free to work on your own ideas!