## Writing a Custom Reader for MDAnalysis

MDAnalysis is designed to work with whatever data format you can throw at it,
however occaisionally you will find yourself with a file it cannot read.
This can be because the file you want to load is a strange/ancient format,
is badly formatted, or comes from an unusual source (eg SI of paper)

These notebook demonstrates how to add support for these weird and wonderful files.
To quickly recap some nomenclature:
 - `Parser`s read the topology information (names, types, etc) which does not vary over time
 - `Reader`s read coordinates, velocities and forces over time (a trajectory)
 - `Writer`s write these trajectory files back out

You can dynamically add these to `MDAnalysis` without having to modify the package itself.
This is useful for reading a single one off file, or when prototyping new functionality.
In this notebook, an example of writing each of these will be demonstrated.

### Our odd file

In this directory is a file called `atoms.csv`, which bizarrely for this field, is a csv file of data.  This file has 5 columns, the first is the name, the second a colour(!?), and the last three are the coordinates.

To read this oddity, we will need to write a new `Parser` and `Reader`

In [1]:
!head atoms.csv

N,Red,52.02,43.559998,31.55
H1,Red,51.19,44.109997,31.72
H2,Red,51.550003,42.83,31.04
H3,Red,52.47,43.18,32.37
CA,Green,53.06,44.21,30.75
HA,Green,53.829998,43.47,30.539999
CB,Green,52.57,44.739998,29.41
HB1,Green,51.89,44.039997,28.929998
HB2,Green,52.02,45.64,29.66
CG,Yellow,53.71,45.11,28.45


### Writing a Parser

A `Parser` is responsible for reading all non time dependent attributes of a `Universe`.  This typically includes attributes such as atoms' names, masses and charge.

In `MDAnalysis`, parsers should inherit from the `MDAnalysis.topology.base.TopologyReaderBase` class.  By subclassing from this, the new parser automatically becomes known to MDAnalysis!

All that is left is to implement a `parse` function, which must return a `Topology` object:

In [2]:
from MDAnalysis.topology.base import TopologyReaderBase
from MDAnalysis.core.topology import Topology
from MDAnalysis.core import topologyattrs
import numpy as np

class CSVParser(TopologyReaderBase):
    format = 'csv'
    
    def parse(self, **kwargs):
        names = []
        with open(self.filename, 'r') as file_in:
            for line in file_in:
                names.append(line.split(',')[0])
                
        n_atoms = len(names)
       
        attrs = [topologyattrs.Atomnames(np.array(names))]
    
        return Topology(n_atoms=n_atoms, n_res=1, n_seg=1,
                        attrs=attrs)

We can now load our file, and check that the names have been read:

In [3]:
import MDAnalysis as mda

u = mda.Universe('atoms.csv')

print(u.atoms.names)

['N' 'H1' 'H2' 'H3' 'CA' 'HA' 'CB' 'HB1' 'HB2' 'CG' 'HG1' 'HG2' 'SD' 'CE']


Accessing positions will however raise an error

In [4]:
u.atoms.positions

AttributeError: AtomGroup has no attribute positions

### Writing a Reader

Readers should inherit from `MDAnalysis.coordinates.base.ReaderBase`.
They then need to implement the `_read_first_frame` method,
which should create a `Timestep` object and fill this with data from `self.filename`.

In [6]:
from MDAnalysis.coordinates import base

class CSVReader(base.SingleFrameReaderBase):
    # This line defines the file extension
    format = 'csv'
    
    def _read_first_frame(self):
        coords = []
        with open(self.filename, 'r') as file_in:
            for line in file_in:
                coords.append(line.split(',')[2:])
                
        self.n_atoms = len(coords)
        self.ts = ts = base.Timestep(n_atoms=self.n_atoms)
        ts.positions = coords
        
        return ts

In [7]:
u = mda.Universe('atoms.csv')

In [8]:
u.atoms.names

array(['N', 'H1', 'H2', 'H3', 'CA', 'HA', 'CB', 'HB1', 'HB2', 'CG', 'HG1',
       'HG2', 'SD', 'CE'], dtype=object)

In [9]:
u.atoms.positions

array([[52.02    , 43.559998, 31.55    ],
       [51.19    , 44.109997, 31.72    ],
       [51.550003, 42.83    , 31.04    ],
       [52.47    , 43.18    , 32.37    ],
       [53.06    , 44.21    , 30.75    ],
       [53.829998, 43.47    , 30.539999],
       [52.57    , 44.739998, 29.41    ],
       [51.89    , 44.039997, 28.929998],
       [52.02    , 45.64    , 29.66    ],
       [53.71    , 45.11    , 28.45    ],
       [53.47    , 45.66    , 27.539999],
       [54.39    , 45.73    , 29.03    ],
       [54.640003, 43.68    , 27.84    ],
       [53.35    , 43.12    , 26.7     ]], dtype=float32)

### Adding new attributes

Our file also had a novel 'colour' field.

We can include this data in `AtomGroup` objects by defining a new topology attribute:

In [10]:
from MDAnalysis.topology.base import TopologyReaderBase
from MDAnalysis.core.topology import Topology
from MDAnalysis.core import topologyattrs
import numpy as np


class AtomColors(topologyattrs.AtomAttr):
    attrname = 'colors'
    singular = 'color'
    per_object = 'atom'
    dtype = object


class CSVParser(TopologyReaderBase):
    format = 'csv'
    
    def parse(self, **kwargs):
        names = []
        colors = []
        with open(self.filename, 'r') as file_in:
            for line in file_in:
                names.append(line.split(',')[0])
                colors.append(line.split(',')[1])

        n_atoms = len(names)
       
        attrs = [
            topologyattrs.Atomnames(np.array(names)),
            AtomColors(np.array(colors)),
        ]
    
        return Topology(n_atoms=n_atoms, n_res=1, n_seg=1,
                        attrs=attrs)

In [11]:
u = mda.Universe('atoms.csv')

In [12]:
u.atoms.colors

array(['Red', 'Red', 'Red', 'Red', 'Green', 'Green', 'Green', 'Green',
       'Green', 'Yellow', 'Yellow', 'Yellow', 'Yellow', 'Yellow'],
      dtype=object)