Adding auxiliary information to trajectory #785

richardjgowers · 2016-03-19T10:45:13Z

It would be nice if we could add arbitrary data to run alongside a trajectory, so..

u = mda.Universe('this', 'that')
u.trajectory.add_auxiliary('some.file')

for ts in u.trajectory:
    print ts.stuff_from_file

I think the best route might be adding in the hooks after a Timestep has been read, so

def _read_next_timestep(self):
    ts = self._read_timestep()
    for aux in self._auxiliaries:
        aux.read_next(ts)
    return ts

So each aux object is like a mini Reader object, which augments a trajectory Reader.

In the longer term, it would be cool if u.trajectory.add_auxiliary(PullForce('pull.xvg')), then made AtomGroup.pullforce a viable attribute, this should be possible with the new dynamic classes + transplant system.

The text was updated successfully, but these errors were encountered:

jbarnoud · 2016-03-19T11:49:06Z

From what I understand of your examples, you see the auxiliaries as objects that implement a read_next method. Could they just be regular iterators?

Also, how do you see the assignment of the auxiliaries to an AtomGroup? In the case of pull.xvg as described in the mailing list (as an output of Gromacs for a simulation with pulling), we only have access to a time serie of pulling force. Should we inspect the TPR to know what atoms are pulled?

jbarnoud · 2016-03-19T11:52:08Z

From what I understand of your examples, you see the auxiliaries as objects that implement a read_next method. Could they just be regular iterators?

Answering to myself. It has to be a custom object, and a simple iterator won't do.

Indeed, we will have to handle the random seeking of the trajectory, so we will not only have to read values from the auxiliaries sequentially. Also, we will have to handle alignement of the auxiliaries: an auxiliary may be save with a different frequency as the coordinates.

richardjgowers · 2016-03-19T12:02:21Z

What I was thinking is that if the ts is passed in, then either the frame or time information from this would let the aux figure out what to return. Sort of, here's a Timestep that says it's from t=10ps, what would you like to add to this?

Ah yeah, sparse attributes aren't really handled currently. We'd need something like how pandas does missing information.

And yes, meshing different frequencies is a lovely headache for someone.

jbarnoud · 2016-03-19T12:27:45Z

What I was thinking is that if the ts is passed in, then either the frame or time information from this would let the aux figure out what to return. Sort of, here's a Timestep that says it's from t=10ps, what would you like to add to this?

OK. So this is very similar to something I had in mind for coordinate transformations. I'll tidy up my proposal so it can be discussed in parallel, then.

kain88-de · 2016-03-19T17:22:29Z

@richardjgowers I thought the add_auxillary method could just manipulate the __dict__ of the Reader class directly.

class Reader(object):
    def add_auxillary(self, name, val):
        self.__dict__[name] = val

read = Reader()
read.add_auxillary('pulling_force', np.arange(5))
read.pulling_force

This makes it very easy to add new values. We woudn't need to write special classes that can be added as auxillaries as in your last example. This way the information is easily accesible like velocities and positions. I could imagine that val is either a filename or a class that handles the access for xvg files for csv file objects. (actually add_auxillary could figure that out based on fileending as well and then use the classes itself for covenience).

richardjgowers · 2016-03-19T17:30:44Z

@kain88-de yeah hacking it into the namespace is cool. But there still needs to be some sort updating of the auxiliary data every time a frame is read?

kain88-de · 2016-03-19T17:41:37Z

But there still needs to be some sort updating of the auxiliary data every time a frame is read?

Yes. I've just given a simple example of the __dict__ change that I thought of using. But the class for updating could be very generic or a collection of generic ones. For one the add_auxillary method should have a simple algorithm to chose the best class and also accept a class as parameter.
I would like that the following list of auxillary data is possible

reader.add_auxillary('pulling_force', ndarray)
reader.add_auxillary('pulling_force', dataframe)
reader.add_auxillary('pulling_force', 'pull.xvg')
reader.add_auxillary('pulling_force', 'energy.edr')
reader.add_auxillary('pulling_force', 'some-variable.csv')

Of course each needs to be handled specially. Maybe with a class. And we need to do more then just add to the __dict__ method. The good thing about this is we can easily extend the TimeStep object with more information.

jbarnoud · 2016-03-19T17:55:16Z

I would prefer the auxiliary data to be in a separate namespace. I think it would be cleaner to call ts.aux.pullforce rather than ts.pullforce. The TimeStep name space can get crowded and messy otherwise. Also, having a aux namespace make clear that the data we access comes from somewhere else.

@kain88-de I am not sure I see how what you suggest differs from @richardjgowers initial proposal. Could you elaborate?

kain88-de · 2016-03-21T22:02:20Z

@kain88-de I am not sure I see how what you suggest differs from

It doesn't differ that much. Rather extends and generalizes it. With @richardjgowers example we would need to write a new class for every new auxillary that we want to add. While I propose that we simple write classes to read the different file formats, this is because xvg and others can actually contain a lot of different observables, this leaves the user more freedom.

# Richard
u.trajectory.add_auxiliary(PullForce('pull.xvg'))
# Me
u.trajectory.add_auxiliary('pulling_force', XVGReader('pull.xvg'))
## And an alias
u.trajectory.add_pulling('pull.xvg')

jbarnoud · 2016-03-21T22:08:15Z

On 21/03/16 23:02, kain88-de wrote:

@kain88-de <https://github.com/kain88-de> I am not sure I see how
what you suggest differs from
It doesn't differ that much. Rather extends and generalizes it. With
@richardjgowers https://github.com/richardjgowers example we would
need to write a new class for every new auxillary that we want to add.
While I propose that we simple write classes to read the different
file formats, this is because xvg and others can actually contain a
lot of different observables, this leaves the user more freedom.

Richard

u.trajectory.add_auxiliary(PullForce('pull.xvg'))

Me

u.trajectory.add_auxiliary('pulling_force', XVGReader('pull.xvg'))

And an alias

u.trajectory.add_pulling('pull.xvg')

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#785 (comment)

I understand. Then I like yours better. Anyway, in your example,
XVGReader will have to inherit from a Auxiliary class that know how
to handle the jumps in the trajectory, and the data alignment when the
auxiliary data is not saved at the same frequency than the trajectory.

kain88-de · 2016-03-21T22:24:44Z

XVGReader will have to inherit from a Auxiliary class that know how to handle the jumps in the trajectory,

yes that we the headache @richardjgowers talked about would only need to be solved once.

orbeckst · 2016-04-29T23:29:28Z

Is this what @fiona-naughton is going to do for GSoC? If so, we should restart the discussion here, especially with Fiona included.
I think the outcome of this discussion should be the definition of a preliminary API; it might change during testing but there should be something concrete to work towards to,

fiona-naughton · 2016-05-09T15:54:04Z

Adding to dict directly indeed sounds good.
It might be nice, in the case of an auxiliary file/array with multiple columns, to be able to assign each a different name? And, if there's no time column, to instead guess time assuming data was recorded with a regular timestep.

On the time sync front: when the auxiliary/trajectory are in phase (as it were), I guess it's just a matter of pulling up the matching timepoint, with some form of 'missing' label when the auxiliary data is less frequent. Otherwise we could find the closest time instead, but should probably flag that it doesn't exactly correspond to that point in the trajectory (the missing values in the less-frequent case could also be filled with the closest value); it may be worth also offering some sort of interpolation option to guess the value instead?

fiona-naughton · 2016-05-20T11:28:30Z

Having finally had a proper look through the existing timestep/reader stuff, if I understand correctly this will work something like this:

As above, given an aux name and a filename add_auxiliary will add an appropriate AuxReader instance (guessed from file extension or additionally passed in) to the trajectory namespace by adding directly to __dict__; e.g. trajectory.auxname.

Similar to the existing readers, the AuxReader will store the frequency, time offset and (last-read) time of the auxiliary data (depending on the type of auxiliary file, dt and offset might be determinable from the file). For meshing different frequencies, the general idea would be that each auxiliary timestep is allocated to the closest trajectory timestep; it should be possible to calculate how many of the next aux timesteps 'belong' to the next trajectory timestep, e.g. floor((time+3*dt/2-auxname.time)/auxname.dt) (though slightly more complicated at edge cases)?

Then in addition to a read_next method to read and return the next auxiliary time step data, have a read_next_ts method that:

calls read_next the appropriate number of times as calculated above, and stores the returned data in an array as something like auxname.all; if there are no closest auxiliary points (i.e. auxiliary data is more sparse) this would have length 0.
calculates a single representative value (or some form of NA flag as appropriate) for the trajectory timeframe as auxname.representative; this could be an exact match only, the closest auxiliary timestep, or an average, as specified in auxname.rep_method

Then call auxname.read_next_ts for each added auxiliary with the trajectory's read_next_timestep.

So, ultimately: auxiliary data is added by:

trajectory.add_auxiliary(auxname, filename, **kwargs)

with kwargs optionally including the auxiliary reader, dt, offset, or rep_method. The data can then be accessed either in full through trajectory.auxname.all and looping through the trajectory (or looping through the auxiliary only?), or one-value-per-frame as trajectory.auxname.representative.

Hopefully that sounds like it would work? I'm not sure where and how best to get all the aux.read_next_ts to run with the trajectory read_next_timestep (I guess we'll have to keep track of all the auxiliary names added to the trajectory in some form of list)? Also, picking a naming scheme that's not too confusing...

jbarnoud · 2016-05-20T12:43:41Z

From what I understand, trajectory.auxname gives access to the AuxReader, not the current value for the time step. Right? If so, it would be convenient to have access to the current auxiliary value from the TimeStep.

As I wrote earlier in the thread, I think it would be better to access the auxiliary data through trajectory.aux.name and/or ts.aux.name rather than trajectory.auxname. Having a distinct name space avoid naming conflicts, and is more explicit about the origin of the data.

Going for the closest time looks to me as the right way to do, assuming we have a user-defined tolerance. That tolerance can be low if we want exact match for the time between the trajectory and the auxiliary (we are talking about floating point number so exact should be understood as "close enough"), it can be higher if the sampling rates do not exactly match and we are OK with this.

The pandas library deals with aligning time series. It might be worth having a look at what they do. In addition to what to propose, I would suggest to have the option of raising an exception if a time is missing in the auxiliary data, and the option to iterate over the frames at the pace of the auxiliary data if the auxiliary data are more sparse.

richardjgowers · 2016-05-20T17:25:20Z

Yeah data has to be stored in the Timestep object, it'd be annoying if I had to one day rummage around in the Reader to find things.

So wrt getting Reader to read aux things, we could add some things to base.Reader like

# in base.Reader
def next(self):
    ts = self._read_next_frame()  # Regular timestep read
    for aux in self._auxs:
        aux.read_next(ts)  # AuxReader examines ts to figure out what to put inside ts
    return ts

Where self._auxs could just be an empty list (so nothing happens) but otherwise stores all the AuxReaders that have been added.

fiona-naughton · 2016-05-22T02:08:31Z

Ok, thanks. So we end up with the readers being stored in a list as trajectory._auxs and the data itself stored separately with the given name as ts.aux.name? If it's still worth having the full set of associated auxiliary values (so we're not loosing info when looping the trajectory) as well as the closest value, where would the former best end up to differentiate it from the latter?

jbarnoud · 2016-05-22T09:13:09Z

On 22/05/16 04:08, Fiona Naughton wrote:

Ok, thanks. So we end up with the readers being stored in a list as
|trajectory._auxs| and the data itself stored separately with the
given name as |ts.aux.name|? If it's still worth having the full set
of associated auxiliary values (so we're not loosing info when looping
the trajectory) as well as the closest value, where would the former
best end up to differentiate it from the latter?

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#785 (comment)

ts.aux.name should somehow give access to the current value. There
should indeed be a way to access the AuxReadder itself so we can
access any other relevant information from it (you mention the closest
value and other things).

richardjgowers · 2016-05-24T10:00:03Z

You should be able to break up this task by just writing a standalone XVGReader at first... so the following code should work

from MDAnalysis.trajectory.auxreaders import XVGReader

myreader = XVGReader('pull.xvg')

for value in myreader:
    print value

print myreader.get_value(4)  # not sure if this should pass in a Timestep or frame number... can fix later

Then all trajectory Readers are just "users" of this code.

orbeckst · 2016-05-24T17:28:02Z

There is a XVGReader in GromacsWrapper. You can use that code as a starting point of you like.

Oliver Beckstein
email: orbeckst@gmail.com

Am May 24, 2016 um 3:00 schrieb Richard Gowers notifications@github.com:

You should be able to break up this task by just writing a standalone XVGReader at first... so the following code should work

from MDAnalysis.trajectory.auxreaders import XVGReader

myreader = XVGReader('pull.xvg')

for value in myreader:
print value

print myreader.get_value(4) # not sure if this should pass in a Timestep or frame number... can fix later
Then all trajectory Readers are just "users" of this code.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

fiona-naughton · 2016-06-02T20:36:43Z

While it's relatively straightforward to in general get the auxiliary to update with the trajectory when iterating (since __iter__()/next() are defined in the base ProtoReader), I'm a bit stuck on what to do for the trajectories where jumping to a specified frame is possible with _read_frame(), since that's defined separately in each reader. Not sure if there's a nice way to get the auxiliary to update appropriately in these cases, without having to modify each reader?

(I'm also unsure on the best place to first create the _auxs list for the trajectory, since ProtoReader doesn't have an __init__; I'm currently checking it exists and creating it if not before using it, or is it alright to say just set it in Reader's __init__?)

richardjgowers · 2016-06-02T21:03:00Z

You can add a layer to Reader, so it calls maybe _goto_frame which in turn calls _read_frame. So then u.trajectory[27] would go something like:

base.Reader.__getitem__(27) -> base.Reader._goto_frame(27) -> XTCReader._read_frame(27)
                                                           -> XVGReader._read_frame(27)

This might also be useful for next, ideally there should only be one place where things happen.

You can add an __init__ to the base class, just make sure then all subclasses super in their inits. (And have a test that loops over all Readers checking that they init with this attribute).

mnmelo · 2016-06-06T09:30:14Z

Reading through the WIP PR (#868) I came across @richardjgowers' comment that we could read in xvg data all at once. I second that idea. However, I think the same will be applicable to any aux data format. Besides, holding all aux data in memory will certainly speed things up when loading new timesteps.

I think it's important to decide on this right away, because the nascent object model will have to change accordingly. I'm bringing this up here, instead of in the PR, to keep the discussion together.
Do we want to:

Read in all the aux data at once? In this case the AuxFileReader class will just be a wrapper to read from file, and the file-agnostic AuxReader should do the rest. This has the tiny danger of one running out of memory, for huge aux data files, or for many universes with large aux data.
Always read aux data from file one timestep at a time? This seems to be the direction of the current implementation in the PR. Probably not as fast as holding everything in memory, and with trickier implementation of things such as random frame access.
A mix of the above based on aux filesize?

jbarnoud · 2016-06-06T09:37:12Z

On 06/06/16 11:30, mnmelo wrote:

Reading through the WIP PR (#868
#868) I came across
@richardjgowers https://github.com/richardjgowers' comment that we
could read in xvg data all at once. I second that idea. However, I
think the same will be applicable to any aux data format. Besides,
holding all aux data in memory will certainly speed things up when
loading new timesteps.

I think it's important to decide on this right away, because the
nascent object model will have to change accordingly. I'm bringing
this up here, instead of in the PR, to keep the discussion together.
Do we want to:

Read in all the aux data at once? In this case the |AuxFileReader|
class will just be a wrapper to read from file, and the
file-agnostic |AuxReader| should do the rest. This has the tiny
danger of one running out of memory, for huge aux data files, or
for many universes with large aux data.

Always read aux data from file one timestep at a time? This seems
to be the direction of the current implementation in the PR.
Probably not as fast as holding everything in memory, and with
trickier implementation of things such as random frame access.

A mix of the above based on aux filesize?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#785 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABUWuv9HtxlVDA4dJxMrqTU9GuZUoe3pks5qI-ingaJpZM4H0a5w.

On the long run, the third option in the best. It allows to be fast when
we can, but still scales when the simulations become very long.

On a shorter term, I would go for the second option as it is what the
current implementation does. This allows to have something that work,
even if we iteratively improve it latter on.

Any way, AuxReader should not care abut how the data is obtained. It is
just a interface for the trajectory reader. Fot what AuxReader knows,
the auxiliary data can be computed on the fly (which may happen in the
future).

richardjgowers · 2016-06-06T09:37:27Z

Aux data could easily be something like 3 floats for every atom every step (maybe some derived quantity you don't want to recalculate) which would then make the data comparable in size to the trajectory. I'd rather let individual implementations (XVGReader) make the assumption that the data is small rather than do it from the outset.

fiona-naughton · 2016-06-06T10:59:29Z

Yeah, this was something I was unsure on; I went with the reading-in-step-at-a-time largely because I wanted to try get that figured out first. I agree it's probably safe to assume the xvg files won't be too large; but I can leave it for now, and come back to it later?

mnmelo · 2016-06-06T13:18:58Z

Seems a good and consensual plan. Still, I'd make sure to clearly define the roles of AuxReader and AuxFileReader in the object model. Right now the function of both classes seems a bit convoluted, with AuxReader also needing to be passed a filename.

richardjgowers · 2016-06-06T13:46:22Z

With value interpolations, there's already an interpolation class in scipy which does quadratic and cubic interpolations too. Scipy isn't a strict dependency, but it does seem daft to reinvent this wheel...

jbarnoud · 2016-09-05T05:30:52Z

Closed by #868

richardjgowers added Difficulty-hard proposal labels Mar 19, 2016

jbarnoud mentioned this issue Mar 19, 2016

On-the-fly coordinate transformations #786

Closed

fiona-naughton mentioned this issue May 5, 2016

Handling umbrella sampling simulations #843

Closed

jbarnoud assigned fiona-naughton May 5, 2016

kain88-de added this to the 0.16.0 milestone May 6, 2016

jbarnoud mentioned this issue May 31, 2016

Support for Gromacs's TNG format #865

Open

7 tasks

orbeckst mentioned this issue Jun 3, 2016

Adding WHAM to MDAnalysis #842

Closed

fiona-naughton mentioned this issue Jun 3, 2016

[WIP] Issue 785 add auxiliary #868

Merged

4 tasks

fiona-naughton mentioned this issue Jun 6, 2016

next/__iter__ in ProtoReader defined independetly #869

Closed

dotsdl mentioned this issue Jul 28, 2016

Support auxilliaries in Universe definitions datreant/MDSynthesis#59

Closed

jbarnoud closed this as completed Sep 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding auxiliary information to trajectory #785

Adding auxiliary information to trajectory #785

richardjgowers commented Mar 19, 2016

jbarnoud commented Mar 19, 2016

jbarnoud commented Mar 19, 2016

richardjgowers commented Mar 19, 2016

jbarnoud commented Mar 19, 2016

kain88-de commented Mar 19, 2016

richardjgowers commented Mar 19, 2016

kain88-de commented Mar 19, 2016

jbarnoud commented Mar 19, 2016

kain88-de commented Mar 21, 2016

jbarnoud commented Mar 21, 2016

Richard

Me

And an alias

kain88-de commented Mar 21, 2016

orbeckst commented Apr 29, 2016

fiona-naughton commented May 9, 2016

fiona-naughton commented May 20, 2016

jbarnoud commented May 20, 2016

richardjgowers commented May 20, 2016

fiona-naughton commented May 22, 2016

jbarnoud commented May 22, 2016

richardjgowers commented May 24, 2016

orbeckst commented May 24, 2016

fiona-naughton commented Jun 2, 2016

richardjgowers commented Jun 2, 2016

mnmelo commented Jun 6, 2016

jbarnoud commented Jun 6, 2016

richardjgowers commented Jun 6, 2016

fiona-naughton commented Jun 6, 2016

mnmelo commented Jun 6, 2016

richardjgowers commented Jun 6, 2016

jbarnoud commented Sep 5, 2016

Adding auxiliary information to trajectory #785

Adding auxiliary information to trajectory #785

Comments

richardjgowers commented Mar 19, 2016

jbarnoud commented Mar 19, 2016

jbarnoud commented Mar 19, 2016

richardjgowers commented Mar 19, 2016

jbarnoud commented Mar 19, 2016

kain88-de commented Mar 19, 2016

richardjgowers commented Mar 19, 2016

kain88-de commented Mar 19, 2016

jbarnoud commented Mar 19, 2016

kain88-de commented Mar 21, 2016

jbarnoud commented Mar 21, 2016

Richard

Me

And an alias

kain88-de commented Mar 21, 2016

orbeckst commented Apr 29, 2016

fiona-naughton commented May 9, 2016

fiona-naughton commented May 20, 2016

jbarnoud commented May 20, 2016

richardjgowers commented May 20, 2016

fiona-naughton commented May 22, 2016

jbarnoud commented May 22, 2016

richardjgowers commented May 24, 2016

orbeckst commented May 24, 2016

fiona-naughton commented Jun 2, 2016

richardjgowers commented Jun 2, 2016

mnmelo commented Jun 6, 2016

jbarnoud commented Jun 6, 2016

richardjgowers commented Jun 6, 2016

fiona-naughton commented Jun 6, 2016

mnmelo commented Jun 6, 2016

richardjgowers commented Jun 6, 2016

jbarnoud commented Sep 5, 2016