Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding auxiliary information to trajectory #785

Closed
richardjgowers opened this issue Mar 19, 2016 · 29 comments
Closed

Adding auxiliary information to trajectory #785

richardjgowers opened this issue Mar 19, 2016 · 29 comments
Assignees
Milestone

Comments

@richardjgowers
Copy link
Member

Original started on discussion board

It would be nice if we could add arbitrary data to run alongside a trajectory, so..

u = mda.Universe('this', 'that')
u.trajectory.add_auxiliary('some.file')

for ts in u.trajectory:
    print ts.stuff_from_file

I think the best route might be adding in the hooks after a Timestep has been read, so

def _read_next_timestep(self):
    ts = self._read_timestep()
    for aux in self._auxiliaries:
        aux.read_next(ts)
    return ts

So each aux object is like a mini Reader object, which augments a trajectory Reader.

In the longer term, it would be cool if u.trajectory.add_auxiliary(PullForce('pull.xvg')), then made AtomGroup.pullforce a viable attribute, this should be possible with the new dynamic classes + transplant system.

@jbarnoud
Copy link
Contributor

From what I understand of your examples, you see the auxiliaries as objects that implement a read_next method. Could they just be regular iterators?

Also, how do you see the assignment of the auxiliaries to an AtomGroup? In the case of pull.xvg as described in the mailing list (as an output of Gromacs for a simulation with pulling), we only have access to a time serie of pulling force. Should we inspect the TPR to know what atoms are pulled?

@jbarnoud
Copy link
Contributor

From what I understand of your examples, you see the auxiliaries as objects that implement a read_next method. Could they just be regular iterators?

Answering to myself. It has to be a custom object, and a simple iterator won't do.

Indeed, we will have to handle the random seeking of the trajectory, so we will not only have to read values from the auxiliaries sequentially. Also, we will have to handle alignement of the auxiliaries: an auxiliary may be save with a different frequency as the coordinates.

@richardjgowers
Copy link
Member Author

What I was thinking is that if the ts is passed in, then either the frame or time information from this would let the aux figure out what to return. Sort of, here's a Timestep that says it's from t=10ps, what would you like to add to this?

Ah yeah, sparse attributes aren't really handled currently. We'd need something like how pandas does missing information.

And yes, meshing different frequencies is a lovely headache for someone.

@jbarnoud
Copy link
Contributor

What I was thinking is that if the ts is passed in, then either the frame or time information from this would let the aux figure out what to return. Sort of, here's a Timestep that says it's from t=10ps, what would you like to add to this?

OK. So this is very similar to something I had in mind for coordinate transformations. I'll tidy up my proposal so it can be discussed in parallel, then.

@kain88-de
Copy link
Member

@richardjgowers I thought the add_auxillary method could just manipulate the __dict__ of the Reader class directly.

class Reader(object):
    def add_auxillary(self, name, val):
        self.__dict__[name] = val

read = Reader()
read.add_auxillary('pulling_force', np.arange(5))
read.pulling_force

This makes it very easy to add new values. We woudn't need to write special classes that can be added as auxillaries as in your last example. This way the information is easily accesible like velocities and positions. I could imagine that val is either a filename or a class that handles the access for xvg files for csv file objects. (actually add_auxillary could figure that out based on fileending as well and then use the classes itself for covenience).

@richardjgowers
Copy link
Member Author

@kain88-de yeah hacking it into the namespace is cool. But there still needs to be some sort updating of the auxiliary data every time a frame is read?

@kain88-de
Copy link
Member

But there still needs to be some sort updating of the auxiliary data every time a frame is read?

Yes. I've just given a simple example of the __dict__ change that I thought of using. But the class for updating could be very generic or a collection of generic ones. For one the add_auxillary method should have a simple algorithm to chose the best class and also accept a class as parameter.
I would like that the following list of auxillary data is possible

reader.add_auxillary('pulling_force', ndarray)
reader.add_auxillary('pulling_force', dataframe)
reader.add_auxillary('pulling_force', 'pull.xvg')
reader.add_auxillary('pulling_force', 'energy.edr')
reader.add_auxillary('pulling_force', 'some-variable.csv')

Of course each needs to be handled specially. Maybe with a class. And we need to do more then just add to the __dict__ method. The good thing about this is we can easily extend the TimeStep object with more information.

@jbarnoud
Copy link
Contributor

I would prefer the auxiliary data to be in a separate namespace. I think it would be cleaner to call ts.aux.pullforce rather than ts.pullforce. The TimeStep name space can get crowded and messy otherwise. Also, having a aux namespace make clear that the data we access comes from somewhere else.

@kain88-de I am not sure I see how what you suggest differs from @richardjgowers initial proposal. Could you elaborate?

@kain88-de
Copy link
Member

@kain88-de I am not sure I see how what you suggest differs from

It doesn't differ that much. Rather extends and generalizes it. With @richardjgowers example we would need to write a new class for every new auxillary that we want to add. While I propose that we simple write classes to read the different file formats, this is because xvg and others can actually contain a lot of different observables, this leaves the user more freedom.

# Richard
u.trajectory.add_auxiliary(PullForce('pull.xvg'))
# Me
u.trajectory.add_auxiliary('pulling_force', XVGReader('pull.xvg'))
## And an alias
u.trajectory.add_pulling('pull.xvg')

@jbarnoud
Copy link
Contributor

On 21/03/16 23:02, kain88-de wrote:

@kain88-de <https://github.com/kain88-de> I am not sure I see how
what you suggest differs from

It doesn't differ that much. Rather extends and generalizes it. With
@richardjgowers https://github.com/richardjgowers example we would
need to write a new class for every new auxillary that we want to add.
While I propose that we simple write classes to read the different
file formats, this is because xvg and others can actually contain a
lot of different observables, this leaves the user more freedom.

Richard

u.trajectory.add_auxiliary(PullForce('pull.xvg'))

Me

u.trajectory.add_auxiliary('pulling_force', XVGReader('pull.xvg'))

And an alias

u.trajectory.add_pulling('pull.xvg')


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#785 (comment)

I understand. Then I like yours better. Anyway, in your example,
XVGReader will have to inherit from a Auxiliary class that know how
to handle the jumps in the trajectory, and the data alignment when the
auxiliary data is not saved at the same frequency than the trajectory.

@kain88-de
Copy link
Member

XVGReader will have to inherit from a Auxiliary class that know how to handle the jumps in the trajectory,

yes that we the headache @richardjgowers talked about would only need to be solved once.

@orbeckst
Copy link
Member

Is this what @fiona-naughton is going to do for GSoC? If so, we should restart the discussion here, especially with Fiona included.
I think the outcome of this discussion should be the definition of a preliminary API; it might change during testing but there should be something concrete to work towards to,

@fiona-naughton
Copy link
Contributor

Adding to dict directly indeed sounds good.
It might be nice, in the case of an auxiliary file/array with multiple columns, to be able to assign each a different name? And, if there's no time column, to instead guess time assuming data was recorded with a regular timestep.

On the time sync front: when the auxiliary/trajectory are in phase (as it were), I guess it's just a matter of pulling up the matching timepoint, with some form of 'missing' label when the auxiliary data is less frequent. Otherwise we could find the closest time instead, but should probably flag that it doesn't exactly correspond to that point in the trajectory (the missing values in the less-frequent case could also be filled with the closest value); it may be worth also offering some sort of interpolation option to guess the value instead?

@fiona-naughton
Copy link
Contributor

Having finally had a proper look through the existing timestep/reader stuff, if I understand correctly this will work something like this:

As above, given an aux name and a filename add_auxiliary will add an appropriate AuxReader instance (guessed from file extension or additionally passed in) to the trajectory namespace by adding directly to __dict__; e.g. trajectory.auxname.

Similar to the existing readers, the AuxReader will store the frequency, time offset and (last-read) time of the auxiliary data (depending on the type of auxiliary file, dt and offset might be determinable from the file). For meshing different frequencies, the general idea would be that each auxiliary timestep is allocated to the closest trajectory timestep; it should be possible to calculate how many of the next aux timesteps 'belong' to the next trajectory timestep, e.g. floor((time+3*dt/2-auxname.time)/auxname.dt) (though slightly more complicated at edge cases)?

Then in addition to a read_next method to read and return the next auxiliary time step data, have a read_next_ts method that:

  1. calls read_next the appropriate number of times as calculated above, and stores the returned data in an array as something like auxname.all; if there are no closest auxiliary points (i.e. auxiliary data is more sparse) this would have length 0.
  2. calculates a single representative value (or some form of NA flag as appropriate) for the trajectory timeframe as auxname.representative; this could be an exact match only, the closest auxiliary timestep, or an average, as specified in auxname.rep_method

Then call auxname.read_next_ts for each added auxiliary with the trajectory's read_next_timestep.

So, ultimately: auxiliary data is added by:

trajectory.add_auxiliary(auxname, filename, **kwargs)

with kwargs optionally including the auxiliary reader, dt, offset, or rep_method. The data can then be accessed either in full through trajectory.auxname.all and looping through the trajectory (or looping through the auxiliary only?), or one-value-per-frame as trajectory.auxname.representative.

Hopefully that sounds like it would work? I'm not sure where and how best to get all the aux.read_next_ts to run with the trajectory read_next_timestep (I guess we'll have to keep track of all the auxiliary names added to the trajectory in some form of list)? Also, picking a naming scheme that's not too confusing...

@jbarnoud
Copy link
Contributor

From what I understand, trajectory.auxname gives access to the AuxReader, not the current value for the time step. Right? If so, it would be convenient to have access to the current auxiliary value from the TimeStep.

As I wrote earlier in the thread, I think it would be better to access the auxiliary data through trajectory.aux.name and/or ts.aux.name rather than trajectory.auxname. Having a distinct name space avoid naming conflicts, and is more explicit about the origin of the data.

Going for the closest time looks to me as the right way to do, assuming we have a user-defined tolerance. That tolerance can be low if we want exact match for the time between the trajectory and the auxiliary (we are talking about floating point number so exact should be understood as "close enough"), it can be higher if the sampling rates do not exactly match and we are OK with this.

The pandas library deals with aligning time series. It might be worth having a look at what they do. In addition to what to propose, I would suggest to have the option of raising an exception if a time is missing in the auxiliary data, and the option to iterate over the frames at the pace of the auxiliary data if the auxiliary data are more sparse.

@richardjgowers
Copy link
Member Author

Yeah data has to be stored in the Timestep object, it'd be annoying if I had to one day rummage around in the Reader to find things.

So wrt getting Reader to read aux things, we could add some things to base.Reader like

# in base.Reader
def next(self):
    ts = self._read_next_frame()  # Regular timestep read
    for aux in self._auxs:
        aux.read_next(ts)  # AuxReader examines ts to figure out what to put inside ts
    return ts

Where self._auxs could just be an empty list (so nothing happens) but otherwise stores all the AuxReaders that have been added.

@fiona-naughton
Copy link
Contributor

Ok, thanks. So we end up with the readers being stored in a list as trajectory._auxs and the data itself stored separately with the given name as ts.aux.name? If it's still worth having the full set of associated auxiliary values (so we're not loosing info when looping the trajectory) as well as the closest value, where would the former best end up to differentiate it from the latter?

@jbarnoud
Copy link
Contributor

On 22/05/16 04:08, Fiona Naughton wrote:

Ok, thanks. So we end up with the readers being stored in a list as
|trajectory._auxs| and the data itself stored separately with the
given name as |ts.aux.name|? If it's still worth having the full set
of associated auxiliary values (so we're not loosing info when looping
the trajectory) as well as the closest value, where would the former
best end up to differentiate it from the latter?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#785 (comment)

ts.aux.name should somehow give access to the current value. There
should indeed be a way to access the AuxReadder itself so we can
access any other relevant information from it (you mention the closest
value and other things).

@richardjgowers
Copy link
Member Author

You should be able to break up this task by just writing a standalone XVGReader at first... so the following code should work

from MDAnalysis.trajectory.auxreaders import XVGReader

myreader = XVGReader('pull.xvg')

for value in myreader:
    print value

print myreader.get_value(4)  # not sure if this should pass in a Timestep or frame number... can fix later

Then all trajectory Readers are just "users" of this code.

@orbeckst
Copy link
Member

There is a XVGReader in GromacsWrapper. You can use that code as a starting point of you like.

Oliver Beckstein
email: orbeckst@gmail.com

Am May 24, 2016 um 3:00 schrieb Richard Gowers notifications@github.com:

You should be able to break up this task by just writing a standalone XVGReader at first... so the following code should work

from MDAnalysis.trajectory.auxreaders import XVGReader

myreader = XVGReader('pull.xvg')

for value in myreader:
print value

print myreader.get_value(4) # not sure if this should pass in a Timestep or frame number... can fix later
Then all trajectory Readers are just "users" of this code.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@fiona-naughton
Copy link
Contributor

While it's relatively straightforward to in general get the auxiliary to update with the trajectory when iterating (since __iter__()/next() are defined in the base ProtoReader), I'm a bit stuck on what to do for the trajectories where jumping to a specified frame is possible with _read_frame(), since that's defined separately in each reader. Not sure if there's a nice way to get the auxiliary to update appropriately in these cases, without having to modify each reader?

(I'm also unsure on the best place to first create the _auxs list for the trajectory, since ProtoReader doesn't have an __init__; I'm currently checking it exists and creating it if not before using it, or is it alright to say just set it in Reader's __init__?)

@richardjgowers
Copy link
Member Author

You can add a layer to Reader, so it calls maybe _goto_frame which in turn calls _read_frame. So then u.trajectory[27] would go something like:

base.Reader.__getitem__(27) -> base.Reader._goto_frame(27) -> XTCReader._read_frame(27)
                                                           -> XVGReader._read_frame(27)

This might also be useful for next, ideally there should only be one place where things happen.

You can add an __init__ to the base class, just make sure then all subclasses super in their inits. (And have a test that loops over all Readers checking that they init with this attribute).

@mnmelo
Copy link
Member

mnmelo commented Jun 6, 2016

Reading through the WIP PR (#868) I came across @richardjgowers' comment that we could read in xvg data all at once. I second that idea. However, I think the same will be applicable to any aux data format. Besides, holding all aux data in memory will certainly speed things up when loading new timesteps.

I think it's important to decide on this right away, because the nascent object model will have to change accordingly. I'm bringing this up here, instead of in the PR, to keep the discussion together.
Do we want to:

  • Read in all the aux data at once? In this case the AuxFileReader class will just be a wrapper to read from file, and the file-agnostic AuxReader should do the rest. This has the tiny danger of one running out of memory, for huge aux data files, or for many universes with large aux data.
  • Always read aux data from file one timestep at a time? This seems to be the direction of the current implementation in the PR. Probably not as fast as holding everything in memory, and with trickier implementation of things such as random frame access.
  • A mix of the above based on aux filesize?

@jbarnoud
Copy link
Contributor

jbarnoud commented Jun 6, 2016

On 06/06/16 11:30, mnmelo wrote:

Reading through the WIP PR (#868
#868) I came across
@richardjgowers https://github.com/richardjgowers' comment that we
could read in xvg data all at once. I second that idea. However, I
think the same will be applicable to any aux data format. Besides,
holding all aux data in memory will certainly speed things up when
loading new timesteps.

I think it's important to decide on this right away, because the
nascent object model will have to change accordingly. I'm bringing
this up here, instead of in the PR, to keep the discussion together.
Do we want to:

  • Read in all the aux data at once? In this case the |AuxFileReader|
    class will just be a wrapper to read from file, and the
    file-agnostic |AuxReader| should do the rest. This has the tiny
    danger of one running out of memory, for huge aux data files, or
    for many universes with large aux data.
  • Always read aux data from file one timestep at a time? This seems
    to be the direction of the current implementation in the PR.
    Probably not as fast as holding everything in memory, and with
    trickier implementation of things such as random frame access.
  • A mix of the above based on aux filesize?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#785 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABUWuv9HtxlVDA4dJxMrqTU9GuZUoe3pks5qI-ingaJpZM4H0a5w.

On the long run, the third option in the best. It allows to be fast when
we can, but still scales when the simulations become very long.

On a shorter term, I would go for the second option as it is what the
current implementation does. This allows to have something that work,
even if we iteratively improve it latter on.

Any way, AuxReader should not care abut how the data is obtained. It is
just a interface for the trajectory reader. Fot what AuxReader knows,
the auxiliary data can be computed on the fly (which may happen in the
future).

@richardjgowers
Copy link
Member Author

Aux data could easily be something like 3 floats for every atom every step (maybe some derived quantity you don't want to recalculate) which would then make the data comparable in size to the trajectory. I'd rather let individual implementations (XVGReader) make the assumption that the data is small rather than do it from the outset.

@fiona-naughton
Copy link
Contributor

Yeah, this was something I was unsure on; I went with the reading-in-step-at-a-time largely because I wanted to try get that figured out first. I agree it's probably safe to assume the xvg files won't be too large; but I can leave it for now, and come back to it later?

@mnmelo
Copy link
Member

mnmelo commented Jun 6, 2016

Seems a good and consensual plan. Still, I'd make sure to clearly define the roles of AuxReader and AuxFileReader in the object model. Right now the function of both classes seems a bit convoluted, with AuxReader also needing to be passed a filename.

@richardjgowers
Copy link
Member Author

With value interpolations, there's already an interpolation class in scipy which does quadratic and cubic interpolations too. Scipy isn't a strict dependency, but it does seem daft to reinvent this wheel...

@jbarnoud
Copy link
Contributor

jbarnoud commented Sep 5, 2016

Closed by #868

@jbarnoud jbarnoud closed this as completed Sep 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants