Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trajectory reader Timestep objects give inconsistent interfaces #250

Closed
dotsdl opened this issue Apr 13, 2015 · 25 comments
Closed

Trajectory reader Timestep objects give inconsistent interfaces #250

dotsdl opened this issue Apr 13, 2015 · 25 comments

Comments

@dotsdl
Copy link
Member

dotsdl commented Apr 13, 2015

The interfaces for Timestep objects for the various trajectory readers are inconsistent, making it difficult to use them in a format-agnostic way. As a specific example, the coordinates.xdrfile.core.Timestep object used for XTC and TRR files features a time attribute giving the time of the current frame, while the coordinates.DCD.Timestep object used for DCDs doesn't have one at all.

If some trajectory formats are more information rich than others, then it makes sense for accessors to vary across them. However, we should probably hammer out what accessors should be present for all formats, and clearly indicate which are not universal.

@richardjgowers
Copy link
Member

I like the idea of setting this in stone too. The docs for trajectory readers set out a strict format, so should Timestep.

http://pythonhosted.org/MDAnalysis/documentation_pages/coordinates/init.html?highlight=trajectory%20api#trajectory-api

Stuff related to this already:
Issue #213 wants to change the Timestep interface in general, so that all things access via .position not ._pos etc.
In the testsuite there is a lot of mixin tests, subclassed off, test_coordinate._TestTimestep which should be updated to reflect what the required interface is.

@orbeckst orbeckst added the API label Apr 13, 2015
@orbeckst orbeckst added this to the 1.0 milestone Apr 13, 2015
@orbeckst
Copy link
Member

I agree, this needs to be sorted put, especially with a view to API-freeze in 1.0. Technically, the Trajectory API already contains an API for Timestep but it's simply not good ;-p.

Can you guys (@dotsdl , @richardjgowers ) start drafting the API, starting with a discussion here? — Comments from everyone welcome, of course!

@dotsdl
Copy link
Member Author

dotsdl commented Apr 14, 2015

More than happy to hammer this out. Will work on a draft API given what we currently see across our existing readers.

@richardjgowers
Copy link
Member

So if we're changing the Timestep API, I guess I can list all the things about it that annoy me:

The init constructor is confusing, you can init one with either

  • an integer
  • another Timestep
  • an array of coordinates

It would make more sense to choose a single one of these methods (integer probably), and add the others as classmethod constructors... ie new_ts = Timestep.from_timestep(old_ts)
Initting from coordinates can just be done as
ts = Timestep(len(coords))
ts._positions = coords

Should we add the ability to have velocities & forces to base.Timestep? This would make it a lot more reusable in other classes.. it would be nice if subclasses of Timestep could use most of the base constructor and look like:

class OtherTimestep(Timestep)
    def __init__(self, arg, **kwargs):
        super(self, Timestep).__init__(arg, **kwargs)
        self.my_specific_thing = 1.0
        self.my_specific_thing2 = 2.0

@mnmelo
Copy link
Member

mnmelo commented Apr 21, 2015

I second Richard's call for less overloading of the constructor and for a broader Base Timestep.

Actually, should we let Base have all possible attributes (not just velocities and forces)? That would definitely help abstract the Reader.

@richardjgowers richardjgowers modified the milestones: 0.11, 1.0 Apr 21, 2015
@richardjgowers
Copy link
Member

I think there's lots of stuff that won't be common to all Timesteps, eg TRZTimestep has .temperature, but it's confusing/misleading if this is always left as 0.0 for every other format. Ideally the constructor for base.Timestep should read exactly like the API specifications.

@orbeckst
Copy link
Member

+1 for @richardjgowers 's proposal (in the spirit of PEP 20 "There should be one-- and preferably only one --obvious way to do it.")

I don't like the assignment to an underscored variable as the official way to get coordinates into a Timestep

ts = Timestep(len(coords))
ts._positions = coords

but rather have an explicit setter or make positions a first class citizen (probably through a managed attribute) so that we can do

ts = Timestep(len(coords))
ts.positions = coords
ts.velocities = v          # some magic has to happen to allocate the arrays...
ts.forces = f              # ... here to0

or use optional kwargs

ts = Timestep(len(coords), positions=coords, velocities=v, forces=f)

I also wouldn't clutter base.Timestep with everything but restrict it to a reasonable common subset along the lines of "if it is important in more than half of the readers it should certainly be in base". Velocities/forces are less common than that (I think) but still seem to be good candidates for base.

The API should really contain a sensible minimal subset of all possible features and user code will have to deal gracefully with information that is not available – that's much safer and cleaner than us trying to guess defaults for everything (see PEP 20 "In the face of ambiguity, refuse the temptation to guess."). We should, however, be consistent how this missing data is communicated. AttributeError on Timestep is ok-ish but it would be even cleaner if all additional data lived, say, in a Timestep.data dict-like object and you simply get a KeyError or could ask if observable in Timestep.data.

richardjgowers added a commit that referenced this issue Apr 22, 2015
Also solves adding velocities/forces to Timesteps without them (Issue #213)
@richardjgowers
Copy link
Member

Ok I've pushed a branch which has a look at what the new Timestep constructor could look like. Feedback very welcome!

Highlights are:

  • base.Timestep supports velocities & forces
  • new constructor that just does one thing
  • from_coordinates and from_timestep constructors
  • positions, velocities and forces are now all properties, which will automagically allocate if you try and set when the Timestep doesn't have it

Hopefully this will make subclasses of base.Timestep a lot smaller, so the enforcing API should be easier.

@orbeckst I played with the idea of having a ._data attribute which was a dict, then overrode getattr setattr to use this. It could work but I think it got confusing very fast. The class structure of a Python object is already a dict, so if we put a dict holding all the observables inside the Timestep class, then it's confusing because we've got a nested dict. Ie do I access Timestep.data['observable'] or Timestep.observable?
I've made the copy methods copy everything that isn't prefixed by an underscore, so underscore stuff is "private" and everything else gets copied.

This does mean that "users" of Timesteps will have to do some error checking on what is passed to them. Eg:

def write_something(ts):
    try:
        pressure = ts.pressure
        temperature = ts.temperature
    except AttributeError:
        raise NoDataError("The Timestep you gave me lacked info")

@orbeckst
Copy link
Member

On 22 Apr, 2015, at 03:45, Richard Gowers wrote:

@orbeckst I played with the idea of having a ._data attribute which was a dict, then overrode getattr setattr to use this. It could work but I think it got confusing very fast. The class structure of a Python object is already a dict, so if we put a dict holding all the observables inside the Timestep class, then it's confusing because we've got a nested dict. Ie do I access Timestep.data['observable'] or Timestep.observable?

My thinking went along the lines that Timestep.data would be an empty dict-like object in base.Timestep and the API will define that anything in data is optional. Anything in Timestep itself is mandatory.

This data object should be something that allows attrib or keyword access interchangeably:

Timestep.data.temperature

or
Timestep.data['temperature']

Maybe it gets too confusing but it would make the API definition straightforward.

@richardjgowers
Copy link
Member

Ahh, yeah it does make the namespace a bit cleaner. I've put in a .data dict that stores anything non essential. I've then also made __ getattr__ look in the data dict if the attribute isn't found in the class namespace.

So the namespace is kind of like

Timestep

  • frame
  • numatoms
  • anything else which is required?
  • _unitcell optional
  • _positions (property)
  • _velocities optional
  • _forces optional
  • data
    • pressure [optional]
    • temperature [optional]
    • whatever else people can find

So Timestep.temperature will work because it looks in Timestep.data['temperature'] automatically.

So still todo:

  1. Someone has to put their foot down on what is a mandatory field for Timestep @dotsdl ?
  2. All code which accesses Timesteps needs to start referring to .positions, .velocities and .forces and not ._pos and ._velocities ._forces.
  3. TRR Timestep. If I've understood it correctly, I need to keep track of which frame properties are set at, then only return them to users if they match the current frame. So something like this? @mnmelo
@property
def positions(self):
    if self.frame == self._pos_source:
        return self._positions
    else:
        raise NoDataError

@positions.setter
def positions(self, new):
    self._positions[:] = new
    self._pos_source = self.frame

@dotsdl
Copy link
Member Author

dotsdl commented Apr 26, 2015

Here is the API spec I propose we adopt, given the discussion here and a review of the existing docs. I agree that the guiding principle should be keeping it as minimal as possible.

Timestep:

  • time (property)
    • returns _time
  • numatoms (property)
    • returns _numatoms
  • positions (property)
    • returns _positions
  • data (property)
    • returns _data (attribute-dict for extra data elements that aren't shared by ALL trajectory formats), which can contain:
      • velocities
      • forces
      • temperature
      • pressure
      • unitcell
      • etc.

This puts even relatively common elements such as velocities and forces in data as extras, but I think this is reasonable because these are not things that are literally common to EVERY format. At a minimum, trajectories are timeseries of positions, and I think the base class should be kept at that.

One (minor) exception is that some trajectories may not store actual time information, in which case I think it appropriate to yield None for that property.

Also, I think these basic elements (time, numatoms, positions, data) should be made properties so that they can raise an exception when an attempt is made to set them, but I'm not sure this is really necessary. It at least allows for one to attach a docstring to each, since the docstring for the base class itself wouldn't necessarily be exposed by the child classes for each Reader. This is important for interactive work.

@richardjgowers
Copy link
Member

I think the problem with putting vels & forces into data is that they then can't (easily) be managed properties. eg what I've suggested for them

f8e2099#diff-99815fce73651366394eb2dd8fe2f872R227

With .time, it might be better to raise NoDataError, but I'm not 100% on that either. It would be really annoying to do something like [ts.time for ts in u.trajectory] and get [None, None, None, ....]

Otherwise I agree with all the above, if you want you can work off the branch I started which has the new constructors?

@dotsdl
Copy link
Member Author

dotsdl commented Apr 26, 2015

With .time, it might be better to raise NoDataError, but I'm not 100% on that either. It would be really annoying to do something like [ts.time for ts in u.trajectory] and get [None, None, None, ....]

I'm not sold on it, either. I think NoDataError sounds reasonable, but then again there is also the suggestion by @orbeckst to do some reasonable guessing instead. I'm not sure how I feel about that either, but I suppose what matters most is if the behavior is consistent with how MDAnalysis is understood to handle "missing" data in other instances. I'm inclined to say "don't guess."

@dotsdl
Copy link
Member Author

dotsdl commented Apr 26, 2015

I think the problem with putting vels & forces into data is that they then can't (easily) be managed properties. eg what I've suggested for them

Ah...I see your point. However, elements in data could be managed properties if data were defined as a class with its elements as properties instead of being an attribute-dict. Not sure if this becomes a case of overengineering, but it would work.

@richardjgowers
Copy link
Member

I'd say that would work but would be overcomplicated, but I'll let someone else weigh in.

I just noticed you've put _unitcell in data too, all my above complaints apply to that as well, (it's very common and needs special handling)

@orbeckst
Copy link
Member

orbeckst commented May 2, 2015

There are a number of common and high-profile data structures in Timestep that warrant top level status (taking @dotsdl 's proposal) and promoting velocities and forces (because they are fundamental to what we consider an MD trajectory frame)

Updated Timestep API

The Timestep class should have the following attributes:

  • frame
  • _frame
    • native frame number if provided by the trajectory (optional)
    • My current proposal is that _frame is essentially ignored and we set frame ourselves but keep _frame around in case the user wants to look there. We treat it similar to _time and _unitcell as the native information.
  • time (property)
  • _time
    • native time in MDA units (optional)
    • At the moment, trajectory readers do the conversion when they fill Timestep.
  • numatoms (property)
    • returns _numatoms
  • positions (property)
    • get/set _positions
  • _positions
    • atom coordinates converted to MDAnalysis units (conversion done by Reader)
  • velocities (property)
    • get/set _velocities (optional)
  • _velocities:
    • native velocities converted to MDA units (conversion done by Reader)
  • forces (property)
    • get/set _forces (optional)
  • _forces
    • native forces converted to MDA units (conversion done by Reader)
  • dimensions (property)
    • converts _unitcell to MDAnalysis native unit cell representation
  • volume (property)
    • derived from dimensions
  • _unitcell
    • native unitcell
    • converted to MDA units (by the Reader)
  • data
    • Let's make this just a plain dict without additional attribute access – basically, just one well-defined way of getting into it.
    • It can contain extra data elements that aren't shared by most trajectory formats), for example:
      • temperature
      • pressure
      • FEP lambda (e.g. TRR reader)
      • per-atom user data (e.g. B-factor array from Multi-PDB reader or occupancy)
      • ...
    • Everything should be converted to MDA units by the Reader.

I think that we should continue with the approach that Timestep contains data in MDA units and that unit conversions are handled at the Reader level. In this way one can always access underscore-attributes and still be sure that the data are usable. Furthermore, we avoid repeated conversion overhead if data in a Timestep is read repeatedly.

What did I miss? Comments?

History

  • draft: 2015-05-01
  • UPDATED 2015-06-29: struck out parts of the proposal that we decided to implement differently and linked to pertinent issues

@hainm
Copy link
Contributor

hainm commented May 2, 2015

n_atoms, n_frames look nicer numatom

orbeckst referenced this issue May 6, 2015
Removed obsolete Timestep class

Changed PDBReader to subclass SingleFrameReader

Cleaned up multiframe Readers

Moved Reader API tests into own file.
orbeckst referenced this issue in richardjgowers/mdanalysis Jun 9, 2015
(defines size of Timestep)

base.Timestep can now have velocities and forces through
keywords 'velocities' and 'forces' [False by default]

Added alternate constructors to Timestep

from_timestep(other_timestep)
  - Allows construction of a Timestep from another Timestep of different type

from_coordinates(pos, velocities=None, forces=None)
  - Allows construction from set of existing coordinates

All other Timesteps updated to reflect new capabilities of base.Timestep

Still TODO:
Docs!
Travis check
Sanity check!
@orbeckst
Copy link
Member

@dotsdl and @richardjgowers : is one of you working on this one? I updated the updated API specs in my comment above. "All" (cough, cough) that needs to be done is to implement it...

Or should this wait until @dotsdl 's Timestep hacking is done?

@dotsdl
Copy link
Member Author

dotsdl commented Jun 30, 2015

I'm a bit bottlenecked at the moment (preparing some last minute things for MDSynthesis for SciPyConf). I'm happy to do it, but it won't be this week I'm afraid.

@richardjgowers
Copy link
Member

I don't mind taking this on

@orbeckst orbeckst assigned richardjgowers and unassigned dotsdl Jun 30, 2015
@orbeckst
Copy link
Member

Cheers!

richardjgowers added a commit that referenced this issue Jul 2, 2015
Readers now always always return ts.frame = 0 on first frame.

Fixed INPCRD and DLPoly Config giving -1 as first frame (should be 0)
@richardjgowers
Copy link
Member

  • frame returns 0 based
  • store native frame as _frame when available
  • time -> going to defer this to Issue Introduce universally Timestep.time as part of the Reader/Timestep API #308
  • _time -> ditto
  • numatoms as read only property stored as _numatoms
  • positions with setter/getter for _positions
  • velocities with setter/getter for _velocities or raise NoDataError
  • forces with setter/getter for _forces or raise NoDataError
  • dimensions & _unitcell
  • volume
  • data dict for extra info
  • docs updated to reflect new API

richardjgowers added a commit that referenced this issue Jul 3, 2015
In general, moved to relative import style across coordinates modules

Timesteps now read their native frame into "_frame".
This replaces what was "step" in some Readers.

Timestep.data dictionary added for formats with nonstandard data for
their Timestep.

Writing a Timestep with XTC/TRR Writer no longer modifies the Timestep
passed
richardjgowers added a commit that referenced this issue Jul 10, 2015
In general, moved to relative import style across coordinates modules

Timesteps now read their native frame into "_frame".
This replaces what was "step" in some Readers.

Timestep.data dictionary added for formats with nonstandard data for
their Timestep.

Writing a Timestep with XTC/TRR Writer no longer modifies the Timestep
passed
richardjgowers added a commit that referenced this issue Jul 10, 2015
All timesteps now use has_positions, has_velocities
and has_forces flags.

TRR uses these flags to manage when data is missing
@richardjgowers
Copy link
Member

I'm going to roll #213 into this too

@orbeckst
Copy link
Member

Good idea.

orbeckst added a commit that referenced this issue Jul 11, 2015
closes Issue #250 and Issue #213

(just docs need to be update)
@orbeckst
Copy link
Member

@richardjgowers close the issue once you have added the docs. Many thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants