New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`ts.dimensions` object with different trajectory formats #1575

Open
tylerjereddy opened this Issue Aug 3, 2017 · 12 comments

Comments

Projects
None yet
4 participants
@tylerjereddy
Member

tylerjereddy commented Aug 3, 2017

Expected behaviour

The ts.dimensions returned object (the numpy array of values) is NOT linked back to the state of the trajectory -- the property should return a numpy array of dimensions that has no connection to the current state of trajectory iteration once the value invalue = ts.dimensions has been stored i.e., in a list.

Futhermore, this behavior must be consistent between different trajectory formats.

Actual behaviour

I'm so confused because look at this behavior when tracking dimensions over time with two different formats (and for even more confusion, look at what happens to the stored values for the XTC format after iteration ends!!!)

  1 import MDAnalysis                                                               
  2 from MDAnalysisTests.datafiles import GRO, XTC, PSF, DCD                        
  3                                                                                 
  4 u1 = MDAnalysis.Universe(GRO, XTC)                                              
  5 u2 = MDAnalysis.Universe(PSF, DCD)                                              
  6 list_dims_1 = []                                                                
  7 list_dims_2 = []                                                                
  8                                                                                 
  9 for universe, list_dims in zip([u1, u2],                                        
 10                                [list_dims_1, list_dims_2]):                     
 11     for ts in universe.trajectory[:2]:                                          
 12         frame = ts.frame                                                        
 13         dims = ts.dimensions                                                    
 14         dims[0] = frame                                                         
 15         list_dims.append((frame, dims))                                         
 16         print('universe:', universe)                                            
 17         print('frame:', frame)                                                  
 18         print('dims:', dims)                                                    
 19         print('list_dims:', list_dims)                                          
 20                                                                                 
 21 # if the above isn't crazy enough for you                                       
 22 # when the trajectory iteration ends, the                                       
 23 # apparent XTC rewind will reset all the dimension                              
 24 # values assigned to the list to those from                                     
 25 # the first frame !!!!                                                          
 26                                                                                 
 27 for element in list_dims_1:                                                     
 28     print(element)                                                              
 29                                                                                 
 30 for element in list_dims_2:                                                     
 31     print(element) 

Output:

universe: <Universe with 47681 atoms>
frame: 0
dims: [  0.          80.01700592  80.01700592  60.          60.          90.        ]
list_dims: [(0, array([  0.        ,  80.01700592,  80.01700592,  60.        ,
        60.        ,  90.        ], dtype=float32))]
universe: <Universe with 47681 atoms>
frame: 1
dims: [  1.          80.13008118  80.13008118  60.          60.          90.        ]
list_dims: [(0, array([  1.        ,  80.13008118,  80.13008118,  60.        ,
        60.        ,  90.        ], dtype=float32)), (1, array([  1.        ,  80.13008118,  80.13008118,  60.        ,
        60.        ,  90.        ], dtype=float32))]
universe: <Universe with 3341 atoms>
frame: 0
dims: [  0.   0.   0.  90.  90.  90.]
list_dims: [(0, array([  0.,   0.,   0.,  90.,  90.,  90.], dtype=float32))]
universe: <Universe with 3341 atoms>
frame: 1
dims: [  1.   0.   0.  90.  90.  90.]
list_dims: [(0, array([  0.,   0.,   0.,  90.,  90.,  90.], dtype=float32)), (1, array([  1.,   0.,   0.,  90.,  90.,  90.], dtype=float32))]


(0, array([ 80.01700592,  80.01700592,  80.01700592,  60.        ,
        60.        ,  90.        ], dtype=float32))
(1, array([ 80.01700592,  80.01700592,  80.01700592,  60.        ,
        60.        ,  90.        ], dtype=float32))
(0, array([  0.,   0.,   0.,  90.,  90.,  90.], dtype=float32))
(1, array([  1.,   0.,   0.,  90.,  90.,  90.], dtype=float32))

This is incredibly confusing to me. Why on earth is the interaction behavior of the XTC returned dimension value so different from DCD returned dimension value. The latter looks sane, but XTC looks wild and has caused me hours of confusing / debugging in a much larger framework context today.

I'm using Python 3.6 and the dev branch.

@tylerjereddy

This comment has been minimized.

Show comment
Hide comment
@tylerjereddy

tylerjereddy Aug 3, 2017

Member

Look very carefully at what happens on those append operations with XTC. All numpy arrays stored in the list are reassigned to the current numpy array of dimensions for XTC but not DCD.

Member

tylerjereddy commented Aug 3, 2017

Look very carefully at what happens on those append operations with XTC. All numpy arrays stored in the list are reassigned to the current numpy array of dimensions for XTC but not DCD.

@kain88-de

This comment has been minimized.

Show comment
Hide comment
@kain88-de

kain88-de Aug 3, 2017

Member

You have been running into a issue with references in python. Unfortunately we seem to handle references for the dimensions differently between formats. For the xtc reader ts.dimensions is retuning a reference that we keep updating when reading a frame. You only store that reference in your list. For the DCD reader we actually have a individual copy. Not sure why that is currently.

The question here is if ts.* should return a reference that gets updated by a reader or should it allocate new memory when a new frame is read.

The dcd has another curious thing. The file you are reading doesn't have any box information. Therefore the unitcell should always be 0, 0, 0, 90, 90, 90 (our default for none given). It's strange that we have a different unit-cell on the second frame.

Member

kain88-de commented Aug 3, 2017

You have been running into a issue with references in python. Unfortunately we seem to handle references for the dimensions differently between formats. For the xtc reader ts.dimensions is retuning a reference that we keep updating when reading a frame. You only store that reference in your list. For the DCD reader we actually have a individual copy. Not sure why that is currently.

The question here is if ts.* should return a reference that gets updated by a reader or should it allocate new memory when a new frame is read.

The dcd has another curious thing. The file you are reading doesn't have any box information. Therefore the unitcell should always be 0, 0, 0, 90, 90, 90 (our default for none given). It's strange that we have a different unit-cell on the second frame.

@richardjgowers

This comment has been minimized.

Show comment
Hide comment
@richardjgowers

richardjgowers Aug 3, 2017

Member

So AtomGroup.positions deliberately returns a copy of positions, and AtomGroup.dimensions should also for consistency (but it looks like it won't right now).

Timestep.positions I'm not completely sure on whether it's meant to return a view/reference/pointer or copy. Accessing the data within the Timestep object is a little more advanced, and I can see that maybe it would be better for this to default to views.

Member

richardjgowers commented Aug 3, 2017

So AtomGroup.positions deliberately returns a copy of positions, and AtomGroup.dimensions should also for consistency (but it looks like it won't right now).

Timestep.positions I'm not completely sure on whether it's meant to return a view/reference/pointer or copy. Accessing the data within the Timestep object is a little more advanced, and I can see that maybe it would be better for this to default to views.

@richardjgowers

This comment has been minimized.

Show comment
Hide comment
@richardjgowers

richardjgowers Aug 3, 2017

Member

Also related (I think @kain88-de mentions this above), some Readers create a new Timestep object each time they read a frame, others will fill the same object. Not sure what this should do either.

Member

richardjgowers commented Aug 3, 2017

Also related (I think @kain88-de mentions this above), some Readers create a new Timestep object each time they read a frame, others will fill the same object. Not sure what this should do either.

@tylerjereddy

This comment has been minimized.

Show comment
Hide comment
@tylerjereddy

tylerjereddy Aug 3, 2017

Member

@kain88-de There's nothing curious about the second DCD frame. I intentionally reassigned the first element of the unit cell to the frame number so that I could assess if the values were references or not--it was more confusing to deal with reference vs. copy if the unit cell is identical in each frame because the reassignments in the list are confounded by the fact that the unit cell doesn't change in the test file during iteration.

You'll also notice that I intentionally appended the value of ts.frame as well -- and that's not a reference for either format. So, they aren't even consistent in and of themselves.

My take is that we should never return references like this for values a user would typically want to access. I've always used MDAnalysis in a manner that allows me to use pythonic paradigms like appending trajectory-retrieved values to a list.

Regardless of what decision is made, I think we can agree that it should be consistent across all formats and probably all reasonably-accessed property types--a real power of MDAnalysis is the ability to perform reproducible analyses in a format-agnostic manner, and breaking this is why I've labelled a high priority here.

I would suggest that we even try to enforce the policy with a (parametrized?) unit test that crawls through the relevant parameters for the derived readers & their properties & ensures that whatever type of object (reference / value / view) is supposed to be returned is actually returned.

Member

tylerjereddy commented Aug 3, 2017

@kain88-de There's nothing curious about the second DCD frame. I intentionally reassigned the first element of the unit cell to the frame number so that I could assess if the values were references or not--it was more confusing to deal with reference vs. copy if the unit cell is identical in each frame because the reassignments in the list are confounded by the fact that the unit cell doesn't change in the test file during iteration.

You'll also notice that I intentionally appended the value of ts.frame as well -- and that's not a reference for either format. So, they aren't even consistent in and of themselves.

My take is that we should never return references like this for values a user would typically want to access. I've always used MDAnalysis in a manner that allows me to use pythonic paradigms like appending trajectory-retrieved values to a list.

Regardless of what decision is made, I think we can agree that it should be consistent across all formats and probably all reasonably-accessed property types--a real power of MDAnalysis is the ability to perform reproducible analyses in a format-agnostic manner, and breaking this is why I've labelled a high priority here.

I would suggest that we even try to enforce the policy with a (parametrized?) unit test that crawls through the relevant parameters for the derived readers & their properties & ensures that whatever type of object (reference / value / view) is supposed to be returned is actually returned.

@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Aug 3, 2017

Member

Following up from #276 (comment)

We should decide clearly if Timestep._unitcell is going to behave like a view (like Timestep.positions) or like a copy. Forpositions we should keep a view for performance reasons; creating big coordinate arrays is really bad. For _unitcell the overhead of creating a copy is small so we could just do that if necessary.

Timestep.dimensions is a property and will either produce a copy on its own by converting _unitcell or hand over _unitcell directly. For consistency between the two cases I suggest to prescribe that dimensions always provides a copy, even if it does not change _unitcell. The basic implementation is then

class Timestep:
  @property
  def dimensions(self):
       return self._unitcell.copy()

EDIT: I would then leave _unitcell as is – basically, whatever works best for the implementation of the reader. Users are not encouraged to use it anyway.

Member

orbeckst commented Aug 3, 2017

Following up from #276 (comment)

We should decide clearly if Timestep._unitcell is going to behave like a view (like Timestep.positions) or like a copy. Forpositions we should keep a view for performance reasons; creating big coordinate arrays is really bad. For _unitcell the overhead of creating a copy is small so we could just do that if necessary.

Timestep.dimensions is a property and will either produce a copy on its own by converting _unitcell or hand over _unitcell directly. For consistency between the two cases I suggest to prescribe that dimensions always provides a copy, even if it does not change _unitcell. The basic implementation is then

class Timestep:
  @property
  def dimensions(self):
       return self._unitcell.copy()

EDIT: I would then leave _unitcell as is – basically, whatever works best for the implementation of the reader. Users are not encouraged to use it anyway.

@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Aug 3, 2017

Member

@tylerjereddy in general I very much agree with your #1575 (comment):

My take is that we should never return references like this for values a user would typically want to access. I've always used MDAnalysis in a manner that allows me to use pythonic paradigms like appending trajectory-retrieved values to a list.

I consider Timestep.positions the one major exception to this approach; a common mistake is, as you know,

timeseries = [ts.positions for ts in u.trajectory]    # does NOT produce a timeseries of different positions

whereas [ag.positions for ts in u.trajectory] will work (as @richardjgowers already said). This could be considered a break in the user interface and is certainly a gotcha and needs to be documented better, especially when we improve the docs #1175 . The docs should show some "best practice" examples, both for the user and for the developers to refer back to.

However, I would consider working with Timestep directly reasonably advanced so I am not overly worried; power users should be able to figure this out. For performance reasons it is important that Timestep.positions remains a view because at least at the moment, AtomGroup.positions slices Timestep.positions – it would be awful if we made two copies at this stage.

If people had very strong feelings about Timestep.positions having to return a copy we could make it do it and then internally use something like Timestep._positions (and the same for velocities and forces).

Member

orbeckst commented Aug 3, 2017

@tylerjereddy in general I very much agree with your #1575 (comment):

My take is that we should never return references like this for values a user would typically want to access. I've always used MDAnalysis in a manner that allows me to use pythonic paradigms like appending trajectory-retrieved values to a list.

I consider Timestep.positions the one major exception to this approach; a common mistake is, as you know,

timeseries = [ts.positions for ts in u.trajectory]    # does NOT produce a timeseries of different positions

whereas [ag.positions for ts in u.trajectory] will work (as @richardjgowers already said). This could be considered a break in the user interface and is certainly a gotcha and needs to be documented better, especially when we improve the docs #1175 . The docs should show some "best practice" examples, both for the user and for the developers to refer back to.

However, I would consider working with Timestep directly reasonably advanced so I am not overly worried; power users should be able to figure this out. For performance reasons it is important that Timestep.positions remains a view because at least at the moment, AtomGroup.positions slices Timestep.positions – it would be awful if we made two copies at this stage.

If people had very strong feelings about Timestep.positions having to return a copy we could make it do it and then internally use something like Timestep._positions (and the same for velocities and forces).

@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Aug 21, 2017

Member

Should this still be considered a bug?

What is the action that should be taken on this issue?

Is it still high priority?

Member

orbeckst commented Aug 21, 2017

Should this still be considered a bug?

What is the action that should be taken on this issue?

Is it still high priority?

@kain88-de

This comment has been minimized.

Show comment
Hide comment
@kain88-de

kain88-de Aug 22, 2017

Member

This is a bug. The problem is a principal design deficiency in TimeStep that requires us to create a subclass for every format to support the current model to show the raw dimensions. I didn't have time to have a proper look into TimeStep and look into a solution that doesn't require sub-classing. It will likely be an easy fix but we have to check/change all readers for this.

I think we just need to redefine what is a raw cell data. I hope we can do this without breaking old user code.

Member

kain88-de commented Aug 22, 2017

This is a bug. The problem is a principal design deficiency in TimeStep that requires us to create a subclass for every format to support the current model to show the raw dimensions. I didn't have time to have a proper look into TimeStep and look into a solution that doesn't require sub-classing. It will likely be an easy fix but we have to check/change all readers for this.

I think we just need to redefine what is a raw cell data. I hope we can do this without breaking old user code.

@richardjgowers

This comment has been minimized.

Show comment
Hide comment
@richardjgowers

richardjgowers Jan 30, 2018

Member

So as @kain88-de has said, it's impossible to make ts.dimensions return a view currently, as some Timesteps store _unitcell in a proprietary format.

I'd like to make ts.dimensions return a view though, so ts.positions and ts.dimensions (and ts.whatever) all behave in the same "raw" way. For this we'd have to remove _unitcell and make Readers & Writers do whatever conversions necessary to handle a particular format. Ie shift the burden of weird box formats from Timestep to Reader/Writer. This might actually simplify things overall.

Thoughts?

Member

richardjgowers commented Jan 30, 2018

So as @kain88-de has said, it's impossible to make ts.dimensions return a view currently, as some Timesteps store _unitcell in a proprietary format.

I'd like to make ts.dimensions return a view though, so ts.positions and ts.dimensions (and ts.whatever) all behave in the same "raw" way. For this we'd have to remove _unitcell and make Readers & Writers do whatever conversions necessary to handle a particular format. Ie shift the burden of weird box formats from Timestep to Reader/Writer. This might actually simplify things overall.

Thoughts?

@kain88-de

This comment has been minimized.

Show comment
Hide comment
@kain88-de

kain88-de Jan 30, 2018

Member

Ie shift the burden of weird box formats from Timestep to Reader/Writer. This might actually simplify things overall.

This is actually already done for the XDR formats of GROMACS for simplicity. Personally, I would also like to get rid of sub classing timestep. It means less moving parts to get right and potentially works towards simplifying the readers to implement a read/write function that does all necessary conversions.

Member

kain88-de commented Jan 30, 2018

Ie shift the burden of weird box formats from Timestep to Reader/Writer. This might actually simplify things overall.

This is actually already done for the XDR formats of GROMACS for simplicity. Personally, I would also like to get rid of sub classing timestep. It means less moving parts to get right and potentially works towards simplifying the readers to implement a read/write function that does all necessary conversions.

@richardjgowers

This comment has been minimized.

Show comment
Hide comment
@richardjgowers

richardjgowers Jan 30, 2018

Member
Member

richardjgowers commented Jan 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment