New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using load_new with MemoryReader #1685

Closed
Yuan-Yu opened this Issue Oct 24, 2017 · 12 comments

Comments

Projects
None yet
4 participants
@Yuan-Yu

Yuan-Yu commented Oct 24, 2017

Expected behaviour

Actual behaviour

return an error :(

Code to reproduce the behaviour

import MDAnalysis as mda

u = mda.Universe(top, trj)
prot = u.select_atoms('protein')
u2 = mda.Merge(prot)
u2.load_new( [ prot.positions ], format=MemoryReader )
....

Currently version of MDAnalysis:

0.16.2

The reason is that "load_new" remove one dimension of the input array when the first dimension of the array is one.

if len(util.asiterable(filename)) == 1:
# make sure a single filename is not handed to the ChainReader
filename = util.asiterable(filename)[0]

Thanks for help!

@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Oct 27, 2017

Member

Yes, that's a bug..

As a silly workaround, add an artifical extra list

u2.load_new( [[ prot.positions ]], format=MemoryReader )
Member

orbeckst commented Oct 27, 2017

Yes, that's a bug..

As a silly workaround, add an artifical extra list

u2.load_new( [[ prot.positions ]], format=MemoryReader )
@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Oct 27, 2017

Member

For anyone writing the unit test:

from MDAnalysis.tests.datafiles import GRO
import MDAnalysis as mda; from MDAnalysis.tests.datafiles import GRO
u = mda.Universe(GRO)
prot = u.select_atoms('protein')
u2 = mda.Merge(prot)
# next line fails and raises ValueError (should pass and give proper Universe)
u2.load_new( [ prot.positions ], format=mda.coordinates.memory.MemoryReader)

# workaround (when fixed, should fail with ValueError)
u2.load_new( [[ prot.positions ]], format=mda.coordinates.memory.MemoryReader)
Member

orbeckst commented Oct 27, 2017

For anyone writing the unit test:

from MDAnalysis.tests.datafiles import GRO
import MDAnalysis as mda; from MDAnalysis.tests.datafiles import GRO
u = mda.Universe(GRO)
prot = u.select_atoms('protein')
u2 = mda.Merge(prot)
# next line fails and raises ValueError (should pass and give proper Universe)
u2.load_new( [ prot.positions ], format=mda.coordinates.memory.MemoryReader)

# workaround (when fixed, should fail with ValueError)
u2.load_new( [[ prot.positions ]], format=mda.coordinates.memory.MemoryReader)
@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Nov 22, 2017

Member

Possibly #1453 (comment) is related to this issue.

Member

orbeckst commented Nov 22, 2017

Possibly #1453 (comment) is related to this issue.

@orbeckst orbeckst added this to the 0.17.x milestone Dec 9, 2017

@richardjgowers richardjgowers removed this from the 0.18 milestone Mar 5, 2018

@richardjgowers

This comment has been minimized.

Show comment
Hide comment
@richardjgowers

richardjgowers Mar 5, 2018

Member

This looks fairly simple, could just add some checks to see if the input is a np.ndarray

Member

richardjgowers commented Mar 5, 2018

This looks fairly simple, could just add some checks to see if the input is a np.ndarray

@palnabarun

This comment has been minimized.

Show comment
Hide comment
@palnabarun

palnabarun Mar 7, 2018

Member

Hi,

I am trying to solve this but I am not making any headway. Getting stuck somewhere. I tried to check if filename is of type np.ndarray and skipping the code block if it is. But, it is not yielding the ideal result.

if not isinstance(filename, np.ndarray) and len(util.asiterable(filename)) == 1:
            # make sure a single filename is not handed to the ChainReader
            filename = util.asiterable(filename)[0]

I think I am doing something very wrong here. Please guide me @orbeckst @richardjgowers

Member

palnabarun commented Mar 7, 2018

Hi,

I am trying to solve this but I am not making any headway. Getting stuck somewhere. I tried to check if filename is of type np.ndarray and skipping the code block if it is. But, it is not yielding the ideal result.

if not isinstance(filename, np.ndarray) and len(util.asiterable(filename)) == 1:
            # make sure a single filename is not handed to the ChainReader
            filename = util.asiterable(filename)[0]

I think I am doing something very wrong here. Please guide me @orbeckst @richardjgowers

@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Mar 7, 2018

Member

We're talking about

if len(util.asiterable(filename)) == 1:

At first glance your solution looks ok-ish. However, if you look at my test case #1685 (comment) you see that the object that you're testing is a list, not an ndarray. That's probably why your code does not skip the bla[0] clause.

Perhaps I should explain what happens in

u2.load_new( [ prot.positions ], format=mda.coordinates.memory.MemoryReader)

The intent is to create a trajectory with a single frame in "fac" (frame-atom-coordinates) order. The outer list is just the "trajectory" holding a single coordinate frame (prot.positions) with shape (n_atoms, 3). If you were to do

trj = np.asarray([ prot.positions ])

then trj.shape would be (1, n_atoms, 3).

Maybe the test should be something like

trj_source = util.asiterable(filename)
if len(trj_source) == 1 and not isinstance(trj_source[0], np.ndarray):
    # make sure a single filename is not handed to the ChainReader
    # (note: a single frame trajectory for the MemoryReader is also treated differently)
    filename = trj_source[0]
Member

orbeckst commented Mar 7, 2018

We're talking about

if len(util.asiterable(filename)) == 1:

At first glance your solution looks ok-ish. However, if you look at my test case #1685 (comment) you see that the object that you're testing is a list, not an ndarray. That's probably why your code does not skip the bla[0] clause.

Perhaps I should explain what happens in

u2.load_new( [ prot.positions ], format=mda.coordinates.memory.MemoryReader)

The intent is to create a trajectory with a single frame in "fac" (frame-atom-coordinates) order. The outer list is just the "trajectory" holding a single coordinate frame (prot.positions) with shape (n_atoms, 3). If you were to do

trj = np.asarray([ prot.positions ])

then trj.shape would be (1, n_atoms, 3).

Maybe the test should be something like

trj_source = util.asiterable(filename)
if len(trj_source) == 1 and not isinstance(trj_source[0], np.ndarray):
    # make sure a single filename is not handed to the ChainReader
    # (note: a single frame trajectory for the MemoryReader is also treated differently)
    filename = trj_source[0]
@palnabarun

This comment has been minimized.

Show comment
Hide comment
@palnabarun

palnabarun Mar 8, 2018

Member

Thank you @orbeckst for the guidance. I now understand how the whole workflow works.

However, when I tried the above snippet, the code returned a ValueError in the mda.Merge step itself.

In [1]: from MDAnalysis.tests.datafiles import GRO
   ...: import MDAnalysis as mda; from MDAnalysis.tests.datafiles import GRO
   ...: u = mda.Universe(GRO)
   ...: prot = u.select_atoms('protein')
   ...: u2 = mda.Merge(prot)
   ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-4eaa621a6aad> in <module>()
      3 u = mda.Universe(GRO)
      4 prot = u.select_atoms('protein')
----> 5 u2 = mda.Merge(prot)

/mnt/Data_SSD/codes/mdanalysis/package/MDAnalysis/core/universe.py in Merge(*args)
   1213     coords = np.vstack([a.positions for a in args])
   1214     u = Universe(top, coords[None, :, :],
-> 1215                  format=MDAnalysis.coordinates.memory.MemoryReader)
   1216 
   1217     return u

/mnt/Data_SSD/codes/mdanalysis/package/MDAnalysis/core/universe.py in __init__(self, *args, **kwargs)
    303             if not coordinatefile:
    304                 coordinatefile = None
--> 305             self.load_new(coordinatefile, **kwargs)
    306 
    307         # Check for guess_bonds

/mnt/Data_SSD/codes/mdanalysis/package/MDAnalysis/core/universe.py in load_new(self, filename, format, in_memory, **kwargs)
    551         kwargs['n_atoms'] = self.atoms.n_atoms
    552 
--> 553         self.trajectory = reader(filename, **kwargs)
    554         if self.trajectory.n_atoms != len(self.atoms):
    555             raise ValueError("The topology and {form} trajectory files don't"

/mnt/Data_SSD/codes/mdanalysis/package/MDAnalysis/coordinates/memory.py in __init__(self, coordinate_array, order, dimensions, dt, filename, **kwargs)
    277                                  "does not match the shape of the coordinate "
    278                                  "array ({})"
--> 279                                  .format(provided_n_atoms, self.n_atoms))
    280 
    281         self.ts = self._Timestep(self.n_atoms, **kwargs)

ValueError: The provided value for n_atoms (3341) does not match the shape of the coordinate array (1)

I inspected the code using debug statements and found that when the prot AtomGroup is passed on to the Merge function which in turn calls load_new down the call stack, this check fails.

if (provided_n_atoms is not None and
provided_n_atoms != self.n_atoms):
raise ValueError("The provided value for n_atoms ({}) "
"does not match the shape of the coordinate "
"array ({})"
.format(provided_n_atoms, self.n_atoms))

Because the coordinate_array passed to MemoryReader in this statement is a tuple of numpy arrays. So, when the line below is executed, the MemoryReader stores a numpy array with shape (1, 1, 3341, 3) which ultimately fails the above test.

self.set_array(np.asarray(coordinate_array), order)

I believe this is not the ideal behavior. We need to somehow make the workflow of load_new uniform.

Member

palnabarun commented Mar 8, 2018

Thank you @orbeckst for the guidance. I now understand how the whole workflow works.

However, when I tried the above snippet, the code returned a ValueError in the mda.Merge step itself.

In [1]: from MDAnalysis.tests.datafiles import GRO
   ...: import MDAnalysis as mda; from MDAnalysis.tests.datafiles import GRO
   ...: u = mda.Universe(GRO)
   ...: prot = u.select_atoms('protein')
   ...: u2 = mda.Merge(prot)
   ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-4eaa621a6aad> in <module>()
      3 u = mda.Universe(GRO)
      4 prot = u.select_atoms('protein')
----> 5 u2 = mda.Merge(prot)

/mnt/Data_SSD/codes/mdanalysis/package/MDAnalysis/core/universe.py in Merge(*args)
   1213     coords = np.vstack([a.positions for a in args])
   1214     u = Universe(top, coords[None, :, :],
-> 1215                  format=MDAnalysis.coordinates.memory.MemoryReader)
   1216 
   1217     return u

/mnt/Data_SSD/codes/mdanalysis/package/MDAnalysis/core/universe.py in __init__(self, *args, **kwargs)
    303             if not coordinatefile:
    304                 coordinatefile = None
--> 305             self.load_new(coordinatefile, **kwargs)
    306 
    307         # Check for guess_bonds

/mnt/Data_SSD/codes/mdanalysis/package/MDAnalysis/core/universe.py in load_new(self, filename, format, in_memory, **kwargs)
    551         kwargs['n_atoms'] = self.atoms.n_atoms
    552 
--> 553         self.trajectory = reader(filename, **kwargs)
    554         if self.trajectory.n_atoms != len(self.atoms):
    555             raise ValueError("The topology and {form} trajectory files don't"

/mnt/Data_SSD/codes/mdanalysis/package/MDAnalysis/coordinates/memory.py in __init__(self, coordinate_array, order, dimensions, dt, filename, **kwargs)
    277                                  "does not match the shape of the coordinate "
    278                                  "array ({})"
--> 279                                  .format(provided_n_atoms, self.n_atoms))
    280 
    281         self.ts = self._Timestep(self.n_atoms, **kwargs)

ValueError: The provided value for n_atoms (3341) does not match the shape of the coordinate array (1)

I inspected the code using debug statements and found that when the prot AtomGroup is passed on to the Merge function which in turn calls load_new down the call stack, this check fails.

if (provided_n_atoms is not None and
provided_n_atoms != self.n_atoms):
raise ValueError("The provided value for n_atoms ({}) "
"does not match the shape of the coordinate "
"array ({})"
.format(provided_n_atoms, self.n_atoms))

Because the coordinate_array passed to MemoryReader in this statement is a tuple of numpy arrays. So, when the line below is executed, the MemoryReader stores a numpy array with shape (1, 1, 3341, 3) which ultimately fails the above test.

self.set_array(np.asarray(coordinate_array), order)

I believe this is not the ideal behavior. We need to somehow make the workflow of load_new uniform.

@richardjgowers

This comment has been minimized.

Show comment
Hide comment
@richardjgowers

richardjgowers Mar 8, 2018

Member

@orbeckst I'm not sure the input you're proposing makes sense. I'd think that only supplying a numpy array of shape (nframes, natoms, 3) would make sense. Maybe with shape (natoms, 3) getting correctly reshaped/broadcasted to have the extra dimension - this should be done by the MemoryReader.

The list input should probably be for ChainReader only, then a list of arrays could then be read with a ChainReader of MemoryReaders

Member

richardjgowers commented Mar 8, 2018

@orbeckst I'm not sure the input you're proposing makes sense. I'd think that only supplying a numpy array of shape (nframes, natoms, 3) would make sense. Maybe with shape (natoms, 3) getting correctly reshaped/broadcasted to have the extra dimension - this should be done by the MemoryReader.

The list input should probably be for ChainReader only, then a list of arrays could then be read with a ChainReader of MemoryReaders

@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Mar 8, 2018

Member

@palnabarun thanks for trying out things and thanks @richardjgowers for the comments. I agree that this is more messy than I assumed. I like @richardjgowers 's suggestion to simply only accept ndarrays for the MemoryReader. Then the user has to do a

u2.load_new( prot.positions[np.newaxis, :, :])
  1. Either we check as proposed by @palnabarun or
  2. we hack the memory reader and special-case when it is handed a (N, 3) array. The latter will have the advantage that it will then work with the ChainReader (I think...) and creating trajectories as
    u2.load_new([protein1.positions, protein2.positions])
    will become possible. Previously you'd do u2.load_new(np.vstack([protein1.positions, protein2.positions])).

In any case, we should document clearly how to use load_new() with coordinates. It's a very nifty and versatile feature.

Member

orbeckst commented Mar 8, 2018

@palnabarun thanks for trying out things and thanks @richardjgowers for the comments. I agree that this is more messy than I assumed. I like @richardjgowers 's suggestion to simply only accept ndarrays for the MemoryReader. Then the user has to do a

u2.load_new( prot.positions[np.newaxis, :, :])
  1. Either we check as proposed by @palnabarun or
  2. we hack the memory reader and special-case when it is handed a (N, 3) array. The latter will have the advantage that it will then work with the ChainReader (I think...) and creating trajectories as
    u2.load_new([protein1.positions, protein2.positions])
    will become possible. Previously you'd do u2.load_new(np.vstack([protein1.positions, protein2.positions])).

In any case, we should document clearly how to use load_new() with coordinates. It's a very nifty and versatile feature.

@palnabarun

This comment has been minimized.

Show comment
Hide comment
@palnabarun

palnabarun Mar 21, 2018

Member

@orbeckst What do you suggest? How should I proceed on this issue?

I also think it is better to handle the special case in the memory reader. Should it look something like this?

if arr.shape[1] == 3:
    arr = arr[np.newaxis, :, :]
Member

palnabarun commented Mar 21, 2018

@orbeckst What do you suggest? How should I proceed on this issue?

I also think it is better to handle the special case in the memory reader. Should it look something like this?

if arr.shape[1] == 3:
    arr = arr[np.newaxis, :, :]
@orbeckst

This comment has been minimized.

Show comment
Hide comment
@orbeckst

orbeckst Mar 21, 2018

Member

I think the check should be more specific: it's only the (N, 3) case that we want to promote,

if len(arr.shape) == 2 and arr.shape[1] == 3:
  arr = arr[np.newaxis, :, :]

(with appropriate error checking).

I'd say try it out in the MemoryReader and then we see if this works. So far, this sounds like the best idea to me but obviously things can change when we see how it works in practice.

Member

orbeckst commented Mar 21, 2018

I think the check should be more specific: it's only the (N, 3) case that we want to promote,

if len(arr.shape) == 2 and arr.shape[1] == 3:
  arr = arr[np.newaxis, :, :]

(with appropriate error checking).

I'd say try it out in the MemoryReader and then we see if this works. So far, this sounds like the best idea to me but obviously things can change when we see how it works in practice.

@palnabarun

This comment has been minimized.

Show comment
Hide comment
@palnabarun

palnabarun Mar 21, 2018

Member

Oh. Yeah, the length of shape checks too.

I tried the implementation. As should be the case, it is handling the case of

u2.load_new( [ prot.positions ], format=mda.coordinates.memory.MemoryReader)

and raising an AttributeError in the case of

u2.load_new( [[prot.positions]], format=mda.coordinates.memory.MemoryReader)

as the input to the MemoryReader is [prot.positions] and lists do not have a shape property.

It also works with the ChainReader.

Member

palnabarun commented Mar 21, 2018

Oh. Yeah, the length of shape checks too.

I tried the implementation. As should be the case, it is handling the case of

u2.load_new( [ prot.positions ], format=mda.coordinates.memory.MemoryReader)

and raising an AttributeError in the case of

u2.load_new( [[prot.positions]], format=mda.coordinates.memory.MemoryReader)

as the input to the MemoryReader is [prot.positions] and lists do not have a shape property.

It also works with the ChainReader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment