Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot pickle universe object #173

Closed
GoogleCodeExporter opened this issue Apr 4, 2015 · 19 comments
Closed

cannot pickle universe object #173

GoogleCodeExporter opened this issue Apr 4, 2015 · 19 comments

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
1. Create a universe object by passing a psf file and a dcd file
          u = universe(psf, dcd)
2. Try to serialize this object by using python pickle module
          u_blob = pickle.dumps(u)

What is the expected output? What do you see instead?
Expected output should be a byte stream, but I got error message 'TypeError: 
'AtomGroup' object is not callable'

What version of the product are you using? On what operating system?

MDAnalysis 0.8.0
Python 2.7.6 / Python 2.6.2
Arch Linux / CentOS

Please provide any additional information below.


Original issue reported on code.google.com by charlesz...@gmail.com on 27 Mar 2014 at 9:18

@GoogleCodeExporter
Copy link
Author

I've certainly brought this kind of thing up in the past, but I think the fact 
that universe objects are closely tied to open files makes it difficult to 
allow proper serialization via pickle. This also makes it challenging to pass 
universe objects between cores in a parallel workflow. I generally end up 
adjusting my workflow to produce something more tractable like a numpy array of 
coordinates and pickle that for storage and / or interprocess communication.

Original comment by tyler.je.reddy@gmail.com on 27 Mar 2014 at 10:14

@GoogleCodeExporter
Copy link
Author

The way Universes are built at the moment makes it impossible to pickle them as 
they contain trajectory reader objects which in turn contain open file 
descriptors.

This won't change any time soon unless someone comes up with a smart way to do 
this. Therefore I am closing this with 'WontFix' – but feel free to start a 
discussion on the developer mailing list or in the comments to this issue. If a 
sensible approach and consensus emerges we will reopen.

Oliver 

Original comment by orbeckst on 27 Mar 2014 at 10:16

@GoogleCodeExporter
Copy link
Author

Original comment by orbeckst on 27 Mar 2014 at 10:16

  • Changed state: WontFix

@GoogleCodeExporter
Copy link
Author

Just to clarify the cryptic error message:

pickle checks whether a class has a __getstate__ function, and then executes it 
if it does. The way this is done is a sort of duck-typing, where 
object.__getstate__ is assigned to a variable, which is subsequently called.
If there is no __getstate__ an AttributeError is raised, in which case pickle's 
default behavior ensues.

The thing with some MDAnalysis onjects is that they manage their attributes and 
never raise an AttributeError. In particular if you try AtomGroup.__getstate__ 
you get a SelectionError which pickle does not handle; if you do the same with 
Segment.__getstate__ you get no error (!!) and an empty AtomGroup is returned. 
It is this last case that causes the 'TypeError: 'AtomGroup' object is not 
callable' when pickle tries to execute what it got for Segment.__getstate__.

This is a dangerous side-effect of the syntactic sugar for selection shortcuts. 
A lot of things can go silently ignored.

On the topic of parallelization, I'll post soon some code I have geared 
specifically for parallelizing trajectory reads. The serialization of 
MDAnalysis objects only becomes a problem if they are to be passed back and 
forth between workers. If the code avoids that (and takes care of renewing file 
descriptors) multiprocessing works fine.

Original comment by manuel.n...@gmail.com on 27 Mar 2014 at 10:29

@GoogleCodeExporter
Copy link
Author

Is that possible to extract all the fields of the universe object, and 
serialize them to byte stream. When we wanna regenerate this object, we 
deserialize all the fields and construct the object.

Original comment by charlesz...@gmail.com on 27 Mar 2014 at 11:18

@GoogleCodeExporter
Copy link
Author

That is pickle's default approach. MDAnalysis universes are unpicklable in this 
way due to, among other things, open file descriptors and unpicklable function 
objects (SWIG stuff).

This can all be managed instead of leaving pickle to do its default process. 
Look into the __getstate__ and __setstate__ functions. There probably is a way 
of coercing the serialization of enough information for __setstate__ to 
recreate the universe. As far as I went it looked messy and, in my case, not 
worth the trouble.

Original comment by manuel.n...@gmail.com on 27 Mar 2014 at 11:34

@GoogleCodeExporter
Copy link
Author

For __getstate__():
* all the arguments of Universe()
* current frame in the trajectory (often that will just be frame 1)
* optional: index of the XTC/TRR reader

For __setstate__():
1) build the Universe again
2) optional: recreate the XTC/TRR reader index
3) go to the saved frame

Original comment by orbeckst on 27 Mar 2014 at 11:42

@GoogleCodeExporter
Copy link
Author

I added explicit __setstate__() and __getstate__() methods to Universe and 
AtomGroup which raise a NotImplementedError. This should at least make this 
particular error message clearer (in 0.8.2-dev, commit 
4bb36e64e8d182226c9303777a166bdb56d08b34)

Original comment by orbeckst on 15 Apr 2014 at 8:25

@GoogleCodeExporter
Copy link
Author

Maybe it would be worthwhile making pickling work along the lines of recreating 
a copy of the universe using the constructor information and information from 
the TrajectoryReader (which would need its own __getstate__/__setstate__) — 
see comments in this issue for more details.

I reopen as an enhancement and anyone interested can grab the ticket.

Original comment by orbeckst on 15 Apr 2014 at 8:29

  • Changed state: Accepted
  • Added labels: Type-Enhancement
  • Removed labels: Type-Defect

@GoogleCodeExporter
Copy link
Author

There is a _dcd_c_ptr attribute inside the DCDReader object. It is a C object 
defined in dcd.c, if we wanna serialize the DCDReader object, we have to 
serialize this c pointer as well. But it seems like pickle cannot dumps a 
pyobject.

Original comment by charlesz...@gmail.com on 15 Apr 2014 at 6:54

@GoogleCodeExporter
Copy link
Author

I don't think that we should serialize a Reader wholesale but instead provide a 
way to re-instantiate. In this way you only serialize state information such as

- filename
- number of atoms
- current frame
- ... all other parameters set in __init__()

and then essentially create a brand new Reader with this information.

Original comment by orbeckst on 15 Apr 2014 at 7:08

@GoogleCodeExporter
Copy link
Author

If we wanna initialize a reader object, we must call the _read_dcd_header() 
inside the initialize function which is implemented by c in dcd.c and it will 
set a new attribute named __dcd_c_ptr which is a PyObject in the reader object. 

This attribute is used by function like MDAnalysis.analysis.align.rmsd(A, B) 
and most other functions. So I think if we wanna serialize DCDReader we will 
have to serialize this attribute or we can just serialize the coordinate and 
topology file for regenerating universe object later.

Original comment by charlesz...@gmail.com on 15 Apr 2014 at 7:54

@GoogleCodeExporter
Copy link
Author

Come to think it will be difficult to pickle the exact state of a Universe, 
e.g. if I renamed atoms or changed coordinates. 

What is quite do-able (I think) is to re-create the Universe in the same way as 
it was created in the first place as you just need to know the constructor 
arguments (namely, topology files and coordinate files and any keyword args). 
This might or might not be sufficient for many applications but it certainly is 
not true serialization of the current state of an object.

Original comment by orbeckst on 16 Apr 2014 at 8:53

@orbeckst
Copy link
Member

Is this going to be solved by #643 ?

@richardjgowers
Copy link
Member

Sort of, Universe pickle will just call topology.pickle and
trajectory.pickle. 643 is most of the hard work though

On Sat, 30 Apr 2016 00:35 Oliver Beckstein, notifications@github.com
wrote:

Is this going to be solved by #643
#643 ?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#173 (comment)

@dotsdl
Copy link
Member

dotsdl commented May 6, 2016

As currently written Topology can be pickled, but the 10M atom system I've been playing with takes much longer to deserialize from disk and the filesize of the pickle is nearly 7 times larger than the JSON. I think if we decide to stick with JSON as the preferred serialization for topologies, we can just define the JSONified topology as the output of __setstate__ (and likewise de-JSONifying for __getstate__).

Trajectory pickling is probably just going to have to be pickling/unpickling the filepaths. Will be sufficient for multiprocessing or distributed computing on a common filesystem.

@kain88-de
Copy link
Member

@dotsdl did you enable compression for the pickle? The standard pickle protocol is super inefficient. If you change it the results are comparable to JSON. I had the same problems deciding how to store offsets in the xdrlib refactor.

@dotsdl
Copy link
Member

dotsdl commented May 6, 2016

@kain88-de I didn't; I'll give that a try and report back. I suppose if there's no benefit to JSONifying the topology before pickling we can skip that step entirely? Or is there still value in only serializing what we know we need to regenerate the Topology object completely?

@richardjgowers
Copy link
Member

This is a duplicate of #878 as that revolves around unpickling a Universe

@orbeckst orbeckst added the GSoC GSoC project label Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants