Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent topologies in JSON #643

Open
dotsdl opened this issue Jan 19, 2016 · 12 comments
Open

Persistent topologies in JSON #643

dotsdl opened this issue Jan 19, 2016 · 12 comments

Comments

@dotsdl
Copy link
Member

dotsdl commented Jan 19, 2016

Since the Topology object is now well defined and independent of the limitations of any particular file format, it should be possible to design a scheme for serializing its information to a file, perhaps JSON. The core translation tables would simply be converted into JSON arrays, and any addtional TopologyAttrs the Topology has could have their own serialize methods that would be called in sequence to produce a serializable data structure. The dict-like object schema of JSON would basically map directly to the Topology object it is a serialized form of.

An example could look like.

{
 'atom-to-residue': [ ... ],
 'residue-to-segment': [ ... ],
 'topattrs':
    {
     'names': [ ... ]
     'resids': [ ... ]
     'happiness': [ ... ]
     'bonds': [ [ ... ], ... ]
    }
}

Likewise, the machinery would be there for complete deserialization. This gets trickier, though, since the appropriate TopologyAttrs would need to be determined from the JSON structure. Failure to find a TopologyAttr for a particular key in the schema wouldn't be a huge problem since they function independently anyhow.

The consequence of this is that any changes made to a Topology object can be preserved for later use, without having to rely on the limitations of any particular file format.

@dotsdl dotsdl self-assigned this Jan 19, 2016
@dotsdl dotsdl added this to the 0.14 - Topology refactor milestone Jan 19, 2016
@hainm
Copy link
Contributor

hainm commented Jan 19, 2016

what is your motivation to write to JSON format?

@dotsdl
Copy link
Member Author

dotsdl commented Jan 19, 2016

@hainm ease of mapping python objects to and from JSON, futureproof, lightweight (no dependency added)

@richardjgowers
Copy link
Member

Yeah I had this idea before in #253, but it didn't make sense before as data was scattered everywhere. I think you're right that each Attr has to define a serial version of itself (usually just an array, but who knows)

@dotsdl
Copy link
Member Author

dotsdl commented Jan 19, 2016

I'm going to try and prototype this fairly soon. I think this could inform a nicer resolution for some related issues, namely #292 and datreant/MDSynthesis#25.

@richardjgowers
Copy link
Member

We've also got #618 which needs the spider scheme or pickling the topology and making a new Reader

@dotsdl
Copy link
Member Author

dotsdl commented Jan 19, 2016

Yeah, we can set __getstate__ to use the same serializeable form that is used for JSON serialization, making Topology objects completely pickleable.

@richardjgowers
Copy link
Member

Completely off topic, but while I've got you here, we really need to merge develop into 363 soonish. So maybe that before you hack too much?

@dotsdl
Copy link
Member Author

dotsdl commented Jan 19, 2016

@richardjgowers will do. I'll happily do this since you've been hard at work on other things.

jbarnoud added a commit to jbarnoud/mdanalysis that referenced this issue Jan 22, 2016
`basestring` got removed from Python 3. This commit replaces all its
occurrences by `six.string_types`.

See MDAnalysis#643 and MDAnalysis#260
jbarnoud added a commit to jbarnoud/mdanalysis that referenced this issue Jan 22, 2016
`basestring` got removed from Python 3. This commit replaces all its
occurrences by `six.string_types`.

See MDAnalysis#643 and MDAnalysis#260
dotsdl added a commit that referenced this issue Apr 21, 2016
We now register TopologyAttrs using the same metaclassing mechanism as
used for Parsers, Readers, and Writers. This is necessary in order to
deserialize Topology objects that have been serialized to JSON.

An existing Topology can now be serialized to JSON with:

```python

top.to_json('topology.json')

```

and a Topology can be generated from its JSON form with:

```python

top = Topology.from_json('topology.json')

```

Individual TopologyAttrs must define how to serialize themselves and
to deserialize their data to make an instance of themselves for this to
work.

Addresses #643.
@davidlmobley
Copy link

Aaron Virshup pointed this out to me. It seems like the time is right to (a) do this, and (b) make sure it can support everything we need for the long haul. Do you have somewhere where you're laying out specs for this? (For example, on this end we have a particular interest in being able to carry along bond types and bond orders along with the types of information one would normally have in a topology file for simulation, partly because this information becomes increasingly important for small molecules).

Also, as @jchodera noted it will probably also be useful to talk with the mdtraj folks about this.

@jchodera
Copy link

@rmcgibbo noted in mdtraj/mdtraj#1144 that mdtraj developed essentially this same format for their HDF5 storage layer:

http://mdtraj.org/latest/hdf5_format.html#topology

@dotsdl
Copy link
Member Author

dotsdl commented Jul 17, 2016

@davidlmobley we have not laid out a formal spec for this yet, and at the moment this serialization scheme is meant to make it possible to fully serialize our Topology objects more transparently than pickling. I do think there is room for a standard, though, and the nature of json is that it could be flexible enough to be pretty robust against varied needs. Our new topology system already requires that kind of flexibility anyway, so the assumption is built in.

I'm hesitant to propose a standard too quickly, since at the moment our needs are local to MDAnalysis. We are in the midst of finishing up and merging the new topology system, and this is one component serving that end. For us it would be best to wait until after we've cleared that milestone before committing to anything new.

We'll see how the chips fall for this issue (and the corresponding WIP PR #831).

@dotsdl
Copy link
Member Author

dotsdl commented Jul 17, 2016

@jchodera I've had a look at the mdtraj HDF5 spec a good while back, and I think it's a reasonable approach to this. Does that address your community's needs @davidlmobley? Like JSON, with HDF5 one can always add additional data within the internal structure of the file, but you also get built-in compression (if you want) and usually really good read performance (and the ability to read only parts you need at any time). The downside is how opaque and sometimes painful HDF5 is to work with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants