Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework for saving and loading operators #91

Closed
babbush opened this issue Jul 7, 2017 · 7 comments
Closed

Framework for saving and loading operators #91

babbush opened this issue Jul 7, 2017 · 7 comments
Assignees

Comments

@babbush
Copy link
Contributor

babbush commented Jul 7, 2017

We have a nice framework for saving and loading instances of the MolecularData class. We should also have something similar for FermionOperator and QubitOperator. In fact, there is a project going on right now which will benefit from this significantly in the coming weeks. A good example of an operator we might want to save and load is an error operator from the Trotter error code. These operators are expensive to compute and complex to analyze so there is good reason that one might want to save the output.

I suggest we continue to use HDF5 and try to loosely parallel the system by which MolecularData is saved. However, automatically generated names for arbitrary FermionOperators and QubitOperators are not a good idea since these classes are quite broad. Naming should then be left up to the user. The directory should perhaps be specified optionally with the default option being the same place where MolecularData is saved by default. I suggest that save() and load() are external functions, kept in utils/. We should anticipate automatic naming functions that will use these primitive save/load functions as subroutines. A good example where this would be helpful would be in saving and loading plane wave Hamiltonians.

We should think about the most efficient way to save both types of operators. An easy (but not necessarily optimal) solution involves calling the str() method that is already implemented in these operators. To load these operators one will need to write a parser. Since these are classes with a small number of attributes that are unlikely to change, it might make sense to use pickle (yes, I know about the security issue). A bigger concern with that is the discrepancy between pickling in python 2 and 3. Is there a standard way to store python dictionaries? That could work since a dictionary essentially defines QubitOperator and FermionOperator. We may also want to think about writing the builtin eval method on these classes. Keep in mind that if somebody is going to the trouble of saving these operators, they are likely rather large and performance should be a priority.

I am curious to hear the opinions of @jarrodmcc, @damiansteiger, @thomashaener, and @Strilanc. We should discuss and agree on a solution prior to any pull requests being opened.

@jarrodmcc
Copy link
Contributor

A few points that come to mind:

  1. Assuming the complexity of generating these objects is much greater than the storage complexity (otherwise on-the-fly regeneration is a valid and practical solution for some cases), how big is big here? The threshold being: does the resulting operator fit in a reasonable amount of memory? If the answer is no, then we might need a streaming storage solution that writes to disk as they are being computed, without storing in memory, and similarly reads and computes from disk without keeping too much around. In HDF5 world, I believe this would mean we need to take advantage of chunking and compression (compression should perhaps be using in MolecularData as well). See here for python docs: http://docs.h5py.org/en/latest/high/dataset.html . If the max storage is expected to be < memory, the sort of solution we have in MolecularData should work.

  2. If the number of terms is large, we should avoid ASCII storage solutions like using the str() representation and parsing. The overhead in both storage and decoding will be too large.

  3. Pickling, or more generally, serialization by some means, is not a bad idea if it's okay to bring the whole thing back into memory. There are more portable options in python than using the pickle module, for example Python's JSON module allows serialization to JSON, which can then be stored directly in HDF5. The downside to this is perhaps if it is not amenable to chunking for large datasets (not sure). I'm sure there are other serialization options that are portable as well.

  4. A mix of the two might be storing two associated chunkable arrays of the dictionary keys and values. This still takes some of the string overhead, requires re-building of the dictionary (perhaps not that expensive? needs to be tested), but likely can handle datasets that cannot fit in memory.

@babbush
Copy link
Contributor Author

babbush commented Jul 7, 2017

  1. I don't anticipate that people will frequently want to do numerics in FermiLib with QubitOperators and FermionOperators that are so large that they cannot fit in RAM. While there is certainly many reasons to want to generate such huge operators, generation of the data structure would be way to slow if it needs to read and write from memory and so I think people will run out of patience.

  2. I think we want to aim to store between ten and a hundred million terms. A hundred million may require something on the order of 10Gb if stored in an extremely minimal representation.

  3. My concern about pickling specifically is about the compatibility issues between python 2 and 3. Files pickled by 2 cannot be read by 3 and visa versa. But maybe we're okay with that. The HDF5 JSON module sounds promising though. I like that.

  4. Again, I am skeptical that numerics will be practical at all if the operator doesn't fit in memory. For instance, does it seem practical to take expectation values of operators that don't fit in memory? I am sure this can be done in a reasonable way but it seems like more than we need at the moment. But next week I expect to be generating operators of this size so maybe I'll change my mind when I realize just how small of instances cause me to run out of RAM.

@Strilanc
Copy link
Collaborator

Strilanc commented Jul 7, 2017

Don't use pickle. It's not good for later compatibility stuff, e.g. it makes your data format python-specific instead of language agnostic.

Looking like the existing thing you're already storing is usually a good idea.

Using a human-readable string format keeps things simple. The human-readable strings might compress well; you should check how much compression you get with zlib or some other standard compression algorithm.

A binary format is good if you're speed-limited or space-limited.

@babbush
Copy link
Contributor Author

babbush commented Jul 7, 2017

@Strilanc how do you feel about @jarrodmcc's suggestion to use the HDF5 JSON module?

@Strilanc
Copy link
Collaborator

Strilanc commented Jul 7, 2017

@babbush That falls under "looking/acting like the existing thing is good". I don't know much about HDF5... unless there's some reason that it doesn't fit this use case, it seems perfectly acceptable to keep using it.

@jarrodmcc
Copy link
Contributor

Tor the argument that you don't want data to be python-specific, it would also fall under that category. That is, I expect serialization of a python dict to be somewhat python specific, even if it's then stored in an HDF5 file. If that is a strong desire of ours (to have the data be somewhat language agnostic), we're only really left with the option of giant string dump or associated lists of keys and values that. Either has to be processed first then used to build a dictionary, but the lists are likely to be faster/more compact if I had to guess.

@babbush
Copy link
Contributor Author

babbush commented Jul 15, 2017

Thanks to @Spaceenter !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants