Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to represent large arrays #15

Open
tovrstra opened this issue Aug 18, 2017 · 17 comments
Open

How to represent large arrays #15

tovrstra opened this issue Aug 18, 2017 · 17 comments

Comments

@tovrstra
Copy link
Contributor

Large arrays (e.g. density matrices when using a lot of basis functions) may cause some efficiency issues. At the moment all arrays are represented as a list of (lists of ...) numbers. This has some limitations:

  1. Writing a double precision number as text in JSON gives about +150% overhead. This may be solved to some extent with BSON or HDF5, provided arrays can be recognized easily before they are written to the file.

  2. Reading a long list of data, then detecting that it is an array and then converting it to an array is slow and very annoying in some programming languages. It would be very convenient to know in advance that the list is an array with a given shape and data type.

Both aspects of this issue can be fixed by representing arrays as follows and to handle such data appropriately when reading and writing a BSON or HDF5 file.

{"density matrix": {
  "shape": [10, 10],
  "type": "float64",
  "array": [1.234123, ...]
}}

It would even be possible to extend the JSON format to encode the array with base64 encoding (only +33% overhead). I know that such practical details are not really a part of the JSON schema. However, it is relevant to define the schema such that arrays can be read and written efficiently.

@cryos
Copy link
Collaborator

cryos commented Aug 18, 2017

I totally agree, and think we should favor representations in this form. There is some push back as it is then necessary to do some work indexing into the arrays, but I think it is worth it for the points mentioned above.

@tovrstra
Copy link
Contributor Author

I should ping @sunqm on this issue. I forgot earlier. Sorry...

@dgasmith
Copy link
Collaborator

Agreed, we might want to roll this into our units tech as well as that follows a similar format.

@dgasmith
Copy link
Collaborator

Another way to store large arrays is to hold the binary data directly:

{
 'data': b'\xf3<\xa1...',
 'shape': (4, 4),
 'dtype': '<f8'
}

Most languages can understand this fairly easily and we do not have to go through the standard serialization/deserialization process where some data can be lost. Kills readability, but for larger arrays its likely the way to go.

@sunqm
Copy link

sunqm commented Oct 11, 2017

As a binary format, does big endian / little endian need to be considered?

@tovrstra
Copy link
Contributor Author

@sunqm One must know the byte ordering, which is indicated by the < in '<f8'. (It means little-endian.)

I'm not sure if supporting binary representation is that important. One is free to use the spec with a JSON alternatives that nativly supports (efficient) storage of binary data and take care byte order. One could see it as a problem to be solved outside the spec.

@sunqm
Copy link

sunqm commented Oct 11, 2017

I agree to @tovrstra. It seems hard to test and debug for the binary format. I think a readable text is definitely needed and the binary format can be an optional field if performance or file size matters in certain scenario.

@BobHanson
Copy link

BobHanson commented Dec 15, 2017

Since we are talking about a data transfer spec, I do think this is within scope.

The following is in relation to the sparce-array issue, but I suggest it could be a binary solution as well...

What I am implementing for my first try at Jmol just for atom names and symbols in topology.atoms is a simple run-length-encoding of arrays with a first-element descriptor:

["_RLE_",.....]

My decoder simply first checks the first array value, and if it reads "_RLE_", then we know it is RLE-encoded. But the developer certainly has the option to not do that if desired.

My suggestion is to allow different array "headers" like this -- so that it is still just an array being read:

["_BASE64_","<base64-encoded string>"]

The context should handle the data type, but if that is also desired, then perhaps a simple qualifier such as:

["_BASE64_/float32", ....]

for example. Very simple and flexible.

Endianness is important -- and should be specified. I prefer little-endian, which is very common.

"shape" seems like an odd choice for a dimensionality descriptor. Why not just "dim": [4,4]

I realize that some may think it better to have some sort of other meta tag that would indicate array type. Here are the arguments against that:

  • You know you have to read an array; so you must have that information already. Better to just need that than to also every time have to check on the presence or absence of the type indicator.

  • This is similar to how data: URIs work.

In fact, another option would be for long arrays would be to use a true data URI:

["data:application/octet-stream;type=float64;base64,......"]

Alternatively, then you really don't need the array:

"data:application/octet-stream;type=float64;base64,......"

but then that is a totally different solution than a run-length encoded array, where you need both a data-type header and numeric array data. But maybe we would want to have:

["data:application/qcschema;type=rle,......"]

I might experiment with that.

Bob

@dgasmith
Copy link
Collaborator

As a note you can use single quotes to do a verbatim ["BASE64/float32", ....]. Or you can use triple single quotes to do code blocks with syntax highlighting:

import numpy
print(np.arange(5))

@tovrstra
Copy link
Contributor Author

@BobHanson The name "shape" comes from the numpy API (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.shape.html) but it seems to be a better choice than "dim", even without considering Numpy conventions.

In the Numpy API, "dimension" or "dim" is used to refer to the number of elements in shape, which is consistent with common terminology: a 1D array corresponds to a vector, a 2D array corresponds to a matrix, but there can also be higher-dimensional arrays. Likewise, "shape" is a also a name consistent with common terminology, e.g. it determines whether a 2D array is square or rectangular.

I like your ideas regarding the encoding, but what do you do in case a list just happens to start with "_RLE_" accidentally?

@BobHanson
Copy link

BobHanson commented Dec 19, 2017 via email

@dgasmith
Copy link
Collaborator

Putting compression inside the spec itself could cause quite a few issues of interoperability as these libraries can be relatively heavy to import into languages. Why not just compress the whole JSON file while sending it between points?

at the same time, a parser should not have to read two key/value pairs
just to read an array.

I dont quite understand why this is important, reading multiple pairs compared to checking the first index of the array doesn't seem terribly different.

@BobHanson
Copy link

BobHanson commented Dec 19, 2017 via email

@cryos
Copy link
Collaborator

cryos commented Dec 20, 2017

I share some of @dgasmith's concerns, what I think we are looking for is a spec that can be implemented using an off the shelf JSON parser. Optimizing for hand-written JSON parsers, and extending for highly efficient binary transport seems to be pushing things in a very different direction. For C++ we currently use jsoncpp, and I am looking at a more modern json header library. In Fortran Bert used this, and Python we usually just import the json module.

This was part of the reason there was push back on using JSON pointers, many parsers do not support them at this point. The more we put in the spec the more difficult it will be to develop support in each language, and I think that will make it less likely to be adopted by quantum codes etc. Personally I would rather develop a HDF5 or binary sister format, than attempt to create a hybrid mixing JSON with binary encoders, compression, etc. The HDF5 dependency raises the bar quite a bit too though, and effectively requires linking to the HDF5 C library to the best of my knowledge at this point.

If you look at the work done in XDMF it is interesting to note how they use very standard XML, and point to external binary resources which might be binary raw blobs, HDF5, or similar. We have seen some great uptake, but they took great care not to mix XML and binary.

I entirely agree that work must be done whatever the spec, but if we step too far outside of JSON to the point of custom parsers, etc I think this becomes quite a different project. There is the proposal of vendor extensions that would enable experimentation with whatever we like, such as these encoded arrays. Just my opinions, there is a BSON spec I am a huge fan of too that would enable direct encoding of arrays, we use it in databases already, and making use of that for binary could be a great option for end-to-end binary support where many details are already hashed out (I haven't tried to use it for transport yet).

@BobHanson
Copy link

BobHanson commented Dec 21, 2017 via email

@langner
Copy link

langner commented Dec 21, 2017

My vote would also be to get a text-based array format finalized first. As mentioned, different binary representations can be tried out as "extensions", since it's unclear what the right thing to do is.

It might be the case that it would be best to support several different (binary) representations if arrays, and signal the choice at the top of the file.

@tovrstra
Copy link
Contributor Author

I'm also in favor of keeping the spec as simple as possible and introduce fully orthogonal extensions later to obtain compact representations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants